Extending the Role of Metadata in a Digital Library System*
Total Page:16
File Type:pdf, Size:1020Kb
Extending the Role of Metadata in a Digital Library System* Alexa T. McCray, Marie E. Gallagher, Michael A. Flannick National Library of Medicine Bethesda, MD, USA {mccray,gallagher,flannick}@nlm.nih.gov Abstract Metadata efforts often fall into the trap of We describe an approach to the development of a digital trying to create a universal metadata schema. library system that is founded on a number of basic Such efforts fail to recognize the basic nature principles. In particular, we discuss the critical role of of metadata: namely, that it is far too diverse metadata in all aspects of the system design. We begin by to fit into one useful taxonomy...the creation, describing how the notion of metadata is sometimes administration, and enhancement of individual interpreted and go on to discuss some of our early metadata forms should be left to the relevant experiences in a digital conversion project. We report on communities of expertise. Ideally this would the Profiles in Science project, which is making the occur within a framework that will support archival collections of prominent biomedical scientists interoperability across data and domains. available on the World Wide Web. We discuss the [4:277] principles that are used in our system design, illustrating Roszkowski and Lukas describe an approach for these throughout the discussion. Our approach has linking distributed collections of metadata so that they are involved interpreting metadata in its broadest sense. We searchable as a single collection [5], and Baldonado et al. capture data about the items in our digital collection for a describe an architecture that facilitates metadata a variety of purposes and use those data to drive the compatibility and interoperability [6]. Current entire system. Futher, we have designed our overall developments in metadata standardization, including system architecture such that it can accommodate interoperability issues, are reported regularly on the Web changes while still ensuring the persistence of the [7-9]. underlying data. 2. Lessons learned from an early digital library project 1. Introduction Some years ago, as an experiment in document Metadata in its broadest interpretation is data about management and conversion, we developed a digital data. The importance of metadata as an aid to resource library system of historical materials. Though our work discovery is acknowledged in the digital library on this system, which we began in 1992, pre-dated recent community. The Dublin Core initiative is a metadata research in digital libraries, we encountered many of the standardization effort whose goal it is "to define a core set same issues that currently face digital library projects, of elements for resource discovery" [1], and, in particular, particularly those that are involved in converting large to develop a set that "provides adequate data for Web collections of materials from paper to digital form. Often resource discovery and is simple for authors and content projects of this type bring together two worlds, the rich managers to create and maintain" [2:176]. Thiele in a world of archival practice and the world of emerging recent review article says of the Dublin Core: "The object technologies. While archivists generally operate at the is to develop a simple metadata set and associated syntax level of an entire collection, digital conversion projects that will be used by information producers and providers require careful attention to individual pages and to describe their networked resources, thereby improving documents. This has major implications for the way in their chance of discovery." [3]. which a collection is processed. Archivists traditionally Metadata interoperability is a closely related issue sort, organize, and catalogue a collection, producing as a and is also a focus of current metadata research. Daniel, in final product a finding aid. The finding aid imposes a discussing the Warwick Framework, says: structure on the collection and indicates, generally at the folder and box level, where the physical documents may * In: Proceedings of the IEEE Forum on Research and Technology Advances in Digital Libraries '99 be found. In digital conversion projects primary attention thesaurus, together with a controlled terminology that was is paid to the identification and management of special to the RMP documents. documents, with perhaps somewhat less attention being We scanned the documents, creating a digital master. paid to the overall structure of the collection. The physical The master copy is a high quality, lossless TIFF image location of the documents becomes of secondary concern from which other formats may be derived over time. (in some cases, the physical documents are even When the Web technology first became available, we destroyed), and of primary concern is the ability to locate created a Web-based version of the system. Our first the documents in a database or over a network. If the challenge was to make the TIFF images available through optical character recognition (OCR) is successful, then the Web without requiring users to acquire additional retrieval by key words can be somewhat effective. viewing software. We experimented with GIF derivatives However, if, as is often the case with older materials, the of the TIFF pages, but at full size these took an OCR is inadequate, and if the item being converted is a unacceptably long time to download, and at reduced size photograph or some other non-textual item, then some their quality was unacceptable. When the portable other method is needed, in any case, for finding the document format (PDF) became available, and, individual items in the collection. importantly, when the viewer became freely available as a Our early project involved historical materials from browser plug-in, we derived PDF images from the the Regional Medical Programs (RMPs) initiative whose original TIFF's and then made both versions available on goal it was to establish regional centers of excellence for the Web site [10]. health care throughout the United States involving medical schools, research institutions, and hospitals. The 3. A new challenge: metadata driven RMP archival materials span the entire history of the conversion project beginning with an initial report to President Johnson in 1964, through the active period of program Founded on our early experience with the RMP implementation, and to its termination in 1976. In program materials, we began a project in the spring of addition, materials from a conference held at the National 1997 whose goal it is to make the archival collections of Library of Medicine (NLM) in December 1991 are prominent biomedical scientists available on the World included. The material in the RMP collection presented us Wide Web. The site is designed for scientists, scholars, with a variety of challenges, either because the documents and students, all of whom may gain an appreciation of the were of very poor quality (including mimeographs, and, history of early scientific discoveries, and also share in in some cases, photocopies of mimeographs), or because the excitement of the scientific enterprise. The collections they were oddly sized (including folded pamphlets, have been donated to the NLM and contain published and oversized books, loose-leafed binders, pages from memo unpublished materials, including books, journal volumes, pads, etc.). In addition to the scanned documents, pamphlets, diaries, letters, manuscripts, photographs, interview transcripts, audio segments, photographs, and audio tapes and other audiovisual materials. The site was conference session transcripts are included in the officially launched in September 1998 [11]. The first database. collection on the site represents the work of Oswald We digitized some 1,500 documents, representing Theodore Avery (1877 - 1955), one of this country's first about 40,000 pages and developed what is now called molecular biologists, whose findings proved that the metadata for each of the items in the database. The genetic material is DNA. purpose of the metadata, which is made available as an Underlying the Profiles Web site is a system that is "index" record, was to ensure that documents could be designed to handle the entire life cycle of a large-scale retrieved even if the OCR was inadequate (which it often conversion project. Metadata forms the core of the was). The metadata also served to link the various forms system. It is the major component of the data input stage; of the same document (e.g., TIFF, OCR, etc.) to each it is used for generating various views for display on the other through the unique identifier that was assigned to Web; and it serves as the basis for search and retrieval. each document. Metadata templates, which were used to The primary principles underlying our system design standardize the information being collected, varied by are modularity, adherence to standards, and extensibility. document type. Thus, for example, published articles We create high quality original images and detailed would have information about authors, journal, publisher, metadata records. From these, we are able to place of publication, etc., while unpublished letters would automatically derive a variety of other image formats, and include information about the sender and the recipient. we are able to derive a variety of views for our Web site. Common to all document types would be information We automate whatever it is possible to automate, hoping about the contributor, number of pages, location of the thereby not only to ensure accuracy, consistency and physical document, scanning and index dates, and index efficiency, but also to contribute to ease of use. Creating a terms from NLM's Medical Subject Headings (MeSH) digital archive is a labor intensive effort, and we are attempting to design a system that minimizes the burden information and instructions for the scanner and feedback of routine data entry, allowing the archivists to to the archivist about problems encountered during the concentrate instead on the intellectual aspects of the tasks scanning process.