<<

Extending the Role of in a Digital System*

Alexa T. McCray, Marie E. Gallagher, Michael A. Flannick National Library of Medicine Bethesda, MD, USA {mccray,gallagher,flannick}@nlm.nih.gov

Abstract Metadata efforts often fall into the trap of We describe an approach to the development of a digital trying to create a universal metadata schema. library system that is founded on a number of basic Such efforts fail to recognize the basic nature principles. In particular, we discuss the critical role of of metadata: namely, that it is far too diverse metadata in all aspects of the system design. We begin by to fit into one useful ...the creation, describing how the notion of metadata is sometimes administration, and enhancement of individual interpreted and go on to discuss some of our early metadata forms should be left to the relevant experiences in a digital conversion project. We report on communities of expertise. Ideally this would the Profiles in Science project, which is making the occur within a framework that will support archival collections of prominent biomedical scientists across and domains. available on the . We discuss the [4:277] principles that are used in our system design, illustrating Roszkowski and Lukas describe an approach for these throughout the discussion. Our approach has linking distributed collections of metadata so that they are involved interpreting metadata in its broadest sense. We searchable as a single collection [5], and Baldonado et al. capture data about the items in our digital collection for a describe an architecture that facilitates metadata a variety of purposes and use those data to drive the compatibility and interoperability [6]. Current entire system. Futher, we have designed our overall developments in metadata standardization, including system architecture such that it can accommodate interoperability issues, are reported regularly on the Web changes while still ensuring the persistence of the [7-9]. underlying data. 2. Lessons learned from an early project 1. Introduction Some years ago, as an experiment in document Metadata in its broadest interpretation is data about management and conversion, we developed a digital data. The importance of metadata as an aid to resource library system of historical materials. Though our work discovery is acknowledged in the digital library on this system, which we began in 1992, pre-dated recent community. The initiative is a metadata research in digital , we encountered many of the standardization effort whose goal it is "to define a core set same issues that currently face digital library projects, of elements for resource discovery" [1], and, in particular, particularly those that are involved in converting large to develop a set that "provides adequate data for Web collections of materials from paper to digital form. Often resource discovery and is simple for authors and content projects of this type bring together two worlds, the rich managers to create and maintain" [2:176]. Thiele in a world of archival practice and the world of emerging recent review article says of the Dublin Core: "The object technologies. While generally operate at the is to develop a simple metadata set and associated syntax level of an entire collection, digital conversion projects that will be used by producers and providers require careful attention to individual pages and to describe their networked resources, thereby improving documents. This has major implications for the way in their chance of discovery." [3]. which a collection is processed. Archivists traditionally Metadata interoperability is a closely related issue sort, organize, and catalogue a collection, producing as a and is also a focus of current metadata research. Daniel, in final product a . The finding aid imposes a discussing the Warwick Framework, says: structure on the collection and indicates, generally at the folder and box level, where the physical documents may

* In: Proceedings of the IEEE Forum on Research and Technology Advances in Digital Libraries '99 be found. In digital conversion projects primary attention thesaurus, together with a controlled terminology that was is paid to the identification and management of special to the RMP documents. documents, with perhaps somewhat less attention being We scanned the documents, creating a digital master. paid to the overall structure of the collection. The physical The master copy is a high quality, lossless TIFF image location of the documents becomes of secondary concern from which other formats may be derived over time. (in some cases, the physical documents are even When the Web technology first became available, we destroyed), and of primary concern is the ability to locate created a Web-based version of the system. Our first the documents in a or over a network. If the challenge was to make the TIFF images available through optical character recognition (OCR) is successful, then the Web without requiring users to acquire additional retrieval by key words can be somewhat effective. viewing . We experimented with GIF derivatives However, if, as is often the case with older materials, the of the TIFF pages, but at full size these took an OCR is inadequate, and if the item being converted is a unacceptably long time to download, and at reduced size photograph or some other non-textual item, then some their quality was unacceptable. When the portable other method is needed, in any case, for finding the document format (PDF) became available, and, individual items in the collection. importantly, when the viewer became freely available as a Our early project involved historical materials from browser plug-in, we derived PDF images from the the Regional Medical Programs (RMPs) initiative whose original TIFF's and then made both versions available on goal it was to establish regional centers of excellence for the Web site [10]. health care throughout the involving medical schools, research institutions, and hospitals. The 3. A new challenge: metadata driven RMP archival materials span the entire history of the conversion project beginning with an initial report to President Johnson in 1964, through the active period of program Founded on our early experience with the RMP implementation, and to its termination in 1976. In program materials, we began a project in the spring of addition, materials from a conference held at the National 1997 whose goal it is to make the archival collections of Library of Medicine (NLM) in December 1991 are prominent biomedical scientists available on the World included. The material in the RMP collection presented us Wide Web. The site is designed for scientists, scholars, with a variety of challenges, either because the documents and students, all of whom may gain an appreciation of the were of very poor quality (including mimeographs, and, history of early scientific discoveries, and also share in in some cases, photocopies of mimeographs), or because the excitement of the scientific enterprise. The collections they were oddly sized (including folded pamphlets, have been donated to the NLM and contain published and oversized , loose-leafed binders, pages from memo unpublished materials, including books, journal volumes, pads, etc.). In addition to the scanned documents, pamphlets, diaries, letters, manuscripts, photographs, interview transcripts, audio segments, photographs, and audio tapes and other audiovisual materials. The site was conference session transcripts are included in the officially launched in September 1998 [11]. The first database. collection on the site represents the work of Oswald We digitized some 1,500 documents, representing Theodore Avery (1877 - 1955), one of this country's first about 40,000 pages and developed what is now called molecular biologists, whose findings proved that the metadata for each of the items in the database. The genetic material is DNA. purpose of the metadata, which is made available as an Underlying the Profiles Web site is a system that is "index" record, was to ensure that documents could be designed to handle the entire life cycle of a large-scale retrieved even if the OCR was inadequate (which it often conversion project. Metadata forms the core of the was). The metadata also served to link the various forms system. It is the major component of the data input stage; of the same document (e.g., TIFF, OCR, etc.) to each it is used for generating various views for display on the other through the unique that was assigned to Web; and it serves as the basis for search and retrieval. each document. Metadata templates, which were used to The primary principles underlying our system design standardize the information being collected, varied by are modularity, adherence to standards, and extensibility. document type. Thus, for example, published articles We create high quality original images and detailed would have information about authors, journal, publisher, metadata records. From these, we are able to place of publication, etc., while unpublished letters would automatically derive a variety of other image formats, and include information about the sender and the recipient. we are able to derive a variety of views for our Web site. Common to all document types would be information We automate whatever it is possible to automate, hoping about the contributor, number of pages, location of the thereby not only to ensure accuracy, consistency and physical document, scanning and index dates, and index efficiency, but also to contribute to ease of use. Creating a terms from NLM's Medical Subject Headings (MeSH) digital is a labor intensive effort, and we are attempting to design a system that minimizes the burden information and instructions for the scanner and feedback of routine data entry, allowing the archivists to to the about problems encountered during the concentrate instead on the intellectual aspects of the tasks scanning process. at hand. Once the metadata for a set of items has been entered, the process of creating the master digital object begins. 3.1. Digitizing and loading the repository High resolution TIFF files are created as the digital master copies from which a variety of Web-accessible Figure 1 illustrates the architecture of the Profiles in derivatives is created. Adobe PDF is derived from the Science system. master TIFF files for black and white documents, and 2 Items chosen for include photographs, sizes of JPEG are derived from the greyscale and color electronic documents (documents or photographs that are TIFF files. Web-friendly streaming audio and video "born" digitally [12:4]), paper documents, audio formats (QuickTime and RealMedia) files are produced recordings, and videos. The archivist uses the customized from the video and audio files. metadata entry system that we created to enter descriptive When document scanning is complete, the scanning and administrative metadata. The descriptive metadata is technician deposits the scanned files into an incoming typically externalized as Dublin Core, and all the directory, indicating that the items are ready to go through metadata is intended to allow mapping to a variety of the quality control process. At that point, the scanning element sets as needed. technician returns the original items to the archivist. The system provides a number of document The metadata entry program reads the incoming management capabilities, including tracking functions. It directory and performs some basic checking, for example, also has built in quality control features and provides a to see if pages are missing, or if a file is named according variety of reports, including an automatically generated to a non-existent unique identifier. Items that pass this scan sheet. These scan sheets accompany the physical check are marked as "ready" and those that do not pass objects throughout the digitization process, providing are marked as "incomplete" and are moved to a "redo"

Figure 1. Architecture of Profiles in Science System directory. The archivist checks each original item against any item to be found during a search of the collection, and the ready items, and uses the metadata entry system to it performs a number of functions in addition to recording change the status of each metadata record to either "final" descriptive information about the items in the collection. (if the item passed quality control) or "redo" (if the item The design of the system encourages correct, consistent, failed quality control). Items that failed either the and standard collection of metadata. It allows multiple automated or manual quality control are returned to the persons to enter metadata simultaneously, providing a scanning technician. Information such as which common interface for all persons entering data, and technician scanned the item, and when the item was enforcing the notion of required fields. Whenever moved through each stage of quality control and by possible, data are entered by choosing from enumerated whom, is automatically logged by the DBMS. lists. Data validation is performed by the system wherever Those digitized items that pass final quality control appropriate, and warning messages are generated to alert are moved to the archival image server or the Web server the user to potential problems. as appropriate. The DBMS is exported and a suite of Since most items require that permission be sought programs performs more validity checking, creates before they can be made available in digital form to the HTML pages for the metadata records which point to the public, capabilities are provided in each metadata record Web-accessible derivatives, and creates sets of HTML for recording information about the status of copyright pages which allow multiple views of the collection. permissions, as well as any special restrictions imposed by donors. 3.2. Entering and validating metadata The metadata record also allows the archivist to enter information that relates to the intellectual organization of The metadata entry system is used to collect a the collection. Thus, the particular series or sub-series into sufficient amount of data about individual items to allow which a document falls can be entered. Our future plans

Figure 2. Initial metadata entry screen are to extend this capability even further by incorporating correspond to Dublin Core elements because they were additional elements found in encoded finding aids [13]. close enough for our current purposes and made the The permanent physical location of an item, such as a mapping to the Dublin Core element set, currently the box or folder, is also recorded. Special information, such best available standard, quite easy. The unique identifier as the location of an item temporarily removed from the that stands in for the Dublin Core resource identifier is collection, can also be documented in the metadata assigned by the system. Other types of information record. Since various types of personnel work on various includes quality control checks, instructions about parts of the process, levels of access to the metadata entry disposition of the document after digitization, the physical system can be granted to different groups such as condition of the document, and where the document in scanners, archivists, and supervisors. The access levels the physical and intellectual organization of the may apply to certain fields in a metadata record, or they collection. may apply to entire collections. The archivist chooses from drop-down lists when Figures 2 - 4 are illustrative screen shots of our entering data whenever possible. Depending on the type metadata entry system. Figure 2 shows the first screen an of entry chosen, a window pops up displaying the archivist sees. appropriate elements that would apply to that entry. As each item is logged into the metadata entry Figure 3 illustrates. system, it is assigned a unique identifier. When the In Figure 3, the user has chosen "Journal Article" as archivist enters the system, the next available unique the source type and would choose the correct Journal identifier for the collection being processed appears. The name from the authoritative list. Then volume, issue, user would enter the Dublin Core information about the pages, and ISSN number for the document would be item as appropriate: Title, Subject (keywords), Relation, added. These elements are stored separately in the Coverage, Resource Type, Format, Creator (author), database, but when the choices are made they are Contributors, Publisher, Source, Rights Management, displayed in a standard combined form in the Source Description, Language, and Date. These category names field. Changing the formula that creates the combined

Figure 3. Choosing items from a list form would require no changes to the data since the permanently to the metadata record. Thus, the TIFF, PDF, elements are stored separately. Consistency is increased, and OCR versions of a document are all linked by the since the combined form is based on a formula instead of same unique identifier as is the full metadata record that free text. has been created for that document. In some cases a word- Although Dublin Core specifies that all fields are processed form of a document may co-exist with a printed optional and repeatable, we require that certain elements version of that same document. In that case, the same be entered for each record. This is so that a reasonable identifier for the of the printed document amount of minimal information is included for every item will also link the word-processed form. entered into the system, and it ensures that the basic The archivist records information about the physical information needed to create a view of the collection object, such as its location, including the method of based on the metadata exists. The archivist may not save organization, e.g., the folder and box in which the the record until the required data are entered (or document resides and the series or sub-series of which it designated as unknown). Figure 4 illustrates. is a part. Information about the quality of the physical object is included. If the item is fragile, oversized, or 4. Roles of the metadata system needs special handling of any type, this is recorded in the system (for use in the subsequent scanning stage). 4.1. Input: framework for collection management Additionally, and importantly, information about the status of copyright permissions is also noted. The metadata entry system manages all aspects of the The metadata entry system enforces quality control. digitization process. Once the scope and overall Pull-down menus, check boxes, and option buttons are arrangement of a collection are determined, and items are used whenever possible, thereby eliminating spelling and chosen for digitization, the archivist begins entering other errors, and data cannot be saved unless all required metadata for each item. The unique identifier binds the fields have been filled. The system tracks whether all digital master files and Web-accessible derivatives metadata elements have been checked, and if so by

Figure 4. Warning message when obligatory data are missing whom, when the item was scanned, by whom, and overall statistics on the number of items that have been whether it has been checked for quality. Only when all processed in each collection to date. quality control has been completed are the record and Security measures are also in place. Entry to the digital object released for inclusion in the digital library. system is by password only. Information may only be A number of reports can be created from the read, entered, modified, or deleted based on the user metadata entry system, which further manages the profile. Each user's rights can be further customized workflow. These reports can be displayed on the screen as relative to the metadata fields, status information, well as printed for further use. Certain types of summary printing, modification of standard lists, and system reports are also available. For example, full lists of the administration. Further, they may be restricted to certain standard elements (e.g., journal names) can be displayed collections and may not log on more than once at any for further analysis. The user may wish to view the record given time. A log is kept of all individuals accessing the just created, and in that case a metadata report is system, including time of entry and exit. displayed. A summary of all items currently in the collection can be displayed which includes unique 4.2. Display and organization: foundation for , titles, formats and document types. Web delivery A complete status history can be printed that shows all the phases through which a document has passed up to The metadata RDBMS is the foundation for the Web that point. Various statistics can be gleaned from the delivery system. A series of programs generates the system, including all work done in a recent month, or HTML which allows the documents to be browsed over

Figure 5. Sample metadata record from the Profiles in Science Web site the Web. The programs read the data exported by the exist wherever needed. Being able to browse a collection database, and first perform consistency checking among through multiple points of view gives the user an metadata records and within individual records, understanding of the collection that might not be obvious displaying warning and error messages about the from searching or sequential viewing of the items. For metadata if problems arise. For example, if the archivist example, one can discern almost immediately whether a has pointed to the unique identifier of a related document collection is composed primarily of correspondence or in the relation field, and if that related document is published items based on a view of the collection by marked as not yet being publicly accessible, then a resource types. warning message would be printed. Once all validation We have also implemented a variety of filtering has been completed, the programs automatically generate mechanisms to address access management issues. Arms HTML versions of metadata records for each item that proposes a policy-based framework for access will appear on the Web site. The metadata record points management: Each policy relates some group of users to an actual document or other digital object on the Web with some set of digital material and permits or denies server, and the programs check to verify the existence of certain types of operations on the material. [14]. In our these objects. Figure 5 shows a metadata record on the case, the system is being developed for three types of Profiles site. uses, 1) universal access (freely available on our Web The elements shown in the metadata record that site), 2) access within NLM's History of Medicine reading appears on the Web constitute a subset of the information room only (if so desired by the donor), or 3) access to that is stored in the metadata RDBMS and conform to the named individuals only, for some period of time. This Dublin Core set of elements. We have also included latter might be access by digital library staff only until the information about the number of pages and image sizes as full collection is ready for release, or it could be access by well as a note indicating that the item is a photocopy. In the donor only until the donor is satisfied that the material the case of the "Relation" field, the title that is printed may be released. For example, the donor may wish to there has been automatically generated by the programs. review the correspondence in the collection to ensure that The archivist simply entered a unique identifier in the the of individuals mentioned in the letters is not Relation field at data entry time; the system generates the violated. Other items may be publicly available only for a full title. certain length of time, and renewed permission must be In addition to automatically generating Web pages, sought, or the item must be removed from public access the programs also generate specially formatted lists of after expiration. In these cases, a checkbox in the URLs and a subset of the metadata elements which are metadata record is used to indicate that the item should fed to the for indexing the collection's not be made available to the public. metadata records and digitized objects. Finally, the programs generate statistics that are added to the release 4.3. Discovery: standards for resource description history for that collection. We have designed our programs such that they As noted earlier, resource discovery has been the generate a series of alternate views based on information focus of the digital library community's in the individual metadata records. The views vary standardization efforts in metadata. The Dublin Core depending on the collection. Each collection has an working group has established desiderata for a core set of alphabetical (by item title) and chronological view of the elements that would be easy for individual authors to use items. Within these views the items are organized by as well as being suitable for larger digital library projects. resource type. One collection has been organized by These are: simplicity, semantic interoperability (useable "epoch", separated into folders, and assigned identifiers across potentially disparate subject domains), by the donor. Because this epoch, folder and identifier international consensus, (since the Internet is a global information is entered for each item, a view which reflects resource), extensibility, and modularity on the Web this organization is also created. As collections are (allowing co-existence of complementary schemes in an processed and documents are assigned to permanent overall architecture, such as the Resource Description boxes and folders, it will be possible to create a view of Framework). Recently, the working group published an the items that reflects their permanent locations. Since informational Request for Comments on the fifteen basic these views are generated automatically from the Dublin Core elements [15]. The document reports on the metadata, views are updated by re-running the programs consensus that has been reached by the individuals using the latest version of the database. In general, any participating in the Dublin Core Metadata Workshop imaginable view can be generated by the program as long series that have been taking place since 1995. as the information exists in the metadata records. The Though the stability of the Dublin Core elements has digital images themselves are stored in only one place, not been guaranteed, and though the of some of according to the unique identifier, but the individual the elements have not always been clear, it has seemed HTML metadata files and the views pointing to them can important to us to stay consistent with the evolving standard. We have taken the same view here as in other interesting case is the archive of the now defunct U.S. areas of our system design. We capture a variety of basic Congress Office of Technology Assessment (OTA) [16]. information about each of our digital objects, and then we Over a twenty-three year history the office created a large generate the Dublin Core elements from this information. number of in-depth reports on a variety of topics. When The advantage of this method is that as the Dublin Core funding for the office was withdrawn in 1995, the develops we can continue to map our generic format to survivability of these reports was in doubt. OTA staff whatever the current Dublin Core elements are. Since the worked to make the reports available electronically and Dublin Core strives to be simple, favoring minimal work created a fully searchable CD-ROM of the materials. by authors, it is unlikely, though possible, that new Princeton University mirrored the former OTA Online elements would be proposed for which we do not have the site and continues to maintain it. What is of note here is basic information in our system. that without the intervention of extraordinarily dedicated individuals, these documents and this legacy would surely 5. Discussion have been lost. Digital library projects involve extensive resources, The Digital Library Federation recently defined both human and computational. As we design and "digital libraries", as this notion is understood and agreed implement such projects, we need to be mindful of the to by its members: investment we are making and the commitment that this Digital libraries are organizations that provide implies. If we design the system in such a way that it the resources, including the specialized staff, to adheres to standards and is extensible, then we have a select, structure, offer intellectual access to, better chance of ensuring its integrity and persistence. interpret, distribute, preserve the integrity of, We will still need to track constantly evolving standards, and ensure the persistence over time of hardware, and software and modify our systems collections of digital works so that they are accordingly over time. readily and economically available for use by a Our approach has involved interpreting metadata in defined community or set of communities. its broadest possible sense. We capture data about the [12:3]. items in our digital collection for a variety of purposes While one may argue with the assertion that a digital and use those data to drive the entire system. The library is synonymous with an organization or institution, metadata record, together with the unique identifier that is it is nonetheless interesting to note the attributes of a assigned to it, is the basic unit in the system. Using this digital library that are highlighted by this definition. record we manage the digitization process; we Selection of material is important, particularly for automatically generate views of the collection for our retrospective projects where it may not be feasible or even Web site; and we extract a subset of the data, publishing it desirable to digitize everything in the collection. It may as Dublin Core, for use in network based retrieval. Since well be the case that a full archival collection will consist future data and delivery formats are unknown, we have of both paper and digital objects, and the digital objects designed our system architecture such that it can easily themselves might have arisen from a conversion process, accommodate changes, while still ensuring the persistence or they might have originated in digital form. The of the underlying data. unifying structure for all of these objects might be an archival finding aid, which would point to both physical Acknowledgements and electronic locations. The next three attributes listed in the definition are closely related to each other. We gratefully acknowledge the following individuals Structuring, offering intellectual access to, and in NLM's Library Operations Division who are our interpreting collections imply a process of organizing, collaborators in the Profiles in Science project: Brian cataloguing, and indexing material such that it can be Aleibar, Margaret Byrnes, Phong Do, Ed Fishwick, Erin more easily accessed and understood by users. McLeary, Sheila O'Neill, Gregory Pike, Melissa Plotkin, Developing a finding aid is the traditional approach for and Monica Unger. These colleagues manage the archival physical archival collections, while assigning metadata to collections, providing the necessary content expertise; individual items in a collection addresses these functions they scan the items; they work with us on defining the set for digital collections. of metadata elements for the system, and they are the Preserving the integrity of and ensuring the primary users of the metadata entry system. Quang Le persistence of digital works is, in our opinion, one of the provides valuable system administration for the project, most thorny issues in digital library work. There are the and Becky Cagle designed the Profiles cover image and technical issues of persistence of hardware and software, created the digital audio and video files. Nancy Bladen and there are the organizational issues involving a carefully reviewed and assisted in final preparation of the commitment to the digital archive, or even the question of paper. whether or not the organization itself survives. A very References [8] W3C-Technology and Society Domain: Metadata and Resource Description. http://www.w3.org/Metadata/.

[1] Weibel, Stuart, Juha Hakala. DC-5: The Helsinki metadata [9] International Federation of Library Associations and workshop: A report on the workshop and subsequent Institutions. Digital Libraries: Metadata Resources. developments. D-Lib Magazine, http://www.dlib.org/, February http://www.ifla.org/II/metadata.htm 1998. [10] Regional Medical Programs, US National Library of [2] Weibel, Stuart, Carl Lagoze. An element set to support Medicine. http://rmp.nlm.nih.gov/. resource discovery. International Journal on Digital Libraries, Vol. 1, pp. 176-186, 1997. [11] Profiles in Science, US National Library of Medicine. http://www.profiles.nlm.nih.gov/. [3] Thiele, Harold. The Dublin Core and Warwick Framework: A review of the literature, March 1995 - September 1997. D-Lib [12] Waters, Donald J. The Digital Library Federation: Program Magazine, http://www.dlib.org/, January 1998. Agenda. Digital Libraries, A Program of the Council on Library and Information Resources, June 1, 1998 [4] Daniel, Ron Jr., Carl Lagoze, Sandra Payette. A metadata architecture for digital libraries. In: Proceedings of the IEEE [13] EAD-Encoded Archival Description Official Web Site Forum on Research and Technology Advances in Digital http://www.loc.gov/ead/. Libraries, pp. 276-88, 1998. [14] Arms, William Y. Implementing Policies for Access [5] Roszkowski, Michael, Christopher Lukas. A distributed Management. D-Lib Magazine, http://www.dlib.org/, February architecture for resource discovery using metadata. D-Lib 1998. Magazine, http://www.dlib.org/, June 1998. [15] RFC 2413, Dublin Core Metadata Resource Discovery, [6] Baldonado, Michelle, Chen-Chuan Chang, Luis Gravano, Weibel, S., J. Kunze, C. Lagoze, M. Wolf. ftp://ftp.isi.edu/in- Andreas Paepcke. Metadata for digital libraries: architecture and notes/rfc2413.txt design rationale. In: Proceedings of the 2nd ACM International Conference on Digital Libraries, pp. 47-56, 1997. [16] United States Congress, Office of Technology Assessment. http://www.access.gpo.gov/ota/ [7] The Dublin Core: A Simple Content Description Model for Electronic Resources. http://purl.org/dc/.