Using XML for Long-Term Preservation Subtitle: Experiences from the Diva Project
Total Page:16
File Type:pdf, Size:1020Kb
Title: Using XML for Long-term Preservation Subtitle: Experiences from the DiVA Project Authors: Müller, Eva; Klosa, Uwe; Hansson, Peter; Andersson, Stefan; Siira, Erik Organization: Uppsala University Library, Electronic Publishing Centre Email: [email protected]; [email protected]; [email protected]; [email protected]; [email protected] Address: Box 510, 75 120 Uppsala, Sweden Url: http://publications.uu.se Keyword: long-term preservation, XML, XML Schema, DiVA, DiVA Document Format, DiVA Archive, URN, URN:NBN Abstract One of the objectives of the DiVA project is to explore the possibility of using XML as a format for long-term preservation. For this reason, the practical use of XML in different parts of the system was evaluated before deciding on the design. The DiVA Document Format - defined by an XML schema - has been developed to describe the inter-relationships amongst the various data elements and processes, and to support long-term preservation of the actual documents. XML Schema provides a means for defining the structure, content and semantics of XML documents. It is an XML based alternative to the XML Document Type Definition (DTD). Because one of the primary reasons for using XML was to support long-term preservation, the most popular DTDs for documents: DocBook and TEI were evaluated. Limitations regarding metadata descriptions were found in both of these DTDs, so the decision to develop a new structure for DiVA, using XML schema, was made. This schema combines the DocBook Schema (derived from the DocBook DTD) for the textual parts of the document with the internal schema for all metadata (bibliographic and administrative data). Using the DiVA Document Format for content management and inter-process communication, several applications were developed. Some of their purposes are essential for long-term preservation: • Make persistent National Bibliographic Numbers (NBN) available for the URN resolution service1 at the Royal Library in Stockholm available. • Send MARC21 records in MARC-XML to the National Library. • Create archival file packages for long-term preservation, checksum them, store them in the DiVA Archive and send a copy of them to the Swedish Royal Library. Currently the file-archives for long-term preservation contain the original full-text file in various formats and the DiVA Document Format file, which contains all the metadata about the document. Furthermore the DiVA Document Format file contains all parts of the full-text file that can be converted into XML. In the future it might be possible to transfer the whole full-text into XML, in which case the file-archives would contain only DiVA Document Format files. 1 http://urn.kb.se/resolve Table of Contents 1 XML as Long-term Preservation Format 3 1.1 XML Schema 3 1.2 Comparison of DocBook and TEI 3 1.3 DiVA Document Format 4 2 Long-term Preservation in the DiVA Project 5 2.1 Uniform Resource Name (URN) and National Bibliographic Number (NBN) 6 2.2 The DiVA Archive 6 3 Conclusions 8 Preface DiVA - Digitala vetenskapliga arkivet (DiVA Archive) - is a comprehensive description of a searchable archive containing all documents, which are published in an electronic form at Uppsala University in Sweden. Other Swedish universities are also co-operating in the project within the DiVA framework. One part of this archive is the database containing theses published at Uppsala University from 1998 to date. In September 2000 an Electronic Publishing Centre was established at Uppsala University Library. Its primary assignment was a project in which technical solutions, and a well-functioning workflow, for electronic posting and full-text publication of doctoral theses, essays, working papers and other types of scientific publications were to be created. The first phase of the project was completed in 2002 and the result was the DiVA Publishing System – a system for electronic publishing of different types of publications. One of the goals has been to create a long-term archive containing all digital documents published at Uppsala University. The assignment involves both technical and organisational issues. Developer team faced with many questions. How can the loss of data be avoided? What kind of descriptive and administrative metadata is useful for archiving? What is the appropriate metadata format for long time preservation? How important is the layout of the objects and how is it to be handled? How can images and formulas be handled? Because of those questions, XML was discussed early on as a format for storing descriptive and administrative metadata, as well as for the complete content of the documents. XML represents a format that is easy to restore and understand by both humans and machines. This paper will describe the current status of the XML implementation in DiVA Archive and the surrounding applications and why XML is an important format for long-term preservation. 1 XML as Long-term Preservation Format One of the objectives of the DiVA project is to explore the possibility of using XML as a format for long-term archiving. There are several advantages of using XML encoded documents for long-term archiving. XML is an open and established notation. XML documents are in a human-readable text format and internationalised character sets are supported. These characteristics facilitate data migration and the documents are likely to have longevity. For these reasons XML seemed like a good choice, but to ensure success, the practical use of XML in different parts of the system was evaluated before a decision about the design was made. In the DiVA project XML is not only for archiving. It is also used for the communication between different processes within the system and for the internal communication in the development team. It also helps to validate data with help of an XML schema. The dynamic web interface is built on XML and XSLT. 1.1 XML Schema XML Schema provides a means for defining the structure, content and semantics of XML documents. XML Schema is an XML based alternative to the XML Document Type Definition (DTD). Because the primary reason for using XML was to support long-term archiving, the most popular DTDs and schemas for documents namely DocBook and TEI were evaluated. Limitations regarding the metadata descriptions needed in the DiVA project were found. Because of the need to combine administrative metadata, descriptive metadata and content, a new schema was developed that meets the needs of the DiVA project. This schema combines the DocBook schema (derived from the DocBook DTD) for the textual parts of the document with the bibliographic metadata and administrative metadata for long-term preservation. XML Schema was chosen over XML DTD because it is written in XML and supports many data types, self-defined data types and different namespaces. The support for different data types offers several advantages. It is possible to describe permissible document content, to validate the correctness of data, to define restrictions on data (data facets), to define data formats (data patterns) and to convert between different data types. It is also easier to work with data coming from a database. During the development, it was noticed that XML Schema facilitated the communication between the developers by providing a simple mechanism for writing formal specifications of subsystem interfaces. 1.2 Comparison of DocBook and TEI TEI2 and DocBook3 are two widely used recommendations for encoding textual material in electronic form. These two recommendations were compared to find which is most appropriate and convenient to use when representing full-text documents in the DiVA Archive. A logical unit, i.e. a combination of XML elements and/or XML attributes that have a certain well- defined meaning, can be expressed differently in TEI and DocBook. A logical unit that consists of only one well-defined element in DocBook often is composed by both a general element and attribute in the TEI representation. Attribute values are not defined in the TEI recommendation and therefore have to be defined locally. Therefore it is likely that others would not correctly interpret a TEI encoded document without any agreements. Elements that define the structure of documents, e.g. headers, chapters, lists and tables are more 2 See: http://www.tei-c.org/ 3 See: http://www.docbook.org/ specifically defined in DocBook than in TEI. For publication of documents like PhD theses or scientific papers it is therefore more convenient to use DocBook because relevant structure elements are well defined. But if a text should be marked-up in detail both semantically and structurally, for example in order to create scholarly archives of diverse kinds of historical sources or for linguistic purposes, the more general TEI scheme would be a better choice. The main purpose in the DiVA project is to store the structure of the contents of the documents and not to store the semantics. Therefore DocBook was chosen to mark up the content. Element TEI DocBook Heading 1 <div1 type="chapter" n='1'> <chapter id="1"> <title> Heading 1</title> <head n="1">Heading 1</head> </chapter> </div1> Superscript <hi rend="sup">text</hi> <superscript>text</superscript> Lists <list type=”…”></list> <orderdlist numeration=”…”>…</orderdlist> Table 1: Some elements in TEI and DocBook 1.3 DiVA Document Format DiVA Document Format - defined by an XML Schema - version 1.0 consists of 99 elements4. Administrative elements are combined with descriptive elements to make it possible to describe a publication in the same XML document file that contains its content. Many element names exist in both singular and plural form. The plural form is always used to name container elements. A container element contains one or more elements in its corresponding singular form. For example <creators> contains one or more <creator> elements, <titles> contains <title> elements and so on.