Downloads/PDF for Archive.Pdf
Total Page:16
File Type:pdf, Size:1020Kb
European Space Agency Symposium “Ensuring Long-Term Preservation and Adding Value to Scientific and Technical Data”, 5 - 7 October 2004, Frascati, Italy Providing Authentic Long-term Archival Access to Complex Relational Data Stephan Heuscher a, Stephan J¨armann b, Peter Keller-Marxer c, and Frank M¨ohle d Swiss Federal Archives, Archivstrasse 24, CH-3003 Bern, Switzerland [email protected], [email protected], [email protected], [email protected] (Dated: July 19, 2004) We discuss long-term preservation of and access to relational databases. The focus is on national archives and science data archives which have to ingest and integrate data from a broad spectrum of vendor-specific relational database management systems (RDBMS). Furthermore, we present our solution SIARD which analyzes and extracts data and data logic from almost any RDBMS. It enables, to a reasonable level of authenticity, complete detachment of databases from their vendor-specific environment. The user can add archival descriptive metadata according to a customizable schema. A SIARD database archive integrates data, data logic, technical metadata, and archival descriptive information in one archival information package, independent of any specific software and hardware, based upon plain text files and the standardized languages SQL and XML. For usage purposes, a SIARD archive can be reloaded into any current or future RDBMS which supports standard SQL. In addition, SIARD contains a client that enables ‘on demand’ reload of archives into a target RDBMS, and multi-user remote access for querying and browsing the data together with its technical and descriptive metadata in one graphical user interface. Keywords: Long-Term Digital Preservation, Digital Archives, Relational Databases, Value of Information The urgency of deep-infrastructure solutions for long- tific activities has drastically increased during the past term digital preservation and archiving was clearly for- decade. In particular, a recent international workshop mulated in the late 90’s of the past century by national on the long-term preservation of databases (16) revealed archives and libraries (1; 2; 3; 4; 5; 6; 7) as well as space that many archives do have a long-standing practice and agencies and institutions in earth observation, oceanog- experience in ingesting and preserving relational data, raphy, and astronomy (8; 9). The finding of problem but their daily work is constrained to the treatment of statements and strategies is still in progress and was re- rather simply structured data sets, requires extensive cently supported by a charta of the United Nations (10). manual work by archives personnel, and does not allow Deep and lasting solutions are still not available, and for for smooth and standardized integration of data and de- national archives and libraries (faced with very heteroge- scriptive meta data on a level required to ingest, preserve, neous types of content) it is broadly accepted that digital and provide access to complex relational databases. collections are growing at a rate that outpaces their abil- In this paper we present a method and applica- ity to manage and preserve them (11). tion named “Software-Invariant Archiving of Relational There is an ongoing process of recognition that long- Databases” (SIARD), developed at the Swiss Federal term digital preservation poses similar problems in di- Archives. It completely detaches typed relational data verse disciplines. Despite of different vocabularies used from almost any relational database management sys- by different communities, research and development in tem, while still retaining most of the original data logic arXiv:cs/0408054v1 [cs.DL] 24 Aug 2004 the field can only be successful in a joint effort. During and integrating data and metadata in one archival infor- the past decade, however, decisive progress was achieved mation package that is based on text files and standard- in analytical and conceptual work (12; 13; 14). Fur- ized technologies. In Section II we discuss the technical thermore, the Open Archival Information System (OAIS) and intellectual complexity of relational data in modern reference model became widely accepted in diverse dis- database systems, the resulting problems for long-term ciplines and covers a full range of archival information preservation, and its relevance to archives. The objec- preservation functions including ingest, archival storage, tives in the development of SIARD are described in Sec- data management, access, and dissemination. It has be- tion III, while Section IV covers SIARD’s system archi- come the international standard ISO 14721:2003 (15) and tecture, workflow, features, and development platform. may be applicable to any archive as it does not refer to specific implementations or archival strategies but merely provides a common terminology and functional frame- I. INTRODUCTION work to discuss different implementation approaches. Many current development projects focus on the Relational data is one of the oldest forms of structured archiving of digital images (digitized photographs or pa- information representation, intuitively used already cen- per documents) and sound recordings, while more com- turies before the “digital age”. With the rise of com- plex types of digital information have been neglected, puter technology, the introduction of mathematical for- even though their relevance for governmental and scien- mulations of the relational data model in the mid-20th 2 century, and the international standardization of a corre- widely used in modern databases. As for the linking of ta- sponding data definition and query language, relational bles, the connection between the data and its data types data has become an omnipresent method to organize data and domain definitions is not preserved when table data for electronic data processing in almost every field of is trivially exported to external plain text files. work, form business activities to scientific research and In addition to those entities and features mentioned government administration. above, modern RDBMS include, for example: During the past two decades, usage has developed from • processing single-table data files with specific applica- Check constraints and assertions: For a single col- tion software to generic relational database management umn of a table or a set of entire tables, assure that systems (RDBMS). These have internal mechanisms for changing or entering data does never violate de- logical and physical organization of arbitrary relational fined data types, quantitative restrictions, or value data models, are able to physically store terabytes of domains. In particular, it can be assured that data data, cover rich data types (including internal procedural items in a table column will never be empty. code), enable multi-user transactions, and provide inter- • Views: Assemble selected parts of several tables nal data life-cycle management. The definition, repre- and operate on them as customized, virtual tables. sentation, management, and query of relational data was thereby standardized and separated from specific appli- • Triggers: Force the RDBMS to initiate timed op- cation logic and application software that operates on the erations on data when user-specified conditions are database. As a result, RDBMS have become core com- met, for example log and audit user activities. ponents of almost any type of digital information system. • Functions (basic and user-defined): Perform nu- It is obvious that this development has decisive impacts merical calculations, conversions, or character op- on the work of those institutions which are charged to erations on data items or sets. collect or accept digital data from various data sources, to make it broadly accessible, and to preserve it over • Stored Procedures: Store and execute programs in- decades: national archives and libraries, science data side the RDBMS to perform common or critical archives, or business companies being under special legal tasks which are not part of the specific application regulations for long-term data retention (like, for exam- software outside the RDBMS. ple, the pharmaceutical sector (19)). • Foreign Keys: Ensure referential data integrity, i.e., automatically prevent that values can be stored in rows of one table if there are no corresponding val- II. COMPLEXITY AND RELEVANCE ues in referencing entries within the database. A. Technical Complexity of Relational Data • Grants and Roles: Define user profiles and assign or withdraw privileges, for example to create new One consequence of the progress in database technol- tables or access certain parts of the database. ogy is that relational data and relational databases be- come highly complex. They often consist of hundreds of For long-term preservation in national archives, rela- linked tables (i.e. physical representations of relational tional data is collected from many different database sys- entities1), which makes it impossible to handle and query tems and has to be retained and kept processable and table data outside an RDBMS if these links become bro- accessible for decades. It is therefore essential to store ken or cannot be managed automatically anymore. and maintain the databases independent of any specific Furthermore, any data item in a RDBMS has a pre- and short-lived products (or at least transferring them all cisely defined data type and domain. (Entire tables or into only one preferred product). In fact, most RDBMS integral parts of them may likewise have