Database Preservation Toolkit

View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Universidade do Minho: RepositoriUM Database Preservation Toolkit A relational database conversion and normalization tool Bruno Ferreira Luís Faria José Carlos Ramalho KEEP SOLUTIONS KEEP SOLUTIONS University of Minho Rua Rosalvo de Almeida 5 Rua Rosalvo de Almeida 5 4710 Braga, Portugal 4710 Braga, Portugal 4710 Braga, Portugal [email protected] [email protected] [email protected] Miguel Ferreira KEEP SOLUTIONS Rua Rosalvo de Almeida 5 4710 Braga, Portugal [email protected] ABSTRACT intrinsically related technologies function together to per- The Database Preservation Toolkit is a software that au- form tasks such as information storage and retrieval, data tomates the migration of a relational database to the sec- transformation and validation, privilege management and even the enforcement of important business constraints. The ond version of the Software Independent Archiving of Re- 1 lational Databases format. This flexible tool supports the most popular databases are based on the relational model currently most popular Relational Database Management proposed by Codd. [5, 4] Systems and can also convert a preserved database back The migration of the relational database information into to a Database Management System, allowing for some spe- a format well suited for long-term preservation is one of the cific usage scenarios in an archival context. The conversion most accepted strategies to preserve relational databases. of databases between different formats, whilst retaining the This strategy consists in exporting the information of the databases' significant properties, poses a number of interest- relational database, including descriptive, structural and being implementation issues, which are described along with havioural information, and content, to a format suitable for their current solutions. long-term preservation. Such format should be able to main- To complement the conversion software, the Database Vi- tain all significant properties of the original database, whilst sualization Toolkit is introduced as a software that allows ac- being widely supported by the community and hopefully cess to preserved databases, enabling a consumer to quickly based on international open standards [7]. Few formats fit search and explore a database without knowing any query this criteria, being the SIARD format one of the main con- language. The viewer is capable of handling big databases tenders. and promptly present search and filter results on millions of The Software Independent Archiving of Relational Data- records. bases (SIARD) format was developed by the Swiss Federal This paper describes the features of both tools and the Archives and was especially designed to be used as a format methods used to pilot them in the context of the European to preserve relational databases. Its second version, SIARD Archival Records and Knowledge Preservation project on 2, retains the (most commonly agreed upon) database sig- several European national archives. nificant properties and is based on international open standards, including Unicode (ISO 10646), XML (ISO 19503), SQL:2008 (ISO 9075), URI (RFC 1738), and the ZIP file Keywords format. [6, 8, 9, 11, 14, 15] Preservation; Archive; Relational Database; Migration; Ac- The manual creation of SIARD files is impractical, the- cess; SIARD refore an automatic conversion system was developed { the Database Preservation Toolkit (DBPTK). This software can be used to create SIARD files from relational databases in 1. INTRODUCTION various DBMSes, providing an unified method to convert Databases are one of the main technologies that support databases to a database agnostic format that is able to re- information assets of organizations. They are designed to tain the significant properties of the source database. The store, organize and explore digital information, becoming software uses XML Schema Definition capabilities present such a fundamental part of information systems that most in the SIARD format to validate the archived data and can would not be able to function without them [5]. Very often, also be used to convert the preserved database back to a the information they contain is irreplaceable or prohibitively DBMS. expensive to reacquire, making the preservation of databases The digital preservation process is not complete if the a serious concern. The Database Management System (DBMS) is the soft- 1According to the ranking at ware that manages and controls access to databases, which http://db-engines.com/en/ranking (accessed on Apr. 2016), can be described as a collection of related data. These two where 7 of the top 10 DBMS use the relational model /................................................zip file root The SIARD format includes advanced features such as header/....................folder for database metadata the support for Large Objects (LOBs)4, and ZIP compres- metadata.xml sion using the deflate method [15]. SIARD 2 brings many metadata.xsd improvements over the original SIARD and other database version/ preservation formats, mainly the support for SQL:2008 stan- 2.0/.............empty folder signalling version 2 dard and data types, including arrays and user defined types; content/.....................folder for database content the strict validation rules present in XML Schema files to en- schemaM/............M is an integer starting with 0 force valid XML structure and contents; and allowing LOBs tableN/ .......... N is an integer starting with 0 to be saved in the tableN.xml file, saved as files in a folder tableN.xml.........same N used in tableN/ inside the SIARD, or saved as files in a location outside tableN.xsd.........same N used in tableN/ the SIARD file. Furthermore, the SIARD 2 specification allows LOB files to be saved in multiple locations or stor- Figure 1: Basic SIARD 2 directory structure. age devices outside the SIARD file, increasing support for databases which contain large amounts of LOBs. archived information cannot be accessed. To access and ex- 3. DATABASE PRESERVATION plore digitally preserved databases, the Database Visualiza- tion Toolkit (DBVTK) is being developed. This software TOOLKIT can load databases in the SIARD format and display their The DBPTK is an open-source project5 that can be exe- descriptive, structural and behavioural information and con- cuted in multiple operating systems and run in the command- tent. The viewer also provides the functionality to search line. It allows the conversion between database formats, in- and filter the database content as well as export search re- cluding connection to live Relational Database Management sults. Systems, for preservation purposes. The toolkit allows ex- traction of information from live or backed-up databases into preservation formats such as SIARD 2. Also, it can import 2. SIARD 2 back into a live DBMS, to provide the full DBMS function- The second version of the SIARD format emerged from ality, such as SQL6 querying, on the preserved database. lessons learnt by creators and users of database preservation This tool was part of the RODA project and has since formats. The SIARD format was originally developed by the been released as a project on its own due to the increasing Swiss Federal Archives in 2007 and is being used by many interest on this particular feature. It is currently being de- archives worldwide. In 2013, the SIARD format became veloped in the context of the European Archival Records and a Swiss E-Government Standard (eCH-0165). The SIARD- Knowledge Preservation (E-ARK) project together with the DK is a variation of the SIARD format created by the Danish second version of the SIARD preservation format { SIARD National Archives to fit their specific needs. The Database 2. Markup Language (DBML) format, created at University The DBPTK uses a modular approach, allowing the com- of Minho, was used by the Repository of Authentic Digi- bination of an import module and an export module to en- 2 tal Objects (RODA) software to preserve databases at the able the conversion between database formats. The import 3 Portuguese National Archives . The Archival Data Descrip- module is responsible for retrieving the database information tion Markup Language (ADDML) is the format used by the (metadata and data), whilst the export module transcribes Norwegian National Archives to describe collections of data the database information to a target database format. Each files. [3, 12, 13, 1] module supports the reading or writing of a particular data- The SIARD 2 format, in its most basic form, consists of a base format or DBMS and functions independently, making ZIP file that contains a hierarchy of folders and files of XML it easy to plug in new modules to add support for more and XSD (XML Schema) format, illustrated in figure 1. The DBMS and database formats. The conversion functionality XML files inside the SIARD file hold database metadata is provided by the composition of data import with data information and contents. export. The metadata.xml file contains database description in- Currently supported DBMSes include Oracle, MySQL, formation, such as the database name, description and ar- PostgreSQL, Microsoft SQL Server and Microsoft Access. chival date, the archivist name and contact, the institution All of these support import and export, except Microsoft or person responsible for the data; database structural in- Access where only import is available, i.e. conversion from formation, including schemas, tables and data types; and Microsoft Access is possible, but conversion to Microsoft Ac- behavioural information like keys, views and triggers. Such cess is not. All these modules use the Java Database Con- information is useful not only to document the database but nectivity (JDBC) modules as a generic starting point, and also to allow the reliable export of its structure on a different then deviate as much as needed to account for functionality DBMS. specific to each DBMS. The tableN.xml files correspond to each of the database The base JDBC import and export modules, given the tables and hold the content of the rows and cells from that correct configurations and dependencies, may enable con- table.

Database Preservation Toolkit

Solutions for the Chora of Metaponto Publication Series

NEWSLETTER February 2012

Preliminary Programme

Database Preservation DPC Training Course

Abstracts of Foreign Periodicals LESTER K

Editors' Notes

File Format Guidelines for Management and Long-Term Retention of Electronic Records

Curriculum Vitae KEITH W

How to Search

Preserving Geospatial Data

MEMORY of the WORLD REGISTER Sound Toll Registers

Arctic Geopolitics and Eartquake Monitoring