Digital Preservation: A Software Approach 8th Convention PLANNER 2012

Digital Preservation: A Software Approach

R K Joteen Singh

Abstract

In today’s ever developing and rapid growing world, the national heritage site- the pride and identity of a country, seems to be in deteriorating state though various preservative measures are being taken up. What is important here is to understand which preservative method and technique will serve us best in this technological and computerized era. Since there is a continuous technological growth and advancement, the technological component does also undergo transformation on both software and hardware levels. One of the suitable preservation techniques available is the digitization using Xena software. This paper attempts to show how the particular software will help in preserving the national heritage in long run.

Keywords: Digital Preservation, Preservation Strategies, Preservation Software, Xena Software, Open Archival Information System

1. Introduction traditional preservation method is maintaining a palm-leaf book. The amount of books on all kinds The national heritage is the pride and identity of of disciplines that India has contributed using this the nation and its people. Heritage that gives an method is remarkable. However, these fascinating identity to the people needs to be preserved in order heritage including paper, palm-leaves and birch to pass down to generations to come. Various bark are organic materials that have their own elements of national significance like the limited life span. German scientists have architecture, landscapes, documents and other determined that most of the Indian palm-leaf books artifacts should be preserved using advanced will naturally decay within the next 50 to 100 years. scientific technology and methods to save them It implies that may be after 100 to 200 years all from decay and disappearance. India among other these kinds of books will disappear. And this will countries is known for its rich heritage and culture be a lost to human civilization. With the in the world. It has been one of the main development it has also undergone a great contributors in the field of medicine, mathematics, transformation through ages now with the science, technology, philosophy, theology, advancement in our science and technology. In this literature, linguistics, graphic arts, music, dance computerized era it is apt to utilize the technology and many other disciplines. The means of in preserving such heritage of a country and as documentation available to our pioneers was to many scientists have suggested, the best, effective record their ideas and thoughts by writing on the and advantageous way is digitizing the materials, palm-leaf. One of the main forms of Indian and the national heritage sites to save it from being 8th Convention PLANNER-2012 lost. Using sophisticated equipments and Sikkim University, Gangtok, March 01-03, 2012 © INFLIBNET Centre, Ahmedabad computers, it has produced ultra sharp digital - 187 - 8th Convention PLANNER 2012 Digital Preservation: A Software Approach colour pictures of the Indian national heritage like from the particular media. In addition to the above, palm-leaf books, paper manuscripts, books, the digital technology is developing in its full speed newspapers, letters, birch-bark texts, drawings, and retrieval and coding technologies can become paintings, sculptures, inscriptions and many other obsolete within very short period of time. When heritage objects though the originality and the more performance and less expensive storage essence of the heritage item may not be maintained devices are developed, older versions may abruptly up to the point. Now the concern here is how far replace. Even a software decoding technology is these digitized materials will serve our purpose? Is abandoned or a particular hardware may no longer it free from limitation? The answer, perhaps, is no, be in production, records created with such we are not safe from its loss. The technology is technologies are at a great risk, because they are changing very fast and the present day information no longer accessible which is known as digital format may not be accessible after a few decades. obsolescence. In this scenario, preservation of Keeping this in mind, the main objective of this digital information is intense required to have focus paper is to attempt and show the feasibility of the at right time than preservation of other media [1]. application of Xena software in long-term 4. Common Preservation Strategies preservation of digitized objects. Regarding the long-term preservation of digital 2. Digital Preservation objects the Online Computer Library Centre Digital preservation is the process of maintaining (OCLC) has developed a set of strategy, which accessibility to information and all kinds of records consists of: including scientific and cultural heritage existing  Assessing the risks for loss of content posed in digital formats. Many of the national heritage by technology variables such as commonly used including rare books have digitized which promotes proprietary file formats and software applications accessibility to the organizations or even individuals irrespective of the location of the  Evaluating the digital content objects to national heritage. determine what type and degree of format conversion or other preservation actions should 3. Necessity Of Digital Preservation be applied

Most of the media which digital information is  Determining the appropriate metadata needed recorded have less life span than some analog media for each object type and how it is associated with such as paper. While acid paper is prone to the objects deterioration however, the rate of deterioration is  Providing access to the content [2]. quite slow. It is also possible to retrieve the information without loss once deterioration is Some other strategies which are commonly use by noticed. In case of digital environment, the digital individuals and organizations may be highlighted data recording media deteriorate more rapidly and as refreshing, migration, replication, emulation, once the deterioration is detected, it is the matter etc. All these long-term preservation strategies are of chance to retrieve even a piece of information mutually important however, - 188 - Digital Preservation: A Software Approach 8th Convention PLANNER 2012 is the main issue, simply because lack of established 6. Digital Preservation Software standards, protocols and methods for preserving  The Digital Preservation Software Platform digital information. Therefore, standardization of (DPSP) is free and open source software developed digital file format is again a basic requirement for by the National Archives of Australia. The DPSP long-preservation of digital objects. is a set of software applications which promote 5. Digital Preservation Standards the process of digital preservation. There are four components of DPSP such as: To standardize digital preservation practice and  Xena: Xena stands for XML Electronic produce a set of recommendations for preservation Normalising for Archives. Xena converts digital programme implementation, the Reference Model files to standards based, open formats. for an Open Archival Information System (OAIS) was developed. The reference model includes the  Digital Preservation Recorder (DPR): DPR following responsibilities that an OAIS archive handles bulk preservation of digital files via an must abide by: automated workflow.

 Negotiate for and accept appropriate  Checksum Checker: Checksum Checker is a information from information producers piece of software that is used to monitor the contents of a digital archive for data loss or  Obtain sufficient control of the information corruption. provided to the level needed to ensure Long Term Preservation  Manifest Maker: Manifest Maker produces a tab-separated list of digital files in a specified  Determine, either by itself or in conjunction location. The manifest includes the checksum, with other parties, which communities should path and filename of each digital file. become the Designated Community and therefore, should be able to understand the information The digital preservation process is described in the provided following diagram:

 Ensure that the information to be preserved is 1 Create manifest with Manifest Maker independently understandable to the Designated Community. In other words, the community Transfer Manifest Maker Files

should be able to understand the information Manifest without needing the assistance of the experts who 2 Process files and manifest in DPR. This transfers the files to the Digital Archive. During processing, DPR calls Xena to convert digital files to preservation formats.

produced the information Transfer Files D P R Digital  Follow documented policies and procedures Archive Manifest which ensure that the information is preserved Xena file (.xena) against all reasonable contingencies and which Xena enable the information to be disseminated as 3 Check integrity of files on the Digital Archive with Checksum Checker. authenticated copies of the original or as traceable Checksum to the original. Digital Checker Archive  Make the preserved information available to the Designated Community [3] - 189 - 8th Convention PLANNER 2012 Digital Preservation: A Software Approach The new features of the latest version of Xena JAR Files are extracted from the archive includes: and normalised into separate Xena files. A Xena index file is created,  ability to normalise harvested websites; which when opened in a Xena Viewer,  integration with Tesseract OCR and the ability will display the files in a table. to create raw text versions of file formats (such as MAC Files are extracted from the archive Word, TIFF and PDF); BINARY and normalised into  support for audio files in OGG container separate Xena files. format using , FLAC or Speex codecs; Files are extracted from the archive and normalised into separate Xena  improved MP3 guesser; files. A Xena index file is created,  support for more image formats (such as CUR, which when opened in a Xena Viewer, PCX and XPM); will display the files in a table.

 new character set detection library; TAR.GZ Works as a combination of ‘’ and ‘TAR’. All files are extracted from the  automatic configuration of Xena output and archive and normalised into separate log directories; Xena files.

 ability to preserve directory structures; WAR Files are extracted from the archive and normalised into separate Xena  ability to handle files normalised with previous files. A Xena index file is created, versions of Xena; which when opened in a Xena viewer, will display the files in a table.  major refactoring of the source code for the external libraries used by Xena and an update of ZIP Files are extracted from the archive license to GPL version 3; and normalised into separate Xena files. A Xena index file is created,  creation of an automated installer for Microsoft which when opened in a Xena viewer, Windows and MAC OS X versions of Xena will display the files in a table. 6.1 Supported Formats Audio During the process of normalisation, Xena will convert the following file types to the specified open AIFF Audio Interchange File Format files format [4]. are converted to FLAC.

Archives and Compressed Files FLAC Free Lossless Audio Codec files are preserved and wrapped in XML. GZIP Files are extracted from the archive and normalised into separate Xena files. MP3 MPEG-1 Audio Layer 3 files are converted to FLAC. - 190 - Digital Preservation: A Software Approach 8th Convention PLANNER 2012 OGG OGG container format files are WPD Word Perfect files are converted to converted to FLAC. Open Document Format. WAV Waveform Audio Files are converted to XHTML Extensible Hypertext Markup FLAC. Language files are preserved and wrapped in XML. Databases XML Extensible Markup Language files are SQL Structured Query Language files are preserved and wrapped in XML. preserved and wrapped in XML.

Email Documents MBX/ Mailboxes are converted to individual MBOX XML files and a Xena index file is CSV/TSV Comma and Tab Separated Values- created which will display the files in based files are stored as a special case a table when opened with Xena Viewer. of plain text. PST Mailboxes from are DOC/PPS documents are converted to individual XML files and /PPT/XLS converted to the Open Document a Xena index file is created which will Format. display the files in a table when opened DOCX/PP Microsoft Office Open XML with Xena Viewer. TX/XLSX documents are converted to the Open Document Format. TRIM Messages from TRIM are converted to XML and a Xena index file is created HTML Hypertext Markup Language files are Mailboxes are converted to individual converted to XHTML. which will display the files in a table MPP documents are when opened with Xena Viewer. converted to XML. ODS/ODP Open Document files are preserved / Graphics ODT and wrapped in XML. BMP Bitmap image files are converted to RTF is converted to Open PNG. Document Format. CUR Windows cursor files are converted to SYLK This spreadsheet format is converted PNG. to Open Document Format. GIF Graphics Interchange Format image SXC/SXI/ StarOffice formats are open, but are files are converted to PNG. SXW converted to the newer Open Document JPEG JPEG image files are preserved and Format. wrapped in XML. TXT Text files are preserved and wrapped ODG Open Office Document Drawings are in XML. preserved and wrapped in Xena XML.

- 191 - 8th Convention PLANNER 2012 Digital Preservation: A Software Approach PCX Personal Computer eXchange image retrieved by stripping the metadata and reversing files are converted to PNG. the Base 64 encoding, using an internal viewer PDF Portable Document Format files are which includes an export function. This way Xena preserved and wrapped in XML. software can be used for long term digital preservation up to some extent. PNG Portable Network Graphics are preserved and wrapped in XML. Acknowledgements

PNM Portable Anymap graphic bitmap files The author thanks Ms. Ningthoujam Somola Devi are converted to PNG. for her generous help and support.

PSD Photoshop image files are converted References to PNG. 1. McLeod, R. Wheatley, P. and Ayris, P. RAS graphics are converted to Lifecycle information for e-literature: full PNG. report from the LIFE project. http:// SVG are preserved eprints.ucl.ac.uk/ (accessed on 23/03/2010) and wrapped in XML. 2. Online Computer Library Center, OCLC TIFF Tagged Image File Format image files Digital Archive Preservation Policy and are converted to PNG. Embedded Supporting Documentation. http:// metadata is preserved in Xena XML. www.oclc.org/ (accessed on 08/06/08/2011) XBM X11 Bitmap Graphics are converted to 3. Consultative Committee for Space Data PNG. Systems, Reference Model for an Open XPM Unix Icon files are converted to PNG. Archival Information System (OAIS). http:// public.ccsds.org/ (accessed on 14/04/08/2011) 7. Conclusion 4. National Archives of Australia, http:// Digital obsolescence is exacerbated by the lack of xena.sourceforge.net/ (accessed on 01/05/ established standards, protocols and methods for 2011). preserving digital information. Such problems can be minimized if open formats based on open About Author standards is used for preserving digital information. Dr. R K Joteen Singh, Information Scientist, Xena digital preservation software can converts Manipur University Library. files into an openly specified format or else it E-mail: [email protected] performs ASCII Base 64 encoding on binary files and wraps the output with XML metadata headers and footers. The resulting .xena file is plain text, although the content of the data itself is not directly human-readable. The exact original file can be

- 192 -