Support for Digital Formats
Total Page:16
File Type:pdf, Size:1020Kb
Chapter 4 Support for Digital Formats ong-term renderability cannot be ensured without into significant properties is focused on formats. The detailed knowledge about and documentation of dig- InSPECT project of the U.K. Arts and Humanities Data Lital file formats. In this respect, digital formats are at Service is investigating the significant properties of raster the heart of digital preservation activities. images, structured text, digital audio, and e-mail messages, and new awards were recently granted to study e-learning objects, software, vector images, and moving images.1 Significant Properties The term significant properties is used to refer to the Readings Library Technology Reports Library Technology properties of digital objects that must be preserved over • Andrew Wilson, “Significant Properties Report,” Oct. time through preservation treatments such as migrations 2007, www.significantproperties.org.uk/documents/ or emulations in order to ensure the continued usability wp22_significant_properties.pdf. A cogent review of and meaning of the objects. (Significant characteristics, work to date undertaken for the InSPECT project. essential characteristics, and essence are less commonly • Margaret Hedstrom and Christopher Lee, “Signifi- used synonyms). The definition and determination of cant Properties of Digital Objects: Definitions, these properties constitute a critical and mostly unsolved Applications, Implications,” in Proceedings of issue in the field of digital preservation. the DLM-Forum 2002, http://ec.europa.eu/ Significant properties are usually categorized as per- transparency/archival_policy/dlm_forum/ taining to content, context, appearance, structure, and doc/dlm-proceed2002.pdf. Describes preliminary www.techsource.ala.org www.techsource.ala.org behavior. If, for example, the digital object in question research taking a rather broad view of significant were a chapter of a book in PDF format, the content might properties, although follow-up appears to be be the text and pictures, the context would be the biblio- unavailable. graphic description of the book and chapter, the appear- ance would be the layout of the pages, the structure would include any metadata relating the chapter to the book Representation Information as a whole, and the behaviors could include internal and and Registries external hyperlinks. For this particular PDF, it might be decided that the content, context, and structure must be A good understanding of digital formats is essential February/March 2008 preserved, but that the appearance and behaviors could for the execution of nearly all preservation strategies. be sacrificed in the course of preservation treatment. Unfortunately, the concept of format is anything but Significant properties may adhere to formats, genres, straightforward. Informally we tend to think of formats or individual objects, and in some cases may be in the eye as generic file types such as PDF or QuickTime, denoted of the beholder—the actionable links that you consider by MIME type or file extension. These distinctions are expendable may be critical to me. Currently, most research not particularly useful for preservation purposes, which 19 require more specific and granular information about ver- The Global Digital Format Registry (GDFR) initia- sion, compression, profile, and bitstream encoding. Is it tive aims to unify these and other registries by defining PDF 1.5 or 1.6? Is the codec used in this QuickTime file a common data model and a common network protocol Apple Video, Cinepak or MPEG-1? Detailed representa- for distributed registries to communicate with each other tion information (as defined by OAIS), preferably linked to and synchronize their format representation information. the official format specification, is critical for carrying out Registries such as PRONOM would then become nodes in preservation strategies such as migration and emulation. the global network. The GDFR was initiated by Harvard A related type of information is sometimes called University and received a major grant supporting devel- “environment” information, and concerns the software opment work from the Andrew W. Mellon Foundation in capable of creating or rendering a format and the hard- 2005; the technical work will be done by OCLC.2 ware capable of supporting that software. For example, a Another use for representation and environment PDF 1.4 file can be created by Adobe Acrobat 5.0 and ren- information is to help preservation managers assess the dered by Acrobat Reader 5.0 (as well as dozens of other risk of format obsolescence for files in their repositories. applications). Acrobat 5.0 can run under Windows 98, This is another time- and knowledge-intensive task that Windows 2000, Windows Me, and Windows 95. Windows would benefit from centralized support. One approach is 98 in turn requires a 486 DX2, 66 MHz or higher proces- the Automatic Obsolescence Notification System (AONS) sor and at least 16 MB of RAM. being developed by the National Library of Australia and By this point it should be obvious that it takes a huge its partners.3 AONS is designed to extract data from exist- amount of research and expertise to determine both rep- ing registries and the future GDFR and provide the data resentation and environment information, and that every- as decision-support information to repository managers. one would benefit from having this information accessible A beta pilot service was launched in 2007. Other studies from centrally maintained registries. Currently there are concerned with methodologies for assessing the risk of at least three registries in some stage of production or format obsolescence are noted below. development, all with overlapping scopes and different but overlapping content: Readings • The PRONOM registry developed and maintained • “Risk Management of Digital Information: A File by The National Archives (U.K.) has elements of Format Investigation,” Council on Library and representation and environment information, and Information Resources, 2000, www.clir.org/PUBS/ is expanding to include preservation planning reports/pub93/pub93.pdf. information. • Andreas Stenescu, “Assessing the Durability of • The Format Descriptions database maintained by Formats in a Digital Preservation Environment: the Library of Congress has detailed representa- The INFORM Methodology,” D-Lib Magazine 10, tion information as well as an analysis of sustain- no. 11, Nov. 2004, www.dlib.org/dlib/november04/ ability factors for many formats. stanescu/11stanescu.html. • The Representation Information Registry Reposi- • Cornell University Library, Virtual Remote Control tory under development by the Digital Curation Web site, http://prism.library.cornell.edu/VRC. February/March 2008 Centre aims to have extensive representation infor- mation closely tied to the OAIS model. Tools Several Java-based open-source tools have been released PRONOM Registry in recent years to aid in format identification, validation, www.nationalarchives.gov.uk/pronom and characterization. Format Descriptions Database www.techsource.ala.org www.techsource.ala.org www.digitalpreservation.gov/formats/fdd/ DROID descriptions.shtml DROID (Digital Record Object Identification) is a software tool developed by The National Archives (U.K.) to identify Representation Information Registry file formats based on their binary signatures. It can be run Repository on single files or batches of files, and has both a graphical http://registry.dcc.ac.uk/omar interface and a command line interface. It uses signature files (information about the internal and external charac- Global Digital Format Registry teristics of a format) stored in the PRONOM registry and http://hul.harvard.edu/gdfr/ Library Technology ReportsLibrary Technology returns PRONOM format identifiers. DROID is available as an open-source application under a BSD license. 20 Metadata Extraction Tool DROID Web site The Metadata Extraction Tool was developed by the http://droid.sourceforge.net/wiki/index.php/ National Library of New Zealand. Like JHOVE, it identi- Introduction fies file formats and extracts technical metadata relevant to preservation, which it outputs in an XML format. Major differences from JHOVE are that this tool does not also perform validation, and this tool includes routines for Reading a number of proprietary Microsoft Office formats that • Adrian White, “Automatic Format Identification JHOVE does not handle. Version 3 was released as an Using PRONOM and DROID,” 2006, http://droid open-source application under the Apache Public License .sourceforge.net/wiki/images/b/b4/Technical (version 2). _Paper_1_-_Automatic_Format_Identification _v2.pdf. Metadata Extraction Tool Web Site http://meta-extractor.sourceforge.net JHOVE JHOVE (JSTOR/Harvard Object Validation Environment) is a software tool that identifies, validates, and character- izes digital files. Like DROID, it has both graphical and Xena command line interfaces. Validation checks that the file Xena (XML Electronic Normalising for Archives) was conforms to the appropriate file format specifications. developed by the National Archives of Australia. Xena will Characterization returns technical metadata in an XML identify the format of a source file and create a normal- format including file name, modification date, byte size, ized version in a more open format. For example, audio format, format version, MIME type, format profiles, and files in AIFF, MP3, or WAVE will be normalized to FLAC checksums, as well as more detailed technical