Chapter 4

Support for Digital Formats

ong-term renderability cannot be ensured without into significant properties is focused on formats. The detailed knowledge about and documentation of dig- InSPECT project of the U.K. Arts and Humanities Data Lital file formats. In this respect, digital formats are at Service is investigating the significant properties of raster the heart of activities. images, structured text, digital audio, and e-mail messages, and new awards were recently granted to study e-learning objects, software, vector images, and moving images.1 Significant Properties The term significant properties is used to refer to the Readings Library Technology Reports www.techsource.ala.org Technology 2008 Library Reports February/March properties of digital objects that must be preserved over • Andrew Wilson, “Significant Properties Report,” Oct. time through preservation treatments such as migrations 2007, www.significantproperties.org.uk/documents/ or emulations in order to ensure the continued usability wp22_significant_properties.. A cogent review of and meaning of the objects. (Significant characteristics, work to date undertaken for the InSPECT project. essential characteristics, and essence are less commonly • Margaret Hedstrom and Christopher Lee, “Signifi­ used synonyms). The definition and determination of cant Properties of Digital Objects: Definitions, these properties constitute a critical and mostly unsolved Applications, Implications,” in Proceedings of issue in the field of digital preservation. the DLM-Forum 2002, http://ec.europa.eu/ Significant properties are usually categorized as per- transparency/archival_policy/dlm_forum/ taining to content, context, appearance, structure, and doc/dlm-proceed2002.pdf. Describes preliminary behavior. If, for example, the digital object in question research taking a rather broad view of significant were a chapter of a book in PDF format, the content might properties, although follow-up appears to be be the text and pictures, the context would be the biblio- unavailable. graphic description of the book and chapter, the appear- ance would be the layout of the pages, the structure would include any metadata relating the chapter to the book Representation Information as a whole, and the behaviors could include internal and and Registries external hyperlinks. For this particular PDF, it might be decided that the content, context, and structure must be A good understanding of digital formats is essential preserved, but that the appearance and behaviors could for the execution of nearly all preservation strategies. be sacrificed in the course of preservation treatment. Unfortunately, the concept of format is anything but Significant properties may adhere to formats, genres, straightforward. Informally we tend to think of formats or individual objects, and in some cases may be in the eye as generic file types such as PDF or QuickTime, denoted of the beholder—the actionable links that you consider by MIME type or file extension. These distinctions are expendable may be critical to me. Currently, most research not particularly useful for preservation purposes, which

19 20 Library Technology Reports www.techsource.ala.org February/March 2008 http://hul.harvard.edu/gdfr/ Global DigitalFormatRegistry http://registry.dcc.ac.uk/omar Repository Representation InformationRegistry descriptions.shtml www.digitalpreservation.gov/formats/fdd/ Format DescriptionsDatabase www.nationalarchives.gov.uk/pronom PRONOM Registry but overlapping content: different and scopes overlapping with all development, or production of stage some in registries three least at are there Currently registries. maintained centrally from one would benefitfrom havingthis information accessible and - environment resentation and every that information, rep - both determine to expertise researchand of amount sor andatleast 16MBofRAM. proceshigher or - requiresMHz 66 turn DX2, in 486 98 a 98, Windows 95. Windows and Me, Windows 2000, Windows Windows under run can 5.0 Acrobat applications). other of dozens as well (as 5.0 Reader Acrobat by dered PDF 1.4 file can be created by Adobe Acrobat 5.0 and ren- For a example, software. that supporting wareof capable hard- the and format a rendering or creating of capable software the concerns and information, “environment” andemulation. suchasmigration strategies preservation officialthe format is specification, critical for out carrying tion information (as defined by representa - OAIS), preferably linked Detailed to MPEG-1? or Cinepak Video, Apple file QuickTime this in used codec the Is 1.6? or 1.5 PDF it Is encoding. bitstream and profile, compression, sion, require more specific and granularabout information ver- •  •  •  By this point it should be obvious that it takes a huge called sometimes is information of type related A mation closely tied to the OAIS tothe model. closely tied mation Centre aims to have Curation extensive representation infor- Digital the by development under tory Reposi­ Registry Information Representation The for many factors formats. ability - sustain of analysis an as well as information tion representa- detailed has Congress of Library the by maintained database Descriptions Format The planning preservation information. include to expanding is and information, environment and representation of elements has (U.K.) Archives National The by maintained and developed registry PRONOM The format obsolescenceformat are below. noted of risk the assessing for methodologies with concerned studies Other 2007. in launched was service pilot beta A managers. repository to information decision-support as data the provide and GDFR future the and registries ing its partners. its and Australia of Library National the developedby being (AONS) System Notification Obsolescence Automatic the is approach One support. centralized from benefit would that task knowledge-intensive and time- another is This repositories. their in files for obsolescence format of risk the assess managers preservation help to is information work technical willbedonebyOCLC. 2005; the in Foundation Mellon W.Andrew the from work opment devel- supporting grant major a received and University Harvard by initiated was GDFR The network. global the Registries such as PRONOM would then become nodes in information. representation format synchronizeand their to each communicate registries other with for distributed protocol network common a and model data common a defining by registries other and these unify to aims tive as an open-source application underaBSD license. as anopen-source application available is DROID identifiers. format PRONOM returns and PRONOM registry the stored in format) a of teristics charac- external and internal the about (information files signature uses It interface. line command a and interface on single files or batches of files, and has both a graphical file formats based ontheir binary signatures. It can berun tool developed by The ArchivesNational (U.K.) to identify DROID (Digital Record Object isIdentification) a software DROID and characterization. validation, identification, format in aid to years recent in released been have tools open-source Java-based Several Tools Readings nte ue o rpeetto ad environment and representation for use Another - initia (GDFR) Registry Format Digital Global The •  •  •  Web http://prism.library.cornell.edu/VRC. site, Cornell University Library, Remote Virtual Control stanescu/11stanescu.. no. 11, Nov. 2004, www.dlib.org/dlib/november04/ Methodology,” INFORM Environment: The Preservation Digital a of in Formats Durability the “Assessing Stenescu, Andreas reports/pub93/pub93.pdf. and Library Information Resources, 2000, on www.clir.org/PUBS/ Council Investigation,” Format File A Information: Digital of Management “Risk 3 AONS is designed data from to exist- extract 10, Magazine D-Lib 2 Library Technology Reports www.techsource.ala.org February/March 2008 21 Xena Web Site Xena Web http://xena.sourceforge.net Metadata Extraction Tool Web Site Web Tool Metadata Extraction http://meta-extractor.sourceforge.net InSPECT Web 17, 2007). (accessed Nov. site, http://hul.harvard site, www.significantproperties.org.ukWeb Registry Format Digital Global 17, 2007). .edu/gdfr Nov. (accessed AONS Repositories, Sustainable PartnershipAustralianfor Automated NotificationObsolescence System II, www.apsr 17, 2007). .edu.au/aons2 (accessed Nov. 1. 2. 3. Notes Metadata Extraction Tool Metadata The Metadata Extraction Tool National was Library of New developed Zealand. Like JHOVE, it by identi- the fies file formats and extractstechnical metadata relevant Major format. XML an in outputs it which preservation, to differences from JHOVE are that this tool does not also perform validation, and this tool includes a routines number for of proprietary JHOVE formats does not that handle. Version 3 was released applicationLicense the Public under Apache as open-source an 2). (version Xena Xena (XML Electronic will NormalisingXena Australia. of Archives Nationalthe by developed for Archives) was identify the format of a source file and create a normal- ized version in a more open format. For example, audio willfiles beWAVE MP3, in normalized or AIFF, to FLAC (Free Lossless Audio Codec); files are converted to Open Officeformat; while GIF images are converted to PNG. Xena can be invoked manually or via an API (Application Program Interface). Xena version 4.0 was released in October 2007. It is available under a GNU General Public License version Sun’s2 and Java requires Runtime Environment and to run. 2.x in order OpenOffice.org

Martin Donnelly, “JSTOR/Harvard Object ­ Valid ation Environment (JHOVE),” DigitalCentre Case Studies and Interviews, Curation March 2006, www.dcc.ac.uk/resource/case-studies/jhove/ case_study_jhove.pdf. Adrian White, “Automatic Format Identification Using PRONOM and DROID,” 2006, http://droid .sourceforge.net/wiki/images/b/b4/Technical _Paper_1_-_Automatic_Format_Identification _v2.pdf. •  •  JHOVE Web Site JHOVE Web http://hul.harvard.edu/jhove Reading JHOVE (JSTOR/Harvard JHOVE Object Environment) Validation is a software tool that identifies,validates, and character - izes digital files. Like DROID, it has both graphical and command line interfaces. Validation checks that the file conforms to the appropriate fileCharacterization format returns specifications. technical metadata in an XML format including file name, modification date, byte size, format, format version, MIME type, format profiles, and checksums, as well as data more for image detailed and audio technical formats. The meta- original of release JHOVE handles PDF and several open image, audio, and text-based formats. An enhanced and re-architected version called is JHOVE2 under development by Harvard will JHOVE2 Portico, University. University, and Stanford format for use identificationDROID and will be designed to be more easily incorporated into other applications. JHOVE is available as an open-source application under the GNU Lesser License. General Public JHOVE Reading DROID Web site DROID Web http://droid.sourceforge.net/wiki/index.php/ Introduction