Agenda ......

Digital Preservation ƒ File formats ƒ Basics and issues ƒ Exercise/experiments File formats and characterisation ƒ Characterisation ƒ Identification and validation (DROID, JHove) ƒ registries ƒ Risk assessment of file formats Christoph Becker, Hannes Kulovits April 17, 2008 ƒ eXtensible Characterisation Languages (XCL)

Vienna University of Technology www.ifs.tuwien.ac.at/dp

......

Agenda ......

ƒ Definition of File/File Format Part 1: ƒ Representation File Formats

ƒ Elements of a file format Hannes Kulovits Institut für Softwaretechnik und Interaktive Systeme TU Wien ƒ File and Preservation

http://www.ifs.tuwien.ac.at/dp ƒ Challenges

Part of this presentation is based on slides by Prof. Manfred Thaller, DELOS Summer School 2007, Pisa ......

What is a file/file format? Plain Text ......

ƒ A file is nothing more than a sequence of bits ƒ De facto standard for Plain Text is ASCII – Uses 8 bits ƒ How to encode those bits is specified in a file format – Maximum of 256 different characters possible – Includes ƒ File format is a specification of how to interpret a bit • Letters of most alphabets (()lower and upper case) stream. • Arabic numerals • Punctuation marks • Standard symbols ƒ File format specifies 1. Whether the file is binary or ASCII ƒ Another important format is Unicode 2. How information is organized – Provides unique encoding for each character 3. ... – Uses multiple bytes to represent each character

......

1 Proprietary vs. Open File formats based on plain text ......

ƒ Proprietary ƒ For example: XHTML 1.1 – Documentation mostly not available – License and patent rules – License agreements subject to change ƒ In HTML plain text must obey certain rules – Restrictions for use and modifications may apply – se of tags – type sizes ƒ Open – color – Documentation available! – Unlimited use – No license fee – Open for modifications – No patent owners

......

Different types of File Formats Different types of File Formats (2) ......

ƒ Different kinds of formats for different kinds of ƒ Three-character file extension of DOS and Windows. information (Neither standardised nor unique.) [Rothenberg, 1995, Ensuring the Longevity of Digital Documents] ƒ Official categorisation of file formats is the IANA MIME ƒ Unix ‚magic numbers‘ type – Text documents – Databases ƒ Macintosh data-forks – Still and moving images – Audio ƒ MIME type, also not unique – Multipart – ... ƒ None of them is really satisfying – Better solution: PRONOM with Pronom Unique Identifier

......

An image An image ......

6 rows 5 columns

......

2 An image An image ......

11111 5 rows 1 == blue 10001 6 columns 0 == red 1 1 0 1 1 11011 11011 11111

......

An image An image ......

11111 Store: 11111 1,1,1,1,1, 1 == green 10001 1,0,0,0,1, 10001 0 == yellow 1 1 0 1 1 1,1,0,1,1, 1 1 0 1 1 1,1,0,1,1, 11011 1,1,0,1,1, 11011 11011 1,1,1,1,1 11011 11111 11111

......

An image An image ......

Store: 11111 Store: 11111 6,1,3,0,3, 6,1,3,0,3, 1,1,0,4,1,1, 10001 1,1,0,4,1,1, 10001 0,4,1,1,0, 11011 0,4,1,1,0, 11011 7,1 7,1 11011 11011 11011 11011 11111 11111

......

3 An image An image ......

Store: 11111 Store: 11111 6,1,3,0,3, 1,1,1,1,1, 1,1,0,4,1,1, 10001 1,0,0,0,1, 10001 0,4,1,1,0, 11011 1,1,0,1,1, 1 1 0 1 1 7,1 1,1,0,1,1, 11011 1,1,0,1,1, 11011 11011 1,1,1,1,1 11011 11111 Uncompressed 11111

......

An image An image ......

Store: 11111 Store: 1,1 2,1 3,1 4,1 5,1 6,1,3,0,3, SetSize: 5 by 6 SetBackgroundColor: Blue 1,2 2,2 3,2 4,2 5,2 1,1,0,4,1, 10001 SetForegroundColor: Red SetLetterHeight: 4 1,0,4,1,1, 1 1 0 1 1 MoveTo: 3,5 1,3 2,3 3,3 4,3 5,3 0,7,1 DrawLetter: T 11011 1,4 2,4 3,4 4,4 5,4 (Compressed) 11011 1,5 2,5 3,5 4,5 5,5 Run Length Encoded 11111 1,6 2,6 3,6 4,6 5,6

......

An image An image ......

6 rows dimensions 5 columns 1 == blue 1 == blue 0 == red 0 == red Uncompressed Uncompressed

......

4 An image An image ......

information> (implicit / explicit) information> (implicit / explicit) information> (implicit / explicit)

… and the data?

......

An image An image ......

(implicit / explicit) 1,1,1,1,1,1, 10001 1111011,1,1,1,0,1, 1 1 0 1 1 (implicit / explicit) 1,1,1,1,1,1 (implicit / explicit) 11011

… and the data? 11111

......

An image File Format ......

Data either as data stream 11111 ƒ Basic Information or as – What to do? processing instructions 10001 ƒ Rendering Information SetSize: 5 by 6 SetBackgroundColor: Blue 1 1 0 1 1 – How to do It? SetForegroundColor: Red SetLetterHeight: 4 11011 ƒ Storage Information MoveTo: 3,5 – How to move it from persistent form to deployed form? DrawLetter: T 11011 ƒ Data 11111 – What to deploy?

......

5 File Format (2) File Format - Definition ......

ƒ Basic Information ƒ A clearer definition of the term file form format: – Mandatory ‚[...] the internal structure and encoding of a digital ƒ Rendering Information object, which allows it to be processed, or to be – Useful rendered in human accessible form. A digital object may be a file, or a bit stream embedded within a file‘ ƒ Storage Information – Historical Brown, A. (2006). Digital Preservation Technical Paper 2.

ƒ Data – Mandatory

......

File as a composite object File format: TIFF ......

ƒ Image File Header ƒ Rather popular file formats at them moment are for instance HTML, XML and PNG ƒ Image File Directory – Information about image ƒ But all of them can be stored in the same file format! – Pointers to actual data DOC

ƒ IFD Entry . – TIFF Tag .doc – Value .png

ƒ Custom tags possible

......

File format: PDF File format: PDF ......

1 0 obj << /Type /Page 2 0 obj /Parent 281 0 R << /Resources 2 0 R /ProcSet [ /PDF /Text ] /Contents 3 0 R /Font << /TT2 292 0 R /TT4 288 0 R >> /StructParents 2 /ExtGState << /GS1 300 0 R >> /MediaBox [ 0 0 612 792 ] /ColorSpace << /Cs6 289 0 R >> /CropBox [ 0 0 612 792 ] >> /Rotate 0 endobj >> endobj

...... 36

6 File format: PDF File format: XML (SVG) ......

3 0 obj << /Length 4605 /Filter /FlateDecode >> stream H‰„WÛŽÛÈ}×Wô#Œ4jR”¨`±Àø ™Í" ¶(²5j›"¹lräý‘|oêÖ-j ‹±/[i²½Ö)}ÔÏö&ªÙH; ŠøL"È÷Û’Æ ¬JYØÂm]j¥Ýqõ¥ÏººÕ™·²ôÒ·Ûº¤–÷.u-kP0 4“øTxM<é識9uôøˆòLi¦Øo TÖ m–;ǯ ÷¤ÿlÕºvéU—Ë Leiste ±¤Lm°gŸˆu1Åëu5l3¯’¢O %òËTîü7?ìNdh endstream endobj

...... 37

File format: XML (SVG) Files and Preservation ......

1. Bit rot.

2. Obscolescence of software.

...... 39

Bit rot Bit rot ......

An Image file ... and after Undetectable before …. one byte is by software. changed.

......

7 Bit rot Bit rot ......

002 004 Processing dictionary 002 004 One byte is damaged, one byte cannot be displayed correctly. 234 123 234 123

234 156 Payload 234 156

127 178 127 xxx

221 221 221 221

......

Bit rot Challenges w.r.t. File Formats ......

ƒ Obsolescence 002 xxx – Software able to read does not exist anymore One byte is damaged, ten bytes – Format specification lost cannot be displayed correctly. 234 123 – Implied algorithm lost – Required object lost 234 156 ƒ Format is proprietary ƒ Format depends on obsolete hardware 127 178

221 221

......

Recommended formats? Recommended formats: text ......

ƒ XML High confidence Medium confidence Low confidence ™ Plain text (encoding: ™ Cascading Style Sheets ™PDF (*.pdf) (encrypted) ISO8859-1 - 9, UTF-8, (*.css) ™ Microsoft Word (*.doc) ƒ TXT UTF-16 with BOM) ™ DTD (*.dtd) ™ WordPerfect (*.wpd) ™ XML (includes ™ PDF (*.pdf) (embedded ™ DVI (*.dvi) XSD/XSL/XHTML, etc.; fonts) ™ All other text formats not ƒ PDF with included or accessible ™ Rich Text Format 1.x listed here schema and character (*.rtf) encoding explicitly ™ HTML 4.x (include a specified) DOCTYPE declaration) ™ PDF/A-1 (ISO 19005-1) ™ SGML (*.sgml) ™ Open Office (*.sxw/*.odt) ƒ ? ™ Office Open XML (*.docx)

http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf

......

8 Recommended formats: / raster Recommended formats: vector ...... image ......

High confidence Medium confidence Low confidence High confidence Medium confidence Low confidence ™TIFF (uncompressed) ™ BMP (*.bmp) ™MrSID (*.sid) ™SVG 1.1 (no Java ™Computer Graphic ™Encapsulated ™ PNG (*.png) ™ JPEG/JFIF (*.jpg) ™TIFF (in Planar binding) (*.svg) Metafile (CGM, Postscript (EPS) ™JPEG2000 (prefer format) WebCGM) (*.cgm) ™Macromedia Flash lossless or ™FlashPix (*.fpx) (*.) uncompressed)() (*.jp2) ™PhotoShop (*.psd) ™All other vector image ™TIFF (compressed) ™All other raster image formats not listed here ™GIF (*.) formats not listed here

http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf

......

Recommended formats: audio Recommended formats: video ......

High confidence Medium confidence Low confidence High confidence Medium confidence Low confidence ™AIFF (PCM) (*.aif, ™SUN Audio ™AIFC (compressed) ™Motion JPEG 2000 ™ (*.ogg) ™AVI (compressed) *.aiff) (uncompressed) (*.au) (*.aifc) (ISO/IEC 15444-4)( ™MPEG-1, MPEG-2 (*.avi) ™ WAV (PCM) (*.) ™Standard MIDI (*.mid, ™ NeXT SND (*.snd) *.mj2) (*.mpg, *.mpeg) ™QuickTime Movie *.midi) ™ RealNetworks 'Real ™ AVI (uncompressed) ™MPEG-4(*.mp4) (compressed) (*.mov) ™Ogg (*.ogg) Audio‚ (*.ra, *.rm, *.ram) (*.avi) ™RealNetworks 'Real ™Free Lossless Audio ™ Windows Media ™QuickTime Movie Video‚ (*.rv) Codec (*.) Audio (uncompressed)(*.mov) ™ ™ Advance Audio ™(*.wma) ™Motion JPEG (*.avi, (*.wmv) Coding (*.mp4, *.m4a, ™WAV (compressed) *.mov) ™All other video formats *.aac) (*.wav) not listed here ™ MP3 (MPEG-1/2, ™All other audio formats Layer 3)(*.) not listed here http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf

......

Recommended formats: “data base” Recommended formats: 3D ......

High confidence Medium confidence Low confidence High confidence Medium confidence Low confidence ™Delimited Text (*.txt, ™DBF (*.dbf) ™Excel (*.xls) ™ (*.x3d) ™VRML (*.wrl, *.) ™All other *.csv) ™OpenOffice ™All other spreadsheet/ ™U3D ( file ™formats not listed here ™SQL DDL *.sxc/*.ods) database formats not format) ™Office Open XML listed here *.xlsx)

http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf

......

9 ......

Digital preservation and file formats: Some robustness experiments Thank you very much for your attention!

Questions?

Christoph Becker, Hannes Kulovits April 17, 2008

Vienna University of Technology www.ifs.tuwien.ac.at/dp

......

Exercise Exercise ......

ƒ www.ifs.tuwien.ac.at/~becker/shotgun.zip or USB ƒ Shoot and try to open the ‚dirty‘ file ƒ (Many thanks to Cologne University!) ƒ Try different intensities, flip vs. remove ƒ Shoot files ƒ Compare results with the FCLA recommendations ƒ Clean vs. Dirty ƒ Flip bits vs. Remove bits ƒ Thoughts? ƒ Size vs. Frequency of shots ƒ File types ƒ Target time: 30-45‘ max ƒ Images ƒ Documents ƒ Audio ƒ www.ifs.tuwien.ac.at/~becker/shotgun.zip or USB

......

Bit rot Bit rot ......

002 004 One byte is damaged, one byte 002 xxx One byte is damaged, ten bytes cannot be displayed correctly. cannot be displayed correctly. 234 123 234 123

234 156 234 156

127 xxx 127 178

221 221 221 221

......

10 Agenda ......

Part 2: ƒ File formats and issues ƒ File format identification ƒ Registries File formats and registries ƒ Characterisation tools ƒ DROID Characterisation ƒ JHove ƒ XCL Christoph Becker Vienna University of Technology www.ifs.tuwien.ac.at/~becker [email protected] www.ifs.tuwien.ac.at/dp

......

Requirements for DP Some file format requirements ...... ƒ Specifications available ƒ Digital preservation has to guarantee – Is an XML specification enough? – Integrity – Syntacs and semantics needed – Understandability ƒ Standardized (ISO, ANSI, ...) – Originality ƒ Accepted and widely used – Authenticity ƒ Not covered by patent – Accessibility ƒ Free of compression ƒ Free of any cryptographical techniques

ƒ Flexible and extensible? ƒ „Interoperability through time“

......

What file is this? What kind of file is this? ......

ƒ What’s wrong with file extensions? ƒ Not necessarily unique (e.g. wks) 1. „Clever software“ ƒ Granularity not sufficient inspects files to decide how to process them ƒ Can be altered by users

2. Format registries ƒ Formats vs. Format profiles ƒ PDF is not one format ƒ DOC is not one format ƒ TIFF is not one format

......

11 What’s Wrong with MIME Types? Why Do We Need a Registry? ......

ƒ Insufficient depth of detail – No requirements regarding syntax and semantic description ƒ Repository functions are performed on a format- – No requirement for complete disclosure, specific basis especially of proprietary formats ƒ Interpretation of otherwise opaque content streams is dependent upon knowledge of how typed content is ƒ Insufficient granularity represented – Bo th til e d RGB G eoTIFF with LZW and st ri ped bi -tlTIFFtonal TIFF- FX with Group 4 are typed as “image/” ƒ Interchange requires mutual agreement of format – All of PDF 1.0 – 1.4, PDF/X-1, X-2, X-3, and PDF/A are typed syntax and semantics as “application/pdf” – These variants might require radically different workflows

Global Digital Format DSpace .User ...... Group, ...... Global Digital Format DSpace. . . User...... Group, ...... Registry March 2004 Registry March 2004

Use Cases File format registries ...... ƒ Identification – “I have a digital object; what format is it?” ƒ PRONOM: ƒ Validation http://www.nationalarchives.gov.uk/pronom/ – “I have an object purportedly of format F; is it?” ƒ Transformation ƒ Global Digital Format Registry – “I have an object of format F, but need G; how can I produce it?” http://hul.harvard.edu/gdfr ƒ Characterization – “I have an object of format F; what are its significant properties?” ƒ FileExt ƒ Risk assessment http://filext.com – “I have an object of format F; is it at risk of obsolescence?” ƒ Delivery – “I have an object of format F; how can I render it?”

Global Digital Format DSpace. . . User...... Group, ...... Registry March 2004

PRONOM Tools ......

ƒ DROID (Digital Record Object Identification) ƒ relies on PRONOM ƒ The National Archives, UK

ƒ JHOVE ƒ JSTOR/Harvard Object Validation Environment ƒ Validation and characterisation

ƒ eXtensible Characterisation Languages (XCL) ƒ Two XML meta-languages ƒ Goal: express complete informational content of an object in an abstract model

......

12 Signatures in DROID What kind of file is this? ...... ƒ External signatures ƒ File extensions (a) By external characteristics (file extensions) (b) By internal characteristics („magic number“, ƒ Internal signatures „signature“). ƒ Format indicators in the bitstream ƒ Byte sequences A TIFF file begins with … 1. Bytes 0-1: The byte order used within the file. Legal values are: “II” (4949.H) / “MM” (4D4D.H) 2. Bytes 2-3: An arbitrary but carefully chosen number (42) that further identifies the file as a TIFF file.

......

Registry content ......

ƒ Descriptive information ƒ Identifiers ƒ MIME ƒ Pronom Unique Identifier (PUID) Demo: DROID, PRONOM ƒ Relationships to formats ƒ Technical environment ƒ References and links…

ƒ Risk factors

......

File format characteristics ......

Questions?

......

13 Use Case Coverage JHove ...... ƒ Identification ƒ JSTOR/Harvard Object Validation Environment ƒ Risk assessment ƒ Modular and extensible Java-based architecture ƒ Delivery – Image modules: GIF, JPEG, JPEG2000, TIFF – Document modules: ASCII,HTML,PDF, UTF-8, XML – “I have an object of format F; how can I render it?” – ... ƒ Transformation – “I h ave an o bjec t o f f ormat F, btbut need G; hIdit?”how can I produce it?” ƒ Three functions – Identification ƒ Validation – Validation – “I have an object purportedly of format F; is it?” – Characterisation ƒ Characterization – “I have an object of format F; what are its significant properties?”

......

The TIFF module… Validation ...... −Tagged Image File Format (TIFF) raster images TIFF 4.0, 5.0, and 6.0 [TIFF 4.0, TIFF 5.0, TIFF 6.0] ƒ A digital object is well-formed if it meets the purely −Baseline 6.0 Class B, G, P, and R [TIFF 6.0] syntactic requirements for its format. −Extension Class Y [TIFF 6.0] ƒ An object is valid if it is well-formed and it meets −TIFF/IT (ISO 12639:2003) [TIFF/IT] File types CT, LW, HC, MP, BP, BL, and additional semantic-level requirements. FP, and conformance levels P1 and P2 −TIFF/EP (ISO 12234 -2:2001) [ TIFF/EP] −Exif 2.0, 2.1 (JEIDA-49-1998), and 2.2 (JEITA CP-3451) [Exif 2.1, Exif 2.2] ƒ Validation use cases: −GeoTIFF 1.0 [GeoTIFF] – "I have an object that purports to of format F; is it?" −TIFF-FX (RFC 2301) [TIFF-FX] – "I have an object of format F; does it meet profile P of F?" −Profiles C, F, J, L, M, and S – "I have an object of format F and external metadata about F in schema S; are they consistent?" −Class F (RFC 2306) [Class F, RFC 2306] −RFC 1314 [RFC 1314] −DNG (Adobe Digital Negative) [DNG]

......

......

Questions? JHove Demo

......

14 Use Case Coverage Core requirement: Keep object intact ......

ƒ Identification ‰ Essential object characteristics ƒ Risk assessment ‰ Content ‰ Appearance ƒ Delivery ‰ Structure ƒ Transformation ‰ Behaviour ‰ Context ƒ Validation

ƒ Characterization – “I have an object of format F; what are its significant properties?”

......

Validating a migrated image Validating a migrated image ......

‰ Yes, it’s in JPEG 2000 format

‰ Yes, it’s wellformed ‰ Yes, it’s valid ‰ Yes, it still has the same dimensions ‰ …. But is it still the same image?

......

Validating a migrated image The XCL languages ......

‰ Yes, it’s in JPEG 2000 format

‰ The eXtensible characterisation description language ‰ Yes, it’s wellformed XCDL ‰ Yes, it’s valid ‰ describes properties of digital objects ‰ Yes, it still has the same dimensions ‰ …. But is it still the same image? ‰ The eXtensible characterisation extraction language XCEL ‰ extracts properties from files ‰ Creates a mapping from a file format to XCDL

‰ We need more characterisation.

......

15 Essential properties as described by file formats Properties described by the format ......

Image width: 277

Image length: 339

Compression: uncompressed

......

XCDL file.png Extracting properties: XCEL ......

‰ One generic XCEL processor instead of specific extractor for every file format ‰ Uniform description of XCDL file.tif properties and values ‰ Preprocessing instructions ‰ Configuration tasks ‰ Uniform structure file.gif ‰ Format description − Properties of different objects ‰ Defines the structure of an object are described using a single ‰ Templates vocabulary and grammar TIFF: imageLength PNG: imageHeight ‰ Describe recurring structures ? : ? ‰ eXtensible ‰ Postprocessing instructions ‰ On the results of processing

XCDL: imageHeight

......

XCDL example: ‘An important word’ Comparing migrated documents ......

An important word

Fontname Times-Bold XCLLabel ......

16 ......

Questions?

www.ifs.tuwien.ac.at/~becker [email protected]

www.ifs.tuwien.ac.at/dp

......

17