Agenda ......
Digital Preservation File formats Basics and issues Exercise/experiments File formats and characterisation Characterisation Identification and validation (DROID, JHove) File format registries Risk assessment of file formats Christoph Becker, Hannes Kulovits April 17, 2008 eXtensible Characterisation Languages (XCL)
Vienna University of Technology www.ifs.tuwien.ac.at/dp
......
Agenda ......
Definition of File/File Format Part 1: Representation File Formats
Elements of a file format Hannes Kulovits Institut für Softwaretechnik und Interaktive Systeme TU Wien File and Preservation
http://www.ifs.tuwien.ac.at/dp Challenges
Part of this presentation is based on slides by Prof. Manfred Thaller, DELOS Summer School 2007, Pisa ......
What is a file/file format? Plain Text ......
A file is nothing more than a sequence of bits De facto standard for Plain Text is ASCII – Uses 8 bits How to encode those bits is specified in a file format – Maximum of 256 different characters possible – Includes File format is a specification of how to interpret a bit • Letters of most alphabets (()lower and upper case) stream. • Arabic numerals • Punctuation marks • Standard symbols File format specifies 1. Whether the file is binary or ASCII Another important format is Unicode 2. How information is organized – Provides unique encoding for each character 3. ... – Uses multiple bytes to represent each character
......
1 Proprietary vs. Open File formats based on plain text ......
Proprietary For example: XHTML 1.1 – Documentation mostly not available – License and patent rules – License agreements subject to change In HTML plain text must obey certain rules – Restrictions for use and modifications may apply – se of tags – type sizes Open – color – Documentation available! – Unlimited use – No license fee – Open for modifications – No patent owners
......
Different types of File Formats Different types of File Formats (2) ......
Different kinds of formats for different kinds of Three-character file extension of DOS and Windows. information (Neither standardised nor unique.) [Rothenberg, 1995, Ensuring the Longevity of Digital Documents] Official categorisation of file formats is the IANA MIME Unix ‚magic numbers‘ type – Text documents – Databases Macintosh data-forks – Still and moving images – Audio MIME type, also not unique – Multipart – ... None of them is really satisfying – Better solution: PRONOM with Pronom Unique Identifier
......
An image An image ......
6 rows 5 columns
......
2 An image An image ......
11111 5 rows 1 == blue 10001 6 columns 0 == red 1 1 0 1 1 11011 11011 11111
......
An image An image ......
11111 Store: 11111 1,1,1,1,1, 1 == green 10001 1,0,0,0,1, 10001 0 == yellow 1 1 0 1 1 1,1,0,1,1, 1 1 0 1 1 1,1,0,1,1, 11011 1,1,0,1,1, 11011 11011 1,1,1,1,1 11011 11111 11111
......
An image An image ......
Store: 11111 Store: 11111 6,1,3,0,3, 6,1,3,0,3, 1,1,0,4,1,1, 10001 1,1,0,4,1,1, 10001 0,4,1,1,0, 11011 0,4,1,1,0, 11011 7,1 7,1 11011 11011 11011 11011 11111 11111
......
3 An image An image ......
Store: 11111 Store: 11111 6,1,3,0,3, 1,1,1,1,1, 1,1,0,4,1,1, 10001 1,0,0,0,1, 10001 0,4,1,1,0, 11011 1,1,0,1,1, 1 1 0 1 1 7,1 1,1,0,1,1, 11011 1,1,0,1,1, 11011 11011 1,1,1,1,1 11011 11111 Uncompressed 11111
......
An image An image ......
Store: 11111 Store: 1,1 2,1 3,1 4,1 5,1 6,1,3,0,3, SetSize: 5 by 6 SetBackgroundColor: Blue 1,2 2,2 3,2 4,2 5,2 1,1,0,4,1, 10001 SetForegroundColor: Red SetLetterHeight: 4 1,0,4,1,1, 1 1 0 1 1 MoveTo: 3,5 1,3 2,3 3,3 4,3 5,3 0,7,1 DrawLetter: T 11011 1,4 2,4 3,4 4,4 5,4 (Compressed) 11011 1,5 2,5 3,5 4,5 5,5 Run Length Encoded 11111 1,6 2,6 3,6 4,6 5,6
......
An image An image ......
6 rows dimensions 5 columns 1 == blue 1 == blue 0 == red 0 == red Uncompressed Uncompressed
......
4 An image An image ......
… and the data?
......
An image An image ......
… and the data? 11111
......
An image File Format ......
Data either as data stream 11111 Basic Information or as – What to do? processing instructions 10001 Rendering Information SetSize: 5 by 6 SetBackgroundColor: Blue 1 1 0 1 1 – How to do It? SetForegroundColor: Red SetLetterHeight: 4 11011 Storage Information MoveTo: 3,5 – How to move it from persistent form to deployed form? DrawLetter: T 11011 Data 11111 – What to deploy?
......
5 File Format (2) File Format - Definition ......
Basic Information A clearer definition of the term file form format: – Mandatory ‚[...] the internal structure and encoding of a digital Rendering Information object, which allows it to be processed, or to be – Useful rendered in human accessible form. A digital object may be a file, or a bit stream embedded within a file‘ Storage Information – Historical Brown, A. (2006). Digital Preservation Technical Paper 2.
Data – Mandatory
......
File as a composite object File format: TIFF ......
Image File Header Rather popular file formats at them moment are for instance HTML, XML and PNG Image File Directory – Information about image But all of them can be stored in the same file format! – Pointers to actual data DOC
IFD Entry .pdf – TIFF Tag .doc – Value .png
Custom tags possible
......
File format: PDF File format: PDF ......
1 0 obj << /Type /Page 2 0 obj /Parent 281 0 R << /Resources 2 0 R /ProcSet [ /PDF /Text ] /Contents 3 0 R /Font << /TT2 292 0 R /TT4 288 0 R >> /StructParents 2 /ExtGState << /GS1 300 0 R >> /MediaBox [ 0 0 612 792 ] /ColorSpace << /Cs6 289 0 R >> /CropBox [ 0 0 612 792 ] >> /Rotate 0 endobj >> endobj
...... 36
6 File format: PDF File format: XML (SVG) ......
3 0 obj << /Length 4605 /Filter /FlateDecode >> stream H‰„WÛŽÛÈ}×Wô#Œ4jR”¨`±Àø ™Í" ¶(²5j›"¹lräý‘|oêÖ-j
...... 37
File format: XML (SVG) Files and Preservation ......
1. Bit rot.
2. Obscolescence of software.
...... 39
Bit rot Bit rot ......
An Image file ... and after Undetectable before …. one byte is by software. changed.
......
7 Bit rot Bit rot ......
002 004 Processing dictionary 002 004 One byte is damaged, one byte cannot be displayed correctly. 234 123 234 123
234 156 Payload 234 156
127 178 127 xxx
221 221 221 221
......
Bit rot Challenges w.r.t. File Formats ......
Obsolescence 002 xxx – Software able to read does not exist anymore One byte is damaged, ten bytes – Format specification lost cannot be displayed correctly. 234 123 – Implied algorithm lost – Required object lost 234 156 Format is proprietary Format depends on obsolete hardware 127 178
221 221
......
Recommended formats? Recommended formats: text ......
XML High confidence Medium confidence Low confidence Plain text (encoding: Cascading Style Sheets PDF (*.pdf) (encrypted) ISO8859-1 - 9, UTF-8, (*.css) Microsoft Word (*.doc) TXT UTF-16 with BOM) DTD (*.dtd) WordPerfect (*.wpd) XML (includes PDF (*.pdf) (embedded DVI (*.dvi) XSD/XSL/XHTML, etc.; fonts) All other text formats not PDF with included or accessible Rich Text Format 1.x listed here schema and character (*.rtf) encoding explicitly HTML 4.x (include a specified) DOCTYPE declaration) PDF/A-1 (ISO 19005-1) SGML (*.sgml) Open Office (*.sxw/*.odt) ? Office Open XML (*.docx)
http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf
......
8 Recommended formats: bitmap / raster Recommended formats: vector graphics ...... image ......
High confidence Medium confidence Low confidence High confidence Medium confidence Low confidence TIFF (uncompressed) BMP (*.bmp) MrSID (*.sid) SVG 1.1 (no Java Computer Graphic Encapsulated PNG (*.png) JPEG/JFIF (*.jpg) TIFF (in Planar binding) (*.svg) Metafile (CGM, Postscript (EPS) JPEG2000 (prefer format) WebCGM) (*.cgm) Macromedia Flash lossless or FlashPix (*.fpx) (*.swf) uncompressed)() (*.jp2) PhotoShop (*.psd) All other vector image TIFF (compressed) All other raster image formats not listed here GIF (*.gif) formats not listed here
http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf
......
Recommended formats: audio Recommended formats: video ......
High confidence Medium confidence Low confidence High confidence Medium confidence Low confidence AIFF (PCM) (*.aif, SUN Audio AIFC (compressed) Motion JPEG 2000 Ogg Theora (*.ogg) AVI (compressed) *.aiff) (uncompressed) (*.au) (*.aifc) (ISO/IEC 15444-4)( MPEG-1, MPEG-2 (*.avi) WAV (PCM) (*.wav) Standard MIDI (*.mid, NeXT SND (*.snd) *.mj2) (*.mpg, *.mpeg) QuickTime Movie *.midi) RealNetworks 'Real AVI (uncompressed) MPEG-4(*.mp4) (compressed) (*.mov) Ogg Vorbis (*.ogg) Audio‚ (*.ra, *.rm, *.ram) (*.avi) RealNetworks 'Real Free Lossless Audio Windows Media QuickTime Movie Video‚ (*.rv) Codec (*.flac) Audio (uncompressed)(*.mov) Windows Media Video Advance Audio (*.wma) Motion JPEG (*.avi, (*.wmv) Coding (*.mp4, *.m4a, WAV (compressed) *.mov) All other video formats *.aac) (*.wav) not listed here MP3 (MPEG-1/2, All other audio formats Layer 3)(*.mp3) not listed here http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf
......
Recommended formats: “data base” Recommended formats: 3D ......
High confidence Medium confidence Low confidence High confidence Medium confidence Low confidence Delimited Text (*.txt, DBF (*.dbf) Excel (*.xls) X3D (*.x3d) VRML (*.wrl, *.vrml) All other virtual reality *.csv) OpenOffice All other spreadsheet/ U3D (Universal 3D file formats not listed here SQL DDL *.sxc/*.ods) database formats not format) Office Open XML listed here *.xlsx)
http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf
......
9 ......
Digital preservation and file formats: Some robustness experiments Thank you very much for your attention!
Questions?
Christoph Becker, Hannes Kulovits April 17, 2008
Vienna University of Technology www.ifs.tuwien.ac.at/dp
......
Exercise Exercise ......
www.ifs.tuwien.ac.at/~becker/shotgun.zip or USB Shoot and try to open the ‚dirty‘ file (Many thanks to Cologne University!) Try different intensities, flip vs. remove Shoot files Compare results with the FCLA recommendations Clean vs. Dirty Flip bits vs. Remove bits Thoughts? Size vs. Frequency of shots File types Target time: 30-45‘ max Images Documents Audio www.ifs.tuwien.ac.at/~becker/shotgun.zip or USB
......
Bit rot Bit rot ......
002 004 One byte is damaged, one byte 002 xxx One byte is damaged, ten bytes cannot be displayed correctly. cannot be displayed correctly. 234 123 234 123
234 156 234 156
127 xxx 127 178
221 221 221 221
......
10 Agenda ......
Part 2: File formats and issues File format identification Registries File formats and registries Characterisation tools DROID Characterisation JHove XCL Christoph Becker Vienna University of Technology www.ifs.tuwien.ac.at/~becker [email protected] www.ifs.tuwien.ac.at/dp
......
Requirements for DP Some file format requirements ...... Specifications available Digital preservation has to guarantee – Is an XML specification enough? – Integrity – Syntacs and semantics needed – Understandability Standardized (ISO, ANSI, ...) – Originality Accepted and widely used – Authenticity Not covered by patent – Accessibility Free of compression Free of any cryptographical techniques
Flexible and extensible? „Interoperability through time“
......
What file is this? What kind of file is this? ......
What’s wrong with file extensions? Not necessarily unique (e.g. wks) 1. „Clever software“ Granularity not sufficient inspects files to decide how to process them Can be altered by users
2. Format registries Formats vs. Format profiles PDF is not one format DOC is not one format TIFF is not one format
......
11 What’s Wrong with MIME Types? Why Do We Need a Registry? ......
Insufficient depth of detail – No requirements regarding syntax and semantic description Repository functions are performed on a format- – No requirement for complete disclosure, specific basis especially of proprietary formats Interpretation of otherwise opaque content streams is dependent upon knowledge of how typed content is Insufficient granularity represented – Bo th til e d RGB G eoTIFF with LZW and st ri ped bi -tlTIFFtonal TIFF- FX with Group 4 are typed as “image/tiff” Interchange requires mutual agreement of format – All of PDF 1.0 – 1.4, PDF/X-1, X-2, X-3, and PDF/A are typed syntax and semantics as “application/pdf” – These variants might require radically different workflows
Global Digital Format DSpace .User ...... Group, ...... Global Digital Format DSpace. . . User...... Group, ...... Registry March 2004 Registry March 2004
Use Cases File format registries ...... Identification – “I have a digital object; what format is it?” PRONOM: Validation http://www.nationalarchives.gov.uk/pronom/ – “I have an object purportedly of format F; is it?” Transformation Global Digital Format Registry – “I have an object of format F, but need G; how can I produce it?” http://hul.harvard.edu/gdfr Characterization – “I have an object of format F; what are its significant properties?” FileExt Risk assessment http://filext.com – “I have an object of format F; is it at risk of obsolescence?” Delivery – “I have an object of format F; how can I render it?”
Global Digital Format DSpace. . . User...... Group, ...... Registry March 2004
PRONOM Tools ......
DROID (Digital Record Object Identification) relies on PRONOM The National Archives, UK
JHOVE JSTOR/Harvard Object Validation Environment Validation and characterisation
eXtensible Characterisation Languages (XCL) Two XML meta-languages Goal: express complete informational content of an object in an abstract model
......
12 Signatures in DROID What kind of file is this? ...... External signatures File extensions (a) By external characteristics (file extensions) (b) By internal characteristics („magic number“, Internal signatures „signature“). Format indicators in the bitstream Byte sequences A TIFF file begins with … 1. Bytes 0-1: The byte order used within the file. Legal values are: “II” (4949.H) / “MM” (4D4D.H) 2. Bytes 2-3: An arbitrary but carefully chosen number (42) that further identifies the file as a TIFF file.
......
Registry content ......
Descriptive information Identifiers MIME Pronom Unique Identifier (PUID) Demo: DROID, PRONOM Relationships to formats Technical environment References and links…
Risk factors
......
File format characteristics ......
Questions?
......
13 Use Case Coverage JHove ...... Identification JSTOR/Harvard Object Validation Environment Risk assessment Modular and extensible Java-based architecture Delivery – Image modules: GIF, JPEG, JPEG2000, TIFF – Document modules: ASCII,HTML,PDF, UTF-8, XML – “I have an object of format F; how can I render it?” – ... Transformation – “I h ave an o bjec t o f f ormat F, btbut need G; hIdit?”how can I produce it?” Three functions – Identification Validation – Validation – “I have an object purportedly of format F; is it?” – Characterisation Characterization – “I have an object of format F; what are its significant properties?”
......
The TIFF module… Validation ...... −Tagged Image File Format (TIFF) raster images TIFF 4.0, 5.0, and 6.0 [TIFF 4.0, TIFF 5.0, TIFF 6.0] A digital object is well-formed if it meets the purely −Baseline 6.0 Class B, G, P, and R [TIFF 6.0] syntactic requirements for its format. −Extension Class Y [TIFF 6.0] An object is valid if it is well-formed and it meets −TIFF/IT (ISO 12639:2003) [TIFF/IT] File types CT, LW, HC, MP, BP, BL, and additional semantic-level requirements. FP, and conformance levels P1 and P2 −TIFF/EP (ISO 12234 -2:2001) [ TIFF/EP] −Exif 2.0, 2.1 (JEIDA-49-1998), and 2.2 (JEITA CP-3451) [Exif 2.1, Exif 2.2] Validation use cases: −GeoTIFF 1.0 [GeoTIFF] – "I have an object that purports to of format F; is it?" −TIFF-FX (RFC 2301) [TIFF-FX] – "I have an object of format F; does it meet profile P of F?" −Profiles C, F, J, L, M, and S – "I have an object of format F and external metadata about F in schema S; are they consistent?" −Class F (RFC 2306) [Class F, RFC 2306] −RFC 1314 [RFC 1314] −DNG (Adobe Digital Negative) [DNG]
......
......
Questions? JHove Demo
......
14 Use Case Coverage Core requirement: Keep object intact ......
Identification Essential object characteristics Risk assessment Content Appearance Delivery Structure Transformation Behaviour Context Validation
Characterization – “I have an object of format F; what are its significant properties?”
......
Validating a migrated image Validating a migrated image ......
Yes, it’s in JPEG 2000 format
Yes, it’s wellformed Yes, it’s valid Yes, it still has the same dimensions …. But is it still the same image?
......
Validating a migrated image The XCL languages ......
Yes, it’s in JPEG 2000 format
The eXtensible characterisation description language Yes, it’s wellformed XCDL Yes, it’s valid describes properties of digital objects Yes, it still has the same dimensions …. But is it still the same image? The eXtensible characterisation extraction language XCEL extracts properties from files Creates a mapping from a file format to XCDL
We need more characterisation.
......
15 Essential properties as described by file formats Properties described by the format ......
Image width: 277
Image length: 339
Compression: uncompressed
......
XCDL file.png Extracting properties: XCEL ......
One generic XCEL processor instead of specific extractor for every file format Uniform description of XCDL file.tif properties and values Preprocessing instructions Configuration tasks Uniform structure file.gif Format description − Properties of different objects Defines the structure of an object are described using a single Templates vocabulary and grammar TIFF: imageLength PNG: imageHeight Describe recurring structures ? : ? eXtensible Postprocessing instructions On the results of processing
XCDL: imageHeight
......
XCDL example: ‘An important word’ Comparing migrated documents ......
16 ......
Questions?
www.ifs.tuwien.ac.at/~becker [email protected]
www.ifs.tuwien.ac.at/dp
......
17