PRONOM in Practice
Total Page:16
File Type:pdf, Size:1020Kb
iPRES 2018 The 15th International Conference on Digital Preservation September 24th, 2018, 9am-12:30pm PRONOM in Room 216, the Joseph B Martin Conference Centre Practice Creating File Format/System Signatures for Submission to PRONOM Technical Registry David Clipsham, Nick Krabbenhoeft, Shira Peltzman, Justin Simpson, & Carl Wilson 1 INTRODUCTIONS Facilitators David Clipsham - Digital Archives Systems Manager, PRONOM Lead, National Archives, UK Nick Krabbenhoeft - Head of Digital Preservation, New York Public Library Shira Peltzman - Digital Archivist, UCLA Library Justin Simpson - Archivematica Technical Director, Artefactual Systems, Inc Carl Wilson - Technical Lead, Open Preservation Foundation 2 Introduction to file format signatures How are file formats identified, overview of 9:15 - 9:35 am Agenda PRONOM, case studies Signature development process Reading bytestreams (why to do it, how to do 9:35 - 10:35 am it), creating signatures [break] Signature development process (cont’d) & case studies 10:50-11:45 pm Testing signatures, submitting to PRONOM [break] Advanced signature development & open signature development workshop 12:00-12:30 pm Container signatures, finding samples, troubleshoot existing signatures 3 Introduction to file format signatures 4 Why file format signatures? Agenda Style Relevancy to digital preservation ● File format identification enables us to know what we’re dealing with ○ This happens early on in most workflows ○ The outcome of this process impacts downstream decision-making around activities like normalization for preservation and access ● File format identification tools are only as good as the file format signatures that have been developed by the community ○ The lack of a file format signature means that file identification cannot meaningfully take place ○ Executing tasks that should be straightforward, like disk image extraction and File Formats characterization, are sometimes difficult if not altogether impossible 5 PRONOM Image from Flickr via kevandotorg 6 PRONOM http://www.nationalarchives.gov.uk/PRONOM/Default.aspx Developed in 2001 to meet the National Archives digital record File format registry for digital preservation planning. preservation planning File format research File format 1670 entries Format extensions, always ongoing, National identification aka PUIDs - PRONOM MIME/Media types, Archives research guided signatures Unique Identifiers links to documentation primarily by UK (for DROID originally) Government needs. External contribution always welcome and encouraged 7 PRONOM Timeline 2001 2004 2005 ongoing Continual PRONOM research and signature development DROID launched alongside PRONOM 4 PUIDs introduced Opened up as externally browsable resource Aka PRONOM 3 Original internal version 8 PRONOM Growth 9 PRONOM Contributors 10 File FormatAgenda ID Style PRONOM identification mechanisms ● Extension (.doc, .exe, .jpg) ● File format signature ○ Binary pattern matching ○ Created from elements of internal structure ○ May be simple ‘magic numbers’ - “CAFEBABE” for Java Class File ■ http://blog.nationalarchives.gov.uk/blog/cafed00ds-and-cafebabes/ ○ May consist of complex patterns of variations, gaps and alternative values. ○ Driven by file format specification where possible ● Container signatures - formats made up of small files contained within a ‘ZIP’ or ‘OLE2’ wrapper (.doc, .xlsx, .odt, .epub) ○ http://openpreservation.org/blog/2016/01/07/droid-container-signature-files-what-th ey-are-and-how-to-create-them-a-template-and-an-example-or-few/ File Format ID 11 File FormatAgenda ID Style DROID Pattern Matching ● Scans internal file byte code ● Compares against known signatures in signature file ● Returns a Hit! where it gets a match ● We’re aiming for certainty – there should be an extremely low chance that a file could be of a different type to the format that DROID identifies ● So, signature needs to be strong enough, but doesn’t need to encode all of the characteristics of a format File Format ID 12 File FormatAgenda ID Style Magic Numbers (AKA Signatures) ● A specified sequence of characters/bytes that must be present ● Usually at the start of the file (not always) ● Explicitly stated within the format specification: ● Java Class file – https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html Hex 0xCAFEBABE ● PNG - https://www.w3.org/TR/PNG-Structure.html ASCII “‰PNG”, then hex 0x0D0A1A0A ● Photoshop PSD - https://www.adobe.com/devnet-apps/photoshop/fileformatashtml File Format ID ASCII “8BPS” 13 File FormatAgenda ID Style Inferred Signatures ● Sometimes formats may not have clearly defined signatures, but may have characteristics that must be present. This can be a good hook for a signature. This can get really complex! ● Gatan DM3: http://www.er-c.org/cbb/info/dmformat/#dm3 00000003{4}000000(00|01){6}(14|15){2-258}25252525 ● Stata DTA 113: http://www.stata.com/help.cgi?dta_113 71(01|02)01{105}00 ● ASP ASAX: https://msdn.microsoft.com/en-us/library/es4ac4ek(v=vs.85).aspx 3C2540204170706C69636174696F6E20(436F6465426568696E64|436F6D70696C6572 File Format ID 4F7074696F6E73|4465736372697074696F6E|496E686572697473|4C616E6775616765 )3D 14 File FormatAgenda ID Style Format ‘Subsets’ ● Sometimes file formats may be ‘subsets’ or subtypes of other formats. Major examples are: ● PDF/a – subtype of PDF (so is PDF/X) ● DNG – subtype of TIFF (so is NIKON Raw NEF) ● WAV – subtype of RIFF (so is AVI) ● SVG – subtype of XML (so is GML) We manage these relationships with ‘priorities’ – for example, PDF/A has priority over PDF because it contains a more specific element File Format ID 15 File FormatAgenda ID Style Notes on Format Identification ● Not all files are automatically identifiable (based on how PRONOM currently works) – see Wireless Bitmap (.wbmp) – but an extension-only entry is better than nothing! ● A 4 byte (32 bit) sequence has a 1 in ~4 billion chance of a clash with truly random data – this is usually strong enough ● A text editor can be better for viewing XML based formats than a Hex Editor (although you’ll need the hex editor for creating the byte sequences) ● We’re not trying to characterise a format or validate that it is well formed, we’re just trying to give us a reasonable degree of certainty about the outcome File Format ID ● Files that ID as OLE2 or ZIP are probably container sigs (for later!) 16 Signature development process 17 Reading bytestreams Format signature tools Hexadecimal and binary Hex (hexadecimal editors) are number systems. allows for manipulation of Base 2 and Base16 the fundamental binary respectively. We usually data that constitutes a file work in Base 10 (decimal), ie. 1, 2, 3 … 10. Binary for 144 is 10010000. In hex this is OP Format Corpus simply 0x90. Fewer zeros to work with helps us see an openly-licensed larger numbers easier. corpus of sample files DROID/Siegfried/FIDO PRONOM Tools to match files to Submission Utility PRONOM format an online form to submit signatures information about file formats for PRONOM 18 Tools Agenda Style Hex Editors A program that allows for manipulation of the fundamental binary data that constitutes a file Also called a binary file editor For more info see: https://en.wikipedia.or g/wiki/Hex_editor Reading Bytestreams 19 Hex editors ● Windows - HxD https://mh-nexus.de/en/hxd/ ● OS X - HexFiend http://ridiculousfish.com/hexf iend/ ● Linux - Bless https://apps.ubuntu.com/cat/ applications/precise/bless/ 20 Hex editors Online options: http://binvis.io https://hexed.it/ http://icebuddha.com/ 21 ResourcesAgenda Style Format specification documents A document that describes the set of requirements necessary for a given file format LoC’s Sustainability of Digital Formats is a good place to look for these: http://www.loc.gov/preservation/digital/for mats/fdd/browse_list.shtml Reading Bytestreams GIF specification: https://www.w3.org/Graphics/GIF/spec-gif89a.txt 22 Hands-on: Examining sample files in a hex editor 23 Case study Developing a simple signature TZX Spectrum Tapes 24 The TZX AgendaTape Format Style Creating a signature ● A format for archiving ZX Spectrum programs ● Used with ZX emulation programs ● Large hobbyist community – lots of information available ● A audio stream of the tape data ● World of Spectrum Archive: 10,000’s of examples - https://www.worldofspectrum.org Creating Signatures 25 26 27 The TZX AgendaTape Format Style The Format Specification - http://www.worldofspectrum.org/TZXformat.html Creating Signatures 28 ResourcesAgenda Style PRONOM terms, basic syntax and data model BOF = Beginning of File. EOF = End of File. Var = Variable (anywhere in the file) Offset/Max Offset = Exact or positional range in which a signature starts Wildcards: ?? = single wildcard byte, e.g. AB??C3 * = 0-many wildcard bytes, e.g BC*D4 {n} = specific number of wildcard bytes, e.g. A2{5}F3 {n-n} = range of wildcard bytes, e.g. 4D{0-12}E4 Byte range: [hh:hh] = single byte value between range, e.g [00:FA] Either/or: (hhhh|hhhh|hh) = either/any or these byte values, e.g. (0D|0A|0D0A) Not: [!hh] = anything except this byte value, e.g. ABCD[!01]E1 https://www.nationalarchives.gov.uk/aboutapps/fileformat/pdf/automatic_format_identification.pdf Creating Signatures 29 Tool Agenda Style PRONOM Signature Development Utility http://www.nationalarchives.gov.uk/pronom/sigdev/index.htm Creating Signatures 30 Hands-on: Creating and editing a sample PRONOM signature 31 Break! Please be back at 10:50am 32 Signature development process cont’d 33 Tool Agenda Style Format characterization tools The process of file format characterization