A PDF Test-Set for Well-Formedness Validation in JHOVE - the Good, the Bad and the Ugly

A PDF Test-Set for Well-Formedness Validation in JHOVE - The Good, the Bad and the Ugly Michelle Lindlar Yvonne Tunnat Carl Wilson TIB - Leibniz Information Centre of ZBW - Leibniz Inforation Centre of OPF - Open Preservation Foundation Science and Technology Economics c/o The British Library Welfengarten 1B Dusternbrooker¨ Weg 120 Boston Spa, United Kingdom LS23 Hannover, Germany 30167 Kiel, Germany 24105 7BQ [email protected] [email protected] [email protected] ABSTRACT embedded and non-embedded fonts, validation poses a challenging Digital preservation and active software stewardship are both cycli- problem. While software to validate digital object’s against PDF 1 cal processes. While digital preservation strategies have to be profile requirements such as PDF/A or PDF/X exist, they typically reevaluated regularly to ensure that they still meet technological focus on the requirements of the profile and do not take the syntac- and organizational requirements, software needs to be tested with tical and structural requirements of the underlying PDF format into every new release to ensure that it functions correctly. JHOVE is account [8]. As of today, the go-to validator for the PDF format 2 an open source format validation tool which plays a central role is the open source tool JHOVE [23] . The initial development of in many digital preservation workflows and the PDF module is JHOVE dates back to 2003-2008 and the tool has been widely used one of its most important features. Unlike tools such as Adobe by digital archives since. PreFlight or veraPDF which check against requirements at profile Digital preservation and active software stewardship are both level, JHOVE’s PDF-module is the only tool that can validate the cyclical processes. While digital preservation strategies have to be syntax and structure of PDF files. Despite JHOVE’s widespread regularly reevaluated to ensure they continue to meet technological and long-standing adoption, the underlying validation rules are not and organizational requirements, software needs to be tested with formally or thoroughly tested, leading to bugs going undetected every new release to ensure that it functions correctly. Despite for a long time. Furthermore, there is no ground-truth data set JHOVE’s widespread and long-standing adoption, the underlying which can be used to understand and test PDF validation at the validation rules are not formally or thoroughly tested, leading to structural level. The authors present a corpus of light-weight files bugs which can go undetected for a long time. Formal testing for designed to test the validation criteria of JHOVE’s PDF module complex software such as file format validators has to be automated. against “well-formedness”. We conclude by measuring the code However, a requirement for such automated testing processes is a coverage of the test corpus within JHOVE PDF validation and by ground-truth as a point of reference, ideally manifested in a light- feeding detected inconsistencies of the PDF-module back into the weight test set. This test set can be used to check the validator’s open source development process. capability to enforce specific clauses in the format specification. In the case of PDF validation in general and JHOVE specifically, no KEYWORDS such test-set has been available until now. This paper describes the authors’ efforts to narrow this gapby file format validation, PDF, test data, quality assurance, JHOVE building a light-weight test-set for PDF validation. The test set ACM Reference format: focuses on the validation against structural and syntactical require- Michelle Lindlar, Yvonne Tunnat, and Carl Wilson. 2017. A PDF Test-Set for ments3 of the PDF file format as described in the ISO 32000-1:2008 Well-Formedness Validation in JHOVE - The Good, the Bad and the Ugly. In standard for PDF 1.7. It will not look at particular profile require- Proceedings of iPRES Conference, Kyoto, Japan, September 2017 (iPRES 2017), 11 pages. ments such as those described in the ISO 19005 series for PDF/A. As DOI: 10.1145/nnnnnnn.nnnnnnn the standard does not make a clear distinction between well-formed and valid requirements, these are derived by looking at required 1 INTRODUCTION structural parts of any PDF object, namely: a header, a body con- sisting of a minimal set of objects, a cross-reference table and a File format validation is a central task in digital preservation pro- trailer (see Figure 1). While JHOVE only supports PDF features up cesses, giving insight into the degree with which the digital object to version 1.7, the cases implemented in the test set are common to complies with the specification of the file format it purports tobe. all PDF versions. The aim of test set is threefold: For complex formats such as PDF, which allow for a multitude of (1) to establish a ground truth for what is not well-formed content types and variations, such as embedded AV material or (2) to test the JHOVE software against that ground truth Permission to make digital or hard copies of part or all of this work for personal or (3) to improve automated regression testing classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation 1For PDF/A, e.g.: veraPDF, Callas pdfapilot, PDFTron, 3-Heightsffl on the first page. Copyrights for third-party components of this work must be honored. 2While the JHOVE framework includes a variety of validation modules, this paper For all other uses, contact the owner/author(s). limits the scope to the PDF-module. Within the context of this paper JHOVE is iPRES 2017, Kyoto, Japan therefore used as a synonym for the JHOVE framework’s PDF-module. © 2017 Copyright held by the owner/author(s). 978-x-xxxx-xxxx-x/YY/MM...$15.00 3Syntactical and structural requirements equal JHOVE’s well-formedness criteria. DOI: 10.1145/nnnnnnn.nnnnnnn Please refer to section 2.1 for further discussion. iPRES 2017, September 2017, Kyoto, Japan Michelle Lindlar, Yvonne Tunnat, and Carl Wilson Section 2 of this paper will introduce the concept of file for- Hence, the digital preservation community largely relies on mat validation and give insight into the development of JHOVE JHOVE for validation - despite known bugs7. The adoption of in general and the PDF module in particular to provide a contex- and ongoing work on JHOVE will be introduced in section 2.3, tual framework for the test set work. Section 3 will introduce the further motivating the relevance of and urgent need for thorough methodology used for the construction of the test set as well as regression testing and ground-truth data. for measuring and describing the automated regression testing gap. Section 4 describes the test set itself and the results of running the 2.1 File Format Validation 4 JHOVE PDF-module across the test set. To introduce a second File format validation is the process of checking an object’s confor- point of reference, each test file is also rendered using a suitable mance to syntactic and semantic rules of the format it purports to 5 application . While the ability to render a file correctly does not be. As such, it is closely related to file format identification. While guarantee that it is well-formed, incorrectly displayed content or most pattern based identification tools such as DROID or filerely the failure to render often indicates that the file is not well-formed. on short signatures such as magic numbers, full identification re- Section 5 discusses the impact of the test set as a ground-truth quires an analysis of the entire bit-stream and a comparison to the and as an improving factor in current JHOVE code as well as in structure and semantics prescribed by the file format’s specification existing automated regression processes. We conclude with section [1]. To illustrate, consider the following minimal PDF code of the 6, highlighting possibilities for further work building on the test file minimal test.pdf: set described in this paper. %PDF-1.4 %%EOF Minimal test.pdf is identified as PDF 1.4 by standard file format identification tools8. JHOVE, however, recognizes that the object is Not well-formed, indicating problems at the basic structural level of the file format level which the object purports to be. Ideally, the normative syntactic and semantic rules used to check the validity of an object are taken from the file format’s authoritative specification. However, in many cases a specification may not exist or may not be publicly available. Format specifications not written within an official standardization context present another problem. Theseare often ambiguous and therefore open to interpretation2 [ ]. Ambigu- ities in the PDF specifications published by Adobe have lead toa rather broad interpretation of the file formats syntactical and semantic makeup. This, in return, has lead to PDF rendering software Figure 1: Basic PDF structure being forgiving towards many violations, resulting in files which are strictly speaking invalid but are still renderable and usable [2]. Format validation is usually broken down into two conformance levels - determining whether an object is well-formed and valid. 2 BACKGROUND AND RELATED WORK The W3C Extensible Markup Language Standard [31], for example, File format validation is a challenging task. Section 2.1 describes the clearly defines the constraints of a well-formed XML object. In motivation behind and general approach to file format validation, short, a well-formed XML document must contain exactly one root sections 2.2 and 2.4 illustrate how this challenge was met in the element, consist of one or more correctly nested and delimited ele- development of the JHOVE framework and the PDF-module, respec- ments and follow the regulations specified for entities.

A PDF Test-Set for Well-Formedness Validation in JHOVE - the Good, the Bad and the Ugly

The Application of File Identification, Validation, and Characterization Tools in Digital Curation

Support for Digital Formats

Ocrmypdf Documentation Release 9.0.0

Raster Still Images for Digitization: a Comparison of File Formats

Policy-Making for Research Data in Repositories: a Guide

JPEG 2000 - a Practical Digital Preservation Standard?

D2D: Digital Archive to MPEG-21 DIDL

Part 2. Detailed Matrix (Multi-Page)

PREMIS Data Dictionary for Preservation Metadata, Version

Extraction of Technical Metadata for Longterm Preservation

PUID (PRONOM’S Persistent Unique Identifier) • Method (Signature, Extension Oder Container)

A PDF Test-Set for Well-Formedness Validation in JHOVE - the Good, the Bad and the Ugly