Proposal for Format Adoption: JPEG2000 (ISO/IEC 15444:1-2000) For Still Image Objects in RUcore

Introduction DRAFT Since inception, the Rutgers Community Repository’s mission has been long term preservation of our digital content. This commitment inherently specifies an object architecture in which we select and use stable file formats to act as digital “containers” for this content. These format containers must be widely used and well documented, must facilitate easy access, and in the case of preservation datastreams, must not in any way degrade the content through methods or the introduction of noise. With the majority of objects in RUcore based on digital-surrogate still image content (scanned , scanned books and documents), the preservation datastream of choice up to now has been overwhelmingly based on the Tagged Image (TIFF). TIFF has fit the criteria of an ideal preservation format, but has a number of disadvantages. Chief among them are the typically large file sizes of uncompressed (or even LZW-compressed) images, requiring a great deal of mass storage to maintain. Additionally, the relatively lengthy age of the format, having originated in the mid-1980s, has led to difficulties in revising TIFF to support embedded image , a trait which more image specialists and archivists are requiring for easy indexing. Finally, the excessive size and limited utility of TIFF requires us to create multiple supplementary “presentation” streams (JPG, PDF, DjVu) that are heavily compressed and optimized for internet connections, in order for end users to access content. While widely used, TIFF is clearly beginning to show its age, and the digital imaging community has worked in earnest to find a successor format. Although it has been long in coming and its development controversial at times, JPEG2000 (JP2) is now reaching a critical mass in the archival community. Mainstream software vendors are now readily supporting the standard, and numerous plug-ins exist to permit viewing of JP2 images from a web browser. The format readily accepts various embedded metadata fields, and flash-enabled viewers permit us to present JP2 images in ways that rival our currently available PDF and DjVu viewers. Most importantly, JP2 incorporates sophisticated compression techniques that permit us to significantly compress image files without loss or degradation, to greater degrees than previously available methods. These recent developments have indicated the need for RUcore to begin adopting – and ultimately, migrating – to a new image standard for preservation of still images. This proposal recommends a shift from TIFF to the JPEG2000 container format.

Why Switch? Advantages of migrating to JP2

1. Adoption mandatory for future grant eligibility Potential RUcore partners have begun to encounter grants where JPEG2000 files are a required deliverable. Specifically, a number of grants sponsored by the National Endowment for the Humanities (NEH) in conjunction with the Library of Congress, are requiring that still images and document scans be stored in both TIFF 6.0 and JPEG2000 lossless formats, with embedded XMP metadata.1 It is expected that an increasing number of grant underwriters will require similar exacting standards in order

1 National Endowment for the Humanites Digital Newspaper Program Guidelines (http://www.neh.gov/grants/guidelines/ndnp.html) 1 Beard, I, - 10/17/07 RUcore JPEG200 Proposal for digital projects to be eligible for funding. To meet the moderate-to-long-term needs of repository partners seeking grant funding, RUcore will need some form of JPEG2000 support built-in.

1. Reduced file size The most tangible benefit to JP2 migration would be a significant reduction in the disk space required to store preservation datastreams. Real world tests have indicated that even very basic JP2 lossless conversions will yield an average 20% reduction in file sizes.2 Picking several image samples from within RUcore and using more processor-intensive techniques, results suggested that an average of up to 30-40% savings in disk space could be attained merely by converting the existing preservation datastreams to lossless-JP2 format.

2. Scalable decompression With the use of newly developed presentation tools, it may be possible for a Derivative Master JP2 file NJDH - Hoboken Historical Photos: to be used as a presentation stream as well. JP2 permits a “First National Bank Interior” TIFF file size (LZW Compressed): 5.4MB “lossy” decompression of an image to be scaled based on JP2 Lossless File Size: 3.2MB a user’s preference, and the speed of a user’s internet Disk Space savings: 40.74% connection. The faster the connection (or, the longer a user with a slow connection is willing to wait), the more detail can be “pushed” to a user. This permits them the freedom to see an image in as granular or as detailed a resolution as they choose.

3. Tiled viewing and enhanced viewer tools for users The same features in JP2 that permit scalable decompression also permit enhanced viewing capabilities for end users. Vendors have developed streamlined interfaces that incorporate JP2 technology, and do not require the installation of plugins beyond what are available by default with most web browsers. The incorporation of zoom in/out and multidirectional scrolling can be built into the web browser without having to instruct users to install additional software.

An example of items 2 and 3 have been recently demonstrated by Northwestern University. The demos can be found at: http://digital.library.northwestern.edu/imageviewer/.

4. Streamlining of presentation formats, and additional reduction of storage overhead Items 2 and 3 also imply that, with the proper deployment of a JP2 viewer, it may be possible to streamline the nuber of presentation formats offered to site visitors. Currently, non-OCR still image objects are presented in three different presentation formats (JPG, DjVu, PDF). The effective implementation of a JP2 viewer could mean that we could eliminate the need to generate and store these data streams, for still image objects that do not require OCR. Users would benefit by having a simple, unified way to view these objects, while RUcore would benefit in additional resource savings.

DRAFT

2 Fasolt, Al. “Trim those important images down to manageable file size with JPEG2000.” Technofile, March 20, 2005. 2 Beard, I, - 10/17/07 RUcore JPEG200 Proposal Drawbacks and pre-requisites to JP2

1. Significant modifications, and development time required. A not-insignificant amount of modification will be required to RUcore’s existing structure. In particular, changes will be necessary to the Workflow Management System (WMS) pipeline which processes still images. Ultimately, the WMS pipeline must both accept submitted JP2s and convert submitted TIFF images to JP2 format. Additionally, the final phases of JP2 adoption will require changes in object architectures for certain still image objects. Implementation of a JP2 viewer will be necessary for end users to reap the benefits. While a gradual phasing in of the JP2 standard should be possible, it is clear that a significant devotion of resources will be required to complete the process. This migration will likely require a gradual re-structuring that spans over three or more upcoming software releases.

2. Some still image objects will not immediately benefit. Currently, there is not a widely available engine for the Optical Character Recognition (OCR) and text extraction of JP2 images bearing text. As a result, our existing pipeline (dependent on LizardTech’s DjVu platform) will need to remain in place for objects requiring OCR for the forseeable future. Additionally, we will need to continue presenting such objects (including multipage book objects) in the existing PDF and DjVu formats. Finally, the existing pipeline will need to be modified so that users submitting images with text in JP2 format can be temporarily downconverted to TIFF. This way, user-submitted documents in JP2 can still be OCR’ed and presented as PDF/DjVu.

Additional objects not (or minimally) impacted by JP2 migration Born Digital documents, such as ETDs and Faculty Submissions, will not be part of the JP2 migration process and should not be affected. Such items are already highly compacted and well- preserved in their respective born-digital formats, and so no benefit is gained from converting their preservation datastreams to JP2. Born Digital Photographs, such as Digital Negatives (DNG), will also reap no benefit at this time from converting their preservation datastream to JP2 format. DNGs incorporate their own advanced . However, users would benefit from seeing a presentation copy of these images as a JP2.

3. Capital spending is a likely requirement for full commitment A number of freely available, open source toolkits are available that facilitate basic JP2 rendering. However, the full implementation of a JP2 viewer will require the use of software packages sold by commercial vendors. Such sophisticated viewers have been developed by Pegasus Imaging Corporation (www.pegasusimaging.com) and Aware, Inc (www.aware.com). Funding sources will need to be found to purchase licenses for the appropriate toolkits. There is a possibility that additional server hardware may be required as well.

DRAFT

3 Beard, I, - 10/17/07 RUcore JPEG200 Proposal Proposed strategy for adoption and migration Given the significant research, development and lead-time required to fully commit to a JP2 migration, the proposed strategy is to implement a gradual phase-in of features and capabilities. The proposed strategy calls for the following phased approach:

Phase One Phase Two Phase Three RETRO Conversion FUTURE OCR 6-12 months 12-18 months 18-36 months (TBD) (TBD) Pipeline retrofit. Full commitment. Initial Acceptance of Converts from JP2 to Research and JP2 replaces Conversion of all non-OCR user-supplied JP2 for TIFF, and TIFF to JP2. implement OCR DjVu/PDF/JPG for objects ingested before ARCH MASTER MASTER and when the available non-OCR images. Phase Three, to JP2 only. DMASTER is DMASTERs are all technology JP2 Viewer is architecture. still TIFF format. stored as JP2. becomes viable. implemented. DjVu/PDF/JPGs remain.

Phase One: Initial JP2 Acceptance The first phase is expected to require only minor changes to WMS, while we prepare for the more involved modifications down the road. In Phase One, only the ARCH MASTER datastream is accepted as a user-generated lossless JP2. Submitting a JP2 during this phase will require that the WMS user also submits a DMASTER in TIF format, so that the WMS can continue to process files and generate our current presentation formats. This mode should be similar in nature to the procedures developed for preserving Digital Negatives (DNG). Implementation of this phase will require that WMS accept MASTER files with a JP2, JPX or JPF extension, with MIME time “image/jp2.” There will be no modifications at this time to the pipeline, and provided a DMASTER in TIFF format is supplied, image processing should proceed normally.

Phase Two: Pipeline Retrofit for JP2 <-> TIFF The second phaseDRAFT will entail retrofitting the existing pipeline with an open source SDK that permits the conversion of TIFF files to JP2, and vice-versa. The goal of this phase is two-fold: 1. WMS users will have the option of submitting either TIFF files, or user-generated lossless JP2s. 2. Regardless of user’s choice, the pipeline will use TIFF files to generate presentation files, but will store JP2 files as MASTER (and when required, TIFF as DMASTER).

To implement this phase, developers will need to use an SDK (open-source packages include JasPer and the J2K toolkits) to enable our current WMS pipeline to handle the JP2 format.

The expected pipeline behavior in Phase Two would be as follows:

- If a WMS user submits a JP2 MASTER/DMASTER file, the pipeline retains the originals and uses them for the ARCH datastream. From the DMASTER, the pipeline creates an intermediary TIFF in order to generate a JPG, PDF and DjVu. Unless retention of the TIFF is required (i.e. to satisfy grant requirements for a particular collection), the TIFF is discarded.

- If a WMS user submits a TIFF MASTER and/or DMASTER, the pipeline uses the originals to generate a lossless JP2, JPG, PDF and DjVu. Unless retention of the TIFF(s) are required (i.e. to satisfy grant requirements for a particular collection), the TIFF(s) are discarded, and the JP2 files are used as the MASTER and DMASTER. 4 Beard, I, - 10/17/07 RUcore JPEG200 Proposal

Phase Three: Full Committment Phase Three will evoke the most comprehensive and dramatic changes to WMS and RUcore. Whereas in the first two phases, JP2 is used only for preservation datastreams, this final phase will incorporate the same JP2 images for presentation to end users as well.

Implementation of Phase Three will most likely require the research and selection of commercially available image server tools for JP2 images. Once it is selected and purchased, this package will replace the current JPG, PDF and DjVu datastreams with an embedded JP2 viewer. The expected behavior will be as follows:

1. A WMS user submits either a JP2 or TIFF file as a MASTER and/or DMASTER. 2. Based on user input, WMS determines if the object requires OCR. o If the object requires OCR, the WMS pipeline continues operation as in Phase Two. o If no OCR is required, the pipeline converts all to JP2s and stores them as a unified datastream for preservation AND presentation. TIFFs are generated or retained only if the user indicates it is required (i.e. to satisfy grant requirements for a particular collection).

3. In the public RUcore interface or partner portal, site visitors are presented JP2 datastreams in an embedded viewer. Zoom, magnify and scroll functions are available to the site visitor for these objects.

Future phases: Retro-convert and OCR As soon as deployment of Phase Three is complete, and after all governing committees have evaluated and are comfortable with the results, we may choose to develop methods to reconcile previously-ingested objects. It is plausible that in the long-term, TIFF may fade as a widely-used format, and we will be faced with the challenge of converting existing TIFF images to JP2 to retain their viability for preservation purposes. To that end, it would be prudent for us to prepare for that eventuality, and develop a method to convert en masse the numerous objects that exist solely as TIFFs. Additionally, we should be prepared to explore OCR solutions that are compatible with the JP2 format. Once a viable solution avails itself, it would be wise to develop a timeline for its inclusion into WMS, so that complex book documents and similar objects can also benefit from end-to-end JP2 deployment.

DRAFT

5 Beard, I, - 10/17/07 RUcore JPEG200 Proposal