DIEPER LB-5632

Deliverable 13 Survey of current methodology in image capturing and document management

Date 08-07-99 Reference D13final/public/ABC, 29 pages Produced by ABC Datenservice GmbH for UNIGOE Workpackage 4, supervised by UBG Distribution list All DIEPER partners Contact person Reinhard Ecker * Am Wasserturm 6 D-60435 Frankfurt am Main ) + 49 69 954031-30 2 + 49 69 954031-12 . [email protected] DIEPER Project: Deliverable 13 Survey of current methodology in image capturing and document management Version: Final Page 2 Date: 08.07.1999

Document history

Versions Version Date Author Comments

1 20/12/98 R. Ecker Preliminary Draft of D13 2 16/02/99 R. Ecker Draft 2 of D 13 3 09/07/99 R. Ecker Final version of D 13

Updates Chapter Description of modifications Version

1. Introduction...... 3 2. Scanning ...... 4 Kind of printed materials ...... 5 Kind of intended use and further processing of the digitised materials...... 6 Kind of intended access to the digitised materials...... 6 Image processing ...... 7 Image compression ...... 7 Versions of image files for different applications...... 7 3. Indexing ...... 10 Categories of indexing ...... 10 Document identifier ...... 10 Document structure...... 10 4. Methods of full text + meta data capturing...... 12 Manual text capturing...... 12 Text capturing by OCR / ICR...... 12 Download of catalogue data ...... 12 5. Document storage...... 13 Document storage formats...... 13 Digital master file ...... 15 Application file formats ...... 15 Self-describing image files ...... 16 Storage media ...... 17 6. Document management ...... 19 Electronic archiving and document management systems...... 19 Basic functions of archiving and document management systems...... 19 Document storage ...... 20 Document retrieval...... 20 Document visualisation and reproduction...... 20 Maintenance and administration ...... 20 Existing archiving and document management systems for digital libraries ...... 21 Online library catalogue software systems ...... 21 Local solutions...... 22 7. Relevant Standards ...... 22 8. References, URLs etc...... 29 Appendix: Dieper Questionnaire DIEPER Project: Deliverable 13 Survey of current methodology in image capturing and document management Version: Final Page 3 Date: 08.07.1999

1. Introduction

This document describes the current status of image capturing and document management methods, especially with respect of library documents. Some years ago several libraries have started to retro-digitise printed materials, as e.g. books or periodicals, and to distribute these digital documents via Internet or other networks to users over the world. We are now at the beginning of a development which will, as we hope, make all important information make available immediately from anywhere and at any time. This new kind of information access meets already some enthusiasm from their users to act as a catalyst for starting additional projects. It is expected, that the information behaviour will be influenced considerably by the direct access to digitised documents. Libraries – in our tradition one of the significant groups of conventional information providers – will identify and use this chance to overtake also a leading role in the digital information society. The goal DIEPER project is to enhance these developments with respect to the digitisation, indexing and presentation of scientific periodicals.

This report gives an overview on the current methodology status on digitisation of printed library materials and electronic storage and administration of digital documents. In addition a short overview is given to indexing and to the capturing of full text and meta data. (Deliverable 16, which is to be prepared later will give more details on these items). A list of relevant standards and a technical glossary of relevant terms is added together with some references. In the appendix to this report the results of a survey (“Dieper Questionnaire”) for the investigation of the current methodology in image capturing and document management at the project partners and selected European libraries are presented. DIEPER Project: Deliverable 13 Survey of current methodology in image capturing and document management Version: Final Page 4 Date: 08.07.1999

2. Scanning

To make a printed document available via the Internet it has to be converted into an electronic format. One of the first difficult issues that must be addressed in any digital conversion project concerns the selection of appropriate formats and technologies for storage, display and distribution of the material. Another difficult question is what file format (images, PDF, SGML, HTML, etc.) should be used to deliver the content. Another question is whether to store and deliver the materials as images or as text. Given the technology available to web browsers, the most accurate way to replicate completely the originally published material, which is full of special characters, foreign languages, mathematical symbols, charts and pictures, is with scanned images. In addition, by the use of Optical Character Recognition software a corresponding text file can be built that would allow the user to search the full-text of the journals in the database. But these (uncorrected) OCR-text files should not be made available to users. A distinction should be made between coded and non-coded information:

Coded information Non-coded information

Files Text Image Capturing Manually input, OCR/ICR Scanning Editing Text editor Pixel editor Direct retrieval Yes No

Basic scanning parameters

· Kind of the original document (printed text on paper, printed image, photograph, colour, microfilm, microfiche, ...) · Size of the original document (micro form, A 4 A0) · Scanning resolution (100 dpi, 300 dpi, 400 dpi, 600 dpi, ...) · Image depth (pixel information: 1 bit, 8 bit, 12 bit, 3 x 12 bit, ...) · Intended exploitation of the digital materials · File size of the digital materials

Criteria for the definition of scanning parameters DIEPER Project: Deliverable 13 Survey of current methodology in image capturing and document management Version: Final Page 5 Date: 08.07.1999

The definition of scanning parameters depends on the · kind of the printed materials · kind of intended use · kind of intended access

Kind of printed materials

Paper based materials

· Bounded volumes · Single sheets of paper · Maps · Library catalogue cards · etc.

· One side – double sided

· Usual book format (~ A 4, A 5 ..) · Small size · Large size (A 0 or larger)

· Text · Graphics · Halftone · Colour

Microfilm based materials

· Microfilm · Microfiche · Slides · Professional reprofilm

Other materials

· 3-D objects · etc. DIEPER Project: Deliverable 13 Survey of current methodology in image capturing and document management Version: Final Page 6 Date: 08.07.1999

Kind of intended use and further processing of the digitised materials

Presentation on the screen

Reproduction by a printer

· Local print of a small number of document pages · Reprint of the complete document · Reprint in professional quality · Reprint of coloured posters in an optimum true colour quality · etc.

Further processing of documents

· Automatically OCR-conversion to full text · Automatically vectorisation of graphical information · Automatically analysis of the document type · Production of a CD-ROM · etc.

Kind of intended access to the digitised materials

· Via Internet/Intranet · Local access within the premises of the library

Categories of scanners

· Flat bed scanners · Flat bed scanners with automatic feeders · Camera scanners · Specialised book scanners · Microfilm scanners · Other specialised Scanners (x-ray, 3 D objects, ...)

· Black/white scanners · Greyscale scanners · Colour scanners DIEPER Project: Deliverable 13 Survey of current methodology in image capturing and document management Version: Final Page 7 Date: 08.07.1999

· Digital resolution (dpi)

Image processing

· Clipping: Separation of double pages to single pages (book scanners) · Despeckle: Purification of the image from dirt (e.g. from marks caused by mould) by deleting single spots and grey background. · Deskew: Orientation of the image to vertical · Black Border removal: To remove the black border or any black back ground · Contrast enhancement: Enhancement and Addition (Tracing) of lines in the document · Level reducing: Reducing the pixel information (number of grey or colour scales) · Resolution reducing: Reducing the image resolution (pixel resolution) · Scaling to original size: Scaling the image file to the original size of the paper document

Image compression

To reduce the size of the image file. There is a limit to the size of the file one can expect a user to down-load over network links. Because of that limitation, it may not be possible to offer very high resolution colour or greyscale images.

Lossless compression

· CCITT G4 T6 · LZW · etc.

Lossy compression

· JPEG · Wavelet · Fractal compression (synchronous or asynchronous) · etc.

Versions of image files for different applications

· Digital master file · Archive file · Screen presentation DIEPER Project: Deliverable 13 Survey of current methodology in image capturing and document management Version: Final Page 8 Date: 08.07.1999

· Gallery / Thumbnail · Local print · Download format

The following tables give an overall view on the different formats and recommended parameters.

Text, Line-graphics

Scanning (300)/400/600 dpi 1 Bit Storage TIFF/CCITT G4 1 Bit Viewing 70-120 dpi GIF 1-4 Bit Gallery/Thumbnails 15 dpi GIF 1 Bit Download 300/400/600 dpi PDF 1 Bit

If an OCR conversion is intended, scanning should be done with a resolution of 600 dpi.

Grey-scale graphics, Photographs

Scanning 300dpi 8 Bit Storage TIFF uncompressed 8 Bit Viewing 512x768 to 1024x1536 JPEG 4 Bit Gallery/Thumbnails ~ 100x150 JPEG 4 Bit Download 2048x3072 JPEG 8 Bit

Manuscripts

Scanning 300dpi 8 Bit Storage TIFF uncompressed 8 Bit Viewing 512x768 to 1024x1536 JPEG 1-4 Bit Gallery/Thumbnails ~ 100x150 JPEG < 8 Bit Download 2048x3072 JPEG 8 Bit

Colour graphics

Scanning 200-300dpi 3x8 Bit Storage TIFF uncompressed 3x8 Bit Viewing 512x768 to 1024x1536 JPEG 3x8 Bit Gallery/Thumbnails ~ 100x150 JPEG 8 Bit Download 2048x3072 JPEG 3x8 Bit DIEPER Project: Deliverable 13 Survey of current methodology in image capturing and document management Version: Final Page 9 Date: 08.07.1999

2-D Representation of 3-D Objects

Scanning 200-300dpi 3x8 Bit Storage TIFF uncompressed 3x8 Bit Viewing 512x768 to 1024x1536 JPEG 3x8 Bit Gallery/Thumbnails ~ 100x150 JPEG 8 Bit Download 2048x3072 JPEG 3x8 Bit

Acceptable compression for JPEG-files Grey scale: maximum 10:1 Colour: maximum 15:1 DIEPER Project: Deliverable 13 Survey of current methodology in image capturing and document management Version: Final Page 10 Date: 08.07.1999

3. Indexing

For more details see Deliverable 16.

Categories of indexing

Bibliographic indexing Document identifier Document structure (SGML or XML representation of the document structure) Full text (complete full text or in part)

Bibliographic indexing

· Bibliographic data · Catalogue data sets · Dublin Core data sets · Storage of bibliographic data in the TIFF-Header · Storage of bibliographic data in the TEI-Header · etc.

Document identifier

· DOI · SICI · URL · PURL · URN · etc.

Document structure

· SGML/XML representation · Ebind format · Hyperlinks text « Image page

SGML is quite used for describing the structure of catalogue records. SGML is an international standard used for the formal definition of electronic text. SGML is thus a structure driven meta language. HTML for instance is an application of SGML. The structure of an SGML set of documents is described in a single definition document referred to as the DTD, the Document Type Definition. HTML corresponds to a specific DTD as well as DIEPER Project: Deliverable 13 Survey of current methodology in image capturing and document management Version: Final Page 11 Date: 08.07.1999

Netscape browser (HTML viewer) which has its own DTD called Mozilla, a superset of HTML 2.0.

Full text (complete full text or in part)

· Complete documents · Special parts of the document · Summary, abstract · Tables of contents · Indexes · Key words

Full text formats

· HTML · ASCII · .DOC · .XLS

· TEX · etc. DIEPER Project: Deliverable 13 Survey of current methodology in image capturing and document management Version: Final Page 12 Date: 08.07.1999

4. Methods of full text + meta data capturing

For more details see Deliverable 16.

Manual text capturing

This is often the cheapest way to convert scanned images to full text, particularly for hand print, old prints in “Fraktur” or low printing quality. The costs are approximately 1 EURO for 1.000 characters “double keying” if the work is done in Far East. The quality range is from 99% (Fraktur) to 99,85% (typewritten text).

Text capturing by OCR / ICR

Optical Character Recognition is a method for the automatic conversion of scanned text pages to full text. Middle class software (as FineReader, Omnipage etc.) can convert well printed and well scanned material with an accuracy of 99,8% and a speed of up to 1.000 characters per minute. The price for a PC license of this software is less than 500 EUROs. It is recommended to check the text by special dictionaries (language, topic). “Intelligent” Character Recognition systems run interactive quality checks based on document structure analysis, syntax and semantic rules. Special software exists for handprint and Fraktur, but it must stated that these products are not competitive in comparison to manually capturing in Far East. A software was recently developed by our own company for the automatic analysis and structured text conversion of tables of contents. This Toccata software will be offered to the Dieper partners for testing free of charge.

Download of catalogue data

This is of course the easiest and cheapest way to “capture” bibliographic data of documents. Usually the catalogue data are downloaded as structured ASCII files. DIEPER Project: Deliverable 13 Survey of current methodology in image capturing and document management Version: Final Page 13 Date: 08.07.1999

5. Document storage

Document storage formats

The document format is mainly important regarding three issues: - size of a document which relates to the size of the required hard disk and transfer time - recommended exchange format as a common format between partners - need of a special viewer or not to display and print the document when retrieved

Tag(ged) Image File Format (TIFF)

TIFF is a widely supported format within the libraries community. The latest TIFF version is 6.0.

TIFF limitations: There are no provisions in TIFF for storing vector graphics and text annotation (although such items could be easily constructed using TIFF extensions). TIFF uses 4-byte integer file offsets to store image data, with the consequence that a TIFF file cannot have more than 4 Gigabytes of compressed raster data. This is not a big deal for DIEPER since this limit is far from being reached within a single document. It is considered that an average document is a 10-page document with each page having 100 KB compressed size. This makes the average size of a requested article roughly 1 MB.

TIFF strengths: TIFF is primarily designed for raster data interchange. Its main strengths are a highly flexible and platform-independent format which is supported by numerous image processing applications. Supported compression algorithms are: raw uncompressed, PackBits, LZW (Lempel-Ziv-Welch), CCITT Group 3 & 4 and JPEG compression.

Regarding time transfer for an average document: Suppose that we have an end-user connected through a dial-up connection (28,800 BPS). A 1 MB document requires then roughly 5 to 10 minutes to download. This seems to be accepted by end-users compared to the classical postal delivery.

Portable Document Format (PDF)

PDF is a file format used to represent a document independent of the application software, hardware, and operating system that were used to create it. A PDF file contains a PDF document and other supporting data. A PDF document contains one or more pages. Each page in the document may contain any combination of text, graphics, and images in a device- and resolution-independent format. This is the page description. A PDF document may also contain information possible only in an electronic representation, such as hypertext links.

PDF limitations: Printing a PDF document requires installing the article embedded fonts on the end-user's machine and several steps in order to convert the file to a postscript format. Pages are not necessarily stored in sequential order in the PDF file.

PDF strengths: PDF is primarily a portable format. To reduce file size, PDF supports a number of industry-standard compression filters: JPEG compression, CCITT Group 3 & 4, LZW. PDF DIEPER Project: Deliverable 13 Survey of current methodology in image capturing and document management Version: Final Page 14 Date: 08.07.1999 viewers are supported freely from Adobe and exist for all platforms: UNIX, Macintosh and PC environments. Supporting hyperlinks is also a helpful feature. Editing a PDF document requires professional tools which may be helpful to guarantee the authenticity of the original document. This compared to HTML for example which presents the content as a free text that may be easily modified by a novice end-user.

PNG

The PNG format provides a portable, legally unencumbered, well-compressed, well-specified standard for lossless bitmapped image files.

Although the initial motivation for developing PNG was to replace GIF, the design provides some useful new features not available in GIF, with minimal cost to developers.

GIF features retained in PNG include:

- Indexed-color images of up to 256 colors. - Streamability: files can be read and written serially, thus allowing the file format to be used as a communications protocol for on-the-fly generation and display of images. - Progressive display: a suitably prepared image file can be displayed as it is received over a communications link, yielding a low-resolution image very quickly followed by gradual improvement of detail. - Transparency: portions of the image can be marked as transparent, creating the effect of a non-rectangular image. - Ancillary information: textual comments and other data can be stored within the image file. - Complete hardware and platform independence. - Effective, 100% lossless compression.

Important new features of PNG, not available in GIF, include:

- Truecolor images of up to 48 bits per pixel. - Grayscale images of up to 16 bits per pixel. - Full alpha channel (general transparency masks). - Image gamma information, which supports automatic display of images with correct brightness/contrast regardless - of the machines used to originate and display the image. - Reliable, straightforward detection of file corruption. - Faster initial presentation in progressive display mode.

GIF

High compressed image format. The limitation to 8 Bit (256 colours) may cause colour inhomogenity. Suitable for screen presentation and for images in one colour. DIEPER Project: Deliverable 13 Survey of current methodology in image capturing and document management Version: Final Page 15 Date: 08.07.1999

JPEG

Compressed image format. Colour depth from 8 to 24 Bits. Lossy compression. It is possible to define the degree of information loss. Suitable for grey scale and colour images of low or medium quality.

Digital master file

The image file which results from the scanning will be the so called Digital master file of the highest quality (resolution, image depth). This is the archive version which should be stored in a standardised format under loss less compression on a long life media (WORM, CD-R, tape) in a secure place and should not be used for the daily access. The preferred file format is Tiff. Printed text, which has been scanned with 1 bit image depth is loss-less compressed according to CCITT G 4 T6 algorithm. Greyscale or colour images can be stored as Tiff in an uncompressed manner. Text files and other file formats are converted to “Tiff” in certain cases. Alternative formats may be PNG and (only for grey scale and colour) the GIF format. The formats described are mostly derivatives from this master.

Application file formats

Archive file

This file is stored in the archive system (magnetic disc, RAID system, jukebox) for access by the user. Resolution and image quality depend on the kind of use and of user access. Formats can be Tiff, JPEG, GIF, PNG etc. File compression, even with loss of information, is possible. In addition application formats will be prepared for the screen presentation (e.g. GIF or JPEG, 75 – 100 dpi), for the download and for local printing (e.g. Postscript or PDF, 300 dpi). These application formats may also be stored in the archive system or can be prepared on-the-fly.

Screen presentation

This file can either be an additional format derived from the digital master and stored in the archive system or be a temporary file produced “on the fly” from the archive file. The same goes also for the formats below. As this file is to be displayed on the screen, the resolution corresponds to this quality (approx. 72 –150 dpi). Lossy file compression up to the range 1:10 (b/w) and 1:15 (colour) is possible. Preferred file formats are JPEG or GIF.

Gallery / Thumbnail

This is a reduced version of the screen presentation file. The goal is to give the user an overview and to allocate a specific image in a number of images. DIEPER Project: Deliverable 13 Survey of current methodology in image capturing and document management Version: Final Page 16 Date: 08.07.1999

The resolution is normally in the range 15 dpi. Extremely high and lossy compression is possible. Preferred file formats are JPEG or GIF.

Local print

This file will be delivered to the user’s local printer if the print button was clicked. The resolution depends on the file size and the requirements for sufficient printing quality. Formats may be PDF, JPEG, GIF etc.

Download format

This file will be delivered to the user’s PC for local storage. The preferred format is PDF. The resolution should be a reasonable compromise between file size and quality.

Self-describing image files

Self-describing documents consist of two parts: The body, which contains the data, and the header with attributes describing the document and its format.

Header Format information and attributes

Body Images file (e.g. Tiff)

The working group “Technik”, established by the Deutschen Forschungsgemeinschaft for the preparation of their retro-digitisation program for libraries, defined five Tiff header fields in addition to the existing standard fields. These additional categories are described below. For further details see: http://wwww.sub.uni-goettingen.de/ebene_2/vdf/einstieg.htm DIEPER Project: Deliverable 13 Survey of current methodology in image capturing and document management Version: Final Page 17 Date: 08.07.1999

Additional Tiff header categories

Category Category Content Sample Name No.

DocumentName 269 Character 1-3: library UBG 12345678 short name Characters 4-#: Catalogue number ImageDescription 270 bibliographic data |PERIO|Journal of Mathematics file structure |1874|Berlin|JourMath_33355577 _V030_I005| PageName 285 page count 00000172 (Scan)Software 305 name and version number SRZ Proscan, Version 2.0 of the scanning software Artist 315 name (or short name) of Universitätsbibliothek Graz the library

Storage media

For digital archive systems two alternative storage technologies are existing: · Magnetic storage

- Magnetic disc

- Magnet tape · Optical storage

-

- Optical tape, card etc.

- Holographic solid storage Magnetic discs are preferred for the data that are in permanent access. Raid systems contain several magnetic discs in one array. For large archives (> 100 Gbytes) optical discs jukeboxes are often used as a cheap mass storage device. The main disadvantage of optical discs jukeboxes is that the disc is mechanically inserted to the optical drive. This process takes several seconds. Magnetic tapes are sometimes used as back-up media for the master file. There exist three categories of optical disc storage media: · Rewritable optical discs · WORM (Write Once Read Many) · CD-R, DVD-R DIEPER Project: Deliverable 13 Survey of current methodology in image capturing and document management Version: Final Page 18 Date: 08.07.1999

Rewritable optical discs apply magneto-optic technology. WORMs are storage media for the permanent and irreversible archiving of image data. CD-R, DVD-R are special versions of WORMs using the CD ROM / DVD ROM formats. Optical discs are seen as the ideal media for long time storage of image data. The expected physical durability of optical storage media is in the range of 100 years, which is much longer than the lifetime of the hardware and software for recording the media. DIEPER Project: Deliverable 13 Survey of current methodology in image capturing and document management Version: Final Page 19 Date: 08.07.1999

6. Document management

Document Management is usually understood as the data base supported management of any kind of dynamic electronic documents during their full life cycle from the first production to the permanent archiving of the final version. Document management systems are wide used for business documents. As digital library documents are, in contrast to business documents, static the term “document management” does not really fit. It is suggested to understand “document management” in the context of retro digitised library material as the administration of image files in a non-hierarchic system.

Electronic archiving and document management systems

Electronic archiving and document management systems are used for the permanent storage and electronic provision of images, text and meta data. Electronic archive systems administer documents and single pieces of information in data bases using a data base management system. Furthermore efficient electronic archive systems usually offer functions to manage jukeboxes for the storage (online, nearline or offline) of huge data collections on optical storage media.

Basic functions of archiving and document management systems

Archiving and document management systems should cover the following functionality.

· Image capturing and indexing · Document storage · Document retrieval · Document visualisation and reproduction · Maintenance and administration

Image capturing and Indexing

This covers scanning of analogous data (usually in paper form), import of image files and data, indexing (manually, semi automatic or full automatic) and preparation of protocols.

Scanning

· Preparatory steps · Inhouse scanning by the library itself · External scanning · Image processing and manipulation · Image file compression DIEPER Project: Deliverable 13 Survey of current methodology in image capturing and document management Version: Final Page 20 Date: 08.07.1999

Import of image files and data

Automatic or semi automatic import of image data from other sources. The data stream is processed according to defined rules. Any import must be comprehensible and free of data loss. It should be possible to stop and set up the process at every stage. During this process it is possible to change image formats and index data.

Indexing

Manually, semi automatic or automatic indexing is possible. Index files can be copied from existing data bases. Semi automatic indexing is possible for instance by offline preparation of bar codes and reading the bar code information (document identifier) during the scanning. The automatic methods apply OCR and document analysis technology. Index data are linked to the image file via their addresses or Ids. A special kind of index information is the representation of the document structure. The index terms (Tables of contents, head lines, categories, pagination, index, key words, ...) of the document elements are hyper linked to the corresponding image pages.

Document storage

Storage of image files in different formats, index data, document structure information and document classification.

Document retrieval

Multiple retrieval and navigation tools (catalogue data, SGML/XML based tables of contents, indexes, lists of illustrations, full text, thumbnails, etc.). Document management systems for digital libraries must apply the web technology to give access via Internet and Intranet.

Document visualisation and reproduction

Standard viewers and browsers should be included. Tools for the conversion of image formats and scaling. Export tools for download, printing-on-demand and storage an CD-R

Maintenance and administration

This includes · set up and maintenance of the data base system · set up and maintenance of user categories · user adminstration DIEPER Project: Deliverable 13 Survey of current methodology in image capturing and document management Version: Final Page 21 Date: 08.07.1999

· statists · system administration and operating

Existing archiving and document management systems for digital libraries

Agora

Agora is the first complete digital library system containing all modules necessary for daily library business as described above. The main components are:

· Electronic Document Management and Administration System · Batch import tools (archiving, meta data, utilization of the TIFF tags, etc.) · Unique procedures to handle any kind of heterogeneous pagination systems · Multiple retrieval and navigation tools (catalogue data, SGML/XML based tables of contents, indexes, lists of illustrations, full text, thumbnails, etc.) · Tools for the conversion of image formats and scaling · Export tools for printing-on-demand and storage an CD-R · Internet/Intranet server, easy to use HTML-templates

The Digital Library System was developed by the Satz-Rechen-Zentrum SRZ in collaboration with the Göttinger Digitalisierungszentrum for the storage of and access to digital and digitized documents, including their structural, bibliographic and content meta data.

IBM Digital Library

Xerox

Die Digitale Bibliothek NRW

This is mainly a common access system to several data bases and electronic documents repositories.

Bieblis Bieblis is an electronic system for archiving of documents (images, text files etc.) and Internet access to these documents. It is an integrated component of the IBIS library system of the University Library of Bielefeld.

Online library catalogue software systems

Several OPAC software systems offer functions for the directs access of images from the catalogue data entries. DIEPER Project: Deliverable 13 Survey of current methodology in image capturing and document management Version: Final Page 22 Date: 08.07.1999

Local solutions

In addition to the systems mentioned several local solutions have been developed by libraries and universities.

7. Relevant Standards

CD DA (Digital Physical specification of the CD-ROM 1980 audio), Red Book CD ROM, Yellow Continuation of Red Book 1983 Book CD-I (Interactive), Complete Multimedia system, ISO 9660 1986 Green Book CD-R (Recordable), 1995 Orange Book CD-RW (Rewritable), 1995 Orange Book UDF Universal Disc Format (DVD compatible, but not ISO 9660 compatible). 1996 Developed by the Optical Storage Technology Association; based on ISO 13346. UDF will replace ISO 9660. ISO 9660 Standard for the CD file formats. Predecessor: High Sierra standard (HSF). 1987 Disc size: 120 mm HSF High Sierra Standard 1986 ISO 9171-1 Specification of the disc format (5,25” = 130 mm) 1990 ISO 9171-2 Specification of the writing format 1990 ISO 10089 Specification of the disc format for MOs 130 mm 1991 ISO 10090 Specification of the disc format for ROMs and MOs 80 mm 1992 ISO 10091 Specification of the disc format for WORM 130 mm 1995 ISO 10149 Specification of the disc format for CD-ROM 120 mm 1995 ISO 10885 Specification of the disc format for 14” WORM (Kodak) 1993 ISO 11560 Specification of the disc format for MO-WORM 130 mm 1992 ISO 11694-4 Specification of the file structure for optical storage cards 1996 ISO 12654 Hardware independent storage format (draft) 1996 ISO 13403 Specification of the disc format for 12” WORM (CCS) 1995 ISO 13481 Specification of 1 GB discs 130 mm 1993 ISO 13549 Specification of 1,3 GB discs 130 mm 1993 ISO 13490-1/-2 Specification of the file structure for ROMs and WORMs 1995 ISO 13614 Specification of the disc format for 12” WORM (SSF) 1995 DIEPER Project: Deliverable 13 Survey of current methodology in image capturing and document management Version: Final Page 23 Date: 08.07.1999

ISO 13482 Specification of 2 GB discs 130 mm 1995 ISO 14517 Specification of 2,6 GB discs 130 mm 1996 ISO 15525 Life Expectancy of CD ROM Draft ISO 8879 SGML 1986 8. Technical Glossary and Acronyms

Access Provider see Internet Service Provider Alpha A value representing a pixel's degree of transparency. The more transparent a pixel, the less it hides the background against which the image is presented. In PNG, alpha is really the degree of opacity: zero alpha represents a completely transparent pixel, maximum alpha represents a completely opaque pixel. But most people refer to alpha as providing transparency information, not opacity information, and we continue that custom here. APDU Application Protocol Data Unit. A unit of information transferred between a client and a server. This is used in the Z39.50 to define the data exchanged between a Z39.50 origin and a Z39.50 target API Application Programmable Interface BC Berne Convention Bib-1 Bib stands for Bibliographic. Denotes the set of attributes that can be searched using Z39.50 Bit depth The number of bits per palette index (in indexed-colour PNGs) or per sample (in other colour types). This is the same value that appears in IHDR. Bits per Second Measure for the speed of data transfer through communication media.. Browser Software for the interpretation and presentation of HTML documents. Byte Eight bits; also called an octet. CANTATE Computer Access to Notation and Text in Music Libraries. R&D project funded within the EU libraries programme CAS Current Awareness Service CCITT Comité Consultatif International de Télégraphie et Téléphonie (ITU-T) CERN European centre for high energy physics (Geneva, Switzerland). CGI Common Gateway Interface. A technique that allows a Web server to interface to external application such as databases. Channel The set of all samples of the same kind within an image; for example, all the blue samples in a true colour image. (The term "component" is also used, but not in this specification.) A sample is the intersection of a channel and a pixel. Chromaticity A pair of values x,y that precisely specify the hue, though not the absolute brightness, of a perceived colour. Chunk A section of a PNG file. Each chunk has a type indicated by its chunk type name. Most types of chunks also include some data. The format and meaning of the data within the chunk are determined by the type name. CI data Coded information; e.g. text files. Composite As a verb, to form an image by merging a foreground image and a background DIEPER Project: Deliverable 13 Survey of current methodology in image capturing and document management Version: Final Page 24 Date: 08.07.1999

image, using transparency information to determine where the background should be visible. The foreground image is said to be "composited against" the background. CRC Cyclic Redundancy Check. A CRC is a type of check value designed to catch most transmission errors. A decoder calculates the CRC for the received data and compares it to the CRC that the encoder calculated, which is appended to the data. A mismatch indicates that the data was corrupted in transit. Critical chunk A chunk that must be understood and processed by the decoder in order to produce a meaningful image from a PNG file. DBMS Database Management System DECOMATE Delivery of Copyright Materials to End-users. R&D project funded within the EU libraries programme Datastream A sequence of bytes. This term is used rather than "file" to describe a byte sequence that is only a portion of a file. We also use it to emphasise that a PNG image might be generated and consumed "on the fly", never appearing in a stored file at all. Deflate The name of the compression algorithm used in standard PNG files, as well as in zip, gzip, pkzip, and other compression programs. Deflate is a member of the LZ77 family of compression methods Document Provider Organisation which provides on line access to primary electronic material Document Server Server from which is processed the secure electronic transmission of documents to the end user. DOI Digital Object Identifier Download The transfer of data (documents) from a server to a local computer DTD Document Type Definition. A DTD describes an SGML document. For example, Mozilla is known to be Netscape's DTD DVD Digital Versatile Disc DVD ROM Read Only DVD DVD-R Recordable DVD (WORM technology) DVD RAM Erasable DVD ECMS Electronic Copyright Management System ECUP, ECUP+ European Copyright User Platform. EDD Electronic Document Delivery. Generic term which envolves the identification of the user, the searching of bibliographical reference and the requesting of document (SOD or Online delivery). EDIFACT Electronic Data Interchange For Administration, Commerce and Transport EDIL Electronic Document Interchange between Libraries. European project terminated on Dec. 31, 1995. ELITE-Project Electronic Library Teleservices. European project (1996-1997). ELITE Service Organisation which manage the LEAS Provider EUROPAGATE European project that aims at interoperability of bibliographic catalogue systems using the search and retrieve protocols (Z39.50 / SR) FASTDOC Fast Document Ordering and Delivery. R&D project funded within the EU libraries programme (1994-1996) DIEPER Project: Deliverable 13 Survey of current methodology in image capturing and document management Version: Final Page 25 Date: 08.07.1999

Firewall See Security Firewall FTP a) File Transfer Protocol. Defines how files will be transferred from one computer to another. b) A software to transfer files using the File Transfer Protocol. File Transfer protocol. Reliable file transfer protocol used on the top of the TCP/IP stack. Filter A transformation applied to image data in hopes of improving its compressibility. PNG uses only lossless (reversible) filter algorithms. GEDI Group on Electronic Document Interchange. GEDI is a TLV format that defines how to encapsulate a TIFF article. GIF Graphics Interchange Format. File format for graphics, developed by CompuServe, Inc. GIF offers the inclusion of Inline graphics to HTML. See also XBM. Greyscale An image representation in which each pixel is represented by a single sample value representing overall luminance (on a scale from black to white). PNG also permits an alpha sample to be stored for each pixel of a greyscale image GUI Graphical User Interface GURL Golden URL, technique used in WebDOC project in order point to online documents HTML HyperText Markup Language. The language that describes Web pages contents. HTML is derived from ISO/SGML using a specific DTD. See also SGML. HTTP HyperText Transfer Protokoll. HTTP / httpd HyperText Transfer Protocol Daemon. This is the world wide Web server. Hyperlink See Link Hypermedia See Hypertext Hypertext A document which contains links to other documents. IAB See Internet Architecture Board IAS Individual Article Supply ICR Intelligent Character Recognition IDF International DOI Foundation IETF See Internet Engineering Task Force IMPRIMATUR Intellectual Multimedia Property Right Model and Terminology for Universal Reference (EC Project) Indexed colour An image representation in which each pixel is represented by a single sample that is an index into a palette or lookup table. The selected palette entry defines the actual colour of the pixel. Inline-Image Graphic as part of a hypertext document. See also Linked Image. Internet a) In general, a number of single networks which operate together like one big network. b) The worldwide network of networks. DIEPER Project: Deliverable 13 Survey of current methodology in image capturing and document management Version: Final Page 26 Date: 08.07.1999

Internet Architecture Committee for standardisation and other important decisions for the Internet. Board (IAB) Internet Engineering Committee for the analysis and clearing of technical problems with respect to Task Force (IETF) Internet. The members of IETF report to the Internet Architecture Board (IAB). Internet-Resources Any kind of information accessible via Internet. Internet Service An organisation which offers Internet connections. Provider (ISP) IPR Intellectual Property Right ISDN Integrated Services Digital Network. The natural evolution of the PSTN towards a fully digital network. Allows data transfer speed up to 64 KBPS for a basic rate interface BRI ISP See Internet Service Provider JPEG Joint Photographic Experts Group. Image compression standard. Link Reference to an other document. If the link is used the corresponding document will be loaded. Loss-less compression Any method of data compression that guarantees the original data can be reconstructed exactly, bit-for-bit. Lossy compression Any method of data compression that reconstructs the original data approximately, rather than exactly. Luminance Perceived brightness, or greyscale level, of a colour. Luminance and chromaticity together fully define a perceived colour. LZW Lempel-Ziv-Welch image file compression algorithm MARC Machine Readable Catalogue. MARC is an exchange format used to import/export bibliographic records. e.g. UNIMARC, USMARC Metadata Data to describe the documents (as e.g. libraries catalogue data). Metadata should guarantee an unique identification key for each document. Activities for standardisation: - Dublin Core Set - Warwick Framework - PURL-Concept of OCLC MIME Multipurpose Internet Mail Extensions. Enhancement to SMTP / 7-bit limitation. Handles all types of content through E-mail (e.g. images) MUMLIB Multimedia Methodology in Libraries MURIEL Multimedia Education System for Librarians Introducing Remote Interactive Processing of Electronic Documents NCI data Non-coded information; e.g. image files NIST National Institute of Standards and Technology NLC National Library of Canada. Implementor of CanSearch, Z39.50 origin OCLC On-line Computer Library Centre, Dublin, Ohio. OCR Optical Character Recognition ONE OPAC Network in Europe. European Project that aims at investigating and evaluating Z39.50 implementations and search and retrieval APIs. OPAC On-line Public Access Catalogue DIEPER Project: Deliverable 13 Survey of current methodology in image capturing and document management Version: Final Page 27 Date: 08.07.1999

OSTA Optical Storage Technology Association. World-wide association of optical storage systems producers (represents > 70% of the market). Specification of the UDF based on ISO 13346 (1996) Palette The set of colours available in an indexed-colour image. In PNG, a palette is an array of colours defined by red, green, and blue samples. Pixel The information stored for a single grid point in the image. The complete image is a rectangular array of pixels. PDF Portable Document Format. PDF is Acrobat's favourite format Perl Practical Extraction and Report Language. An interpreted language. Mostly used to implement CGI scripts for Web servers. PIN-code Personal Identification Number code usually used in order to authenticate an end- user PNG Portable Network Graphics. A standard format for lossless bitmapped image files. The intention was to replace GIF. PNG editor A program that modifies a PNG file and preserves ancillary information, including chunks that it does not recognise. Such a program must obey the rules given in Chunk Ordering Rules PSTN Public Switched Telephone Network. Just to denote the plain old telephone infrastructure. Allows data transfer speed up to 28,800 BPS PURL Persistent URL. RAID Redundant Array of Inexpensive Discs Harddisc systems to store data in different security levels. Five level system, was defined at the University of Berkeley in 1987. RAID 7 Architecture (7 levels) gives access from several hosts to one array system. RAMA/CHIO Remote Access to Museum Archives is a European project managed by Telis which aimed at interconnected museum archives in Europe / Cultural Heritage Information On-line is a North-American initiative identical to RAMA. RPN Reverse Polish Notation. A query basic language used in Z39.50 in order to issue search requests RRO Reproduction Rights Organisation (IFFRO, VG Wort, CLA, CCC, etc.) Scanline One horizontal row of pixels within an image. Secure Location Unique identifier which protects the document from unauthorised access. Identifier Service-Provider See Internet Service Provider SGML Standard Generalised Mark-up Language. SGML is an ISO standard widely adopted by publishing professionals Shell-Account see Dialup-Account SMTP Simple Mail Transfer Protocol. Internet E-mail SSL Secure Socket Layer. Low level packet encryption mechanism. Current version is SSL 3.0 STM Scientific, Technical and Medical publications (publishers) SUTRS Simple Unstructured Text Record Syntax. A Z39.50 record syntax that allows an origin to retrieve a result set of bibliographic records Tags a) In HTML: the structure and presentation of documents will be defined via tags. DIEPER Project: Deliverable 13 Survey of current methodology in image capturing and document management Version: Final Page 28 Date: 08.07.1999

b) In TIFF: categories for the description of the TIFF file. TCP/IP Transmission Control Protocol/Internet Protocol Transmission Control Protocol / Internet Protocol. Denotes the required stack of software that allows a machine connect to the Internet. This covers layers 3 and 4 of the 7-layer OSI model. TEI Text Encoding Initiative. Guidelines for Electronic Text Encoding and Interchange (final version 1994). Based on SGML. TIFF Tag(ged) Image File Format (Aldus Corp.). Graphics file format. Latest revision is 6.0. TIFF Header Contains the file description of an TIFF file. Truecolor An image representation in which pixel colours are defined by storing three samples for each pixel, representing red, green, and blue intensities respectively. PNG also permits an alpha sample to be stored for each pixel of a truecolour image UNIMARC UNIversal MARC. Widely supported bibliographic MARC records URL Uniform Resource Locator. A Web reference of a document (or object). e.g. http://www.telis-sc.fr denotes the URL of Telis S&C Web server. VAN EYCK Visual Arts Network for the Exchange of Cultural Knowledge (EC project) W3C An organisation jointly founded by MIT and CERN to manage the development of the www WAIS See Wide Area Information Server WAN See Wide Area Network WebDOC Project where Pica is co-operating with publishers in order to provide a document delivery on the Web (http://www.pica.nl) White point The chromaticity of a computer display's nominal white value. WIPO World Intellectual Property Organisation. Belongs to the UN. WIPO is seated in Geneva. WORM Write Once Read Multiple WORM disc Disc, optical disc, applying WORM technology World Wide Web A Hypertext based system for retrieval and access to Internet resources WWW see World Wide Web X.400 CCITT Messaging System XBM X-Bitmap file format. Standard format for the storage of bit-map graphics under X- Windows. See also GIF X-Windows System A network based window system which has been developed originally by the Massachusetts Institute of Technology (MIT). X-Windows (also called „X“) is mainly used for UNIX computers. YAZ Yet Another Z39.50 implementation (http://www.indexdata.dk/yaz) Z39.50 ANSI and ISO standard. Z39.50 is a search and retrieval protocol widely accepted within the libraries community in the USA and Europe zlib A particular format for data that has been compressed using deflate-style compression. DIEPER Project: Deliverable 13 Survey of current methodology in image capturing and document management Version: Final Page 29 Date: 08.07.1999

8. References, URLs etc.

· GIF Info: http://www.geocities.co.jp/SiliconValley/3453/gif_info/index_en. · JPEG Home Page: http://www.jpeg.org/public/jpeghomepage.htm · PNG Homepage: http://www.cdrom.com/pub/png/ · PDF: http://www. · TIFF Revision 6.0: http://www.jgd.fhg.de/icib/it/defacto/company/aldus/read.html · Wavelet Digest Home Page: http://www.wavelet.org/wavelet/index.html · Report of the technical DFG working group “Verteilte digitale Forschungsbibliotheken”: http://wwww.sub.uni-goettingen.de/ebene_2/vdf/einstieg.htm · Glossary of Digital Age Terms: http://strategy.gemconsult.com/resources/glossary/index.htm