11 (2008) Ausgabe 3

Reprint

D 52614 ISSN 1435-7607

Zeitschrift für Bibliothek, Information und Technologie mit aktueller Internet-Präsenz: www.b-i-t-online.de

Mass digitisation workflow management Book and digital scan workflows

By Martin Baumgartner, Michael Beer, Berthold Gillitzer, Rosmarie Leichtl, Gabriele Meßmer, Karsten Trzcionka, Thomas Wolf-Klostermann und Wilhelm Hilpert Nachrichtenbeiträge

Mass digitisation workflow management

Book and digital scan workflows

By Martin Baumgartner, Michael Beer, Ingrid Bochmann, Berthold Gillitzer, Rosmarie Leichtl, Gabriele Messmer, Karsten Trzcionka, Thomas Wolf-Klostermann and Wilhelm Hilpert

n In 2007, the Bavarian State two important respects from all other scan- The preparatory phase concluded a much-acclaimed contract ning service providers in to date. with Google relating to the digitisation of Instead of receiving payment for its services, The reason for this article is the conclu- around one million out-of-copyright old Google gets (or, to be more precise, retains) sion of the preparatory phase and the com- books which are part of the Bavarian State a copy of each digital scan and instead of a mencement of the productive phase of Library‘s collection. The project will run for few hundred pages, it scans and process- this mass digitisation project. Some people several years. One of the contract‘s stipu- es many hundreds of books every day. Basi- may consider a preparatory phase of more lations is that the books are to be scanned cally, the workflows described are typical of than 1 year to be too long, but it is actual- within the Free State of ly quite normal in a project of this size and . It is based on the the project team members view it as short Bavarian State Library‘s rather than long. First of all, Google had to pan-European „Call for provide the necessary human and materi- Expression of Interest in al resources for a project of this size. For its Participation in the Nego- part, the Bavarian State Library had to cre- tiation Proceedings“ pub- lished in Tenders Elec- tronic Daily, generated widespread positive press coverage and triggered a lively debate among the librarian public. BSB/H.-R. Schulz. BSB/H.-R.

Public private partnership Abb. 1: agreements are always Der Altbestand formulated to satisfy each der Bayerischen party‘s confidentiality Staatsbibliothek requirements and, natu- mag mit über rally, confidentiality has einer Million Titel to be observed. This gives noch manchen private sector enterprises unentdeckten the assurance that their Schatz für know-how is protected, die Wissenschaft particularly the know- enthalten. how relating to company processes and technolo- gy, which is often a key factor of competition. „Confidential“ is the word that most aptly describes Google‘s BSB/H.-R. Schulz. BSB/H.-R. relationships with its partner . That is the reason why this article does not dis- a mass digitisation work environment and ate the tools for the management, monitor- close any information about Google‘s scan- they exist - in appropriately modified form ing and reliable archiving of the books and ning technology, quality assurance or inter- - in the Bavarian State Library‘s other large- digital scans. nal processes. scale projects, such as the project to digitise This article views Google as a popular serv- 37,000 German 16th century prints. The logistical challenge ice provider which does, however, differ in From another perspective, the Bavarian State Library doesn’t incur any direct costs Some people have no inkling of how big  Cf.: Klaus Ceynowa: Massendigitalisierung für for digitisation and it receives an identical a challenge it is to make over one mil- die Wissenschaft – Zur Digitalisierungsstrategie der Bayerischen Staatsbibliothek. In: Information copy, the “library digital copy” of the digit- lion books available and process them in – Innovation – Inspiration. 450 Jahre Bayerische al scan created by Google. a project with a relatively compact time- Staatsbibliothek. Ed. Rolf Griebel and Klaus Ceyno- wa. : Saur 2008. p. 241-252. frame. In Germany, people often used to

B.I.T.online 11 (2008) Nr. 3 267 Nachrichtenbeiträge Baumgartner / Beer / Gillitzer / Leichtl / Meßmer / Trzcionka / Wolf-Klostermann / Hilpert n talk about mass digitisation when a thou- the mass digitisation project increase the be catalogued by autopsy, so there are data sand books had to be scanned. Just for clari- number of books movements every day to errors, some of which require urgent correc- ty‘s sake, this quantity was processed in less well above 20,000. tion before the book in question is handed than one week in the Google project. The over for mass digitisation. The three most project marks the beginning of a new era of It is also worth remembering that numerous frequent errors are incomplete and incor- digitisation in German libraries - the era of decisions have to be taken at several stag- rect conversion images, incorrect book clas- industrial mass digitisation. es of the logistics process every day. Suita- sifications and incorrect book numbers. Although we said at the outset that we bility for conservation, the completeness of Many formal errors (e.g. an incorrect book weren‘t going to interfere in „scanning serv- meta information and its clear assignability classification) can be identified by automat- ice provider“ Google‘s business, there are to the book in question, as well as suitability ic catalogue database queries and printed two very important points that we would in terms of size and the copyright situation out as error lists. Trained assistants are able like to cover: are examples of issues which necessitate ad to process some of these error lists after n The Bavarian State Library has and hoc decisions at individual book level. The thorough training. This enables the elimina- retains exclusive control over the appli- project team members are prepared for this tion of the errors in meta information per- cation of conservation and maintenance situation in training courses and intensive taining to the entire collection so that they criteria in every process step - even at an familiarisation phases so that they are able have no influence on the provisioning of individual case level. to make fast and accurate decisions. the books for the digitisation process. How- n The Bavarian State Library works in close ever, the majority of errors have to be rem- collaboration with Google to ensure The challenge facing the librarians edied by catalogue specialists who are very that the digital scans are of the highest familiar with the Bavarian State Library‘s old possible quality. The challenge facing the library is far great- collection. Although the experts only mak- er than the logistical challenge. The library ing corrections where absolutely necessary, Basically, this means that the scans match staff have to ensure that the books are hand- it will take quite some time to process the the quality achieved in publicly-funded projects. Google is committed to the con- tinuous improvement of digital scans qual- ity and the application of its quality assur- ance measures. The Bavarian State Library can count itself lucky not to have been among the first group of „Google Librar- ies“. Obviously, however, the quality of industrially mass-produced digital scans is not comparable to the quality of digital scans taken from a fifty-page incunabulum in a scanning process that takes several days to complete. The consequence of mass digitisation is that the mass digitisation processes make it impossible or very personnel-intensive to select individual books, text sections or sys- tem groups. It is logistics which determines the digitisation processes used and their speed. Even before the Google project com- menced, around 6,000 books were removed every day from the Bavarian State Library‘s repository for local lending, reading room Abb. 2: Der Workflow im Überblick (LS = Lokalsystem; PG = Projektarbeitsgruppe; lending, inter-library lending, document WDB = Workflowdatenbank) delivery and official lending (mainly for digitisation projects and other mainte- nance measures). Naturally, this means that ed over to Google with clearly assignable meta information for every book before it 6,000 books also have to be returned to meta information. When each production can be digitised. We will be discussing the the correct place in the repository every batch is handed over, the meta information consequences of this later on. day. Although a library‘s service orientation pertaining to the batch in question has to isn‘t expressed by usage figures alone, they be transferred with it and clearly assigna- The workflow database do to some extent reflect it. There are also ble to it. Most of the meta information per- around 5,000 volumes which are moved taining to the out-of-copyright books in the We have already made good progress in the around within the repository every day for digitisation project is available in electron- project process of “eliminating errors from repository management optimisation pur- ic form, though the majority was obtained the meta information”. To explain the pro- poses, and over 500 volumes arrive for stor- through the retro-conversion of hand-writ- cedure, we have to provide a little more age at the repository every day. Additional ten catalogues. Quite a few of the old cata- background information. daily book movements (retrievals, returns, logues did not contain enough information At the outset of the project, we soon real- transportation to and from the project and for the book‘s inclusion in the electronic ised that we needed a tool that would give meta information optimisation teams) in catalogue and problematic cases could not us with an overview of the entire workflow

268 B.I.T.online 11 (2008) Nr. 3 Nachrichtenbeiträge n Baumgartner / Beer / Gillitzer / Leichtl / Meßmer / Trzcionka / Wolf-Klostermann / Hilpert and provide us with answers to many fore- performed and documented by the two already performed, format, location of the seeable questions: software tools. The most important func- repository and the repository’s mid-term Is book xy with the service provider? tions are listed below: relocation plan. Sections containing sever- Which books are currently with the service n Guarantee of the correct assignment of al 10,000s of books are selected. The meta provider? the scanned object, meta information information, particularly the book num- Is book xy back and has it been digitised by and digital copy bers, are automatically subjected to for- the service provider? n Creation of digitisation order sets and mal checks. Error lists are printed out and Have we also received the digital library print out of the appurtenant order slip processed by trained assistants as explained copy from book xy? n Documentation of the reasons why a above. The book number sections are then Why wasn‘t book xy digitised? book cannot be processed selected again in the local system, divided How long will user X have to wait for book n Management of subsequent processing up and exported. When they have been xy? tasks e.g. any necessary meta informa- stored in MyBib-WDB, the order sets are How long will user Y from Würzburg have tion corrections created. A specific number is allocated to to wait for the digital scan to be available n Compilation of a production batch and each order set, the “digID”, which is then on the internet? documentation of all works in a batch supplemented by the book titles contained Precisely how many books have been proc- n Controlled hand-over of books in batch- in the Bavarian State Library’s local cata- essed to date? es, including the appurtenant meta logue in a check back process. The digID Does book xy have a fold-out map or graph- information, to the scanning service is then the most important connecting ele- ics that couldn‘t be scanned straight away? provider ment between the local catalogue data- Which books will be handed over to the n Acceptance of the books back from the base and the MyBib-WDB data. It is later a service provider tomorrow morning? scanning service provider and com- decisive element in the persistent link and pleteness checks therefore the main link between the archiv- It was soon clear that this instrument did n Acceptance of the library digital copy ing system and catalogues. It is printed as a not exist in the required form or only met back from the scanning service provid- bar code on the order slips which accompa- some of our requirements. That’s why we er ny the book through the digitisation proc- chose two software tools, ImageWare’s n Publication of the digital copies on the ess until it is returned to its storage place in MyBib software for library document deliv- World Wide Web the Bavarian State Library’s repository. The ery order processing and the Bavarian State n Verification in local and national cata- digID also enables the library digital copy to Library’s workflow tool and repository archi- logues through the automatic creation find the “right place” in the Bavarian State tecture, ZEND (Zentralen Erfassungs- und of URNs and further information (resolv- Library’s archiving system. Nachweisdatenbank). We can build on ing URL; reference to freedom from Around six weeks before we print the order these tools and both of them are effective- costs etc.) and their entry in the cata- slips in MyBib-WDB, the volumes to be digi- ly in use at the Bavarian State Library. ZEND logues tised are taken out of circulation and loaned and MyBib were further developed by the n Transmission of image files, structural out books are officially reserved. By the Bavarian State Library‘s Digitisation Cen- data and text data on processed works way, if they are essential for research pur- tre in Munich and ImageWare into a tool to the LRZ Leibnitz data centre‘s long- poses, these books can be loaned for use in for the management of mass digitisation term archiving system the Bavarian State Library’s reading rooms projects. Both softwares were harmonious- Only a very high degree of automation ena- until they are handed over to the service ly integrated and connected via open inter- bles the daily processing of many hundreds provider. The process of digitisation, our faces. Together, they are called the „Bavar- of digital copies. The LRZ hosts the scala- completeness checks and the return of the ian State Library Workflow Database (WDB) ble hardware with adequate memory and books takes a few weeks’ time. The entire in this publication, whereby the MyBib- the server cluster for the image conversion digitisation workflow makes the books una- based part has the provisional name MyBib- of the digital copies at the Bavarian State vailable for loan for around three months WDB and it is predominantly used for the Library, and it provides the high speed data in total. However, they are only absolute- management and documentation of logis- lines that are necessary to process the high ly inaccessible for around half of this time. tical processes. The upgraded ZEND that is daily production loads. This is not much longer than the standard used for industrial mass digitisation is called four week loan period. We would like to Google-ZEND. This part controls the col- An overview of the workflow specifically emphasise that this is an impres- lection of digitised scans, the creation of sive time compared with other digitisation WWW-compatible image formats, provi- The Bavarian State Library’s collection of old and maintenance projects and it is an indi- sioning and long-term archiving. One key works is categorised in just over 200 differ- cation of quality based on thorough proc- process effectively demonstrates how close- ent kinds of subject classifications. The sub- ess planning. ly interlinked MyBib-WDB, Google-ZEND ject classification numbers were retained The order slips are printed out and the and the catalogue system are. It is the proc- until 1936 and they are still used to clas- books are retrieved from the repository ess for the catalogue verification of the dig- sify all works published before this date. It on the basis of the order slips around two ital copy. When the digital copy has been therefore makes sense to use these classifi- weeks before they are handed over to the provided to the repository by Google- cation numbers to select all the works which service provider. This is a very complex ZEND, automatic feedback is sent by Goog- are suitable for digitisation. The sequence process which is inadequately described by le-ZEND to the MyBib-WDB which initiates in which the classification numbers are verification via the Bavarian Library Associa- processed is established many months  The Bavarian State Library‘s pre-1935 works are tion‘s (BVB) inter-library system‘s catalogue before the digitisation process begins. The mainly divided into the three formats of folio (≥35 cm), quarto (25–35 cm) and octavo (≤25 cm). and local catalogue by the entry of a persist- sequence is selected on the basis of pro-  The Bavarian State Library has three repositories: ent link at the relevant cataloguing entry. jected meta information error frequency, the main repository in Ludwigstrasse, the reposi- tory in Garching and an alternative repository in In this process, over 30 workflow steps are maintenance measures to be performed or Munich‘s Euroindustriepark.

B.I.T.online 11 (2008) Nr. 3 269 Nachrichtenbeiträge Baumgartner / Beer / Gillitzer / Leichtl / Meßmer / Trzcionka / Wolf-Klostermann / Hilpert n

Abb. 3: Die System- architektur des Bereit- stellungs- und Archivie- rungssystems (PURL = Persistent URL; TSM = Tivoli Storage Manager)

the word “retrieve”. First of all, the books Some of the books are sent to the main- the Institute for Book and Document Resto- have to be checked to ensure that they are tenance department or the Bavarian State ration. MyBib-WDB then creates a “batch” suitable for scanning (condition, size, year Library’s Institute for Book and Document consisting of a precisely defined number of of publication etc.). Then, one of three dif- Restoration (IBR). All the decisions about the books and delivers them for scanning. ferent things can happen: cultural artefacts which have been entrust- So what happens to all the books with 1. The order slip is placed inside the book, ed to us have to be efficient, objective and meta information that can’t be corrected in which is then sent off for digitisation. The responsible. The Bavarian State Library’s time to send them for digitisation? In a few order is entered in MyBib-WDB by scanning workflow database provides the necessary year’s time, before the defined project term the digID bar code with handheld scanners support and documentation in this process ends, we will be implementing the above- and integrated mobile data storage units. so that each book’s progress can be tracked described process of “going through” the The digIDs are centrally input from the stor- at all times. Bavarian State Library’s entire old book age unit into the MyBib-WDB database at This form of working through the reposi- repository for a second time. Obvious- a later time. The individual orders are then tory systematically, shelf-by-shelf, is associ- ly, all the books which have already been assigned a status which permits them to be ated with the major benefit that all books scanned (the majority) will not be includ- included in a batch for handing over to the for which no order slip was printed out as a ed in the process, which will considerably scanning service provider. result of the meta information being incor- reduce the time required. Now you will see 2. If books are obviously unsuitable or miss- rect or incomplete can be found. why we were able to be relatively gener- ing – sometimes the books are put on the This process will provide us with two further, ous to users who requested book loans dur- wrong shelf – the reason is specified on the long-term desiderates, although this was ing the “first round”. Any books that were order slip. The order slip is then scanned not the original intention. All of the library’s loaned out instead of being send for digiti- into MyBib-WDB to store information about out-of-copyright and valuable old works sation will be included in the second or pos- the reasons why the book has not been will be subjected to a thorough audit and sibly even a third round of digitisation. This scanned. If the situation changes or follow- books which cannot be digitised as a result process also ensures that books which had up projects are implemented, it is then eas- of serious damage will be documented and lengthy maintenance performed on them ier to select these works again. then repaired. The difference between this can also be digitised. 3. When the person retrieving the book isn’t and other damage audits is that instead of As previously mentioned, the Google proc- sure whether it is suitable for scanning or only a small portion of the collection being esses are not described here. The books are discovers meta information ambiguities, the audited and the result extrapolated, all definitely scanned, scan quality and com- book either remains on the shelf (in which books will be audited one-by-one and the pleteness are checked and - if possible - case a reprocessing note is made on the findings in each case will be documented classification data and indexing data are order slip) or it is shown to the local project in MyBib-WDB. generated for full text searches. Then the group manager for his or her opinion. The books provided to Google therefore books are returned in complete batches to The project group has to decide whether come from three sources - books which do the Bavarian State Library. Batch-by-batch the books that it has received can be proc- not require any corrections and are sent for returns and MyBib-WDB simplify book essed quickly within the group or wheth- digitisation straight away, books prepared counts to ensure that all books have actu- er they have to be passed on to the qual- by the local project group and a small vol- ally been returned. ity assurance and media processing teams ume of works which will come from the After their return, the books are put back for complex meta information corrections. media processing team, a book binder or in the repository by experienced staff who

270 B.I.T.online 11 (2008) Nr. 3 Nachrichtenbeiträge n Baumgartner / Beer / Gillitzer / Leichtl / Meßmer / Trzcionka / Wolf-Klostermann / Hilpert can identify any missing books by gaps on State Library‘s out-of-copyright historical the shelves. One final completeness check collection will be available online. Then, it is made by scanning all the order slips will only take users a few mouse clicks to removed from the books that were replaced display around one in ten of the Bavarian on the shelves into MyBib-WDB. State Library‘s books on their screens at any time of day or night, from any location and Provisioning of the digital copies without having to wait. Although these will „only“ be virtual copies, all the books will The provisioning and long-term archiving be returned safely to their shelves in the of the library digital copies is organised by BSB‘s repositories where they‘ll be available the Munich Digitisation Centre (MDZ) in to book lovers who like the feel of the gen- collaboration with the Leibniz Data Centre uine article. (LRZ). An efficient technical infrastructure was set up and tested for this purpose in the months prior to project start-up. It was based on the Bavarian State Library‘s exist- n Author‘s ing, tried-and-tested ZEND system which Martin Baumgartner has been in operation at the MDZ since Kooperatives Datenmanagement 2004. It was adapted for mass digitisation [email protected] purposes and upgraded to include several additional functions. The central feature is Michael Beer Ltg. d. Qualitätssicherung Erschließung a scalable repository for the storage of all [email protected] project data. A central production server retrieves the data packages from the scan- Dr. Berthold Gillitzer Stellv. Leitung Benutzungsdienste ning service provider and transfers them to [email protected] a server cluster for conversion. The data is then processed in the Munich Digitisation Dr. Wilhelm Hilpert Centre‘s standard workflow. Presentation Leitung Benutzungsdienste [email protected] derivatives and structural data are created and a leaf-through version of each book is Rosmarie Leichtl mapped for use. A second production serv- Leitung Workflowteam Google [email protected] er then archives the files in the TSM (Tivo- li Storage Manager) system at the LRZ. The Gabriele Messmer archiving system is directly connected to the Koordination Digitale Bibliothek production system. At the end of the entire [email protected] process, the digital scan is made available Karsten Trzcionka on the Bavarian State Library‘s web serv- Leitung Dokumentverwaltung er. Then, automatic feedback is provided [email protected] to the WDB and, from there, to the cata- Dr. Thomas Wolf-Klostermann logue system. Only when the catalogue Digitale Bibliothek link is entered is the digital scan visible and Thomas.Wolf-Klostermann@ bsb-muenchen.de accessible to users. Finally, full text indexing makes whole book searches possible. Alle Autoren: The system is designed so that all digitised Bayerische Staatsbibliothek 80328 München books can be processed and provisioned soon after scanning and so that no „back- log“ builds up. The entire processing chain, from data collection through image conver- sion to provisioning and archiving would not be possible without a high degree of automation in all production stages. Due to the vast quantity of data to be processed - several hundred books have to be proc- essed and provisioned on every normal pro- duction day - manual processing wouldn‘t just slow the process down, it would also introduce errors. Fast availability for our users and the reliable archiving of all data take priority. ImageWare Components GmbH After comprehensive preparatory work, the Am Hofgarten 20inside BSB is well equipped to cope with the chal- 53113 Bonn lenges of this large-scale project. In a few Germany years‘ time, the majority of the Bavarian www.imageware.de

Installationen in deutschen Exzellenzuniversitäten: RWTH Aachen, FU Berlin, Albert-Ludwigs-Universität B.I.T.online 11 (2008) Nr. 3 271 Freiburg, Georg-August-Universität Göttingen, Uni- versität Konstanz, TU München, Ludwig-Maximilian- Universität München, Ruprecht-Karl-Universität Heidelberg Bayerische Staatsbibliothek, München

Buchscanner Scansoftware Dokumentenliefersystem

ImageWare Components GmbH | Am Hofgarten 20 | 53113 Bonn | Germany| www.imageware.de

AZ_IW_BSB_Muenschen_83x180_4c_V2.indd 1 04.07.2008 16:09:04