HathiTrust Digital

Update On February Activities In This Newsletter

Top News March 12, 2010 Top News • Ingest In the last several months, the Ha- New Programmer for Non- • New Programmers for thiTrust partners have made steady Ingest – UM has hired two new pro- Non-Google Ingest progress in expanding the repository’s grammers, for a total of 1.7 FTE, to • Data API Demonstration ability to support the variety of digital concentrate on developing ingest rou- Working Groups outputs produced at their local insti- tines and common workflows for non- • Quality tutions. While the bulk of content in Google-produced materials. These will • Discovery Interface HathiTrust currently is the result of include materials digitized by the Inter- • Development Environment Google’s digitization efforts, preserv- net Archive and through local digitiza- Ingest ing and delivering content from librar- tion efforts at partner institutions. • University of Minnesota ies’ non-Google sources is an important Data API - The interface to the Data Development Updates part of HathiTrust’s mission to meet the API demonstration application that was • Shibboleth needs of broadly, and assemble undertaken by Michigan in January is • Large-scale Search a comprehensive of published available at http://www.lib.umich.edu/ • PageTurner materials that is co-owned by libraries two-over-threehundred/. The goal of themselves. Three items in this month’s New Growth the application was to use HathiTrust’s update highlight our efforts in this area: Data API to facilitate the location and Number of volumes added: our progress in ingesting materials digi- download of complete packages Month of Total tized by the Internet Archive, the hiring Feb. for volumes not digitized of two new programmers to focus on the Indiana by Google. The code used to produce 23,066 174,882 transformations and normalizations in- Univ. the demonstration is also available. The Penn volved in bringing in diverse content, 128 5,144 application is still processing the Ha- State and the creation of a demonstration ap- thiTrust data files, and so will only dis- Univ. of plication that uses the HathiTrust Data 5,976 1,162,315 play a subset of the full data. California API to deliver master repository content Univ. of 50,873 3,781,841 from non-Google sources to users. We Working Groups Michigan will be highlighting developments such Univ. of Quality, Ingest, and Error Rate – 64,966 64,966 as these in the coming months. Minnesota The working group kicked off activities Univ. of 35,683 303,727 Internet Archive Ingest – Ingest of under its recently revised charge in Feb- Wisconsin UC volumes digitized by the Internet ruary, and will be meeting on a monthly Total 108,977 5,434,537 Archive was delayed in late February basis. At this stage, the group is under- due to a validation error that UM staff taking information gathering and do- 54,555 public domain volumes were added in February, bring- encountered, but ingest of more than ing planning for work items, including ing the total number to 818,886 200 pilot volumes was begun in early building a framework for defining quali- (about 15% of total content). March. Following quality review of the ty principles and developing a varied set volumes by UC staff and the resolution of scenarios under which content would of any associated issues, download of be gated from entering HathiTrust. UC’s Internet Archive-digitized vol- This work will help to spur discussion umes will begin in earnest. Staff at UC and identify larger issues that are play. and UM are in the process of compiling Members of the group include Paul Fo- technical and procedural documenta- gel (California ), Peter tion related to Internet Archive ingest to Gorman (University of Wisconsin), Bry- share with partner institutions and the an Skib (), and community at large. Paul Soderdahl (University of Iowa). HathiTrust Digital Library

Update On February Activities March Forecast

• Deploy the new page image Discovery Interface – The Ha- software systems, and replicating and server and related changes thiTrust-OCLC team made significant troubleshooting issues live in produc- to the HathiTrust PageTurner strides in February towards the version 1 tion. • Release Shibboleth authenti- catalog beta implementation, with some cation support adjustments to the projected timeline. Ingest • Continue large-scale search Due to changes in OCLC’s product re- University of Minnesota – Ingest performance monitoring lease cycles, the catalog is now expected of content from the University of Min- • Complete quality assurance to be complete in May 2010. The Ha- nesota began in February, with nearly processes for pilot ingest of thiTrust library team is now exploring 65,000 volumes being deposited. All of Internet Archive-digitized strategies and requirements for the cat- these volumes are government docu- materials alog’s public release, with the guidance ments, and are part of a larger effort of • Begin ingest of all UC’s of both the HathiTrust Strategic Advi- the Committee on Institutional Coop- Internet Archive-digitized sory Board and Executive Committee. eration (the Big Ten plus the University volumes The load of HathiTrust bibliographic of Chicago) in partnership with Google, metadata into WorldCat remains on to digitize more than 1 million U.S. fed- schedule. OCLC is currently testing the eral documents from their combined first batch of records, and large-scale collections. The Minnesota documents loading will take place throughout the themselves can be found by clicking on month of March. Preliminary user test- the University of Minnesota facet in the ing is currently underway at Penn State HathiTrust Catalog. and will be complete in mid-March, Development Updates thanks to the collaborative efforts of OCLC and HathiTrust’s usability group. Shibboleth – UM is in the process of finalizing Shibboleth attribute release Collaborative Development Envi- requirements for HathiTrust applica- ronment – The working group recon- tions in coordination with partner insti- vened via conference call in February to tutions, and is registering HathiTrust as discuss strategies for version control. a service with the InCommon Shibbo- All agreed that the version control tools leth federation. The release of this en- used should facilitate development at hancement to HathiTrust applications local sites as well as within the environ- is still planned for a March timeframe. ment itself, and allow public availability of the source code. Modern distributed Large-scale Search – The large-scale version control systems, including some search index grew to the point in Feb- third-party systems such as GitHub, fit ruary that it exceeded the Solr/Lucene well with these needs, and UM staff will limit of 2.1 billion unique terms. Core propose an architecture to the group Lucene developer Michael McCandless at their next meeting in early March graciously provided a patch raising the- for approval. The group also discussed limit to 274 billion unique terms. Michi- building logical divisions in the envi- gan continued performance tests aimed ronment to segregate its use for various at identifying optimal shard sizes. Staff purposes, such as active code develop- at Michigan also led team members at ment, integration testing and staging CDL on a walk-through of the large- for production release, the presentation scale search implementation in mid- of relatively stable “beta” versions of February. HathiTrust Digital Library

Update On February Activities

Four new redundant servers for large- full-resolution page images. The ability scale search indexing arrived at Indiana to download full PDF files of HathiTrust and will be installed once additional public domain volumes will be available power and networking infrastructure to partner institutions when Shibboleth work has been completed, probably in is implemented. Michigan also explored late March. Two new servers for index pipelines for fast on-the-fly generation building arrived in Michigan and are of scaled, rotated, and watermarked tentatively scheduled for March instal- page images and developed a prototype lation as well, pending staff availability. image server. Once completed, it will serve all individual page images not en- PageTurner – Michigan revamped capsulated in PDF. the PageTurner code that generates from the repository in February, Outages – There were no outages in optimizing it for high performance de- February. livery of full-book PDF files containing