HathiTrust Digital

Update On January Activities In This Newsletter

Top News February 12, 2010 Top News • New Cost Model New Cost Model – The HathiTrust ex- workflow documentation. Guided by • Disaster Recovery Plans ecutive committee approved a new cost industry standards such as TRAC and • Profile model for partnership in December that best practices in the Working Groups will be adopted by all partners begin- community, the committee will ensure • Quality ning in 2013. In the new model, partners a high level of preparedness for known • Discovery Interface will share in the cost of and unknown risks to the long-term in- • Development Environment and open access volumes preserved in tegrity and use of materials in the repos- • Storage HathiTrust, and in the cost of in copy- itory. A preliminary meeting of key staff Ingest right volumes that they hold, or have will occur in February, and membership • Ingest held, in their physical collections. The in the Disaster Recovery Planning Com- • Non- Ingest model will distribute the costs of curat- mittee will be finalized soon thereafter. Development Updates ing and managing the digital collections Digital Library Profile – As part • Shibboleth in a way that more accurately reflects of its participation in an NSF EAGER • Data API the benefits each partner receives from grant awarded in September 2009, Ha- • Large-scale Search deposited volumes. It will also allow in- thiTrust completed a technological pro- • PageTurner stitutions to join HathiTrust who do not file of its repository based on two frame- necessarily have content to deposit, but New Growth works developed by Johns Hopkins who wish to support and benefit from Number of volumes added: University. The profile can be found at the long-term curation and access ser- http://www.hathitrust.org/technology. January Total vices that HathiTrust provides. Such Indiana institutions are eligible for partnership Working Groups 38,344 151,816 Univ. effective immediately, and do not need Quality – In July 2009, the Strate- Penn to wait for the 2013 general adoption. 0 5016 gic Advisory Board (SAB) assembled a State Details of the new cost model are avail- working group to investigate issues sur- Univ. of able at http://www.hathitrust.org/doc- 972 1,156,339 rounding the quality of partner institu- California uments/hathitrust-cost-rationale-2013. Univ. of tion volumes downloaded from Google. 71,094 3,730,968 . Please contact hathitrust-info@ Michigan The working group was asked to re- umich.edu for additional information Univ. of search and provide recommendations 691 268.044 and inquiries about partnership. Wisconsin on a quality threshold HathiTrust uses Total 104,342 5,312,183 Disaster Recovery Planning – Fol- to limit ingest of poor quality volumes. lowing an evaluation of disaster pre- The working group presented its recom- 5,384 public domain volumes were added in December, bring- paredness performed last summer by an mendations to the SAB in January and ing the total number to 764,331 IMLS-funded intern, and the hiring of a the SAB decided to continue the work- (about 14% of total content). preservation librarian in November, the ing group with a revised and expanded is taking steps to charge. The new charge is to a) develop a formalize and expand HathiTrust’s poli- set of quality principles for HathiTrust, cies and practices relating to disaster b) monitor quality control as related to recovery. The UM preservation librar- user experience, c) track developments ian is leading a process to form a Disas- in a separate quality working group es- ter Recovery Planning Committee and, tablished by Google and Google library with support of a winter intern from the partners following the Google partner UM School of Information, has begun summit in October, and d) evaluate Ha- to gather key inventory, personnel, and thiTrust practices with regard to thresh- HathiTrust Digital Library

Update On January Activities February Forecast

• Complete and deploy Shib- olding or limiting ingested content. with making recommendations on a boleth authentication sup- Membership in the new group, called third instance of storage for HathiTrust port the HathiTrust Quality Ingest and Error presented its final report to the Execu- • Complete quality assurance Rate Working Group, is currently being tive Committee in January. The group processes for pilot of UC’s determined. concluded that although there were Internet Archive-digitized significant benefits to implementing a Discovery Interface – With the ver- materials and begin ingest third instance of storage, given the high sion 1 catalog beta release only a few into the repository level of preservation confidence in Ha- months away, the Discovery Interface • Continue large-scale search thiTrust and the absence of economic Working Group is turning its focus to performance monitoring conditions favorable for acquiring and the usability of the catalog and its inte- • Make progress toward the operating new storage, there was no integration of gration with existing HathiTrust Digi- urgency in establishing a new instance. Builder functionality in full- tal Library services (Collection Builder, The group noted, however, that Ha- text search results Page Turner, and Full-Text Search). thiTrust should be prepared to estab- The Working Group formed a usability lish a third instance of storage if such subgroup, which will collaborate with Presentations a course becomes more economically staff at OCLC to begin usability testing feasible. NISO Webinar Feb 10 of the catalog before it is released. Test- Please see http://www.hathitrust. ing will also be performed in post-re- The Executive Committee would like to org/papers for links to all Ha- lease phases. Aspects of the pre-release solicit broader feedback from partner thiTrust presentations, papers, and analysis will include verifying accurate institutions regarding these recommen- reports. functionality and fulfillment of agreed- dations (especially from a collection de- upon requirements. velopment perspective), and requests that thoughts on the report and a third In preparation for loading HathiTrust instance of storage be sent by email to volumes into Worldcat for the version [email protected]. Those who 1 release, staff at UM provided an API wish to remain anonymous should indi- that will allow OCLC to display Ha- cate this in their email. The full report of thiTrust information in World- the working group is available at http:// cat records. www.hathitrust.org/projects#wg_stor- Collaborative Development Envi- age. ronment – UM staff have been gather- ing specific topics for the working group Ingest to discuss when it reconvenes (now General – Ingest rates were low in planned for late February), and have January, due in part to challenges UC developed a draft timeline for the steps experienced in retrieving bibliographic ahead. A message to reassemble the records from one of its systems. UM group was sent in early February, and loaded the first set of bibliographic re- scheduling is underway. The area the cords for Minnesota, but could not be- group will address first is the design of gin ingest because of problems with a version control system. UM staff have Google’s delivery of the content files. also begun to research the GlusterFS Ingest numbers from other institu- cluster file system as a storage back-end tions were also low because HathiTrust for the environment. caught up with the rate that partner vol- Storage – The working group tasked umes were made available from Google. HathiTrust Digital Library

Update On January Activities

Internet Archive Ingest –UM began Data API – In January, staff at the time devoted to the ingest of materials testing validation routines on a batch University of Michigan began work on from the Internet Archive decreases. of 200 volumes of Internet Archive- a web application that will use the Data Outages – There were no outages in digitized volumes from the University API to facilitate the location and down- January. of California in January. The teams are load of complete packages for revising validation strategies based on public domain volumes not digitized by the findings of these tests and the re- Google. The application is being creat- sults of quality assurance performed ed entirely with data and services avail- by UC staff on transformed, but not yet able to the general public and is meant ingested objects. UM and UC will pro- to demonstrate uses that can be made ceed with the ingest pilot in February, of the API. The first step of crawling testing all aspects of bibliographic and the repository for eligible volumes is in content loading, validation, and access. progress, and release of a beta version Completion of the pilot is projected for of the application is expected in Febru- late February. ary. New Programmer For Non-Google Large-scale Search – UM improved Ingest – UM extended the bidding pe- logging and log analysis in January, riod for the new programmer position enabling staff to monitor search per- through mid-January, and several new formance in a way that more closely re- qualified candidates have been -inter sembles the user’s experience. UM staff viewed. UM staff are in the final stages documented changes to large-scale of selecting candidates, and expect to search hardware in a new blog post en- have a new full-time staff member and titled “Scaling up Large Scale Search a new part-time staff member on board from 500,000 volumes to 5 Million vol- by the end of February. umes and beyond”. Development Updates New index servers were ordered for the Indiana site and are scheduled to be in Shibboleth – Shibboleth implemen- service before the end of March. The tation in HathiTrust is nearly complete. current index release process already Major portions of the code are in place synchronizes an updated version of the and UM staff have begun to contact index to be stored in Indiana on a daily partner institutions to exchange infor- basis. Acquisition of the new hardware mation that will allow individuals from will provide full redundancy of the partner institutions to authenticate large-scale search application servers into HathiTrust. Initial benefits to part- as well. Two additional servers that will ners will be increased facility in creat- be used exclusively for index building ing personal collections in Collection are on their way to the Michigan site, Builder and full-PDF download of all and one server originally purchased for public domain volumes. Non-partners production service is being re-purposed will still be able to create collections us- for testing and development. ing the University of Michigan “friend account” system. Deployment of Shib- PageTurner – PageTurner develop- boleth is planned for March. ment was slowed in January but will pick up in February and March as staff