<<

HathiTrust Digital

Update On June Activities In This Newsletter

Top News July 9, 2010 Top News • Shibboleth and Full-PDF Shibboleth and Full-PDF Down- vide computational access to materials Download load – HathiTrust released Shibboleth stored in the repository. Staff at the Uni- • SEASR as a mechanism for partner authentica- versity of Michigan began installation of Working Groups tion in June. Authenticated users can SEASR in the HathiTrust development • Discovery Interface now download full- of all public environment in June, and expect to • Development Environment domain volumes in HathiTrust, and gain more knowledge about SEASR and • Quality access the Builder feature what would be involved in applying it to Development Updates through local sign-on. Shibboleth also HathiTrust over the next several weeks. • Large-scale Search lays the groundwork for future aug- Working Groups • PageTurner mented services to partner institutions, • Collection Builder potentially including the ability to make Discovery Interface – As of the • Storage Upgrade uses of digital volumes allowed by Sec- end of June, there are nearly 3.1 mil- tion 108 of U.S. copyright law, and al- lion HathiTrust records in WorldCat. New Growth low full access to in copyright volumes Record loading is now continuing at a Number of volumes added: for users with print disabilities. quicker pace, and is nearly complete. Month of Overall Meanwhile, the working group is in the June Full-PDF Download: The release of process of configuring the HathiTrust- Indiana 236 177,333 Shibboleth was made in conjunction OCLC catalog interface to make brand- Univ. Penn with improvements to PageTurner that ing and design consistent with the exist- 328 22,824 State enabled delivery of high-resolution ing HathiTrust system. Univ. of PDF files with embedded OCR for en- OCLC is also making several alterations 616 1,509,169 California tire volumes. While only individuals at to the catalog’s functionality to fully Univ. of 34,605 4,056,835 member institutions have access to this meet HathiTrust’s requirements. This Michigan service across the repository, all public work is expected to extend into early Univ. of 173 73,856 domain volumes that were not digitized August, after which time the interface Minnesota by are available for full-PDF Univ. of will be re viewed for public beta release. 10,073 353,639 download to members and non-mem- Wisconsin With the working group’s charge ex- bers alike. Right now these include Total 46,031 6,193,386 panding to include development of the nearly 100,000 -dig- HathiTrust Full Text Search, the group (~20% of total) itized volumes that have been contrib- plans to restructure its membership in Total 32,805 1,208,351 uted by the , and order to specifically target different ar- thousands of volumes digitized locally eas of focus. While the new structure is by the . The part- still being finalized, the goal is to form ners are poised to significantly increase various task forces to address different the amount of non-Google-digitized aspects of the HathiTrust Discovery In- content preserved in HathiTrust in the terface: full text search, bibliographic near future, making many more pub- data management, and the HathiTrust- lic domain volumes freely available for OCLC catalog interface. download and distribution. Development Environment – Uni- SEASR – HathiTrust is in the process versity of Michigan staff continued the of investigating SEASR, the Software migration of HathiTrust applications Environment for the Advancement of into the new development environment Scholarly Research, as a means to pro- in June, performing testing and config- HathiTrust Digital Library

Update On June Activities July Forecast uration of the GlusterFS distributed file to determine the optimum size to use for • Explore capabilities and re- quirements of SEASR system that will be used as the storage index shards, or sections of the search • Continue configuration of the back-end for the environment as well. index, that are stored on each index new development environ- Michigan staff are in the process of set- server, the optimum number of shards ment and migration of cur- ting up and testing the virtual MySQL per server, and optimum memory allo- rent development activities and web service provisions of the new cation per server. Indexing speed is of • Install storage upgrade at environment. An initial version of the critical importance for deploying new Indiana site development environment is being used searching features, which often requires currently by staff at California and at the entire search index to be rebuilt. Presentations Michigan to make improvements to the Michigan staff also developed a Lucene existing PageTurner application. When ALA NISO/BISG June 25 utility in June (Solr uses Lucene) to read configuration is complete, the environ- Forum an index and print out the total number ment will support HathiTrust develop- Please see http://www.hathitrust. of occurrences of a term. The code has ment efforts broadly across the partner- org/papers for links to all Ha- been contributed and committed to the thiTrust presentations, papers, and ship. stable Lucene development branch (3.x). reports. Quality, Ingest, and Error Rate – PageTurner – Additional progress The quality working group is still work- was made on GnuBook integration with distribution. No outage is expected for ing through a set of scenarios for gating the current HathiTrust PageTurner. this maintenance work. volumes of poor quality from entering Michigan investigated in particular HathiTrust, and developing a justifica- Outages – HathiTrust services were ways to optimize the serving of thumb- tion and recommendation for the best unavailable on Monday, June 7 from nails. Performance optimization for the approach to follow. A set of larger issues 7:10-10:00am and on Tuesday, June 8 new page image server also continued, around quality has also been identified, from 5:00-5:30pm due to a connectiv- with a focus on common CGI perfor- some of which deal with larger policy ity problem with one of the web servers; mance mechanisms, including FastCGI. considerations. and on Saturday, June 25 from 8:30- Collection Builder – Integration of 10:00am due to a database server disk Development Updates Collection Builder functionality with space shortage. Large-scale Search – The full text large-scale search is in the final stages search index in Indiana was put into of testing and will be deployed in July. production by Michigan staff in early Storage Upgrade – Michigan staff June, making the infrastructure for full have ordered and received additional text search fully redundant. Two new storage for the Indiana and Michigan index build servers were also put into sites and will be putting it into service production in Michigan. All of the new during July and August. The upgrade systems have been functioning well, requires the installation of a new, larger and the new build servers have sub- storage network switch, so staff will be stantially improved the performance using the opportunity to introduce a of index building and maintenance. new cabling layout for the entire sys- Michigan staff began running tests in tem. In Indiana, the upgrade and re- June to determine the effects of cache- cabling work will be combined with a warming on performance, as well as recommended relocation of all server tests relating to scaling strategy and in- equipment to another area of the data dexing speed. The goal of scaling tests is center for improvements in air handling and a transition to high-voltage power