Collaborative Librarianship

Volume 8 Issue 3 Article 5

2016

Building a U.S. Federal Government Documents in HathiTrust

Heather Christenson HathiTrust, christeh@.org

Follow this and additional works at: ://digitalcommons.du.edu/collaborativelibrarianship

Part of the Collection Development and Management Commons

Recommended Citation Christenson, Heather (2016) "Building a U.S. Federal Government Documents Collection in HathiTrust," Collaborative Librarianship: Vol. 8 : Iss. 3 , Article 5. Available at: https://digitalcommons.du.edu/collaborativelibrarianship/vol8/iss3/5

This From the Field is brought to you for free and by Digital Commons @ DU. It has been accepted for inclusion in Collaborative Librarianship by an authorized editor of Digital Commons @ DU. For more information, please contact [email protected],[email protected]. Christenson: Building a U.S. Federal Government Documents Collection

Building a U.S. Federal Government Documents Collection in HathiTrust

Heather Christenson ([email protected]) Program Officer for Federal Documents and Collections, HathiTrust

Abstract

The HathiTrust Digital encompasses over 760,000 federal documents digitized from print. Ha- thiTrust has recently begun to focus attention on further developing this collection via the U.S. Federal Documents Program. The program will leverage the power of HathiTrust infrastructure, services, and member contributions and will focus not only on collection building, but also on the enrichment of dis- covery and access for end users. This article provides history of HathiTrust’s investment in federal docu- ments, background on the program, a description of current goals and activities, and a brief look at the future.

Background

Launched in 2008, HathiTrust is well known as a later joined by additional library part- collaborative composed primarily ners such as the University of and of texts digitized from print. At this writing in Cornell University. Other , including the late 2016, HathiTrust has over 120 member li- University of Florida and the Library of Con- braries and almost 15 million volumes in its gress, have partnered with the Internet shared collection.1 HathiTrust offers a variety of to digitize federal documents. Large services for users including catalog and full text efforts have expanded in recent years to include search, distributed user support, and computa- federal agencies and collaborations such as the tional analysis via the HathiTrust Research Cen- Center for Research Libraries’ Technical Report ter. HathiTrust members participate in a shared Archive and Image Library (TRAIL).3 governance structure that guides development and services for the shared collection. Via com- HathiTrust’s initiative to create a U.S. federal mittees and working groups, members collabo- documents collection dates from 2011, when rate on important areas such as collection devel- members at the “Constitutional Convention”, a opment, rights, quality, and metadata policy. gathering of the membership, approved a pro- posal to build on previous work and create a HathiTrust’s collection is largely the result of comprehensive collection of these materials.4 mass digitization projects conducted since 2005 Since then, HathiTrust has tackled the challenge by U.S. research libraries in partnership with of building this collection on a number of fronts: Google and, to a lesser extent, the Internet Ar- inventorying the universe of U.S. federal docu- chive. Mass digitization has been focused on a li- ments by building a database known as the U.S. brary or collection at a time rather than being se- Federal Documents Registry,5 focusing on mem- lective at a finer level. Mass digitization of U.S. ber deposit of mass-digitized federal documents federal documents dates back to the beginnings into the repository, and convening a group of of the Google Library Project, well before the member library experts who articulated a strat- founding of HathiTrust. Early partners with egy for federal documents6 leading to the recent Google, especially the Big Ten Academic Alli- establishment of the HathiTrust U.S. Federal ance (BTAA) universities (then known as the Documents Program in 2016. Committee on Institutional Cooperation),2 worked with Google to prioritize digitization of The HathiTrust U.S. Federal Documents Pro- federal documents beginning in 2005, and were gram will leverage the power of HathiTrust in- frastructure, services, and member contributions and will focus not only on collection building,

Collaborative Librarianship 8(3): 124-129 (2016) 124 Christenson: Building a U.S. Federal Government Documents Collection but also on enriching discovery and access for historic run of these print publications contains end users. HathiTrust has appointed a new ad- an enormous trove of information about US and visory committee to consult with the Program international history, policy, economics, science, Officer and ensure that program activities serve and law.10 the interests and needs of the partnership. As the program develops, the focus will be on To solve these challenges, the HathiTrust librar- working within the library community to solve ies have focused on digitization and aggregation shared problems. to improve access and provide more flexibility to manage print collections. With a large and invested membership commu- nity, a growing digital collection, infrastructure HathiTrust currently includes over 760,000 digit- for discovery, access, and preservation, along ized federal documents that will serve as a base with the U.S. Federal Documents Registry data- for future expansion. Although this enormous base, HathiTrust is well-positioned as a locus of collection has accumulated as a result of mass collaboration to improve digital access to U.S. digitization projects, it has also grown from the federal documents. inclusion of a large number of documents digit- ized in collaboration with TRAIL, and from indi- HathiTrust’s Investment in Federal Documents vidual libraries that have digitized their collec- tions locally and deposited them into Ha- By virtue of its membership, HathiTrust is com- thiTrust. mitted to the inclusion of federal documents in its collections. Eighty-four HathiTrust member U.S. Federal Documents Registry libraries also participate in the Federal Deposi- tory Library Program (FDLP),7 and most other The collection continues to grow via mass digiti- member libraries include federal documents in zation, but in order to reach the goal of compre- their collections. In addition to participation in hensiveness, more focused collection develop- the FDLP, many HathiTrust members also be- ment will be necessary. Due to varying catalog- long to consortia and organizations that have ing practices, the biggest challenge to building a made significant contributions to the digital doc- comprehensive collection of federal documents uments landscape. Among these are ASERL’s is understanding the full spectrum of documents (Association of Southeastern Research Libraries) that exist. As described in the recent paper De- Centers of Excellence for cataloging documents,8 tecting US Federal Documents to Expand Access, “a TRAIL’s digitization program, BTAA’s digitiza- major component of HathiTrust’s program has tion progress in Google partnerships, and the been the development of the US Federal Docu- ’s FedDocArc project to ments Registry, envisioned as a reliable inven- archive print and digital versions of federal doc- tory of items published at the expense of the US 11 uments.9 HathiTrust member libraries have government.” played a role in all of these collaborative activi- The Registry database began with a set of over ties. twenty million records contributed by forty li- Several important factors have driven the crea- braries. It is intended to provide a full inventory tion of a digital collection within HathiTrust. of titles and volumes associated with those titles, Over time, sizeable collections of documents and now includes 5.3 million records that have have accumulated in libraries, taking up costly been consolidated via bibliographic analysis to shelf space, and there has been a strong feeling de-duplicate and detect relationships. A primary from the libraries that documents are underused use case for the Registry is to identify U.S. fed- compared to their value. This state of affairs eral documents held in libraries but not yet dig- described in a recent paper by Mike Furlough, itized and deposited into the HathiTrust reposi- HathiTrust’s Executive Director: tory. The Registry holds promise for comparison of library holdings to HathiTrust and to the full These collections... are notoriously challenging inventory of federal documents, as well as sup- for general users to access due to complexities of port for HathiTrust’s ability to create definitive publication history, cataloging, and format. The collections. A user interface has been developed

Collaborative Librarianship 8(3): 124-129 (2016) 125 Christenson: Building a U.S. Federal Government Documents Collection for the Registry, enabling librarians or end users library catalogs, discovery services, and link re- to search the database. Future Registry use cases solvers, enabling wider discovery. For example, and development are currently being evaluated, via this data, the HathiTrust collection including including those related to metadata remediation federal documents, is surfaced within the Digital and enhancement. Public Library of America (DPLA),13 alongside the DPLA materials. Discovery and Access Benefits Current Goals and Activities The size of the collection and aggregation within HathiTrust provide significant benefits to librar- Collection Building ies and end users. HathiTrust provides a num- ber of end-user services across its entire collec- The U.S. Federal Documents Program is an en- tion including basic and advanced bibliographic deavor to leverage the enormous collaboratively search, full text search, a collection-building built mass-digitized HathiTrust corpus to de- tool, and services for blind and print-disabled velop a more specific collection. In order to users. In particular, the ability to search the full build a comprehensive collection of U.S. federal text of over 750,000 federal documents is a major documents, HathiTrust needs to both draw benefit that dramatically increases the research upon the mass-digitized collection and extend value of federal documents to users. it. The program’s primary collection develop- ment focus is digitized versions of U.S. federal In keeping with its public service mission, Ha- documents distributed in print by the U.S. Gov- thiTrust takes a broad approach to access, open- ernment Office (GPO), but docu- ing up materials to the extent permitted by law. ments distributed by federal agencies outside In general, the viewability of digital volumes the GPO will also be included. HathiTrust in- within HathiTrust is based on bibliographic tends to expand the collection via digitization metadata: and deposit of already-digitized documents, and, later, inclusion of born-digital documents. All objects in the archive are either in the , have the necessary permissions to sup- The program is beginning to tackle collection port the level of access afforded, or are simply ar- development by undertaking an analysis to pro- chived in such a way as to ensure an enduring file the HathiTrust federal documents collection copy of the content. HathiTrust only provides as it stands today. An overall goal of the project access to those publications where per- is to test the ability to differentiate and charac- mitted by law or by the rights holder.12 terize the collection based on current state of bibliographic and other data within HathiTrust Most U.S. federal documents within HathiTrust and the Registry. We are analyzing the collec- are fully viewable by readers, with the exception tion on characteristics indicating provenance of of a small number of government entities whose the digital object (such as contributing institu- publications are subject to copyright, such as the tion, digitization agent), bibliographically deter- Smithsonian Institute. mined characteristics ((such as agency, SuDoc (Superintendent of Documents) number, publi- HathiTrust is architected so that external ser- cation date, languages)), and usage data. The vices can incorporate and rely upon connections analysis will establish benchmarks for the collec- into its collection and services. Every item and tion, and enable HathiTrust staff to identify spe- every page in HathiTrust has a persistent URL, cific opportunities for collaborative collection enabling reliable linking, either individually in building and refinement. Another intended out- web , , or , or program- come is to determine the best method for matically within services. Similarly, Ha- providing regular updates of the descriptive thiTrust’s “search widgets” allow a search box analysis of the collection, for the benefit of Ha- to be embedded in another location. HathiTrust thiTrust members and ultimately the greater li- also offers freely available data that may be in- brary community. corporated into other tools and services such as

Collaborative Librarianship 8(3): 124-129 (2016) 126 Christenson: Building a U.S. Federal Government Documents Collection

Through the Federal Documents Program, Ha- The limitations of existing federal documents thiTrust will look for ways to coordinate digiti- metadata present the biggest challenge to tar- zation efforts to ensure that the resulting digit- geted collection building, since comparisons and ized documents continue to aggregate in Ha- analysis must be based on bibliographic data. thiTrust. The decision to deposit digitized collec- For federal documents in particular, cataloging tions rests with individual HathiTrust member has not been consistent over time or across insti- libraries, however the goal of a comprehensive tutions, and is missing entirely for many docu- federal documents collection is a powerful in- ments that were originally organized on the centive to mobilize those libraries to fill in the shelf by GPO-assigned SuDoc numbers. Com- gaps. HathiTrust also plans to seek additional pounding this problem is the fact that the librar- paths to collaborate on large scale digitization ies that were likely to catalog documents were projects, keeping in mind economy of scale and also not likely to organize their materials by Su- cost-effectiveness. Additionally, the program Doc numbers, so might not have coded the rec- will seek to encourage and coordinate local dig- ord as a federal document. For documents pub- itization projects at HathiTrust member librar- lished after 1976, the prospects improve, as that ies. is when the GPO began cataloging and sharing metadata via OCLC.15 In order to build a comprehensive collection, in the future we may also look for possibilities to HathiTrust staff who created the Registry have incorporate already digitized and born-digital learned a great deal by bringing together the documents into the HathiTrust Digital Library. many millions of records thought to be federal To this end, HathiTrust may investigate collabo- documents and comparing them to distill into rative projects to incorporate digital documents Registry records for each document. We will ap- held by federal agencies or other entities, as well ply our expertise in bibliographic data patterns as opportunities related to web-archived docu- and analysis, gained in Registry work, to the ments. process of deduping between collections and HathiTrust, and identifying gaps. Examples of Shared Print this expertise include matching by identifiers and reconciling enumeration and chronology for In addition to , HathiTrust serial records. has initiated print preservation efforts by re- cently launching a Shared Print Program.14 The The Registry itself, and the metadata within, will initial phase of the Shared Print Program, fo- also continue to be refined. We have identified a cused on monographs, encompasses mono- variety of use cases for the Registry such as out- graphic federal documents. As the two pro- ward-facing services or tools for libraries to use, progress, they will coordinate to address or as a basis for enrichment of metadata within similar collection analysis challenges. We will the HathiTrust digital library and HathiTrust also look for synergies between programs in ar- Research Center. An assessment of the Registry, eas such as building infrastructure for databases, currently underway, will determine a path for services and tools for libraries, as well as in de- potential future development. veloping analysis techniques and practices. Improving Discovery and Access In addition to focused collection development, there are other issues that deserve and need at- HathiTrust’s current discovery environment of- tention as we build the corpus; many of these fers a variety of services, but to a large extent, are important for the entire HathiTrust collec- discovery and access within HathiTrust rely on tion but are of particular significance for federal the quality of existing metadata and digitization. documents. Metadata-supported features in HathiTrust’s user interface include catalog search, search fac- Metadata Challenges ets, and listing of serials in results. However, as previously noted, metadata quality is a particu- lar concern for federal documents more so than

Collaborative Librarianship 8(3): 124-129 (2016) 127 Christenson: Building a U.S. Federal Government Documents Collection for the broader HathiTrust collection. In addi- Conclusion tion to existing features, HathiTrust’s current user interface has room for improvement to bet- HathiTrust is a community of libraries with ter support federal documents, and better shared needs and interests, pooling their re- metadata could pave the way. For example, a sources for shared collections and services. By SuDoc search, or the ability to filter search re- virtue of serving the interests of its membership, sults to include or exclude federal documents, HathiTrust also creates a public good in making would greatly improve discovery. Additionally, millions of digitized volumes available to end better authority control for names of federal users. In a landscape where new U.S. federal in- agencies and other entities would make it easier formation now originates primarily in digital for end users to zero in on specific document form, many of the retrospective print publica- sets. tions remain to be digitized, and libraries con- tinue to serve an essential role in collectively Metadata remediation is also likely to open up preserving and providing access to federal doc- more federal documents to end users. Since uments. HathiTrust members have designated viewability in HathiTrust is based on biblio- federal documents as strategically important graphic data, inaccurate and incomplete data and look to HathiTrust to preserve both print can result in a limited view for end users. It is and digital versions comprehensively, through estimated that approximately 93,000 federal doc- this transition, and beyond. uments are currently in limited view in Ha- thiTrust. This set of documents may include a HathiTrust’s development of shared digital and large number of items without enough catalog- print collections give member libraries a basis to ing to distinguish them as federal publications. make collection decisions, and set the stage for HathiTrust has a distributed user support ser- documents to be discoverable in whatever con- vice that fields inquiries from HathiTrust end text is needed. In addition, computational access users. Anecdotally, an overwhelming majority to a collection of federal documents provides of the user support issues regarding federal doc- new opportunities for research and learning for uments involve identification of items as federal scholarly purposes, and also may improve ac- documents where HathiTrust does not designate cess within the digital library if research results them so. in richer metadata or other new access methods.

Quality of digitization is also an important fac- A venue for digital versions of federal docu- for access, and for discovery, since indexed ments that exists outside of the government, and text is derived via optical character recognition is developed and maintained collaboratively by (OCR) from the images. HathiTrust has taken libraries, is in itself an important public re- steps to address quality issues by forming sev- source. Through its partners, HathiTrust has de- eral new groups. In September 2016 we veloped a valuable complement to the Federal launched a new Quality Assurance and Stand- Depository Library Program, expanding access ards Working Group that will develop strate- beyond print distribution, and helping to realize gies, processes and techniques to make scalable the goal of wider access to federal information. improvements of “digital surrogate fidelity at The HathiTrust U.S. Federal Documents Pro- the item/object level.”16. A new focused user gram’s goal of a comprehensive collection is ex- support group will handle digitization and digi- tremely ambitious, but ultimately will ensure tal object composition errors such as missing that federal documents will be digitally pre- pages. This work will have benefits for the full served and made accessible to library patrons, digital library collection and for the HathiTrust online users, and citizens. Research Center corpus, both encompassing fed- eral documents.

Collaborative Librarianship 8(3): 124-129 (2016) 128 Christenson: Building a U.S. Federal Government Documents Collection

1 “Statistics and Visualizations,” HathiTrust, ac- 14, 2016, http://www.aserl.org/programs/gov- cessed October 21, 2016, https://www.ha- doc/ thitrust.org/statistics_visualizations 9 “UC Federal Documents Archive Report,” Uni- 2 “Member Universities,” Big Ten Academic Al- versity of California Libraries, accessed Novem- liance, accessed October 21, 2016, ber 14, 2016, http://libraries.universityofcalifor- https://www.btaa.org/about/member-univer- nia.edu/content/uc-federal-documents-archive- sities report 3 “About TRAIL,” Center for Research Libraries, 10 Mike Furlough and Valerie Glenn, “Detecting accessed October 25, 2016, US Federal Documents to Expand Access” (pa- https://www.crl.edu/grn/trail/about-trail per presented at the IFLA World Library and In- 4 “Constitutional Convention Ballot Proposals,” formation Congress 2016. Wednesday, June 29, HathiTrust, accessed October 21, 2016, 2016) https://www.hathitrust.org/constitu- 11 Furlough and Glenn, “Detecting US Federal tional_convention2011_ballot_proposals#pro- Documents to Expand Access” posal4 12 “Copyright,” HathiTrust, accessed October 25, 5 “Creating a Registry of U.S. Federal Govern- 2016, https://www.hathitrust.org/copyright ment Documents,” HathiTrust, accessed October 13 “About,” Digital Public Library of America, 21, 2016, https://www.hathitrust.org/usgov- accessed November 14, 2016, docs_registry https://dp.la/info/ 6 “Government Documents Initiative Planning 14 “Shared Print Program,” HathiTrust, accessed and Advisory Group Charge,” HathiTrust, ac- November 14, 2016, https://www.ha- cessed October 25, 2016, https://www.ha- thitrust.org/shared_print_program thitrust.org/usgovdocs_planning_charge 7 “Federal Depository Library Program,” U.S. 15 “GPO Historic Shelflist,” U.S. Government Government Publishing Office, accessed Octo- Publishing Office, accessed November 14, 2016, ber 25, 2016, https://www.gpo.gov/libraries/ https://www.fdlp.gov/project-list/gpo-his- toric-shelflist 8 “Collaborative Federal Depository Program 16 “HathiTrust Quality Assurance and Standards (CFDP): ASERL’s Plan for Managing FDLP Col- Working Group,” HathiTrust, accessed Novem- lections in the Southeast,” Association of South- ber 14, 2016, https://www.ha- eastern Research Libraries, accessed November thitrust.org/qaswg_charge

Collaborative Librarianship 8(3): 124-129 (2016) 129