CHECKING IN WITH , HATHITRUST, AND THE DPLA By Naomi Eichenlaub BIG DEVELOPMENTS OVER THE PAST 12 MONTHS HAVE INCLUDED THE LAUNCH OF THE DPLA, A PARTNERSHIP BETWEEN HATHITRUST AND THE DPLA, [AND] A SETTLEMENT BETWEEN PUBLISHERS AND . …

oogle Books and HathiTrust have been making head- Google Books lines in the world and beyond for years now, G while a new player, the Digital Public Library of Amer- Let’s begin with a bit of background on the Google Books proj- ica (DPLA), has only recently entered the scene. This article ect, which was officially launched in October of 2004 at the Frank- will provide a “state of the environment” update for these furt Fair. The Google Print Library Project, also known as projects including project history and back- Google Book Search, was announced 2 months later in December ground. It will also examine some challenges common to all 2004. It included partnerships with a number of high-profile uni- three projects including copyright, orphan works, metadata, versity and public , including the , and quality issues. Harvard University, Stanford University, the Bodleian Library A

4 | NOVEMBER 2013 | infotoday.com infotoday.com | NOVEMBER 2013 | 5 COMPUTERS IN LIBRARIES | Checking In With Google Books, HathiTrust, and the DPLA

at the University of Oxford, and browse up to 20% of a book’s The New York Public Library. ‘The time is close at hand when any content and purchase digital The project quickly became con- copies through the student, in any part of the world, will troversial because of Google’s service. At the time of writing plan to digitize not only works be able to sit with his projector in his this article, the legal dispute in the but also ti- with the is still tles still under copyright. With- own study at his or her convenience outstanding. However, a July in a year of the project’s launch, 2013 development in the case two lawsuits were filed against to examine any book, any document, saw a ruling that the Authors Google: a class-action suit on Guild’s lawsuit cannot pro- in an exact replica.’ behalf of the Authors Guild ceed as a class-action suit and and a civil lawsuit brought for- —H.G. Wells, World Brain, Metheum & Co. Limited, London, 1938 that another trial is needed to ward by the Association of determine the validity of American Publishers (AAP). Google’s assertion that dis- Despite the controversy sur- playing excerpts or snippets rounding Google Books, in of whole books online should 2006 and 2007 additional be deemed under libraries announced partner- U.S. copyright law. As of the ships with Google. In 2006, time of writing, the Google these included the Univer- database was rumoured to sity of California system, contain approximately 30 the University of Wisconsin– million scanned books. Madison, the University of Virginia, and the University Google and Complutense of Madrid, which the World Brain became the first Spanish-lan- guage partner for the project. Making the rounds as an In 2007, eight more libraries official selection for a num- joined including libraries in ber of film festivals in 2013 Germany, Switzerland, Bel- is a documentary about the gium, and Japan, as well as Google Books project titled, the University of Texas–Aus- Google and the World Brain, tin, Cornell University, and which is a reference to the Columbia University. H.G. Wells book published in In October 2008, a hefty 1938. The film, produced by settlement decree hundreds Polar Star Films and B.L.T.V. of in length and worth and directed by Ben Lewis, is $125 million—which would a Spain-U.K. co-production eventually be rejected—was that premiered in January announced between Google, 2013 at the Sundance Film publishers, and authors in Festival. It also won Best response to both lawsuits. Documentary at the Rincón The settlement would per- International Film Festival mit Google to sell entire Movie poster for the documentary that depicts Google as doing no good in Puerto Rico in May 2013. books, offer subscription ac- It tells the story of the cess to the full database, and allow decade, which, at the time of the an- Google Books scanning project saga, up to 20% of the book to be viewed for nouncement, numbered just less than which the film’s website describes as free. In 2010, Google announced that 130 million. “[t]he most ambitious project ever con- it would launch a digital bookstore to In March 2011, a federal judge re- ceived on the Internet” and which the be called Google Editions with all con- jected the 2008 settlement on the basis trailer describes as “[a] battle between tent hosted online in the cloud. At this of multiple objections. One year later, the people of the book and the people of point, Google had scanned more than Google had scanned approximately 20 the screen.” Framed from a vantage 12 million books. This was also the million books. In October 2012, Google point that definitely leans toward de- year that it made the ambitious proc- and AAP finally reached a settlement picting Google as an evil entity, the film lamation that it intended to scan all in their 7-year copyright dispute with illustrates the potential dangers inher- known existing books by the end of the an agreement that allows users to ent in Google’s plan to scan the universe

6 | NOVEMBER 2013 | infotoday.com Checking In With Google Books, HathiTrust, and the DPLA | COMPUTERS IN LIBRARIES

of knowledge. The re- initiatives. It allowed in- views are favorable, and stitutions to build a re- you can check the web- pository to preserve and site (worldbrainthefilm distribute digitized collec- .com) for a list of screen- tions and develop “shared ings near you. strategies for managing […] digital and print HathiTrust holdings in a collabora- tive way [in order to] en- With Google scanning sure that the cultural re- millions of books from cord is preserved and the collections of re- accessible long into the search libraries, an in- future.” Today, there are evitable question arose: more than 80 institutions What will happen to The HathiTrust website participating in the proj- the scanned collections ect, and membership is of the Google Books library partners the focus of the initial collaboration open to institutions worldwide. if Google disappears? was “preserving and providing access In terms of content in HathiTrust, Enter HathiTrust. HathiTrust­­ Digital to digitized book and journal content we know that while Google scanned Library began in October 2008 as a col- from the partner library collections,” the contents of a number of large re- laboration of the 12 universities of the which included materials digitized by search libraries, Google also contains a Committee on Institutional Coopera- Google, the , and Mi- large number of trade and more popu- tion (CIC), the crosoft (both in copyright and public lar titles as well. According to an over- system, and the University of Virginia. domain materials), as well as works view handout published by HathiTrust, According to the HathiTrust website, digitized locally through in-house “Many works that are available in

1/2 HORIZ HOUSE AD

infotoday.com | NOVEMBER 2013 | 7 COMPUTERS IN LIBRARIES | Checking In With Google Books, HathiTrust, and the DPLA

HathiTrust are not pres- lections around the globe. ent in Google Books be- In fact, the DPLA infra- cause they were not digi- structure was designed to tized by Google, or not be interoperable with the available in Google Books Europeana cultural data- because of differing rights base, an aggregator of the determination processes. digital cultural collections The largest categories of of more than 2,200 institu- these include U.S. federal tions across Europe. Darn- government documents ton envisions that “[w]ithin and public domain works a generation, there should published in the United be a worldwide network States after 1923.” Ha- that will bring nearly all thiTrust also asserts that DPLA’s homepage the holdings of all libraries its subject representation and museums within the is similar to any large North American that will collocate the metadata of mil- range of nearly everyone on the globe.” research library, and it approximates lions of publicly accessible digital assets. that it holds digital versions of roughly Conceived by Robert Darnton of Harvard Common Challenges 50% of the print holdings of every large University, in part as a response to the research library in North America. more commercial Google Books endeav- Building these large collections of Data visualizations on its website show our, the DPLA aims to unify previously digital content is a massive under- graphical representations of subject siloed large collections such as the Li- taking and is, of course, not without coverage of the by the Li- brary of Congress, the Internet Archive, major challenges. Both HathiTrust brary of Congress call number, as well and various academic collections as well and Google Books projects have faced as language and date coverage for the as to collocate the metadata of smaller many challenges already, including a collection. At the time of writing, Ha- institutions and historical societies. In an number of common issues that we will thiTrust hosted more than 10.7 mil- article for The New York Review of Books look at now. Hopefully, the DPLA will lion total volumes and more than 5.6 in April 2013, Darnton described the goal be able to take advantage of lessons million book titles. of the DPLA as “to make the holdings of learned and avoid some of these issues. According to HathiTrust’s­ overview America’s libraries, archives, and mu- Copyright. The Google Books law- handout, in terms of its copyright status, seums available to all Americans—and suits are described earlier. To recap, its content is approximately 68% “in eventually to everyone in the world—on- AAP and Google formally resolved copyright” and 32% in the public domain. line and free of charge.” their lawsuit in October 2012, but liti- Of that 32% in the public domain, 21% is According to Darnton’s article, the gation between the Authors Guild and public domain worldwide (of which ap- DPLA comprises of a distributed sys- Google continues. A recent victory for proximately 4% comprises U.S. federal tem of content hubs and service hubs, Google was the July 2013 ruling that government documents), and 11% is pub- where the former are “large reposito- the Authors Guild cannot sue Google lic domain in the U.S. Approximately ries of digital materials” and the lat- as a class-action suit. 12,000 volumes or 0.1% of the content is ter are physical centers that will help HathiTrust has faced legal challeng- licensed as open access (OA), including “local libraries and historical societ- es as well. In September 2011, the Au- Creative Commons-licensed materials. ies to scan, curate, and preserve local thors Guild filed a federal copyright in- To clarify, when HathiTrust uses the materials.” In June 2013, the DPLA fringement suit against HathiTrust, the term “public domain” worldwide, it announced a partnership with Ha- University of Michigan, the University means in the public domain for anyone thiTrust—one of its newest and largest of California, the University of Wiscon- anywhere in the world. In general, these content hubs with the addition of more sin system, Indiana University, and Cor- are texts that were published in the U.S. than 3 million . nell University for storing digital copies prior to 1923 or published outside of the Moreover, the DPLA describes itself of millions of books. In October 2012, a U.S. before 1873. It also includes U.S. as a platform that facilitates “new and judge ruled against the Authors Guild in federal government documents. The pub- transformative uses of […] digitized favor of the libraries. HathiTrust has a lic domain in the U.S. documents are only cultural heritage” with an “application statement posted on its website regard- available from U.S. IP addresses. programming interface (API) and open ing the ruling with a quote from Harold data [that] can be used by software devel- Baer, Jr., the presiding judge: Digital Public Library of America opers, researchers, and others to create novel environments for learning, tools for I cannot imagine a definition of The DPLA, launched in the spring of discovery, and engaging apps.” Ultimate- fair use that would not encompass 2013, is building a national digital library ly, the DPLA could link with national col- the transformative uses made by

8 | NOVEMBER 2013 | infotoday.com Checking In With Google Books, HathiTrust, and the DPLA | COMPUTERS IN LIBRARIES

Defendants’ MDP [Mass Digitiza- be high. A response in a comment thread important as libraries put more of our tion Project] and would require from a Google Books manager, Jon Or- collections online. TRAC (Trustworthy that I terminate this invaluable want, states that Google has “learned the Repositories Audit and Certification) contribution to the progress of sci- hard way that when you’re dealing with is a process of audit and certification ence and cultivation of the arts a trillion metadata fields, one-in-a-mil- for digital repositories. The criteria that at the same time effectuates lion errors happen a million times over.” were developed in part by OCLC and the ideals espoused by the ADA The HathiTrust project has, not sur- the Center for Research Libraries, and [Americans With Disabilities Act]. prisingly, put great emphasis on provid- version 1.0 was published in 2007. ing metadata for its collection. Since HathiTrust was certified in March The Authors Guild, however, filed an its metadata originates from partner 2011. In Canada, the Ontario Council appeal in November 2012 and, conse- libraries, the libraries have the exper- of University Libraries’ Scholars Por- quently, litigation in The Authors Guild tise and opportunity—more so than is tal project, a platform that preserves v. HathiTrust case is ongoing as of the the case with Google Books—to explore and provides access to the information time of writing this. Meanwhile, academ- opportunities to enhance existing print resources collected and shared by On- ic authors have filed a brief in the case in cataloguing and to optimize this biblio- tario’s 21 university libraries, is now support of the work of HathiTrust. graphic metadata for the digital world. the first certified trustworthy digital Orphan works. Another massive An example of this is the data visualiza- repository in Canada as of early 2013. challenge, and one related to copy- tions for call number, date, and language right, is the issue of orphan works. An that are available on the HathiTrust Conclusion orphan work is a copyrighted work for website (hathitrust.org/statistics_info). which the copyright owner cannot be The DPLA has a two-page metadata This article has looked at three contacted. For example, the copyright policy available on its website that de- large-scale digitization initiatives: Google owner may have died, may be unaware tails its “commitment to freely shar- Books, HathiTrust, and the DPLA. They of their ownership, or may even be a able metadata to promote innovation” are all unique projects with unique company that has gone out of business. (dp.la/info/wp-content/uploads/2013/04/ goals that, at the same time, struggle In 2011, the University of Michigan Li- DPLAMetadataPolicy.). with common challenges. Certainly, there brary’s copyright office announced the Quality. In terms of quality of large- is an enormous amount of value in hav- launch of a new HathiTrust-funded re- scale digital libraries and digitization ing massive collections of digital books search project to identify in-copyright initiatives, the Google Books project in at our fingertips, despite the aforemen- orphan works in the repository and to particular has been criticized for con- tioned challenges. The year 2013 pro- begin making some of these titles avail- cerns over digitization quality, as well vides a useful vantage point for looking able to members of the HathiTrust com- as the quality of its metadata. Atten- into these projects, especially since they munity. The program was halted by the tion has been drawn to the poor quality range in age from nascent to nearly 1 University of Michigan shortly there- of some page scans and to the unreli- decade. Big developments over the past after, however, and is currently under- able and error-laden optical character 12 months have included the launch of going a redesign of the orphan works recognition (OCR), which is the process the DPLA, a partnership between Ha- identification process. At this point, the that makes text machine readable. A thiTrust and the DPLA, a settlement University of Michigan and HathiTrust 2012 paper published in Literary and between publishers and Google Books, have not made any works identified as Linguistic Computing, written by Paul and the release of a documentary about orphans publicly available, and they Gooding (titled, “Mass Digitization and the Google Books project. Robert Darn- have no plans to do so. The DPLA is the Garbage Dump: The Conflicting ton writes that “the DPLA took inspira- also struggling with the challenge of Needs of Quantitative and Qualitative tion from Google’s bold attempt to digi- orphan works, and the issue of how Methods”), attributes the quality issues tize entire libraries, and [DPLA] still orphan works were handled by Google not only to the scale of these projects hopes to win Google over as an ally in was a significant stumbling block in the but also to the desire to “digitize first working for the public good.” There are rejected 2008 Google settlement. and worry about quality later.” undoubtedly further developments just Metadata. Another challenging area HathiTrust has a statement about around the corner. with large-scale digital initiatives is its dedication to quality on its website. metadata. The Google Books project It is committed to ensuring optimum Naomi Eichenlaub (neichenl@ry metadata has been described in the quality of the content in its reposito- erson.ca) is the electronic resources ac- past—by Geoff Nunberg in a now infa- ry by applying formal quality review cess and discovery librarian at Ryerson mous 2009 blog post—as a “[m]etadata to all content submitted. Discussions University Library, Toronto. She has train wreck” (languagelog.ldc.upenn.edu/ around quality and digital repositories been managing access to ebooks at the nll/?p=1701). In Google’s defense, how- underscore the importance of certifica- college since 2008 as catalogue librar- ever, the extremely large scale of the tion for repositories. Digital repository ian. She is currently covering a mater- project means that the error rate will certification is becoming increasingly nity leave as e-resources librarian.

infotoday.com | NOVEMBER 2013 | 9