<<

An Examination of Massive Digital Libraries' coverage of materials: Issues of multi-lingual accessibility in a decentralized, mass-digitized world

Andrew Weiss Ryan James Oviatt Library Hamilton Library State University, Northridge (CSUN) University of Hawai'i at Manoa Los Angeles, CA, Honolulu, HI, United States [email protected] [email protected]

Abstract—Google has been shown to have limitations focuses on collecting digital books in the public in the coverage and access of Hawaiian and Pacific books. domain (like ); and the This paper investigates the results of an ensuing study that functions like an online catalog along the lines of OCLC's examines the coverage and accessibility of Spanish language WorldCat, but with access to almost 48,000 digitized books for books in four Massive Digital Libraries (MDLs): Google at-large users and subscribers. Books, HathiTrust, , and Open Library. A MDLs have certain advantages. MDLs collect and provide random sample of 1,200 books taken from a Master's- access to disparate collections of content. They degree awarding university library in the United States. In this allow access with no pay walls. They provide a post-modern study, 400 titles from the library's general collection were mix of high and low culture without passing judgment on compared to 400 Spanish-language titles. Levels of content, leaving the search algorithm to frame results in terms accessibility for each title in the four MDLs were recorded, of "relevance" rather than cultural importance. ranging from no record, record only, partial view, to full-view. MDLs, however, currently display certain flaws. Copyright Results showed little difference in accessibility between concerns have especially impacted and the Spanish and English books in Google Books (7% Spanish HathiTrust, with publishers and authors filing suit against these fully accessible vs. 8% English). However, differences in projects for .[2][3] Furthermore, accessibility were found in the Internet Archive (2% vs. 8%), metadata in Google Books, has been shown to be error-prone. HathiTrust (6% vs.11%), and Open Library (5% vs. 16%). [4][5] Google Books also shows problems in quality control of Issues related to accessibility and non-English collections in digitized output. [6] MDLs are also addressed. B. Online Accessibility & MDLs Keywords—massive digital libraries; accessibility; collection Google Books Search in a previous study by the authors development; multiculturalism also shows a particular lack of diversity in Hawaiian and Pacific subject matter. [1] This lack of diversity points towards I. INTRODUCTION the flaws of collection development policies that fail to see the A. Introducting Massive Digital Libraries (MDLs) problems of a diffuse and largely unknown audience. Google Books relies entirely on its partners for the In a prior study, Weiss and James introduce the term metadata and the digital files. This can be problematic in terms Massive Digital Libraries (MDLs) to describe the extremely of diverse collections since the aggregation of content large digital collections (>1,000,000) of digitized print amplifies gaps in the source collections, a problem Jeanneney content aggregated from multiple mass-digitization projects.[1] predicted as early as 2005. [7] Google Books, HathiTrust, Internet Archive and the Open Library (formerly part of 's now-defunct Open After finding a noticeable lack of coverage of Hawaiian and Content Alliance) are four current exemplars of the trend Pacific content in Google Books, the authors endeavored to toward the mass-digitization and online dissemination of look further into the subject of diversity by examining the previously-published print book material. coverage of non-English languages. Each project offers a slightly different view of the Massive C. Problem scope . Google Books, which includes 20 institutional The researchers asked whether sufficient access to and consortia partners from the United States and seven other collections would extend to languages other than English. countries, focuses on accessibility of all texts of all subjects Spanish was chosen for this study as it is the second-most equally like a “Super-Public Library”; the HathiTrust primarily spoken language in the United States, yet is also a primary focuses on accessibility of academic library texts; the Internet language of the largest historically-underrepresented group, Hispanics. [8] California State University, Northridge (CSUN) was chosen as a collection to study due to its designation as a Hispanic-Serving Institution (HSI). [9] Its Spanish-language A.1: Google Books – English texts collection would likely be large and deep enough to provide Exact edition | Any edition useful comparisons. Accessibility N % | N % Ultimately, it was posited that looking at how these four MDLs with their decidedly different missions and values - No Record: 34 8.5% | 12 3% some are profit-driven, some are education-driven - handle Record Only: 86 21.5% | 77 19.25% diversity with Spanish-language collections will help end users, librarians and information specialists better judge their Partial: 250 62.5% | 265 66.25% efficacy and overall usefulness. Preview: 11 2.75% | 13 3.25% II. METHODOLOGY Full View: 19 4.75% | 33 8.25% An initial random sample of 600 books (1,200 total) was taken from two collections - one general collection and one Spanish collection - at the library of California State A.2: Google Books – Spanish texts University, Northridge (CSUN), a large-sized state Exact edition | Any edition comprehensive master’s granting university in Los Angeles, California. This study includes 400 titles from each sample. Accessibility N % | N % The authors will subsequently expand to look at the remaining No Record: 50 12.5% | 17 5% 200 in each sample. Selected items were queried in four Massive Digital Libraries: Google Books, HathiTrust, Internet Record Only: 75 18.75% | 65 16% Archive, and Open Library. Searches were then conducted in Partial: 269 68.5% | 287 73% each MDL using the titles, author names, publication dates and occasionally publisher names in order to determine exact Preview: 3 .75% | 4 1% editions of a text. Full View: 3 .75% | 27 7% Matching item records found in each MDL were evaluated on their level of accessibility, ranging from ‘No Record,’ ‘Record Only,’ ‘Snippet’ (Partial view), ‘Preview’ (More B.1: HathiTrust – English texts extensive but still limited view of content), and ‘Full View’. ‘No Record’ signifies any item records that could not be found Exact edition | Any edition during the queries. ‘Record Only’ signifies that the MDL Accessibility N % | N % provides access only to a metadata record and not the book’s text. ‘Snippet’ or ‘Partial View’ allows for full-text searching No Record: 113 28.25% | 92 23% but displays only a small part of text, usually a few lines of Record Only: n/a 0% | n/a 0% text. In the case of HathiTrust, no text at all is visible, but searches can reveal the “word count” for the number of times a Partial: 250 62.5% | 259 64.75% keyword is found in the text. ‘Preview’ allows for full-text Preview: n/a 0% | n/a 0% searching but provides a larger amount of readable, though still not unlimited, text. ‘Full’ provides full-text search as well as Full View: 37 9.25% | 49 12.25% access to the entire item for unrestricted viewing.

Selected item records were recorded at their highest level of access in each MDL for that specific edition found in the B.2: HathiTrust – Spanish texts CSUN library catalog and categorized once. Additionally, if a Exact edition | Any edition different edition of the book was available, that edition’s highest level of access was also noted. Multi-volume works Accessibility N % | N % were also noted for their levels of access in each MDL; if No Record: 126 31.5% | 97 24.25% different editions were found, these were also noted for their accessibility. Record Only: n/a 0% | n/a 0% There are no inter-coder statistics available because all Partial: 267 66.75% | 275 68.75% coding and categorizing of the data were done by one of the Preview: n/a 0 % | n/a 0% investigators. Full View: 6 1.5% | 28 7% III. RESULTS

The following accessibility levels, with a 4.9% margin of error at .95 confidence (N=400) were found in the four MDLs for both English and Spanish-language books. coverage. C.1:Internet Archive –English texts Exact edition | Any edition D.1:Open Library – English texts Accessibility N % | N % Exact edition | Any edition No Record: 370 92.5% | 357 89.25% Accessibility N % | N % Record Only: n/a 0% | n/a 0% No Record: 57 14.25% | 52 13.5% Partial: n/a 0% | n/a 0% Record Only: 320 80% | 282 70.5% Preview: n/a 0 % | n/a 0% Partial: n/a 0% | n/a 0% Full View: 30 7.5% | 43 10.75% Preview: n/a 0 % | n/a 0% Full View: 23 5.75% | 64 16% C.2:Internet Archive – Spanish texts Exact edition | Any edition D.2:Open Library – Spanish texts Accessibility N % | N % Exact edition | Any edition No Record: 393 98.25% | 377 94.25% Accessibility N % | N % Record Only: n/a 0% | n/a 0% No Record: 80 19.5% | 72 18% Partial: n/a 0% | n/a 0% Record Only: 318 80% | 309 77.25% Preview: n/a 0 % | n/a 0% Partial: n/a 0% | n/a 0% Full View: 7 1.75% | 23 5.75% Preview: n/a 0 % | n/a 0% Full View: 2 .5% | 19 4.75% Fig. 1. The four MDLs examined each show varying results in terms of accessible English and Spanish language texts. Exact editions are also compared with any edition for a book title, revealing marked improvements in

IV. DISCUSSION The Internet Archive does have numerous partnering institutions and projects, but a very limited amount (<3000) of A. Accessibility Issues in MDLs Spanish-language texts. [13] The results show little difference in the sample in the overall number of English or Spanish books that have been Multicultural collections will likely develop if MDLs make digitized. With the exception of the Internet Archive and the conscious decisions to partner with enough institutions that Open Library, which do not provide Partial or Preview access have deep collections of the target languages. The HathiTrust’s to non-public domain digitized books, HathiTrust and Google differences in coverage could be improved by working with Books each show that roughly 70% of the sampled books had more international partners. The coverage of language then will been digitized. Yet, among these sampled books, 85-90% depend on which countries are also represented as partners. ultimately fall under limited access or no access. In most cases Google Books has a greater number of partners in the limited access appears to be as good as no access with either international arena, especially in the Spanish and German only a few sentences available for users to sample or only blind languages. text searching. This practice provides less accessibility than if C. Multi-volume works & accessibility in MDLs one were to physically browse books on a library shelf. One issue that complicates the study, however, is the As for full-text accessibility, however, there were some accessibility of multi-volume works. With the exception of the differences between the four MDLs. Google Books showed HathiTrust, the other MDLs showed a lack of organization little difference in the amount of openly accessible English or and often an inability to distinguish between multiple volumes Spanish language books once multiple editions were accounted of a single work in both languages. The HathiTrust lists all for. HathiTrust shows a significant difference in the amount of volumes of a work within one record, allowing end users to full-text English and Spanish texts. Internet Archive shows a see the full extent of the work. However, Internet Archive, large difference along the same lines as the HathiTrust. The Open Library and Google Books often do not. largest difference appears in the Open Library’s accessibility The least organized of the MDLs is Google Books, which is of these texts. nearly impossible to distinguish the true number of volumes of There are differences found in the rates of accessibility for a multi-volume work. Each volume is given its own record exact editions and other editions. Google Books showed the page within Google Books. The situation is muddled further largest increase among Spanish texts in terms of accessing full- when multiple editions of multi-volume works are searched. text items, growing from 1% (N=3) to nearly 7% (N=27). This The set result is often a mix of two or more volumes and increase suggests that the larger number of digitization partners editions with end-users unable to distinguish between editions. could improve the user’s likelihood of finding the text in some The most egregious example of this confusion occurs when form or another. Likewise, the Open Library also showed querying the fifty volumes of the Harvard Classics and their significant increases in the amount of fully readable English multiple editions. texts when multiple editions were factored into the count, especially those available in DAISY format for the blind. V. CONCLUSION B. Multiculturalism and MDLs While the authors' previous study of Google Books A. Four MDLs, four models, displayed a significant difference in the amount of The four MDLs represent four different examples of how Hawaiian/Pacific materials, this study does not conclusively large collections of aggregated material can be used to provide show that there is a difference in the coverage between English users with digital versions of previously printed books. The and Spanish books in that MDL. The other three MDLs show model breaks down, however, when source materials are differences in the amount of full-text accessible English and lacking. In comparing Spanish and Spanish texts. materials, Google Books showed the most even coverage, There are some likely reasons for this. Google Books has while the HathiTrust, Internet Archive and Open Library partnered with several Spanish universities as well as showed larger differences in full-text accessibility. However, universities in the United States with significant holdings in the all four MDLs often provide access to more than one version Spanish language, including the University of Texas at Austin, of a text. This is encouraging for long-term usability. National Library of Catalonia and Complutense University of Madrid. [10] The amount of material in the Spanish language B. Limitations from these partnering institutions should be sufficient to cover The study investigates only a small sample of works in the the needs of the typical U.S.-based user. overall corpus of digitized books. Current estimates suggest that close to 30 million books have been digitized. Larger The HathiTrust, however, partners primarily with U.S. samples would yield more accurate rates of accessibility. universities, though there is overlap with some of Google's partners, including Complutense University of Madrid. [11] Other university catalogs with large Spanish-language While Spanish is a major language world-wide, it is still not a holdings could also be investigated and compared to the primary language in academia, unlike French and German. The CSUN catalog's findings. Finally, using broader catalogs Open Library has also based most of its partnerships on U.S. might alter results. Using a university catalog as a quasi- partners, with only a handful of partners in other countries.[12] control could skew results in favor of the HathiTrust given the source material for that MDL is primarily from universities. C. Future directions [2] J. Howard, "Publishers Settle Long-Running Lawsuit Over Google's Book-Scanning Project," Chronicle of Higher Education, October 4, Future studies will focus on other languages and will thus 2012, http://chronicle.com/article/Publishers-Settle-Long- allow the researchers to reexamine their conclusions in this Running/134854/ context. In particular the researchers intend especially to [3] A. Hufford, "‘U’ wins copyright lawsuit against Hathitrust digitalization examine the coverage of materials in project," The Michigan Daily, October 11, 2012, http://www.michigandaily.com/news/10-hathitrust-ruling-11 MDLs and the Google partnership with Keio University. [4] G. Nunberg, "Google Books: a metadata train wreck," Language Log, Furthermore, it will be important to study how under- http://languagelog.ldc.upenn.edu/nll/?p%BC1701 represented groups in multi-cultural countries can begin to [5] R. James and A. Weiss, "An assessment of Google Books' metadata," utilize, impact and demand greater access to mass Journal of Library Metadata, vol. 12, Iss:1, pp. 15-22, February 2012. aggregations of content. (references) Copyright and orphan works remain an obstacle to the [6] R. James, "An assessment of the legibility of Google Books," Journal of growth of MDLs. A clearer investigation of public domain, Access Services, Vol.7, Iss:4, October 2010. [7] J. Jeanneney, Google and The Myth of Universal Knowledge, Chicago, orphan works and MDLs is necessary. Press, 2007.

[8] United States Census Bureau, "Top languages other than English spoken in 1980 and changes in relative rank, 1990-2010," Census.gov, February ACKNOWLEDGMENT 14, 2013, http://www.census.gov/dataviz/visualizations/045/ Andrew Weiss thanks Eric Willis for his work in procuring [9] K. Dabbour, "Hispanic Serving Institutions (HSI) Grant Project," Oviatt the sample from the CSUN Oviatt Library's Integrated Library Library, 2013, http://library.csun.edu/About/HSI System. [10] "Library Partners," Google Books, 2013, http://books.google.com/googlebooks/library/partners.html REFERENCES [11] "Partnership Community," HathiTrust, 2013 http://www.hathitrust.org/community [12] "Participating Libraries," Open Library, 2013, [1] A. Weiss and R. James, "Assessing the coverage of Hawaiian and http://openlibrary.org/libraries Pacific books in the Google Books digitization project," OCLC Systems and Services, vol. 29 Iss:1, pp.13-21, January 2013. (references) [13] "Community Spanish Texts," Internet Archive, 2013, ://archive.org/details/opensource_Spanish