Clair / J Zhejiang Univ-Sci C (Comput & Electron) 2010 11(11):919-922 919

Journal of -SCIENCE C (Computers & Electronics) ISSN 1869-1951 (Print); ISSN 1869-196X (Online) www.zju.edu.cn/jzus; www.springerlink.com E-mail: [email protected]

Personal View: Challenges in sustaining the Million Project, a project supported by the National Science Foundation

Gloriana St. CLAIR wisely invested in bringing educational and cultural Director, Universal Project resources to a large segment of their constituents. The Dean, Carnegie Mellon University , Pittsburgh, disadvantage is that, as government budgets tighten, Pennsylvania, USA the funding necessary to sustain a project can be lost. E-mail: [email protected] One great advantage of government funding is that the government wants to serve the whole public. doi:10.1631/jzus.C1001011 Beginning this year, in the U.S., the National Science Foundation now requires principal investigators to

explain how the data they have collected will be made One of the main roles I have played as a director available to the larger research community and how it of the Universal has been to write will be sustained. In the U.S., the government also grant proposals to support our work. Both for this wants free-to-read access and at the same time allows project and for another project, Olive.org, an archive creators to charge for enhanced versions. of executable content, how to sustain the final product Foundations and other not-for-profit organi- is the most difficult challenge. This paper discusses zations. Foundations, like the government, are ex- the various models that might be adopted to sustain a cellent sources of support for the initiation of a large large corpus of digital material, such as that of the digital project. They have the vision to see what could . Methods discussed here in- be accomplished by increasing progress in selected clude government funding, foundations and nonprof- disciplines, such as high-energy physics and astro- its, university homes, and joining existing projects. physics, and broadening the availability of educa- All individuals working with large digital projects tional resources. JSTOR and ArtStor are two re- should be concerned about how their work will be sources initially supported by the A.W. Mellon kept available to the public. Foundation. Government funding. Many of the partners in The Qatar Foundation gave funding to create the this project have benefited liberally from government Qatar Arabic and Islamic Heritage digital collections. funding. The Chinese partners have had significant Because that so actively reflects the coun- government support through several successive Min- try and region’s culture and because the Qatar Foun- istry of Education five-year plans. The Indian gov- dation is so focused on educational goals, they are ernment has supported the project with funding for more likely than other foundations to sustain it. Other language translation research projects. The Egyptian foundations, such as A.W. Mellon, require that sus- government funded the creation of the Bibliotheca tainability models be explained before they will fund Alexandrina and continues to contribute to it. In the the initial project. Mellon has been particularly fo- U.S., the National Science Foundation supported cused on the issue of sustainability. equipment, travel, and meetings. Some electronic products and services found in This support has been essential to the creation of U.S. academic libraries are licensed through consortia this large corpus of material. The governments very and some come from not-for-profit organizations. One of the more popular ones is JSTOR, a database of © Zhejiang University and Springer-Verlag Berlin Heidelberg 2010 articles in journals in a wide variety of fields. 920 Clair / J Zhejiang Univ-Sci C (Comput & Electron) 2010 11(11):919-922

Originally, all the articles in this database were five However, this year, arXiv has begun aggressively years old or older, but this year, some publishers have asking for academic libraries to contribute to arXiv’s begun putting more current material into JSTOR. The upkeep. Thus, this project, initially funded by the Online Computer Library Center (OCLC) is another government, then hosted by a university, now appears prominent not-for-profit organization. Each of these to be moving towards a subscription-like model. organizations does realize enough ‘profit’ to grow and Universities have much to offer as homes for to maintain a significant reserve. digital projects because historically they have been These organizations fund themselves by selling stable. As a creator of new knowledge, which inevi- subscriptions, services, and products. In OCLC’s case, tably is related to and derives from older learning, a membership fee also exists. This approach has been universities, and especially their libraries, care about most successful because libraries need the content the preservation of knowledge. Nevertheless, re- provided and can pay the fees necessary. Chinese sources for funding are scarce and are expected to partners have created a licensed resource and the continue to be scarce. inclusion of the Million Book Project in that Joining existing projects. Another option is to resource provides a good sustainability plan for that join an existing digital project that has already solved part of the corpus. Of course, when materials become the sustainability problem. Three alternatives are licensed, they are often no longer free to read. The , , and the Google challenge of a licensed database is that a significant Books Project. organization may be required to select and administer 1. Wikibooks. According to its Web page, Wiki- the resource, unless the corpus can be placed with an books is a collection of . If they are ingest- existing organization. ing only textbooks as content, then only a small frac- University homes. The initial vision we had for tion of the existing million book corpus would be sustaining the Million Book Project was that it would ingested. As part of , Wikibooks is a non- have a permanent home in the School of Computer profit and appears to rely on contributions to sustain it. Science. The Universal Digital Library (UDL) di- As long as it remains the preeminent online ‘pedia’, it rectors observed that the price of storage was falling may be sustainable. The free-to-read model is char- steeply and thought that, even though the corpus was acteristic of Wiki resources. large, funding would be available to purchase storage. 2. Open Content Alliance (OCA). OCA is also a However, storage was not the only resource needed to nonprofit, associated with the . sustain the corpus. A system manager to curate the has long been a partner and fellow data—to ingest, backup, regularly review, and re- traveler with the Million Book Project. At the spond to queries—was also needed. When that posi- founding of OCA, he ingested materials collected tion was lost, graduate students began to fill in, but from and those materials are still part of OCA. their primary attention is elsewhere. The result did not At our 2007 Pittsburgh meeting, the partners agreed meet standards for persistent access. To date, the li- to become a part of OCA, but OCA has not actively braries, which are committed to long term, 24/7 ac- followed up on that decision. Certainly, the Internet cess, have not had the resources to be able to step up Archive does plan on sustaining itself long term. to this challenge. 3. . The U.S. directors of the UDL One particularly successful example of a large, project all believe that giving Google non-exclusive extremely popular digital resource is arXiv, a re- access to our corpus is the best alternative. We believe pository of preprint articles in high-energy physics that not only would the corpus be maintained long and related fields. With the leadership of Paul term but also that the materials would receive maxi- Ginsparg, the repository was originally created at Los mum use because of the popularity of the Google Alamos with government funding. The free-to-read search engine. Many research studies show that U.S. nature of this article repository does foster efficient students and faculty both go directly to the Web and a progress in the field. Librarians who were concerned majority of them directly to the Google search engine for its sustainability were relieved when Cornell as their first source of information. Placing our con- University gave arXiv a more permanent home. tent where it can be most easily found and used will Clair / J Zhejiang Univ-Sci C (Comput & Electron) 2010 11(11):919-922 921 be the most successful means of achieving our level some would consider both profligate and tedious. original goal. Societal norms around privacy issues are changing, Google is an extremely successful for-profit and in that changed environment, individuals seem company whose corporate philosophy mirrors that of willing to exchange personal information for focused the Million Book Project. Their aim is “to organize information, including advertising, on areas of inter- the world’s information and make it universally ac- est. cessible and useful” (Google Books Mission, Net neutrality is a stance that libraries and com- available from http://books.google.com/googlebooks/ puting organizations have taken vis-a-vis the gov- agreement/#6). ernance of the Web. These organizations argue that They do make money through advertising from research libraries and higher education institutions are the over five million volumes they have already digi- enormous providers of content and applications. The tized. This revenue stream provides both an incentive information thus provided fosters research, creativity, and a practical resource for the sustenance of Google and education, and should be allowed to flow freely. Books. The Google collection grew from digitizing They believe that Verizon and Google would like to books at their partner libraries, with the University of prioritize content from their affiliated and fee-paying Michigan contributing the most volumes. Our Chi- sources relegating other content to a slower delivery nese and Indian language books would complement system. They believe that Google-Verizon want to the existing western-focused collection. establish a second Internet with expensive, discrimi- The consensus is that there are perhaps 100 mil- natory wireless services to those who can pay pri- lion books in the world. This figure is based on the marily deriving from their paying sources. size of the OCLC Worldcat database and a perception Conversely, an Engadget article by Nilay Patel that worldwide, many, many books exist which are (August 5, 2010) reports that Google CEO Eric not in U.S. and European libraries, whose collections Schmidt said repeatedly on the call that “Google will are Worldcat. A non-exclusive partnering with never pay for prioritized access and Google products Google Books at this time would be significant as would remain on the public Internet” (http://www. compared with a later time when the database is much engadget.com/1020/08/09/google-and-verizons-net- larger. neutrality-proposal-explained). Schmidt portrayed Google and the Google Books Project have Google as a watchdog on Verizon to make sure that many critics. Three major issues that are discussed nothing untoward happens to the public Internet. briefly below, are privacy, net neutrality, and the Google Book Settlement is still unresolved as of Google Book Settlement. September, 2010. Several issues continue to concern Privacy is a valid major concern because the those engaged around the creation of digital libraries. success of Google’s marketing of advertising revolves My colleague, Denise Troll Covey, identified these around the company’s ability to target ads to those concerns about the settlement: who have demonstrated interest in the product. (1) Library partners signed on to pursue the le- Google does track, analyze, and profit from its gality of snippets as fair use, yet Google now pro- knowledge of individual interests. One of the values poses a different schema. Fair use may be weakened. of U.S. libraries has been to protect individual privacy. (2) Machine (Non-consumptive) use is restricted U.S. libraries typically do not reveal the searching and to research and researchers that Google approves. check-out behavior of their constituents; we are even (3) Google continues to make machine (non- careful to erase check-out records for our integrated consumptive) use of content they scanned but do not library system so that we cannot be forced to divulge include in the Google Books database because the this private information. Perhaps many would find copyright owners opted out or brought law suits (e.g., these values and practices old fashioned. France, Germany). Certainly, the advent of social networking tools, (4) Google’s proposed solution to the orphan such as Facebook, Twitter, and YouTube, represents a works problem makes it unlikely that Congress will different approach to personal information. These pass orphan works legislation that will be equitable; tools encourage the sharing of private information at a the proposed orphan works fiduciary will NOT have 922 Clair / J Zhejiang Univ-Sci C (Comput & Electron) 2010 11(11):919-922 the power to make orphan works available open foundations and nonprofits, foundations, universities, access. and commercial companies. Some of these institu- (5) Proposed settlement gives Google—and only tional types have established records of being able to Google—a license to break copyright laws, in effect accommodate societal change. Governments have a creating a copyright regime for Google and another mixed record around longevity. Nonprofits and copyright regime for everyone else. foundations also have less robust track records, al- (6) Academics were not adequately represented though the Catholic Church is over 2000 years old. in the class action settlement (email between St. Clair Universities, for instance, have existed in a recog- and Troll Covey, August, 2010). nizable form since the founding of the University of Each of these points has validity and should be Bologna in 1088. Similarly commercial entities can considered carefully. also demonstrate longevity. According to Wikipedia, Perhaps the larger issue around sustainability is several firms in Japan began in the 700s and one the long-term prospects of Google. That issue comes construction company claims 578 as its founding date. to the center of Google as a recommended solution to Yet, many, many companies, and especially technol- million book project sustainability issues. The finan- ogy companies, such as Google, are short-lived. Some cial health of Google may revolve around the out- fail, some merge, and some are bought out. These come of the book settlement with the publishers. If the casual historical examples would suggest that either a ruling were to be adverse, Google’s current robust company or a university institution type would be financial situation could be effected. The company suitable for long-term sustainability. might no longer be a favored choice for sustaining the The Million Book Project represents the coop- corpus of the million book project. erative work of hundreds of people in several differ- ent countries. Our vision was to demonstrate to the world that large-scale digitization could increase the Conclusions amount of useful knowledge available on the Web free to read for students and scholars around the world. This paper offers several choices of types of in- If our vision is to continue, then we must select a good stitutions as long-term sustainers—governments, model to sustain our work.