Architectural Choices in LOCKSS Networks
Total Page:16
File Type:pdf, Size:1020Kb
Architectural Choices in LOCKSS Networks David S. H. Rosenthal Stanford University Libraries Abstract 2 Existing Networks The LOCKSS digital preservation technology collects, To illustrate the differences between the various net- preserves and disseminates content in peer-to-peer net- works we describe three examples, the GLN and the works. Many such networks are in use. The Global CLOCKSS network, both preserving e-journals and e- LOCKSS Network (GLN) is an open network with many books, and Data-PASS, a PLN preserving social science nodes in which libraries preserve academic journals and data. Social science data has to be preserved in confor- books that they purchase. The CLOCKSS network is mity with one set of laws, governing for example the a closed network, managed by a non-profit consortium storage and dissemination of personal information. E- of publishers and libraries to form a dark archive of e- journals and e-books have to be preserved in conformity journal content. There are also many Private LOCKSS with a different legal framework, copyright law. This is Networks (PLNs) preserving various genres of content. one major reason for the differences between the config- Each of these networks is configured to meet the urations of these networks. specific requirements of its community and the content it preserves. This paper describes these architectural 2.1 The Global LOCKSS Network choices and discusses a development that could enable other configurations. The LOCKSS technology was invented in 1997 as a re- sponse to librarians’ concerns about long-term access to the journal content to which they subscribe [10]. The transition of this content from paper to the Web, while 1 Introduction greatly improving the reader’s experience, forced li- braries to switch from owning a copy of the content to The LOCKSS (Lots Of Copies Keep Stuff Safe) digital leasing access to the publisher’s copy. Future readers’ preservation technology collects, preserves and dissemi- access would be at the mercy of the publisher’s whim. nates content in peer-to-peer networks. Many such net- The idea behind the LOCKSS technology was to repli- works are in use. The Global LOCKSS Network (GLN) cate in the Web world the way that the system worked in is an open network with many nodes in which libraries the paper world. Each library would run a LOCKSS box, preserve academic journals and books that they purchase. the Web analog of its stacks. The library’s box would ob- The CLOCKSS network is a closed network, currently of tain a copy of the subscribed content from the publisher 12 nodes, managed by a non-profit consortium of pub- and keep it for as long as the library wanted. If, for any lishers and libraries to form a dark archive of e-journal reason, the library’s readers were unable to access the content. There are also many Private LOCKSS Networks publisher’s copy they would instead be provided access (PLNs) preserving various genres of content. to the library’s copy. Each of these networks is configured to meet the spe- In principle, the engineering to implement this model cific requirements of its community and the content it was simple. The Web pipeline from the publisher to preserves. We describe these architectural choices and the reader’s browser already contained a number of Web discuss a development under investigation in a number caches with temporary copies of Web content. All that of countries that could enable alternative configurations was needed was to build a Web cache with two additional that might be more suitable in some circumstances. properties: 1 • It would be pre-loaded; content would arrive in avoided this. The LOCKSS team negotiated with the the cache not because a human had requested their publishers on behalf of all libraries using the system. By browser to display it, but because a specialized Web amortizing one negotiation across all the libraries wish- crawler designed to exhaustively visit every page ing to preserve a publisher’s content, the system was eco- of each subscribed journal had requested it. This nomically viable even for small and open access publish- was necessary to ensure that the library acquired a ers. It did not have to charge the publishers fees to have copy of all the content to which it subscribed, much their content preserved. of which the library’s readers would never actually Very early in the LOCKSS Program’s history it was read. In the paper world libraries had a process one of six initial e-journal projects funded by the Mel- called ”claiming the serials” in which staff ensured lon Foundation. The other five all envisaged a central, that the library had received every issue of every third-party archive obtaining a copy from the publisher journal they purchased. and supplying it to a library’s reader if, for example, the library’s subscription had lapsed. Obtaining permission • It would be persistent; content that arrived in the to take the publisher’s content and re-publish it for the cache would stay there instead of being displaced financial benefit of the third-party archive proved very by more recent content. This was necessary to en- difficult. Eventually, the difficulty was overcome with sure the library continued to own a copy of the con- the advent of the Portico archive [9]. tent to which it subscribed; it meant that the cache On the other hand, because the LOCKSS system re- would grow steadily in size as new content was pub- quired each library to use the content they collected only lished. In the paper world journal content gradually for their own readers, publishers understood that this did accumulated along shelves in the stacks. not compete with them or materially increase the risk of Even in 1998, Web crawlers and persistent storage were leakage. This made the negotiation much easier, and less off-the-shelf techniques. The innovative part of the expensive. LOCKSS technology was to combine them in a way that Publishers agreeing to their journals being preserved minimized the cost to a library of deploying the technol- put a statement on the journal’s Web site, visible only ogy. It did so in two ways: by reducing the up-front cost to readers with a valid subscription, granting permis- of obtaining permission from the publisher for the sys- sion. Each LOCKSS box, before attempting to collect tem to make and keep copies of their copyright content, any content from the journal, checks that the permission and by reducing the capital and operational costs of sub- statement is visible, and preserves it along with the con- sequently keeping those copies safe from harm. tent to which it relates as evidence that those copies were made with permission. 2.1.1 Obtaining Permission In this way, the collected content at each library serves as a database showing which libraries had valid access In the US the Digital Millennium Copyright Act to which content when, in addition to its use in provid- (DMCA) means that libraries and, more to the point, the ing the library’s readers with access if they cannot obtain developers of the software, would be criminally liable if it from the publisher. There is no need for an external they kept a copy of content obtained from the Web longer database of access rights. This was essential, since at than specified by the HTTP Cache-Control header. This that time (and in most cases even now) no such database threat was the most important factor in the architecture was available. of the LOCKSS software. There are two ways to avoid the DMCA liability. Web archives such as the Internet Archive use the ”safe har- 2.1.2 Preserving the Copies bor” provision, under which they are required to delete Libraries subscribe to journals that serve their readers’ the copy if the copyright owner sends them a take-down interests. Because these interests overlap, library sub- 1 notice . This provision would not provide libraries ade- scriptions overlap. The overlap is greater for important quate assurance against disputes with journal publishers. journals, and lower for less important journals. As each The alternative is to obtain permission from the pub- library kept its own copy of the content to which it sub- lisher. A system in which each library had to negotiate scribed, the system as a whole had many copiesof impor- individually with each journal would clearly be too ex- tant journals, and fewer for less important journals. But pensive to be viable. The architecture of the GLN in overall, in comparison to the levels of replication found which each library’s LOCKSS box obtains its content in other reliable distributed systems, the LOCKSS sys- from the publisher using the library’s subscription access tem had an abundance of replicas of the content it was 1In practice, anyone representing themselves as the copyright owner preserving. can do this; the archive lacks the resources to verify ownership. Both because libraries’ resources are under stress, and 2 because the system envisaged an abundance of copies, it a strong business model, their journals are at low risk of was important to minimize the per-copy cost of the sys- loss. tem. As the number of replicas in a system increases, The less important journals are typically published in- the effect on the reliability of the system of an individual dividually from an idiosyncratic Web server. Thus one replica’s failure decreases. Because the GLN had a large negotiation and one technical effort results in one jour- numberof replicas, it did not need each replica to be very nal being preserved.