ISSN 1747-1524

DCC | Digital Curation Manual

Instalment on

"Open Source for Digital Curation" http://www.dcc.ac.uk/resource/curation-manual/chapters/open-source/

Andrew McHugh Humanities Advanced Technology and Information Institute (HATII) University of Glasgow, Glasgow G12 8QJ http://www.hatii.arts.gla.ac.uk

July 2005 ver 1.6

DCC - http://www.dcc.ac.uk Page 2 DCC Digital Curation Manual

Legal Notices

The Digital Curation Manual is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike 2.5 UK: Scotland License.

© in the collective work - Digital Curation Centre (which in the context of these notices shall mean one or more of the University of Edinburgh, the University of Glasgow, the University of Bath, the Council for the Central Laboratory of the Research Councils and the staff and agents of these parties involved in the work of the Digital Curation Centre), 2005.

© in the individual instalments – the author of the instalment or their employer where relevant (as indicated in catalogue entry below).

The Digital Curation Centre confirms that the owners of copyright in the individual instalments have given permission for their work to be licensed under the Creative Commons license.

Catalogue Entry Title DCC Digital Curation Manual Instalment on Open Source for Digital Curation

Creator Andrew McHugh (author)

Subject Information Technology; Science; Technology--Philosophy; Computer Science; ; Digital Records; Science and the humanities.

Description Instalment on the role of open source software within the digital curation life- cycle. Describes a range of explicit digital curation application areas for open source, some examples of existing uses of open source software, a selection of open source applications of possible interest to the digital , some quantifiable statistics illustrating the value of open source software and some advice and pointers for institutions planning on introducing these technologies into their own information infrastructures.

Publisher HATII, University of Glasgow; University of Edinburgh; UKOLN, University of Bath; Council for the Central Laboratory of the Research Councils.

Contributor Seamus Ross (editor)

Contributor Michael Day (editor)

Date 1 August 2005 (creation)

Type Text

Format Adobe Portable Document Format v.1.2

Resource Identifier ISSN 1747-1524

Language English

Rights © HATII, University of Glasgow

Citation Guidelines McHugh A, (July 2005), "Open Source for Digital Curation", DCC Digital Curation Manual, S.Ross and M.Day (eds), Retrieved , from http://www.dcc.ac.uk/resource/curation-manual/chapters/open-source/ Andrew McHugh, Open Source for Digital Curation Page 3

About the DCC The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage, management and preservation of digital information to enable its use and re-use over time. The project represents a collaboration between the University of Edinburgh, the University of Glasgow through HATII, UKOLN at the University of Bath, and the Council of the Central Laboratory of the Research Councils (CCLRC). The DCC relies heavily on active participation and feedback from all stakeholder communities. For more information, please visit www.dcc.ac.uk. The DCC is not itself a data repository, nor does it attempt to impose policies and practices of one branch of scholarship upon another. Rather, based on insight from a vibrant research programme that addresses wider issues of and long-term preservation, it will develop and offer programmes of outreach and practical services to assist those who face digital curation challenges. It also seeks to complement and contribute towards the efforts of related organisations, rather than duplicate services.

DCC - Digital Curation Manual Editors

Seamus Ross Director, HATII, University of Glasgow (UK)

Michael Day Research Officer, UKOLN, University of Bath (UK)

Peer Review Board

Neil Beagrie, JISC/British Library Terry Eastwood, Professor, School of Reagan Moore, Associate Director, Data- Partnership Manager (UK) Library, Archival and Information Intensive Computing, San Diego Studies, University of British Columbia Supercomputer Center (USA) Georg Büechler, Digital Preservation (Canada) Specialist, Coordination Agency for the Alan Murdock, Head of Records Long-term Preservation of Digital Files Julie Esanu, Program Officer, U.S. Management Centre, European (Switzerland) National Committee for CODATA, Investment Bank (Luxembourg) National Academy of Sciences (USA) Filip Boudrez, Researcher DAVID, City Julian Richards, Director, Archaeology of Antwerp (Belgium) Paul Fiander, Head of BBC Information Data Service, University of York (UK) and Archives, BBC (UK) Andrew Charlesworth, Senior Research Donald Sawyer, Interim Head, National Fellow in IT and Law, University of Luigi Fusco, Senior Advisor for Earth Space Science Data Center, NASA/GSFC Bristol (UK) Observation Department, European Space (USA) Agency (Italy) Robin L. Dale, Program Manager, RLG Jean-Pierre Teil, Head of Constance Member Programs and Initiatives, Hans Hofman, Director, Erpanet; Senior Program, Archives nationales de France Research Libraries Group (USA) Advisor, Nationaal Archief van Nederland (France) (Netherlands) Wendy Duff, Associate Professor, Faculty Mark Thorley, NERC Data Management of Information Studies, University of Max Kaiser, Coordinator of Research and Coordinator, Natural Environment Toronto (Canada) Development, Austrian National Library Research Council (UK) (Austria) Peter Dukes, Strategy and Liaison Helen Tibbo, Professor, School of Manager, Infections & Immunity Section, Carl Lagoze, Senior Research Associate, Information and Library Science, Research Management Group, Medical Cornell University (USA) University of North Carolina (USA) Research Council (UK) Nancy McGovern, Associate Director, Malcolm Todd, Head of Standards, IRIS Research Department, Cornell Digital Records Management, The University (USA) National Archives (UK) Page 4 DCC Digital Curation Manual

Preface The Digital Curation Centre (DCC) develops and shares expertise in digital curation and makes accessible best practices in the creation, management, and preservation of digital information to enable its use and re- use over time. Among its key objectives is the development and maintenance of a world-class digital curation manual. The DCC Digital Curation Manual is a community-driven resource—from the selection of topics for inclusion through to peer review. The Manual is accessible from the DCC web site (http://www.dcc.ac.uk/resource/curation-manual). Each of the sections of the DCC Digital Curation Manual has been designed for use in conjunction with DCC Briefing Papers. The briefing papers offer a high-level introduction to a specific topic; they are intended for use by senior managers. The DCC Digital Curation Manual instalments provide detailed and practical information aimed at digital curation practitioners. They are designed to assist data creators, and re-users to better understand and address the challenges they face and to fulfil the roles they play in creating, managing, and preserving digital information over time. Each instalment will place the topic on which it is focused in the context of digital curation by providing an introduction to the subject, case studies, and guidelines for best practice(s). A full list of areas that the curation manual aims to cover can be found at the DCC web site (http://www.dcc.ac.uk/resource/curation-manual/chapters). To ensure that this manual reflects new developments, discoveries, and emerging practices authors will have a chance to update their contributions annually. Initially, we anticipate that the manual will be composed of forty instalments, but as new topics emerge and older topics require more detailed coverage more might be added to the work. To ensure that the Manual is of the highest quality, the DCC has assembled a peer review panel including a wide range of international experts in the field of digital curation to review each of its instalments and to identify newer areas that should be covered. The current membership of the Peer Review Panel is provided at the beginning of this document. The DCC actively seeks suggestions for new topics and suggestions or feedback on completed Curation Manual instalments. Both may be sent to the editors of the DCC Digital Curation Manual at [email protected].

Seamus Ross & Michael Day. 18 April 2005 Andrew McHugh, Open Source for Digital Curation Page 5

Table of Contents 1 Executive Summary...... 7 2 Introduction and Scope...... 9 2.1 Advantages of Open Source for Digital Curators...... 9 2.2 Proprietary Software Development and Distribution...... 9 2.3 Commercial Retention of Control...... 10 2.4 The Philosophy of Open Source and Free Software...... 10 2.5 Facilitating Preservation Through Transparency...... 10 2.6 Open Source Within the Digital Curation Life-cycle...... 11 3 Background and Developments to Date...... 12 3.1 The Origins of Free and Open Source Software...... 12 3.2 Open Source – Free Software with Different Emphases...... 13 3.3 Licensing...... 13 3.4 Generic and Specialist Benefits of Open Source...... 14 4 How does Open Source Apply to Digital Curation?...... 16 4.1 Life-cycle as User Perspectives...... 16 4.2.1 Cost of Software...... 16 4.2.2 Availability of Assistance ...... 16 4.2.3 Developer Advantages...... 16 4.3.1 Customisable Functionality...... 17 4.3.2 Peer Reviewed Software Integrity...... 18 4.3.3 Users Assume a Strong Legal Position...... 19 4.3.4 Increased Security of Digital Resources...... 19 4.4.1 Longevity of Digital Information...... 20 4.4.2 The Relationship Between Open Source and Open Standards...... 21 4.4.3 Portability of Information ...... 23 4.4.4 Preservation Through Transparency...... 24 4.4.5 Legal Issues for Long-term Access...... 26 4.4.6 Later Stages...... 26 5 Open Source and Free Software In Action...... 27 5.1.1 Government and Public Sector...... 27 5.1.2 Humanities Institutions...... 29 5.1.3 Science...... 29 5.1.4 HE/FE Institutions...... 30 5.1.5 Commercial Organisations...... 31 5.2.1 The GNU/ Operating System...... 33 5.2.2 Emulation Applications for Open Source...... 34 5.2.3 Server and Development...... 35 5.2.4 The Apache Web Server...... 35 Page 6 DCC Digital Curation Manual

5.2.5 Databases...... 36 5.2.6 The GRID...... 36 5.2.7 Programming Languages...... 37 5.2.8 Others...... 38 5.2.9 Desktop and Productivity...... 38 5.2.10 OpenOffice.org...... 38 5.2.11 The Mozilla Project...... 39 5.2.12 Specific Open Source Applications for Digital Curation...... 40 (a) Fedora Digital Object Repository Management System...... 40 (b) DSpace...... 41 (c) FreeBXML...... 42 (d) JHOVE...... 43 (e) LOCKSS...... 43 (f) Xena...... 43 5.2.13 Other Institutional Repository Implementations...... 44 6 Quantitative Issues...... 45 6.1 Financial Costs of Open Source Software...... 45 6.2 Software Acquisition and Upgrade Costs...... 45 6.3 License Management and Litigation Costs...... 46 6.4 Hardware Costs...... 46 6.5 Support and Training...... 47 6.6 Total Cost of Ownership...... 48 6.7 Longer-term Considerations...... 49 6.8 Performance and Reliability...... 50 6.9 Market Share...... 51 7 Future Developments...... 53 8 Conclusion...... 54 Bibliography...... 55 Glossary of Terms...... 62 Acronyms and Abbreviations...... 63 About the Author...... 64 Andrew McHugh, Open Source for Digital Curation Page 7

benefits to the digital curator. With its intrinsic 1 Executive Summary transparency it is more straightforward to ensure Throughout this Curation Manual a future accessibility through migration or number of individual practices, principles, emulation, bereft of the legal entanglements that techniques and technologies are suggested as with proprietary commercial software may being particularly appropriate throughout the make such activities problematic. Re-use is also 1 digital curation life-cycle. Some are uniquely facilitated, with open source licenses explicitly associated with issues of use and longevity, permitting the integration, alteration and while others are more generic in their redistribution of program code. Closely application areas, but with identifiable and associated with (although by no means significant benefits for those charged with the synonymous with) open source are open creation, curation and re-use of digital materials. standards and open formats; both are frequently With a range of ubiquitous advantages, open embraced by the open source community. source and free software can offer tangible Encoding digital information in a manner which benefits throughout the digital curation life- is documented, commonly understood and not cycle. By its nature the adoption of open source linked to an individual commercial product or represents a broadly affecting cultural measure, intended to help pursue a corporate goal is a which underpins and influences the outcome of high priority for many open source developers numerous other decisions within any digital and distributors. 2 curation endeavour. Open source software is This Curation Manual instalment frequently available without cost, which from a discusses at some lengths the relevant strengths management perspective facilitates the creation of open source software from a digital curation of digital content, and its legal status frees data perspective, as well as detailing some of its creators from the onerous licensing restrictions more general advantages, which must be associated with proprietary software. In terms of understood in order to accept the viability of an active use, there are rarely any kind of ongoing upgrade costs so there are fewer concerns 1 For convenience and consistency, the American associated with ensuring that software is English spelling of the noun “license” is used maintained to the latest version. Furthermore, a throughout, since this spelling is most commonly used within discussions of this topic. range of quantifiable evidence suggests that 2 This instalment adapts and builds upon materials open source applications can match and often originally published as part of "Digicult Technology better the performance, security and reliability Watch Report 3", 2005, Seamus Ross, Martin of commercial alternatives. It is in the areas of Donnelly, Milena Dobreva, Daisy Abbott, Andrew long-term access and preservation that the open McHugh and Adam Rusbridge, http://www.digicult.info/pages/techwatch.php source approach offers some of its most relevant [Accessed: 7 April 2005, 11:30]. Page 8 DCC Digital Curation Manual institutional or cultural shift to open software products. Through a series of sections it describes a range of explicit digital curation application areas for open source, some examples of existing uses of open source software, a selection of open source applications of possible interest to the digital curator, and some quantitative statistics illustrating the value of open source software. Andrew McHugh, Open Source for Digital Curation Page 9

2.2 Proprietary Software Development and 2 Introduction and Scope Distribution The traditional, commercial software 2.1 Advantages of Open Source for Digital development model has a number of key Curators characteristics. When a software application is The problems implicit in digital curation created, it is written in a programming language, can be mitigated at every stage of a digital a human-readable syntax that broadly object’s life-cycle by adopting appropriate corresponds to the way in which a computer strategies and exploiting particular technologies. understands and processes information. The open source software movement represents However, for a computer to make sense of a and characterises a thirty-year-old software program it has to be offered in a much ‘lower development and distribution philosophy, and level’ format - ultimately the 1s and 0s of offers several valuable advantages to the digital binary. In order to transform a program from curator. In recent years, the open source ethos human-readable form to binary, many languages has come to fruition within a range of require the code to undergo a process called commercial and public sector environments. ‘compiling’. The original, or ‘source’ code is Core beliefs in the principle of free software passed through an intermediate program and availability, and of a community-based approach translated into computer-readable syntax, which to software development have increasingly to human eyes bears little relation to the established free and open source software within original. Within a proprietary model, developers the mainstream, where its numerous applications will typically perform the compiling process now reside as realistic and competitive behind closed doors before distributing the alternatives to proprietary commercial software binary results to customers, who can run the that has been produced within a more traditional program and enjoy its benefits without ‘behind closed doors’ development model. establishing a sense of how the program works, While advantages can be identified with open and without any means of finding out. Users are source in a range of application areas, there are unable to change the way the program runs, several intrinsic qualities that lend themselves other than by using the program’s inbuilt tools. particularly well to digital curation activities, Often, such utilities offer significant scope for and that make the use of open source tools an modification – an example is the macro excellent starting point for data creators, functionality incorporated within Microsoft’s curators and re-users seeking to facilitate the use Office suite of applications. However, complete and preservation of digital materials. control over functionality is withheld, and ultimately, changes can only be made at the Page 10 DCC Digital Curation Manual publisher or distributor’s behest. 2.4 The Philosophy of Open Source and Free Software 2.3 Commercial Retention of Control In contrast, open source software is Proprietary software companies are developed and released in a more transparent motivated by commercial concerns, and are fashion. Instead of concentrating on the naturally keen to strengthen their own position, financial advantages of limiting access to source often at the expense of consumer freedom. By code and tightly guarding knowledge, open limiting access to binary files the vendors source software is motivated by community control the functionality of their applications, concerns. Source code is openly shared, and can impose limitations based on their own contributions are welcome from competent distribution or upgrade policies and plans. If users anywhere in the world and software is problems occur with software, new features are distributed free from the onerous end-user sought or versions are required for alternative agreements that characterise a great deal of hardware/software platforms, they must all be proprietary software. By empowering users with negotiated with the software vendor. Similarly, access to source code the open source the customer is quite powerless to fix bugs that methodology encourages and rewards are identified within the system, since to do so modification, re-use, redistribution and will generally require some familiarity or understanding. Institutions and organisations are interaction with the software at source code empowered to choose appropriate tools to level. Because source code access is likely to be achieve their intended outcomes. This helps to limited to a small group of developers, changes limit the dangers posed by relying upon specific take time to implement, and the addition of commercial proprietary software solutions. The specialist functionality may be overlooked or most obvious is the surrendering of IT considered commercially non-viable. From a infrastructure control to the commercially digital curation perspective this model means motivated technology vendors – often at that end-users are unable to identify the significant costs in terms of the ‘curatability’ of characteristics of the software and formats they one’s digital information. use, and subsequently are limited in the ways in which they can inject additional functionality or 2.5 Facilitating Preservation Through preservation qualities into their digital Transparency information. Preservation strategies are likely to Open source technologies are no longer be hampered by legal and technical barriers the marginalised preserve of bedroom related to restrictive license terms and the hobbyists, with several open source applications software’s closed nature. among the most proven and reliable of all digital solutions. By regularly embracing the Andrew McHugh, Open Source for Digital Curation Page 11 concept of open standards, these technologies Software acquisition costs are certainly lower further remove the mystery from information than those associated with equivalent storage and use over the longer term. As with proprietary products, and although other hidden source code availability, open standards aim to costs are involved in introducing and excise the opaque veneer that threatens and maintaining an open source infrastructure, disrupts digital preservation, limits and curtails several studies agree that total costs of access to long-term stored documents, and ownership are also significantly cheaper.3 hampers the straightforward exchange and Furthermore, transparency through source code interchange of digital content. Understanding of availability and the frequent association the structures that underlie the software and between open source and open standards formats we use and the legal rights to recreate, facilitates long-term comprehension and re-use, modify and re-distribute these structures are enabling creators, curators and re-users to great facilitators to everyone: from desktop effectively and explicitly present their digital users seeking an application specifically tailored materials alongside their underlying descriptive to their needs to large-scale memory institutions infrastructures. In addition, the lack of onerous that need to ensure that the software format they licensing restrictions that accompany select to encode their digital will not proprietary products and stipulate acceptable become obsolete, unsupported and impenetrable conditions for use, redistribution, transfer and within a few years’ time. reverse engineering removes many of the problems often associated with the management 2.6 Open Source Within the Digital Curation and redeployment of software. Life-cycle The adoption of open source software provides several diverse benefits throughout the entire scope of the digital curation life-cycle. When determining the ‘curatability’ of an application or software format several important criteria must be considered. These include its longevity; the ease of its re-creation or emulation; its adherence to and use of open standards; the level of legal freedom associated with its use; its associated costs; its ubiquity; its 3 David A. Wheeler, 2005,"Why Open Source support for ; and its stability. From the Software/Free Software (OSS/FS, FLOSS, or FOSS)? very conception of digital information open Look at the Numbers!", http://www.dwheeler.com/oss_fs_why.html source presents some immediate advantages. [Accessed: 7 April 2005, 11:30]. Page 12 DCC Digital Curation Manual

freedoms:5 3 Background and Developments to Date 1. The freedom to run a program, for any 3.1 The Origins of Free and Open Source purpose; Software 2. The freedom to study how a program The open source and free software works, and adapt it to individual needs, movements share a common goal, but differ implying access to the underlying source subtly in their emphases. While both oppose code; proprietary closed-source software development 3. The freedom to re-distribute copies; and distribution, their motivations for doing so 4. The freedom to improve the program and are contrasting. Nonetheless, both movements release improvements to the public so that believe in the same general ethos: that software the whole community benefits, again should be made universally available in its implying source code access. entirety, with everyone afforded the opportunity to understand, change and re-distribute it. These simple definitions offer a The free software movement, spearheaded comprehensive insight into the priorities and by the Free Software Foundation (FSF) and motivations of the free software movement. In characterised by the writings of Richard the absence of a suitably unambiguous word in Stallman, has at its core a predominantly the English language, the classic definition is political, social, and moral agenda.4 From its free as in free speech, not as in free beer. While origins in the late 1970s and early 1980s, the there is no stipulation that free software should free software school grew out of frustration with be made available without cost, it must be the barriers imposed by the secrets and non- possible to re-distribute bought software at no disclosure agreements surrounding proprietary cost if it is to qualify. Stallman refutes the software. Uncompromising in its philosophy, the traditional legal ownership arguments for movement argues that a number of fundamental proprietary software. He claims that traditional human freedoms depend on the ability to access property law concepts are irrelevant since they software without obstruction. The definition of relate to the problems caused by taking away free software is displayed prominently on the someone else’s property, not simply making a FSF Web pages, and can be broken down into copy. According to Stallman, since programs four parts, each relating to one of four essential are not consumed in the same way as other

4 http://www.fsf.org [Accessed: 7 April 2005, 11:42]; 5 “The Free Software Definition,” http://www.stallman.org [Accessed: 7 April 2005, http://www.fsf.org/licensing/essays/free-sw.html 11:30]. [Accessed: 7 April 2005, 11:40]. Andrew McHugh, Open Source for Digital Curation Page 13 types of property, they should not be subject to of the increased ‘curatability’ of open source the same values. The free software movement is software. Significantly, the open source committed to a culture whereby users can definition is not structured in terms of benefit from the time already spent by others individual ‘human freedoms’, instead bearing solving problems, negating the need to ‘reinvent more relation to a legal document of the kind the wheel’ themselves. familiar to users of commercial software products. Among its ten individual requirements 3.2 Open Source – Free Software with are that open source software must be freely Different Emphases distributed; that source code must be available Some have cited the uncompromising along with any compiled binaries; and that idealism of the free software movement as one modifications and derived works must be of the main reasons for its continued permitted and re-distributable under the same marginalisation within the computing license as the original software.7 community, despite its indisputably impressive track record in terms of software development. 3.3 Licensing In the mid 1990s a new movement evolved in The most common software license under reaction to the unease often provoked by which Free and open source software is Stallman’s favoured socio-political arguments. distributed is called the GNU General Public Seeking to characterise and promote the more License (GPL).8 Originally conceived to ‘sellable’ aspects of free software (giving much describe the legal status of the GNU operating less emphasis to the arguments favoured by the system, this has become the generic standard FSF), this movement was dubbed ‘open source,’ free software license. It is a copyleft license, and and is represented by the Open Source Initiative establishes and seeks to protect the freedom of (OSI) with the programmer and writer Eric S. its associated software quite strictly. Raymond at its helm.6 The OSI’s foremost Copyleft is a concept of the Free Software argument is that the unique development model Foundation and serves as an alternative to underpinning free software leads to better traditional copyright restrictions. The coining of software than that developed behind closed the term came in the light of concerns that doors by the paid employees of commercial 7 Needless to say, these requirements have a great deal companies. For the purposes of this Curation in common with those outlined for Free Software. Manual instalment the most notable way in However, the Free Software Foundation generally which this superiority manifests itself is in terms places higher ethical demands on software licenses, so while most, if not all, Free Software approved licenses 6 http://www.opensource.org [Accessed: 7 April 2005, will be open source, the opposite is not necessarily 11:42]; http://www.catb.org/~esr/ [Accessed: 7 April true. 2005, 11:43]. 8 GNU is a recursive acronym for ‘GNU’s Not Unix’. Page 14 DCC Digital Curation Manual without some kind of protection, free and open license drafted by the French scientific research source software could be taken by proprietary community in response to inadequacies of the software developers, changed and then re- GPL in the French legal context.10 However, distributed under a proprietary non-free software controversy still looms to an extent, since the license. The copyleft requirement makes it Open Source Initiative is yet to approve impossible to “strip off the freedom” in this CECILL as a conforming license. Although the fashion. Copyleft says that anyone who re- intentions behind it are good, European legal distributes the software, with or without reaction to it has remained somewhat wary. changes, must pass along the freedom to further Several different open source and free software copy and change it. Some opponents of free licenses exist. Open source licenses are those software have dubbed this “a viral clause” explicitly acknowledged by the OSI, and these because it means that any new software that currently number around fifty individual incorporates existing copyleft code licenses, each with their own profile. The Free automatically inherits the same license. Thus, Software Foundation is responsible for the code and the freedoms of the license become identifying those licenses that qualify as free legally inseparable. From a digital curation software. perspective, copyleft offers a level of assurance that any measures taken to limit software 3.4 Generic and Specialist Benefits of Open obsolescence and to facilitate use are likely to Source persist within an application irrespective of any Notwithstanding the moral and ethical subsequent revisions or redevelopment that arguments in favour of free and open source takes place. software, most readers will expect some insights One criticism that is often levelled at the into their more pragmatic merits before being General Public License is that its terms and tempted to use them, or to develop applications content were conceived mainly in accordance under their terms. There are several persuasive with the United States legal system. Several arguments in favour of using open source commentators have suggested that the GPL is software, and of releasing under an open source weaker or even completely inapplicable within license. Many of these are generic, and equally alternative, non-US jurisdictions.9 Consequently applicable to computer users in any field, various regionally specific licenses have been activity or industry, but there are several developed. A good example is the CECILL advantages of particular relevance to the digital curation community. Clearly, a software 9 Alex Thurgood, January 2005, "The GPL and non- infrastructure that facilitates digital curation can U.S. law", Open Source Law Blog, http://www.oslawblog.com/2005/01/gpl-and-non-us- 10 http://www.cecill.info/licences/Licence_CeCILL_V1. law.html [Accessed: 7 April 2005, 11:43]. 1-US.html [Accessed: 7 April 2005, 11:43]. Andrew McHugh, Open Source for Digital Curation Page 15 only be viable if it also offers functionality, value and reliability on a par with or in excess of that offered by alternative proprietary tools. The remainder of this Curation Manual instalment will therefore concentrate on both the general qualities of open source software and those aspects that make an open source software environment extremely useful from the specific perspective of digital curators. Page 16 DCC Digital Curation Manual

onerous responsibilities. 4 How does Open Source Apply to Digital Curation? 4.2.2 Availability of Assistance Help for data creators is also widely 4.1 Life-cycle as User Perspectives available within the open source community. In considering the value of open source for With vast documentation projects often digital curation it is convenient and worthwhile coexisting alongside software development, to consider its merits through every stage of the comprehensive and useful information for using data life-cycle model. Encompassing creation, and understanding open source software is active use, archiving, preservation, access and increasingly available. In addition to formal re-use, and disposal or transfer one can identify help materials, an Internet-wide community of three main user roles. These are of data creator, expertise contributes to an ever expanding knowledge pool consisting of discussion fora, data curator and finally data re-user. The FAQs and “how to” guides. following sections will consider the advantages offered by open source in the context of each of 4.2.3 Developer Advantages these roles. Similarly, there are numerous additional 4.2 Perspective 1 - Data Creator advantages in favour of creating new digital information within the open source community. 4.2.1 Cost of Software From a practical perspective, a great deal of The first, and perhaps most often cited software is released under an open source benefit of open source software relates to the license because it is the only way to legally issue of acquisition cost. Open source and free integrate existing free software code or libraries. software need not be distributed without charge, Because of the vast range of well-written, but the definitions ensure that while a vendor standards compliant and commonly understood could sell an open source product for a fee, its software that is currently only available under purchaser could then re-distribute it for free. copyleft licenses like the GPL it may be a more Total cost of ownership is a more complex issue, attractive prospect to build on the work of incorporating a number of often more hidden others than start from scratch, replicating costs across the entire data life-cycle, and this is functionality that already exists and is freely explored in more depth below. Nonetheless, it available. In addition, basing work on can be indisputably stated that from a financial mainstream ‘accepted’ code automatically perspective, open source software empowers expands the pool of developers and users with users to begin working with digital materials and an interest in ensuring its longevity. An open create digital content quickly and with few source approach also offers the opportunity to Andrew McHugh, Open Source for Digital Curation Page 17 consult and collaborate with a wide Internet- into existing applications straightforwardly, due based developer community to facilitate testing to the availability of source code. Features that and improvements. This is particularly useful for would not be worthwhile for a commercial solo developers, and small groups for whom company to implement due to the lack of overall outside intervention, assistance and feedback are user demand can be introduced, and new beneficial and otherwise unavailable. In projects can be started when it seems that a addition, several popular web sites exist to particular functional requirement is unlikely to promote and distribute open source software be met. The open source model empowers users materials. For applications that work well, to either develop their own specifically required success, prominence and large-scale adoption applications or to add their own functionality to usually follow with word spreading around the those that already exist. For example, it is community, fuelled by exposure on sites like possible for developers to incorporate additional Sourceforge.net.11 Umbrella resources like the digital curation functionality such as metadata Open Source Technology Group offer a support or compatibility with additional file platform for shared ideas and knowledge formats into existing open source software. The interchange, and at the same time provide culture of customisation ensures that software mechanisms for the promotion and distribution and digital content can be altered to suit specific of open source applications.12 requirements and expectations. The commercial software world cannot match this level of flexibility. It will often charge a fee to add a 4.3 Perspective 2 – Data Curator particular feature at the request of an individual client or, more likely, assure the customer that 4.3.1 Customisable Functionality the functionality will be integrated when they The overall depth of functionality offered pay to upgrade to the next version of the by a particular program is an immediate and software. obvious indicator of its value from an active use Needless to say, many users of open perspective. Since open source involves source software are ill-equipped in terms of potential users at every stage of the development expertise or time to individually implement process, the functionality requirements and every change they require, or to affect the expectations that users have can be identified modifications in house. However, by facilitating and implemented effectively. Unique or development on a global basis, the open source marginalised functionality can be incorporated model enables organisations to outsource freely, or to motivate others within the community who 11 http://sourceforge.net/ [Accessed: 7 April 2005, 11:44]. are equipped to modify or build upon code to do 12 http://www.ostg.com [Accessed: 7 April 2005, 11:44]. so. It is likely that open source users will Page 18 DCC Digital Curation Manual continue to seek assurances that software with no access to source code it can be difficult developers are committed to their products long- to trace bugs; this represents a failure to exploit term and that ongoing maintenance will be the application users’ programming and undertaken, new features added and bugs fixed. debugging abilities and dramatically lengthens However, although the developer-customer the process. In addition, there is usually nothing relationship can continue to exist in this fashion, but goodwill to guarantee that companies make it is not the only one that will ensure these ends any corrections available free of charge and they are satisfied. If the developer reneges on a might simply refuse to address the flaw at all, commitment that he or she has made then the leaving users with no option but to learn to unique status of open source software ensures accept deficiencies in their applications. that another individual or organisation can Relying on the philosophy of releasing intervene and ensure the product’s sustainability. software early and often, open source projects are likely to receive users’ bug reports before a 4.3.2 Peer Reviewed Software Integrity program is even close to the level of maturity The identification and correction of bugs that a commercial company would deem from programs is another strength of the open acceptable for release. With access to source source development model. All but the most code, collaborators can fix problems straightforward of software programs will themselves, or offer detailed accounts in contain bugs: they are a regrettable, but programming terminology of where and in what inevitable part of software development. With circumstances bugs manifest themselves. traditional software development, it is common “Treating your users as co-developers is your for a team of programmers to complete an least-hassle route to rapid code improvement application and, through a period of evaluation, and effective debugging,” writes Eric Raymond, to identify and fix errors. However, software his mantra: “Given enough eyeballs, all bugs are companies face great pressure to get their shallow.”13 In addition, open source projects are products on the shelves to begin generating becoming increasingly well documented, with income, and therefore bug-fixing schedules are the rapid growth and mainstream proliferation often necessarily limited. Users will often find of open source leading to the generation and flaws, but without access to source code it is prioritisation of good quality documentation. As impossible for them to personally remedy these; well as facilitating debugging the peer review the only recourse is to notify the relevant system can also be used to ensure that digital software publisher. If a bug is sufficiently information or content is sufficiently serious then the company will probably issue a 13 software patch to repair the flaw. However, Eric S. Raymond, "The Cathedral and the Bazaar", since they must rely on error reports from users http://ot.op.org/cathedral-bazaar.html [Accessed: 7 April 2005, 16:28] Andrew McHugh, Open Source for Digital Curation Page 19 functionally rich throughout its development and financial models underpinning a variety of open maintenance. For instance, from a digital source software it is likely that vendors will curation perspective, collaboration can take offer support contracts and software guarantees place to ensure that the information as a commercial service quite distinct from the infrastructures are optimised for longevity, distribution of software itself, which may continued accessibility and re-use. While poorly incorporate rights to compensation in the event written code is by no means the exclusive of a failure to meet their commitments. In preserve of the proprietary software addition, as discussed above, the transparency of development world, the chances of a bad open open source enables individual developers or source program continuing to be developed administrators to independently implement badly are mitigated somewhat by the solutions to overcome the shortcomings in the community’s watchful eyes. programs that they use.

4.3.3 Users Assume a Strong Legal Position 4.3.4 Increased Security of Digital Resources An argument often raised by those The Internet remains a dangerous place, opposed to open source is that if something goes with the potential for virus infection, denial of wrong with the software there is no one to direct service attacks and interception of personal the blame towards. From the digital curator’s details presenting serious concerns. Security is point of view this might provoke concerns: if therefore something that must be taken into one relies upon a particular software package or consideration during any digital curation work data format to ensure the curatability of digital flow. For instance, where materials are stored in resources then there are certainly grounds for remote repositories, security must be assured in dissatisfaction and an expectation of recompense order to be confident that retrieved information if this is not achieved. While this is true, the has not been compromised, or altered from its majority of proprietary licenses will include initially deposited form. It is often argued that terms absolving responsibility for problems by making source code available it will be easier caused by flaws or shortcomings in software. for malicious individuals to identify and exploit For instance, no one has successfully sued security vulnerabilities in open source software. Microsoft for downtime or information lost as a This is dismissed by open source advocates as a direct consequence of security loopholes in their false argument, and typical of the “Fear, Windows operating system.14 Due to the Uncertainty and Doubt” strategies frequently

14 In fact, Microsoft’s Windows XP license includes the September 2003, "Should Microsoft Be Liable For clause “In no event shall Microsoft…be liable for Bugs?" Seattlepi.com, any…damages whatsoever…even in the event of http://seattlepi.nwsource.com/business/139286_msftli fault…(including negligence).” Todd Bishop, ability12.html [Accessed: 7 April 2005, 11:45]. Page 20 DCC Digital Curation Manual employed by the proprietary software world. large distributions are more likely to be The sense of ‘security through obscurity’ accessible in the future due to the sheer number promoted by proprietary software companies is of people who have a vested interest in ensuring considered to be rather dangerous: not only does that this is the case. Few would question the fact it create a false sense of safety, but it also limits that increasing the number of stakeholders is opportunities for identifying existing security likely to increase the demand for an application loopholes. Malicious ‘crackers’ are motivated or file format’s sustained accessibility (the so- and determined, and will uncover any called ‘follow the crowd’ approach). Particular vulnerabilities that exist whether code is freely concerns could be levelled at open source available or not. Opening the source ensures that projects that are marginalised within the overall those who are interested in finding problems to digital community, and formats that although fix and secure (rather than to exploit) can do so open are less well supported and less frequently as straightforwardly as possible. Also, while it use than commercial alternatives, such as the should certainly not lead to complacency, open OpenOffice 1.1 document or Ogg digital source users are less likely to find themselves audio formats. However, one must make a clear the target of malicious attacks than those using distinction between the size of the user proprietary tools, since a great deal of community that is interested in ensuring an destructive code is motivated by distrust and application or data-set’s longevity and the resentment directed towards large corporations. straightforwardness with which this can be This situation could change if open source achieved. Digital curators face both sociological continues to establish itself within the and technical challenges. It is suggested that the mainstream, since the intention of malicious latter are better addressed by the use of open crackers may be to simply attack the biggest source. Transparency is at the very foundation targets, irrespective of political factors. of open source and free software, promoting understanding and facilitating its curation. In 4.4 Perspective 3 – Data Re-users addition, such software is far more likely to embrace open formats and standards in favour 4.4.1 Longevity of Digital Information of proprietary alternatives. Therefore, although The expected lifetime of open source there may be more voices demanding the software compares favourably with proprietary curation of commercially distributed proprietary alternatives, although arguments can be software, it is likely that a comparatively presented to suggest that either is assured of modest number of open source users can greater longevity. Ubiquity is an appealing achieve the same goal with less effort and with characteristic, and it is frequently maintained significantly less expense incurred. People that mainstream commercial applications with power can make it easier to overcome the Andrew McHugh, Open Source for Digital Curation Page 21 barriers to digital curation, but although 4.4.2 The Relationship Between Open Source currently less well used, open source software and Open Standards itself overcomes a number of the most The philosophies underpinning open problematic obstructions that the digital curator source software have close associations with the 15 is likely to face. Furthermore, while it lacks concepts of open standards that are vital for comparable user numbers in many disciplines successful exchange, re-use and preservation of

Figure 1 © Digital Curation Centre and application areas (particularly the desktop documents and data. Open standards are those domain), there are several areas in which open that, by virtue of their transparency and source software, such as the Apache Web Server, accepted nature, offer a degree of protection the Bind DNS server and prominent institutional against obsolescence and inaccessibility. repository implementations like DSPACE, Technologist Bruce Perens suggests a definition Fedora and GNU EPrints hold dominant of the principles and practices surrounding open positions, even over commercial alternatives. standards, and offers detailed insights into what significant details elevate a common specification to the status of open standard.16

16 Bruce Perens, "Open Standards: Principles and Practice" 15 See Figure 1 http://perens.com/OpenStandards/Definition.html Page 22 DCC Digital Curation Manual

According to Perens’ criteria an open standard understanding of how they work. Consequently offers the freedom to view and implement it, there is no way to confidently read and write to prevents customers from being ‘locked in’ to a these formats with programs other than particular vendor or group, and ensures that Microsoft’s own.17 Recent reports from there is no associated royalty or fee and no Microsoft suggest that future versions of their favouring of one implementer over another. document formats will be defined in XML, Although it should be possible to extend open which should theoretically introduce a greater standards or offer them in subset form controls degree of transparency in their structure. must exist to prevent dominant vendors from However, within a community suspicious of implementing the standard with extensions that Microsoft (following for instance their are incompatible with other systems. Close extremely limited ‘Shared Source’ scheme18) parallels can be drawn between these principles few expect these plans to lessen the opacity and those expressed within the open source intrinsic to Microsoft’s products to any definition, particularly in terms of the concepts significant degree. These expectations are of freeness and unencumbered total access that galvanised with Microsoft’s failure to offer each promotes. Examples of open standards for confirmation that future versions of their file formats include the OASIS Open Document software will support OASIS's Open Document Format, the World Wide Web Consortium's Format. David Rosenthal argues that the (X)HTML and Adobe’s PDF format. These represent reasonably safe starting points for 17 Numerous projects, such as OpenOffice.org have tried storing content for future retrieval since all are to remedy this, with some success. See also summary understood, documented and published. There of CAMiLEON working papers in: The DigiCULT are no associated licensing costs and no charge Report Full Report “Technological Landscapes for can be levied for their use and distribution. tomorrow’s cultural economy: Unlocking the value of cultural heritage”, (January 2002), p. 212. Available Commercial software companies tend to assume online at http://www.digicult.info/pages/report.php, that if they can propagate their own particular [Accessed: 7 April 2005, 11:46] file formats widely enough, people will soon 18 ‘Shared Source’ is a Microsoft initiative under which become reliant upon them. The most obvious enterprise users, academics and others can get controlled access to select parts of Microsoft’s source example is Microsoft’s Office suite, which uses code. Heavily criticised for its toe-in-the-water the core file formats .doc, .xls, .mdb and .ppt to conservatism, The UK Register web site described it encode word-processed documents, as nothing more than a "worthless PR exercise", spreadsheets, database files and slide show Andrew Orlowski, 2004, "Why Microsoft ‘Shared presentations. None of these are open standards, Source’ Can Never Be Trusted", http://www.theregister.co.uk/2004/03/17/why_micros and therefore it is impossible to gain a thorough oft_shared_source_can/, [Accessed: 6 July 2005, [Accessed: 7 April 2005, 11:46]. 15:30] Andrew McHugh, Open Source for Digital Curation Page 23 incompatibly of data formats within the Office documented and non-standard software is suite is a deliberate and quite integral part of shrouded in mystery, with even sophisticated Microsoft’s business model. By distributing its decompiler software unable to offer reliable and software at low cost with new computers definitive insights into fundamental underlying pressure is placed on users to pay for expensive qualities. The efforts of OpenOffice.org to ‘essential’ upgrades that are subsequently create a standards compliant productivity suite introduced. The case for upgrading can be supporting a range of both proprietary and open persuasive – ongoing support may depend on it formats offers a somewhat trite, but insightful and poor backwards compatibility in many example of the kinds of barriers faced when products may render collaboration impossible if dealing with commercially encoded digital one’s peers are running a newer version. The assets. The contemporary problems associated implications that this has for accessibility need with these are likely to be amplified many times little further explanation. By its nature open in the future. Without an intimate understanding source software can be understood and if of the format for instance, it is necessary replicated at a later date, and open impossible to adequately and confidently render formats boast similar long-term lucidity. all the information contained within a .doc file Developers and users can continue to access in any non-Microsoft endorsed environment.20 open digital resources in the future more As long as one has to rely upon an individual straightforwardly than proprietary assets with private corporate organisation to access and use missing digital jigsaw pieces. one’s digital content, it can never be effectively curated, and its longevity can never be assured. 4.4.3 Portability of Information Technology journalist David Berlind pulls no In terms of ease of emulation and potential punches: “Putting the vendor in control of your portability open source carries a significant IT costs is not a good position to be in. advantage. Without having to painstakingly Unfortunately, that’s where a lot of us are.”21 reverse engineer existing applications, software environments and data formats it is theoretically g//p2.pdf [Accessed: 7 April 2005, 11:47]. straightforward to take information in a 20 Maria Guercio and Cinzia Cappiello, 2004, "File particular form or structure and recreate or Formats Typology and Registries for digital preservation", (DELOS, WP6 D6.3.1), repackage it as required.19 Binary-only, non- http://www.dpc.delos.info [Accessed 7 April 2005, 11:48]. 19 S. Ross and A. Gow, 1999, "Digital archaeology? 21 David Berlind, July 2002, "Who Gave Microsoft Rescuing Neglected or Damaged Data Resources", Control of Your IT Costs? You did", (London & Bristol: British Library and Joint http://techupdate.zdnet.com/techupdate/stories/main/0, Information Systems Committee), ISBN 1900508516, 14179,2875958,00.html [Accessed 7 April 2005, http://www.ukoln.ac.uk/services/elib/papers/supportin 11:48]. Page 24 DCC Digital Curation Manual

These costs are likely to be more than simply anniversary of the Domesday Book, a project financial. The combination of proprietary was undertaken to incorporate a diverse range of software and formats puts the software materials contributed by UK schoolchildren distributor in an unhealthily powerful position, within a multimedia resource. Unfortunately, and exposes the customer to the cost of the project’s technological choices led to certain ‘essential’ upgrades and even greater problems subsequent problems. For storage media, the should the technology be discontinued or the project team chose to use Philips’ proprietary developer become insolvent. Every time that a LaserVision LVROM disc, which could only be user records data in a closed format it tightens played on the associated LVROM player. The the grip held by its proprietary developer. The multimedia application itself was written in a chain becomes increasingly more difficult and language called BCPL, a precursor to C which more expensive to break away from. It should be ran on the BBC Model B platform, which had to clear that when preservation issues and future be modified to interface with the proprietary content access are considered, the problem is discs, increasing costs and limiting the chances only exacerbated. The OAIS reference model of viability and uptake. Regrettably the system speaks of the importance of the availability of soon became obsolete and, less than twenty Representation Information to ensure that the years later, very few players or discs remain. informational value of our preserved bit-streams Only the sustained efforts of the CAMiLEON remains available even into the future. Among project to rescue the application and implement the most frequently cited items of an emulation strategy have ensured that future Representation Information include software and generations can access this valuable resource.23 format specifications, which with an open Domesday need not have come up against approach to software management, acquisition such problems if a more future-conscious series and distribution are likely to be much more of decisions had been made at its conception. readily available than within a proprietary One of the biggest single issues for the project commercial model. (and for digital curation more generally) was its inability to ensure that at an unknown time in 4.4.4 Preservation Through Transparency the future users would still be able to access the The digital curation community is no [Accessed 7 April 2005, 11:48]. stranger to costly projects that have failed 23 http://www.si.umich.edu/CAMILEON/domesday/dom because of technological choices that were esday.html [Accessed 7 April 2005, 11:49]. See Daisy overly proprietary or marginalised. The BBC’s Abbott, "Overcoming the Dangers of Technological Domesday project is perhaps the most frequently Obsolescence: Rescuing the BBC Domesday Project", cited example.22 In 1986, to celebrate the 900th DigiCULT.Info 4, Page 4 , http://www.digicult.info/pages/newsletter.php 22 http://www.atsf.co.uk/dottext/domesday.html [Accessed 7 April 2005, 11:49]. Andrew McHugh, Open Source for Digital Curation Page 25 stored digital materials.24 An analogy with the with less guesswork or painstaking reverse original Domesday Book is offered on the engineering procedures. Furthermore, if the CAMiLEON project web site. If the Latin source code of the application was made freely language in which the original Domesday Book available it too could continue to be understood was written became somehow and broken down into more easily migrated incomprehensible, accessing the information it algorithmic chunks. holds would be impossible.25 A Latin dictionary New digital hardware will inevitably be could be used to overcome this problem and the introduced, and it is likely that the machines we OAIS reference model would call this use ten years from now will operate quite Representation Information. What must be differently from those in we are familiar with remembered is that the remit of digital curator is today. But the hardware level is just one of not limited to simply maintaining the physical several potential areas where problems can materials themselves, something which is occur for the digital curator. Assuming that data conceptually quite straightforward; it is also are encoded in an open file format, and that the necessary to ensure that a method of programs that read, access and write to these understanding and exploiting their full formats are open source, any hardware-related usefulness continues to exist in perpetuity. preservation problems can be more Instead of dwelling on the curation or straightforwardly overcome. The knowledge preservation of data one must strive for long- conferred by the use of open source applications term access to information. By using open, and open standards empowers future users, standardised formats, one can more feasibly enabling resources to be more straightforwardly limit the problems caused by the passage of manipulated within a future hardware time. If the Domesday project had used an open configuration. The INFORM methodology standardised structure to describe and encode its proposed by Andreas Stanescu26 suggests multimedia components, together with open several classes of risk, including those formats for incorporated sound and video, it is originating from the digital object’s format, its likely that a clearer understanding of the data associated software, and organisations and structures could be established in the future, communities related to the preservation plans for the object. Open source software and open, 24 For an example of an open source tool which standardised formats are likely to fare very well ‘normalises’ file formats for preservation see Adam Rusbridge, April 2004, “XENA: Electronic Normalising Tool”, DigiCULT.Info, Issue 7, page 32, 26 Andreas Stanescu, 2005, "Assessing the durability of http://www.digicult.info/pages/newsletter.php formats in a digital preservation environment: The [Accessed 7 April 2005, 11:49]. INFORM methodology", (OCLC Systems and 25 http://www.si.umich.edu/CAMILEON/domesday/faq.h Services, International Digital Library Perspectives, tml, Accessed 7 April 2005, 11:50 Vol 21, Number 1, 2005, pp. 61-81) Page 26 DCC Digital Curation Manual in each of these categories. In a closed source developers. While proprietary vendors may environment, future preservationists will face have little interest in continuing to support their the onerous tasks of reverse engineering software indefinitely and seldom offer the software to run on new platforms, developing means to ensure its longevity, they often have a emulators without a clear understanding of tendency to legally challenge anyone else who systems that need to be replicated and attempts to do so. This places a further burden continuing to maintain otherwise obsolete on the user to administer licensing and terms of hardware upon which resources are known to use documentation to ensure that the correct operate. infrastructure is in place and that no infringements can take place. Open source has 4.4.5 Legal Issues for Long-term Access no such problems – in contrast, only steps taken The practical problems inherent in dealing to limit free access to open source software are with proprietary software represent only part of likely to fall foul of licensing agreements. the problem. In addition there are likely to be legal obstacles to the emulation or porting of 4.4.6 Later Stages commercially distributed, proprietary software. Later stages of the digital curation Terms of use, usually strictly defined in software lifecycle, such as disposal or transfer of license agreements generally operate on a can- stewardship are further facilitated and simplified do basis, with an implicit assumption that if a within an open source infrastructure. With none particular type of use is not mentioned that it is of the legal barriers to redistribution that are forbidden. It is therefore common for many often explicitly forbidden under proprietary commercial software licenses to prohibit the licenses open source materials are are emulation, porting, migration and reverse comparatively straightforward to disseminate engineering of application information or and transfer. datasets. Similarly, restrictive terms of use may infringe upon one’s ability to collect and maintain appropriate Representation Information to prolong the useful life of a digital object. For software to qualify as open source or free however, there are assumptions to the contrary, in favour of freedom of use, re-use and redistribution. License agreements associated with OSS make it easier to take preservation measures without fear of violating the intellectual property claims of the original Andrew McHugh, Open Source for Digital Curation Page 27

throughout the public and social hierarchy, 5 Open Source and Free Software In within the new era of e-government and digital Action public administration. Numerous reasons can be identified to explain the enthusiasm with which 5.1 Areas of Use open source has been embraced by many public bodies. Perhaps the most obvious are related to From its origins in laboratories, the financial savings it affords, which in technology centres and student dormitories, governmental terms may be a vote-winner. open source has exploded in popularity over the However, this is only part of the story. There last few years, and now performs a key role in can be little doubt that government bodies and the IT policy and infrastructure of many agencies are to some extent wary about the organisations, institutions and companies. potential consequences of trusting their entire IT Various reasons are cited for the adoption of infrastructure to one or two private (usually open source, and these tend to vary across the foreign) companies who are likely to guard their sectors in which it enjoys exposure. Many users software secrets closely. Open source software are attracted to the traditional stability and is able to nullify this problem. In addition, reliability of the software, others to its governments are invariably charged with the straightforward integration with heterogeneous responsibility of ensuring that provisions are in system environments, and others, notably the place to preserve a wide range of information digital curation community, to the empowering for future generations. Open source facilitates transparency and freedom that are both intrinsic this in a way that proprietary infrastructures to open source. Many more may be convinced cannot. Governments are formally charged by by the financial savings that might be achieved their electorate with the responsibility to from using these technologies. maintain public records long-term, and 5.1.1 Government and Public Sector legislation such as the Freedom of Information Act in the United Kingdom offers persuasive The increasing success of open source in arguments for a move towards a more the public and government sector has been one environment. of the more significant developments of recent There are numerous examples of large- times in terms of technology take-up. Public scale public sector migrations to open source sparring between proprietary software within Europe and further afield. The French companies and the open source movement for Government has decided that central lucrative governmental IT contracts emphasises administration should terminate most of its the significance of this market, particularly for agreements with proprietary vendors for the the subsequent dissemination of technologies supply and use of software, meaning that Page 28 DCC Digital Curation Manual national and local authorities are to use open The International Institute of Infonomics source software as far as possible. The Agency report entitled ‘Free/Libre and Open Source for Information and Communication Software: Survey and Study’ (2002) Technologies in Administration (ATICA) was recommended and reported a widespread set up in August 2001 to support this decision, deployment of open source tools throughout and to coordinate the various governmental European government.29 This document agencies and bodies towards the intended contains several accounts of the public sector outcome. Similarly, the German central embracing open source systems. The French administration in June 2002 entered into a Ministry of Culture migrated 400 servers from framework agreement with IBM and SUSE on Unix and Windows NT to Linux and intends to the supply of open source products based on have comprehensive Linux server solutions by Linux, making it possible for German public 2005. The Ministry of Justice and national administration to acquire Linux-based systems at crime register use a combination of open source a reduced price from IBM.27 The agreement tools such as the Apache Web Server, Perl, incorporates the supply of servers and Samba, and fetchmail, with an imminent workstations, as well as ongoing support from migration envisaged from proprietary Unix to IBM. While this promotion of open source is Linux, PHP, and MySQL and finally the not a law, it represents a tempting incentive to Ministry of Defence have FreeBSD, an open open source decision-makers within the German source operating system comparable to public sectors. The United Kingdom has shown GNU/Linux, installed within their infrastructure. a clear commitment too, and the British Office In what is regarded as one of the most of E-envoy issued an open source policy at the significant developments in the lifetime of open end of October 2004 which states that British source software, the City of Munich in Germany Government and authorities will in future officially confirmed in June 2004 that it would consider open source, declaring in particular a be transferring 14,000 municipal desktop concern about being ‘locked-in’ to the products computers from to open of single private commercial companies.28 source, combining Linux server software, desktop software, and virtual machine 27 http://www.ibm.com [Accessed 7 April 2005, 11:50]; http://www.suse.com [Accessed 7 April 2005, 11:51]. technology from VMware to provide 28 See Danish Board of Technology, October 2002, “Open Source Software in e-government”, 16:27] http://www.tekno.dk/pdf/projekter/p03_opensource_p 29 Except where otherwise stated, accounts are from aper_english.pdf [Accessed: 7 April 2005, 11:51] and International Institute of Infonomics, 2002, UK Government, October 2004, "Open Source", "Free/Libre and Open Source Software: Survey and http://www.govtalk.gov.uk/policydocs/policydocs_doc Study", http://www.infonomics.nl/FLOSS/report/ ument.asp?docnum=905 [Accessed: 8 July 2005, [Accessed: 7 April 2005, 11:51]. Andrew McHugh, Open Source for Digital Curation Page 29 interoperability among heterogeneous systems.30 700 proprietary workstations with open source Bloomberg News described this as Microsoft’s however, since “Windows is so entrenched in “biggest PC loss yet,” and the decision has led the desktop space that it would take a nuclear analysts to predict that Linux-powered PCs will war to remove it.”33 This seems to be an grow 25%-30% in 2004, and that Linux will argument in favour of change sooner rather than account for 6% of desktop operating system later, and could be read as a firm assertion of the shipments by 2007.31 Bergen, Norway’s second difficulties posed to effective digital curation by city, has followed in the footsteps of Munich, the current proprietary configuration. Germany in choosing Linux to underpin its technology infrastructure, moving away from 5.1.3 Science proprietary UNIX and Microsoft Windows With bleeding edge innovation evident platforms and applications. throughout every scientific discipline, it is unsurprising that software developed within the 5.1.2 Humanities Institutions open source model has been embraced Like public sector institutions, the cultural wholeheartedly by the science community. heritage sector has a vested interest in both the Significant institutions and organisations such financial cost and the openness and accessibility as NASA have displayed a commitment to of the software it uses. Many institutions have distributing their endeavours under open source discovered that open source solutions offer licenses, with the NASA Open Source 34 advantages to facilitate their requirements. A Agreement conceived as a license determining prominent example is the National Library of legal usage for a range of applications. Relevant Australia, which now deploys a range of open projects include artificial intelligence software source applications across its server space, systems (Livingstone2), dynamic 3-D world reflecting a willingness to invest in the skills of environments (World Wind), a simulation the library and a commitment to standardisation toolkit for planetary exploration vehicles (the in general.32 The Library’s Director of IT Mission Simulation Toolkit) and a evolutionary Business Systems, Mark Corbould, described simulation (JavaGenes). NASA cites four main how the institution has hesitated to replace its motivators for their adoption of open source technologies and release habits. Increasing 30 http://www.vmware.com/, [Accessed: 7 April 2005, 11:53] 33 Nadia Cameron, September 2003, "Open Source 31 June 2004, "Munich Linux Decision Final", Bookmarks Australian Heritage", DesktopLinux.com, http://www.computerworld.com.au/index.php?id=522 http://www.desktoplinux.com/news/NS7137390752.ht 130461&fp=16&fpid=0 [Accessed 7 April 2005, ml [Accessed: 7 April 2005, 11:55]. 11:56]. 32 http://www.nla.gov.au/ [Accessed 7 April 2005, 34 http://www.opensource.org/licenses/nasa1.3.php 11:56]. [Accessed 7 April 2005, 11:56]. Page 30 DCC Digital Curation Manual software quality via community peer review, Commons37 carries similar emphases, with its accelerating development via community intention to promote innovation through contributions, maximising the awareness and knowledge sharing. In addition to developing impact of NASA research and increasing open source software and promoting its ideals dissemination of NASA software in support of the science community has exhibited a the education mission are together identified as consistent enthusiasm for using existing open sufficiently worthwhile ends to justify utilising source tools. A good example is the NASA the open source development model. Acquisition Internet Service, which in 2000 was A range of scientific applications available moved without a hitch to the open source under open source licenses are made available MySQL database, which has continued to by projects like OpenScience, which gathers a provide a robust foundation to this service ever diverse selection of applications intended to since.38 facilitate the work of various communities. Examples include tools for conducting studies of 5.1.4 HE/FE Institutions forensics, acoustics, astronomy, life sciences, Notwithstanding the benefits of open nanotechnology and chemistry. In addition, source from a digital curation perspective, several initiatives exist to promote the open Richard Stallman expresses a passionate belief source ethos more generally throughout the that all educational institutions should use free sciences. BIOS (Biological Innovation for Open software for several additional reasons. He cites Society) aims to "extend the metaphor and the financial savings, moral influence, and concepts of open source and distributive additional learning opportunities that source innovation to biotechnology and other forms of code availability affords as major motivators.39 innovation in biology" and to facilitate "the Many schools and universities have responded cooperative invention, improvement and sharing to these and other justifications by of biological technologies".35 Taking a sceptical implementing open source solutions within their view about the wide proliferation of restrictive IT environments. Projects such as OSS Watch, patents within biological sciences, the initiative funded by the UK’s Joint Information Systems has identified a requirement for more transparency to encourage knowledge 2005, 11:57]. 37 http://creativecommons.org/ [Accessed: 7 April 2005, dissemination and the progression of long-term 11:57]. communities of understanding. Other work like 38 Paula Shaka Trimble, December 2000, "Open Minds 36 Science Commons, an off-shoot of Creative on Open Source", FCW.com, http://www.fcw.com/fcw/articles/2000/1204/pol-nasa- 35 http://www.bios.net/daisy/bios/15 [Accessed: 7 April 12-04-00.asp [Accessed: 7 April 2005, 11:57]. 2005, 11:57]. 39 http://www.gnu.org/philosophy/schools.html 36 http://science.creativecommons.org [Accessed: 7 April [Accessed: 7 April 2005, 11:57]. Andrew McHugh, Open Source for Digital Curation Page 31

Committee (JISC), offer advice, support, and materials, which represent some of the most expertise to higher and further education valuable resources generated in Higher and institutions interested in deploying open source Further Education institutions, can be more solutions.40 An Insight Special Report entitled effectively managed and maintained within such ‘Why Europe Needs Free and Open Source open source infrastructures. Software and Content in Schools’ indicates areas Developers face a range of difficulties in in education in which open source can be Higher and Further Education institutions where deployed. The report concludes that OSS intellectual property fruits of employees’ provides “a beneficial way to transfer research activities are often retained by the knowledge and best practice.”41 institution itself. In such cases decisions about A recent study among IT specialists in redistributing the IPR rest with the owner. In thirty-seven tertiary education institutions in the such cases it is vital that employees and UK and Antipodes showed that free and open researchers familiarise themselves with the source software is already in place in 94% of terms of their employment to ensure that their surveyed institutions.42 A number of commonly participation in open source development is used Virtual Learning Environment packages legitimate and acceptable. The copyleft (VLEs) are open source, including the popular requirement within many open source licenses moodle,43 designed to facilitate the creation of compels those who adapt, build upon or online courses.44 Teaching and learning integrate the licensed code to release the fruits under the same license. Therefore, it is vital that 40 http://www.jisc.ac.uk/ [Accessed: 7 April 2005, 11:57]; http://www.oss-watch.ac.uk/ [Accessed: 7 employees are aware of the implications within April 2005, 11:57]. their own institution of utilising copylefted open 41 http://www.eun.org/insight- source products. pdf/special_reports/Why_Europe_needs_foss_Insight_ 2004.pdf [Accessed: 7 April 2005, 11:57]. 5.1.5 Commercial Organisations 42 David G. Glance, Jeremy Kerr and Alex Reid, January 2004, "Factors Affecting the Use of Open Source Identifying the quality of software and the Software in Tertiary Education Institutions", financial savings available, the enthusiasm with http://www.firstmonday.org/issues/issue9_2/glance/ which some corporations have embraced open [Accessed: 7 April 2005, 11:57]. source is indisputable.45 Prevalent now in a 43 http://moodle.org, [Accessed: 7 April 2005, 11:57] range of often mission-critical applications, 44 Open source can also be used for content management in educational institutions. See Paul Conway, open source performs a range of roles, from the December 2003, “Zope at Duke University: Open generic to highly specialist within both small to Source Content Management in a Higher Education Context”, DigiCULT.Info, Issue 6, p. 10, 45 IBM is a good example. See http://www- http://www.digicult.info/pages/newsletter.php 136.ibm.com/developerworks/opensource/, [Accessed: 7 April 2005, 11:57]. [Accessed: 7 April 2005], 11:57 for more details. Page 32 DCC Digital Curation Manual medium enterprises and multi-national mega- particular application can be intimidating, corporations. Notable example of companies despite the range of excellent web and with open source deployments or interests repository search tools currently available. include IBM, Novell, Hewlett Packard and These problems can be addressed using a Yahoo! These deployments range from web number of open source resources. Two of the servers and database infrastructures, to large- most prominent web-based examples – scale distributed computing projects. Facilitating SourceForge and Freshmeat – serve distinct commercial interactions and the management of but similar functions.46 SourceForge provides corporate information long-term, open source free hosting and Web space for thousands of and open standards are likely to continue to individual open source projects, offering perform a vital role within the commercial centralised search tools, distribution across sector. several worldwide mirrors and a large community of users offering advice, feedback and impressions of software projects. 5.2 Open Source Applications Freshmeat essentially comprises a massive Since the early development of the index of “preferably” open source applications GNU/Linux operating system, the open source and tools for a range of platforms, together with software library has grown at an impressive rate. links to each project’s own pages where the From an initial emphasis on server and software software itself can normally be downloaded. infrastructure code, an increasing number of Popularity details, ratings and vitality statistics open source projects now commit their are maintained and presented, offering novice resources and efforts to the development of users clear insights into the success and level of desktop applications in a range of areas, use of individual applications. With an including general office productivity, application's ubiquity offering one insight into multimedia development, sound and video its curatability, it is important to be able to editing and manipulation, scientific analysis and identify just which are the most used desktop publishing. As wide-ranging as open applications and software formats. source software is in terms of functionality, it Simply browsing these impressive also varies greatly in terms of maturity, stability resources offers insights into the range of open and performance. Given the vast array of source tools that exist, as well as the diversity of projects currently at different stages of the applications areas that are covered. In the development, identifying the most valuable, realms of infrastructure, server and development useful or technologically worthwhile 46 http://sourceforge.net [Accessed: 7 April 2005, applications can be difficult. Similarly, with 11:57]; http://freshmeat.net [Accessed: 7 April 2005, such a broad range of software, locating a 11:57] Andrew McHugh, Open Source for Digital Curation Page 33 software, several programs are commonly held doesn’t necessarily lead to the development of to be as good as or better than their proprietary the best software. Since the open source model alternatives. Open source software like the pools expertise and doesn’t set programmers the Linux Operating System, Apache Web Server task of usurping one another it can achieve a and Bind DNS Server, combined with a range of great deal quite quickly. However, this open standards, are all integral parts of the argument is not quite sufficient to quash these Internet, and one cannot overstate the role that concerns. Diversity is welcome within software the concept of openness played during the to overcome systemic failures when they arise. Web’s conception. Recent times have seen an As in many disciplines, mistakes are made, and increasing number of more mainstream desktop spreading the intellectual effort more widely is applications move towards this level of likely to ensure the non-fatality of any problems excellence. In addition to continuing to refine that are encountered. Nonetheless, although a the usability and functionality of general desktop great deal of open source work finds itself applications, it is also a priority for the open expressed in just one or two applications within source world to expand to meet the functional each domain area there still exists some requirements of more specialist users. To this welcome diversity. For instance, as the end, a number of excellent applications examples below illustrate, there are a number of specifically aimed at the field of digital curation individual and distinct institutional repository are now available under open source licenses. implementations currently being developed and A concern that is often vocalised is that released under open source licenses. the open source development model, based as it is upon the concept of collaboration, is likely to 5.2.1 The GNU/Linux Operating System result in monolithic software infrastructures, The operating system is the central with single choices that represent a software program within any computer system, compromised community consensus. In many communicating at a low level with the cases there is some truth in this. ‘Forking’ is a microprocessor and other hardware, and term that describes the process of branching the organising the execution and run-time of each development of source code over two or more installed program. Among the most common separate, perhaps incompatible paths: this is and familiar examples of proprietary operating strongly discouraged within the open source systems within the personal computer market community. Many open source advocates will are Microsoft’s Windows and Apple’s OS X. argue that when several individual alternative The free software movement would have had no projects condense into a single unified effort it is foundations if it had had to rely upon a central a good thing, since although competition within proprietary program, and this realisation led to the software industry is good for business, it the conception of the GNU (GNU’s not Unix) Page 34 DCC Digital Curation Manual project in 1984, to develop a suite of variety of fully integrated applications. Some of applications that would represent a free the most popular ‘distributions’ include Red operating system. Identifying early on that Hat, SUSE, Mandrake and Debian.49 Each of multi-platform support was a desirable these can be downloaded for free from its characteristic, Stallman chose to his system associated web site or purchased on CD or DVD on the widely published core concepts that are for a small sum, fully packaged and shared by Unix computer systems, the only documented. Most distributions also offer platform at the time to offer a degree of corporate packages, with full support structures portability. Following success with a number of more akin to commercial proprietary systems. applications, eventually GNU lacked only a Along with hardware, the operating kernel to make it into a complete, fully system represents one of the most significant functional operating system. The kernel environmental factors in determining the represents the heart of the operating system and operability of digital objects. Establishing an manages system memory, the file system and understanding of the systems that are required to disk operations. In a timely coincidence, a interpret the information encoded within our young Finnish programmer, Linus Torvalds, was data streams is essential to facilitate our digital concurrently building his own Unix compatible curation endeavours, and GNU/Linux offers the kernel, Linux, and this was swiftly integrated opportunity for anyone to do so. into the existing GNU code, resulting in what is now known as GNU/Linux.47 5.2.2 Emulation Applications for Open Since its initial release in the early 1990s, Source GNU/Linux has undergone almost constant Many information tasks that can be refinement, and now represents a mature, stable undertaken using proprietary tools can also be and usable platform, incorporating most of the achieved with open source. If an appropriate features of expensive proprietary Unices, and application is not available however, the WINE representing a viable solution for both server package (Wine Is Not an Emulator) is a and desktop deployment.48 A number of “Windows Compatibility Layer” for Linux/Unix companies distribute the system with which can be used to install and run many straightforward installation packages and a

47 Although the entire operating system is often referred to as “Linux,” Torvalds’s contribution represents only 49 http://www.redhat.com/ [Accessed: 7 April 2005, a part (albeit a very significant one) of the overall 11:57]; http://www.SUSE.com/, [Accessed: 7 April system. Open source advocates tend to favour the 2005, 11:57]; http://www.mandrakesoft.com/ abbreviated terminology, probably because it is more [Accessed: 7 April 2005, 11:57]; concise and catchy, which helps in its promotion. http://www.debian.org/ [Accessed: 7 April 2005, 48 Unices is the plural of Unix. 11:57]. Andrew McHugh, Open Source for Digital Curation Page 35

Windows applications.50 However, the Unix applications within a Windows shortcomings of this project offer some insights environment Cygwin offers a "Linux-like into the problems posed when attempting to environment for Windows", consisting of a recreate proprietary, unpublished software Linux API layer and a selection of tools to infrastructures. Despite the WINE project’s provide Linux look and feel.53 vintage51 it still suffers from instability problems and lacks support for the full range of Windows 5.2.3 Server and Development applications. Legal and technological The success enjoyed by the open source impediments associated with Windows’ software movement can be directly attributed to proprietary nature have been significant a number of infrastructure, server and obstacles to the WINE project’s success. In the development technologies that have been event of WINE offering insufficient levels of wholeheartedly embraced by the technological performance or reliability, commercial Linux community. It is in this area that open source applications like VMware allow an alternative has traditionally been most prominent, and operating system to be installed within Linux, within this domain, open source products are and run as an internal application.52 This means well established. Since most of these tools are that full software support and performance is offered at no cost and offer levels of retained, although VMware is not available performance, reliability and security comparable under a free license, and any installed operating with proprietary alternatives, they appeal to systems must also be licensed. Another many enterprises, organisations and institutions. proprietary alternative is to purchase Market share is considered in more depth in the Codeweaver’s Crossover Office application, quantitative section below. which builds upon WINE technology and offers 5.2.4 The Apache Web Server full, robust and supported Linux compatibility for a very small range of Windows applications, Alongside GNU/Linux, the Apache Web Server project represents one of the most including , prominent success stories of the open source and Lotus Notes, negating the need to purchase movement.54 A web server is an application an additional Windows license. used to make World Wide Web resources For those wishing to run Linux or other available. In April 2005, Apache had a 69% 55 50 http://www.winehq.org/ [Accessed: 7 April 2005, market share of all those on the Web. 12:10]. 51 The WINE project’s origins can be traced to June 53 http://cygwin.com [Accessed: 13 July 2005, 11:49] 1993. 54 http://www.apache.org [Accessed: 7 April 2005, 52 http://www.vmware.com/support/linux/ [Accessed: 7 12:10]. April 2005, 12:10]. 55 http://news.netcraft.com/archives/web_server_survey. Page 36 DCC Digital Curation Manual

Straightforwardly configurable, well- when deployed on the Web due to its quick documented, secure and available for a wide handling of multiple connections. PostgreSQL range of platforms, Apache pushes into second and Firebird are other notable examples, and place its closest rival, Microsoft’s Internet tend to be regarded as more functionally Information Server. Apache can be modified to complete than MySQL.57 None matches the suit particular deployments, allowing system heavyweight functionality offered by the administrators to customise the services they leading proprietary database (Oracle), but offer, effectively creating new web servers based unless an application has special requirements on the Apache model. For digital curators the PostgreSQL in particular is likely to incorporate transparency offered by Apache is welcome, most if not all of the necessary features.58 All given the vast numbers of web services and three packages run natively on a range of web-deployed applications currently in use that platforms including Linux and Windows. are closely integrated with the web server. An MySQL’s significantly larger user base accounts understanding of these applications can only be for its more comprehensive documentation and obtained in many circumstances by help structures, as well as its increased understanding the software infrastructure that stability.59 Prominent MySQL users include facilitates their delivery. Google, Cisco, Sabre Holdings, Hewlett Packard, NASA and Yahoo!60 5.2.5 Databases Databases are central to most of our interactions with digital technologies, offering 5.2.6 The GRID storage opportunities as structured as individual Of interest to many working in the applications require. Several open source 57 packages are available. Three particular http://www.postgresql.org/ [Accessed: 7 April 2005, 12:10]; http://firebird.sourceforge.net/ [Accessed: 7 examples enjoy great prominence, with their April 2005, 12:10]. own respective individual strengths, each 58 http://www.oracle.com [Accessed: 7 April 2005, offering a maturity and depth of functionality 12:10]. 59 elevating them above many proprietary A fuller comparison of the relative merits of each can be found in "PostGreSQL or MySQL?", http://www- packages. MySQL is perhaps the best known, css.fnal.gov/dsg/external/freeware/pgsql-vs- and compares extremely favourably with most mysql.html [Accessed: 7 April 2005, 12:10] and Ian proprietary equivalents, particularly in terms of Gilfillan, December 2003, "PostgreSQL vs MySQL: speed and stability.56 It is particularly valuable Which is better?", DatabaseJournal.com http://www.databasejournal.com/features/postgresql/a html [Accessed: 7 April 2005, 12:10]. rticle.php/3288951 [Accessed: 7 April 2005, 12:10]. 56 http://www.mysql.com [Accessed: 7 April 2005, 60 http://www.mysql.com/customers [Accessed: 7 April 12:10]. 2005, 12:10]. Andrew McHugh, Open Source for Digital Curation Page 37 scientific data disciplines is the use of GRID 5.2.7 Programming Languages computing, which uses the processing power of Part of the GNU/Linux operating system, several computers connected via a network to the GNU C Compiler (GCC) is an open source solve complex and large-scale computational implementation of a C language compiler. 61 problems. CERN describes the GRID as "a Modules are also available to add support for a service for sharing computer power and data range of additional languages such as C++. storage capacity over the Internet.". The Furthermore, a number of other languages responsibility for defining specifications for grid operate under open source licenses, and several computing is held by the Global Grid Forum have been relied upon consistently in a range of 62 (GGF), and these are implemented through the computing areas. Three of the most prominent Globus Toolkit by the Globus Alliance,63 a are the scripting languages PHP (PHP group of individuals and organisations Hypertext Preprocessor), Perl (Practical developing fundamental technologies behind the 64 Extraction and Report Language), and Python. GRID. The Globus Toolkit is an open source Because all three are traditionally interpreted toolkit used for building Grid systems and languages (that is, they need not be compiled applications. A growing number of projects and prior to execution), when it is made available companies are using this package to unlock the their code is usually in a human-readable form. potential of grids for their own specific This also facilitates multi-platform purposes. It has become the de facto standard for interoperability. PHP is most widely used in the grid middleware and provides a standard development of dynamic Web pages. Popular platform for services to build upon. As they did among Web developers due to its fast parsing during the development of TCP/IP, open source and flexibility, PHP is also versatile and comes tools are playing a fundamental role in the with many built-in and modular interfaces. development within this area, facilitating its Database connectivity is straightforward; while growth and the creation of new tools and PHP is most commonly associated with MySQL application possibilities. Similarly, open it can connect to any ODBC-enabled database. standards and collaboration are intrinsic Perl offers similar functionality to PHP, but characteristics of and fundamental requirements with more general deployments traditionally. for the GRID. Perl is frequently used to add dynamic functionality to Web pages, but it is also used to 61 Conseil Européen pour la Recherche Nucléaire handle a range of other tasks involved in system (European Laboratory for Particle Physics) 62 http://www.gridforum.org, [Accessed: 11 July 2005, 64 http://www.php.net [Accessed: 7 April 2005, 12:10]; 15:43] http://www.perl.com [Accessed: 7 April 2005, 12:10]; 63 http://www.globus.org, [Accessed: 11 July 2005, http://www.python.org/ [Accessed: 7 April 2005, 15:48] 12:10]. Page 38 DCC Digital Curation Manual administration and data processing. Perl’s rich commonly used domain name system (DNS) support for regular expressions also makes it server, and Apache Jakarta Tomcat, one of the useful as a text manipulation language. Like most popular Java Servlet and Java Server both Perl and PHP, Python is frequently used in Pages containers in use on the Web, providing a Web environment. Combining powerful an infrastructure for the delivery of Web-based capabilities with a simple syntax, Python also Java programs.66 has interfaces to numerous system calls and libraries. Additional modules can be developed 5.2.9 Desktop and Productivity using C or C++, extending the functionality to While its traditional arena of dominance suit individual requirements. All three since the early 1990s has been in servers and programming languages are portable, supporting network infrastructure, the recent upsurge in a range of platforms including Linux, Windows, popularity of open source has led to the Mac, and OS/2. By developing in an open source development of a number of mature and environment, even if languages should fall into functionally rich desktop applications that decline or obsolescence, existing code will be measure up well against their proprietary peers. able to be re-purposed for a future system Although notable gulfs continue to exist in some configuration and executed with mitigated risk areas, several key open source desktop of loss.65 applications have introduced innovative and practically useful features that have been 5.2.8 Others subsequently adopted into commercial To further describe the various open proprietary applications. source software packages that have established major footholds within the Internet 5.2.10 OpenOffice.org infrastructure would take a great deal of space; With the numerous problems associated suffice to say that OSS applications exist for with the proprietary and hidden nature of almost every subject area, from eGovernment to Microsoft’s Office formats, institutions should gaming. Further examples of popular and be extremely wary of regarding Office-encoded successful tools include Sendmail, the world’s data as curated. OpenOffice is a large-scale most used email server, Bind, the most project, backed by Sun Microsystems to develop a comprehensive and transparent suite 65 Sun Microsystems are currently involved in a public debate over the merits of opening up their currently 66 http://www.sendmail.org/ [Accessed: 7 April 2005, closed-source Java programming language. 12:10]; http://www.isc.org/index.pl?/sw/Bind/ Developments here will be well worth watching, given [Accessed: 7 April 2005, 12:10]; Java’s prominence as a platform-independent, web- http://jakarta.apache.org/tomcat/ [Accessed: 7 April friendly language. 2005, 12:10]. Andrew McHugh, Open Source for Digital Curation Page 39 of tools, including word processor, spreadsheet unique features such as built-in PDF-writing and presentation software, that writes to open support and a user interface that integrates each standard formats defined in XML.67 With its of the individual applications, it lacks the polish current version (1.1.4), the project has enjoyed and some of the more advanced functionality of success. ‘Read’ and ‘write’ support for the latest version of Office. OpenOffice 1.1 is Microsoft Office formats is included, but since regarded as functionally comparable to these are not publicly documented, errors are Microsoft’s Office 97, although the open source occasionally encountered, particularly when product is thought to offer several additional dealing with complex structures such as tables. features and a greater level of reliability. Many Objects such as images, plugins, videos and ordinary users are unlikely to have requirements charts can be embedded as straightforwardly as that extend beyond the features that are with Microsoft tools. There is no support for included. However, when Jack Wallen Jr. of Visual Basic for Applications (VBA) macros, ZDNet Australia writes “if you can do it in since this is a proprietary Microsoft technology, Microsoft Office, you can do it in but scripting is possible using the integrated OpenOffice.org… for free,” it should be borne StarBasic syntax. The imminent new release of in mind that OpenOffice still has some way to OpenOffice will include support for the OASIS68 go before matching all of the functionality of OpenDocument format, which is likely to be Microsoft’s flagship application.70 That it adopted by the European Commission as the exceeds Microsoft’s efforts in terms of recommended format for document interchange implementing a system for the creation, editing within the European public sector. and rendering of preservable documents is An eWeek survey comparing OpenOffice however, unquestionable. 1.1 and Microsoft’s Office 2003 illustrates the relative merits of the two application suites.69 5.2.11 The Mozilla Project The general consensus is that while OpenOffice Comprising a number of individual represents a good free package, with several programs, the Mozilla project represents one of the most successful open source desktop 67 http://www.openoffice.org [Accessed: 7 April 2005, 12:13]. application projects, offering a level of maturity, 68 Organisation for the Advancement of Structured functionality and innovation that matches and Information Standards, http://www.oasis- surpasses much equivalent proprietary open.org/home/index.php , [Accessed: 7 July 2005, software.71 At its forefront is the Mozilla 14:39] 69 Jason Brooks, April 2004, "Office 2003 vs. 70 http://www.zdnet.com.au/insight/0,39023731,202703 Openoffice.org", Eweek.com, 00,00.htm [Accessed: 7 April 2005, 12:10]. http://www.eweek.com/article2/0,1759,1571626,00.as 71 http://www.mozilla.org/ [Accessed: 7 April 2005, p [Accessed: 7 April 2005, 12:10]. 12:10]. Page 40 DCC Digital Curation Manual package, which offers Web browsing, email, system Bugzilla, which facilitates the reporting newsgroup and IRC access and a Web of errors from a wide number of applications, development application within a single, ensuring their continued development and customisable interface. It is particularly in the refinement.74 area of security that Mozilla outshines Microsoft’s frequently vulnerable Internet 5.2.12 Specific Open Source Applications for Explorer and Outlook Express, but it also offers Digital Curation greater compliance to web standards and The increasing maturity of open source improved functionality, with a built-in in pop-up software has led to the development of a range blocker and integrated search facility that unless of tools designed for the achievement of patched with Microsoft’s Windows XP Service specific, specialist goals. A number of factors Pack 2, Internet Explorer does not offer. make the open source model particularly The Mozilla project has also developed a suitable for the digital curation community, and range of sister products. Mozilla Firefox is a this has led to concentrated development in this stripped down, lightweight Web browser with area. The requirements for openness and support for the addition of separate modules of ‘future-proofing’ within software and the often functionality, or extensions.72 It aims to be fully physically distributed nature of organisations customisable, with optional features that and projects are issues in which open source can introduce exciting navigational and development be profoundly beneficial. possibilities. Thunderbird is another project, The following sections detail a number of essentially promising the same things for email prominent open source applications aimed as Firefox offers for the Web.73 Lightweight and specifically and indirectly at meeting digital secure, it supports all major protocols and can be curation requirements. fully customised to suit environment or user (a) Fedora Digital Object Repository preferences. All of these Mozilla tools are Management System available for Windows, Linux, and Mac OS X, emphasising the interoperability of these The Fedora project (not to be confused solutions. with Red Hat’s Fedora Linux distribution), The Mozilla project has inspired and originally developed by the Digital Library enabled the development of a number of other Research Group at Cornell University is one of applications, including the defect-tracking several digital object repository architectures that have been proposed in recent years. The 72 http://www.mozilla.org/products/firefox/ [Accessed: 7 Fedora structure is based on object models that April 2005, 12:10]. 73 http://www.mozilla.org/projects/thunderbird/ 74 http://bugzilla.mozilla.org/ [Accessed: 7 April 2005, [Accessed: 7 April 2005, 12:15]. 12:15]. Andrew McHugh, Open Source for Digital Curation Page 41 each form the template for individual units of notable qualities, in particular the fact that the content, called data objects. These can contain software is implemented as a set of Web various digital content, associated metadata and Services and that its full functionality is exposed references to representation information. through a series of well defined web service Behaviour objects describe tools and services . The D-Lib article describes four use case that can be used by the repository to provide scenarios for Fedora, illustrating its usefulness. access to the data objects. The system has a From the first “low barrier to entry” scenario to three-layered architecture. The Web Services a full fledged digital library or repository for Exposure layer defines interfaces for distributed objects Fedora is sufficiently administration and access, the Core Subsystem flexible to meet the requirements of many layer implements their subsystems and the institutions. Undoubtedly innovative, and Storage Layer implements the storage subsystem functionally rich, Fedora offers a good that handles reading, writing, and removal of illustration of what open source development data from the repository. Digital Objects are can achieve. stored as XML files corresponding to an extension of the Metadata Encoding and (b) DSpace Transmission Standard (METS). Among The DSpace Institutional Repository Fedora’s “noteworthy” features identified by D- System, developed jointly by MIT and Hewlett 75 Lib Magazine are its XML submission and Packard offers capture, storage, indexing, storage, its access control and authentication, preservation and redistribution functionality. It 76 searching, OAI-PMH Metadata harvesting and aims to satisfy the definition of an Institutional a batch utility supporting the mass creation and Repository offered by Clifford Lynch, as “an loading of data objects. A recent paper entitled organisational commitment to the stewardship “Fedora: An Architecture for Complex Objects of digital materials, including long-term 77 and Their Relationships” describes further preservation where appropriate, as well as organisation and access or distribution”.78 75 Thornton Staples, April 2003, "The Fedora Project: An Open-source Digital Object Repository Wilper, (rev v.4, March 2005), "Fedora: An Management System", D-Lib Magazine, Volume 9, Architecture for Complex Objects and Their Number 4, Relationships" , http://www.dlib.org/dlib/april03/staples/04staples.html http://www.arxiv.org/pdf/cs.DL/0501012 [Accessed: [Accessed: 7 April 2005, 12:15]. 7 April 2005, 12:15]. 76 Open Archive Initiative Protocol for Metadata 78 Clifford A. Lynch, February 2003,"Institutional Harvesting, Repositories: Essential Infrastructure for Scholarship http://www.openarchives.org/OAI/openarchivesprotoc in the Digital Age" ARL, no. 226: 1-7, ol.html [Accessed: 7 July 2005, 14:53] http://www.arl.org/newsltr/226/ir.html [Accessed: 7 77 Carl Lagoze, Sandy Payette, Edwin Shin, Chris April 2005, 12:15]. Page 42 DCC Digital Curation Manual

DSpace can be configured to accept a diversity with other systems that an institution may of digital content ranging from documents, maintain. books and theses to data sets, computer programs, visual simulations and models. These (c) FreeBXML are organised according to the groups that ebXML (Electronic Business using contribute content, called ‘communities’, and eXtensible Markup Language)79 is a suite of 'collections', which house individual content specifications that facilitate the exchange of items and files. The primary goal of DSpace is business information over the Internet by digital preservation, to provide long-term organisations of any size. Using these physical storage and management of materials in specifications it is possible for organisations to a secure and effectively administered exchange messages, trade, communicate in environment. Persistent identifiers are allocated common terminology and define and register to each stored item in the interests of ensuring processes relevant to their business. Started in their longevity, with preservation conducted in 1999 by OASIS and the UN/ECE agency both bit and functional terms. The latter is CEFACT it was based upon five layers of achieved using emulation or migration strategies substantive data specification, which are for supported open formats and by relying on realised in XML standards for business the third party tools that are expected to emerge processes, core data components, collaboration for popular proprietary formats. It is conceded protocol agreements, messaging and registries that for unknown or one-off proprietary data and repositories.80 freebXML is an initiative functional preservation is difficult, but by aiming to promote the ebXML specifications preserving the bit-stream too it is hoped that through the sharing of software, expertise and future digital archaeologists will at least have experience. Its web site the opportunity to retrieve and reproduce (http://www.freebxml.org) offers centralised information. Ancillary DSpace functionality access to relevant code and applications as well allows the implementation of access controls, as a forum for discussion about developments versioning and search and retrieval based on and deployments using ebXML. Among the Dublin Core metadata which can be applied to most relevant programs available from the web each submitted object. DSpace is promoted as a site under open source licenses are freebXML flexible solution, equipped to effectively handle 79 http://www.ebxml.org/. the diversity of materials and expectations 80 For further information see Brian Gibb, Suresh implicit within a multi-disciplinary archive, Damodaran, 2002, "ebXML : Concepts and mainly through its use of communities in the Application", Wiley, ISBN: 076454960X, or Alan organisation of its information. Built-in Java Kotok and David R.R. Webber, 2001, "ebXML: The New Global Standard", New Riders, ISBN: APIs allow the interoperation of stored content 0735711178. Andrew McHugh, Open Source for Digital Curation Page 43

CC, a set of tools developed to facilitate the persistent access to preserved digital materials. management of data dictionaries and freebXML Initiated by Stanford University Libraries, Registry, built upon an extensible information LOCKSS runs on standard desktop workstations model, allowing the storage of any kinds of data and offers librarians and information managers in its repository, with arbitrary associations the opportunity to create low-cost, persistent created between registry entries. and accessible copies of digital content as it is published. A secure peer-to-peer polling and (d) JHOVE reputation system ensures that the integrity and The result of collaboration between accuracy of LOCKSS materials are maintained. JSTOR81 and Harvard University Library, the JSTOR/Harvard Object Validation (f) Xena Environment82 aims to provide functions to The National Archives of Australia perform format-specific identification, originally developed the XML Electronic validation and characterisation of digital objects. Normalising of Archives84 project to meet the In essence, this offers the opportunity to challenges posed by preserving electronic automatically determine a particular object’s records into the future in a constantly changing format if unknown, to confirm or deny the hardware and software culture. The application validity of a purported example of a particular aims to resolve these concerns by converting format by assessing whether it meets syntactic electronic records in proprietary formats to a and semantic requirements of that format and to standardised XML format that can be read by identify the intrinsic properties of a particular future technology. Xena’s current version object based on its format. Standard format supports a range of formats that can be modules are distributed with the system and converted with no information loss to the include AIFF, ASCII, GIF, HTML, JPEG, PDF, standard XML. These include Microsoft’s TIFF and XML. Word, Excel and Powerpoint, the OpenOffice.org suite of formats, RTF, (e) LOCKSS Relational database files, JPG, GIF, TIFF, PNG LOCKSS stands for “Lots of Copies Keeps and BMP image files, HTML and plain text. 83 Stuff Safe” and is an open source peer-to-peer Furthermore, with its plug-in based architecture, application with its core raison d’etre to offer Xena can conceivably be extended to support 85 81 The Scholarly Journal Archive, http://www.jstor.org any other formats. [Accessed: 25 July 2005, 11:43] 82 http://hul.harvard.edu/jhove/ [Accessed: 25 July 2005, 84 http://xena.sourceforge.net/ [Accessed: 25 July 2005, 11:43] 11:43] 83 http://lockss.stanford.edu/ [Accessed: 25 July 2005, 85 For an example of an open source tool which 11:43] ‘normalises’ file formats for preservation see Adam Page 44 DCC Digital Curation Manual

5.2.13 Other Institutional Repository Open source carries other advantages in Implementations the context of digital resource registries and Developed at the University of repositories. For instance, Representation information registries can benefit from its Southampton, GNU EPrints facilitates the flexible legal status by offering direct access to creation of online archives, with the default rendering, management and conversion configuration a repository of the research output applications that are distributed under open of an academic institution. EPrints servers are source licensing agreements. designed to help dissemination of research publications by sharing associated metadata using Open Archive Initiative (OAI) standards. Further open source alternatives include MyCoRe, developed in Germany, the Dutch ARNO and also CERN Document Server Software. According to a recent DPC Technology Watch Report by Paul Wheatley into “Institutional Repositories in the Context of Digital Preservation”86 none of these four solutions cites digital preservation as a key aim. With the range of repository solutions available there is a increasing interest into how multiple repositories might cooperate within a global curation network. The transparent nature of open source facilitates the implementation of systemic connections offering a range of potential benefits. Cooperation on the selection of content and optimisation of technical infrastructures can take place, and duplication of effort can be minimised.

Rusbridge, April 2004, “XENA: Electronic Normalising Tool”, DigiCULT.Info, Issue 7, page 32, http://www.digicult.info/pages/newsletter.php [Accessed: 7 April 2005, 11:49]. 86 Paul Wheatley, 2004, “Institutional Repositories in the context of Digital Preservation” DPC Technology Watch Series Report 04-02. Andrew McHugh, Open Source for Digital Curation Page 45

describe the relative cost implications of using 6 Quantitative Issues open source and proprietary tools in more general ways. Inevitably there will be unique 6.1 Financial Costs of Open Source Software implications relating to the cost of digital It is important to consider financial curation, but these are difficult to assess at this implications in terms of total cost of ownership time. (TCO), particularly when software is distributed free of charge. However, since a definitive list of the cost factors that should be taken into account 6.2 Software Acquisition and Upgrade Costs in a TCO study has yet to be settled upon, a The initial acquisition cost of open source number of conflicting accounts exist. It is no software will usually be less than any doubt possible to identify a persuasive TCO proprietary alternative. Of course, it need not be study in favour of most software configurations, free (i.e. gratis) under the terms of its license, but the actual figures will always depend on a and other additional costs may be incurred for specific combination of environment and documentation, storage media, and support requirements. An accurate picture can only be contracts. Taking these factors into account, a drawn following consideration of all the relevant 2001 study by Cybersource Consulting found individual cost elements, from software and the following acquisition cost results, hardware purchases to administration, and from illustrating the scalability of an open source technical support to staff training. solution over three increasingly sized As well as ambiguities across surveys it is installation environments:87 also clear that no one has yet conducted any formal studies as to the relative cost implications of conducting a digital curation strategy with open source or proprietary tools. This is true both when considering the relative costs of curating digital resources that are open source or proprietary and when considering using open source tools to conduct our digital curation activities. While it seems likely that the use of open tools and software formats are likely to prolong the longevity of our digital assets it is 87 Cybersource, 2004, "Linux vs. Windows Total Cost of hard to present the cost implications of this Ownership Comparison", hypothesis in quantitative terms. Instead we http://www.cyber.com.au/cyber/about/linux_vs_wind must rely on the figures that do exist which ows_tco_comparison.pdf [Accessed: 7 April 2005, 12:18]. Page 46 DCC Digital Curation Manual

Microsoft GNU/Linux OS Savings using open source applications. Proprietary upgrades will typically cost around half the 50 $69,987 $80 (€65) $69,907 amount of the original application. Users can users (€56,615) (€56,550) subsequently find themselves at the mercy of the proprietary companies, who have a 100 $136,734 $80 (€65) $136,654 monopoly on the distribution of their software. users (€110,609) (€110,544) To upgrade an open source application one 250 $282,974 $80 (€65) $282,894 simply has to download the latest version, or pay the original cost once again and redeploy users (€228,907) (€228,842) across as many machines as required.

6.3 License Management and Litigation Costs The reason why open source software Needless to say, open source software is scales so well is that it is only necessary to again favourable in the context of license purchase or obtain a single license, which covers management and litigation. For users of unlimited subsequent installations. Proprietary proprietary software, failure to adhere to the software, on the other hand, is typically licensed strict terms of software licenses can lead to on a per-installation or per-user basis. This is extremely heavy fines and even custodial worth bearing in mind if the intention is to sentences. It is therefore in users interests to deploy a large number of workstations. Network manage licenses effectively, undertaking regular World Fusion News reported in 2001 that a software audits and even installing license- major part of the reason for an increase in tracking software. Under an open source license Linux’s deployment in finance, healthcare, such procedures, costly in terms of both time banking and retail was its scalability in cost and and money, are rendered unnecessary. technical terms when large numbers of identical Similarly, the costs involved in migrating to and sites and servers are needed. The journal from open source formats, emulation of existing calculated that for a 2,000-site deployment SCO open source software architectures and of UnixWare would cost $9 million (€7.3m), supplying open source tools to render or convert Windows $8m (€6.5m), and Red Hat Linux just particular file formats is likely to be $180 (€146).88 dramatically less than with proprietary Upgrade costs also compare favourably alternatives.

88 Deni Connor, March 2001, "Linux Slips Slowly into 6.4 Hardware Costs the Enterprise Realm", As far as hardware is concerned, it is http://www.nwfusion.com/news/2001/0319specialfocu generally acknowledged that open source s.html [Accessed: 7 April 2005, 12:15]. Andrew McHugh, Open Source for Digital Curation Page 47 software – such as the GNU/Linux operating digital music device90 and Microsoft’s X-Box system – can run effectively on a lower gaming console.91 With the uncertainty specification machine than its Windows surrounding tomorrow’s computer hardware equivalent. The latest version of Windows, XP environments it is comforting to know that such Professional, recommends a 300Mhz Intel migration remains possible, irrespective of the compatible processor, 128MB of RAM and a nature of hardware products and their original minimum of 1.5GB of hard disk space. A intended purposes. Suffice to say, from a cost comparable version of Mandrake Linux perspective the flexibility of Linux mitigates the (version 9.1) requires only a Pentium class likelihood of hardware obsolescence, offering processor, 128MB of RAM and 150MB of disk users more opportunity to make the most of space. As Linux desktops and user interfaces their existing resources. Furthermore, rumours have become more graphically complex the suggest that future versions of Microsoft margin between the two has narrowed. But since Windows will incorporate hardware-tied a Linux system tends to be much more security, essentially limiting the user's ability to configurable, it is more straightforward to install configure the hardware environment on which only those parts of a system that are really the software operates.92 This may have required, saving processor power and disk problematic consequences for future migration space.89 Furthermore, unlike existing consumer and reuse. Windows systems most Linux distributions are available in both 32bit and 64bit Intel-based 6.5 Support and Training versions, offering users the opportunity to For other, less explicit up-front costs it exploit the advantages of newer more becomes more difficult to find consensus. sophisticated hardware. Further emphasising the Technical support and administration is one interoperability intrinsic to open source, many such area. Microsoft claims that it is more GNU/Linux distributions are available for non- straightforward to find trained administrators Intel hardware, such as PowerPC, Sun SPARC and technicians for its platforms, and that they and Alpha. The transparency at Linux’s core therefore cost less. However, the open source facilitates porting to limitless alternative community rebuts this, arguing that with hardware environments, and numerous endeavours are constantly being undertaken to 90 http://neuron.com/~jason/ipod.html [Accessed: 7 run Linux on devices as varied as Apple’s iPod April 2005, 14:35]. 91 http://www.xbox-linux.org/ [Accessed: 7 April 2005, 14:57]. 92 http://www.theregister.com/2003/11/03/ms_to_intro_ 89 This also means that hardware can be used for longer, hardwarelinked_security/ [Accessed: 25 July 2005, without the need to upgrade so frequently. 11:55] Page 48 DCC Digital Curation Manual

GNU/Linux fewer administrators are required, basing its analysis on software, and hardware because it is possible to automate a great deal purchases and maintenance, upgrade and and the systems are more reliable. Support for administrative costs. This study also found that open source products may be less available in a although Windows administrators cost less formal commercial capacity, but many problems individually, each Linux or Solaris administrator encountered in one’s digital curation efforts can could cover many more machines, making be mitigated by turning to a vast web-based Windows administration more expensive. It was community of users and experts that has also revealed that Windows administrators spent demonstrated a regular and enthusiastic twice as much time patching systems and commitment to offer assistance. dealing with security issues than the others. A further question is that of training. There is also a great deal of persuasive Anecdotal evidence suggests that the costs testimonial evidence from a range of companies involved are fairly modest, thanks to the and public institutions that have used open proliferation of modern GUI desktops within source successfully and saved money. For Linux systems. It remains to be seen whether instance, Amazon.com was able to cut US$17m this is demonstrably true across the board, (€13.8m) in technology expenses in a single although it could be argued that training costs quarter by moving to Linux. The city of Largo should be no more than those incurred for in Florida saved $1m (€811,000) by using Windows training. However, retraining GNU/Linux and ‘thin clients,’ and Intel Vice experienced Windows users will inevitably be President Doug Busch reported savings of more challenging, and will involve higher $200m (€162.4m) by replacing proprietary associated costs. UNIX servers with GNU/Linux alternatives.94 While TCO studies are useful for interest 6.6 Total Cost of Ownership 94 Lynn Haber, April 2002, "City Saves with Linux, Thin The Robert Frances Group’s July 2002 Clients", ZDNet, study found that the TCO of GNU/Linux is http://techupdate.zdnet.com/techupdate/stories/main/0, roughly 40% of that of Windows, and 14% of 14179,2860180,00.html [Accessed: 7 April 2005, Sun Microsystems’ Solaris.93 The group used 12:15]; David A. Wheeler, 2005,"Why Open Source Software/Free Software (OSS/FS, FLOSS, or FOSS)? actual costs of production deployments of web Look at the Numbers!", servers at fourteen Global 2000 enterprises, http://www.dwheeler.com/oss_fs_why.html [Accessed: 7 April 2005, 11:30]; Stephen Shankland, 93 David A. Wheeler, 2005,"Why Open Source Margaret Kane and Robert Lemos, October 2001, Software/Free Software (OSS/FS, FLOSS, or FOSS)? "How Linux Saved Amazon Millions", News.com, Look at the Numbers!", http://news.com.com/2100-1001- http://www.dwheeler.com/oss_fs_why.html 275155.html?legacy=cnet&tag=owv [Accessed: 7 [Accessed: 7 April 2005, 11:30]. April 2005, 12:15]. Andrew McHugh, Open Source for Digital Curation Page 49 purposes, the source of their commissioning source will result in inevitable cost savings should be carefully noted. At least one (albeit when long-term access to digital materials is Microsoft sponsored) study has suggested that required. The costs of recovery of information Windows is cheaper than Linux,95 although are becoming increasingly well known. Several technology writer Joe Barr has discussed and insightful examples come from the US legal criticised some of the problems inherent in the system, where information discovery legislation report, such as assuming no upgrades over a five compels litigants to produce electronic materials 97 year period, costing for an older operating prior to trial, at the defendant’s expense. system, and not using the current Microsoft Proprietary systems can cause problems in the Enterprise license. Barr concludes his report by face of these kinds of requirements, hampering stating that “TCO is like fine wine: it doesn’t straightforward access to digital materials and travel well. What may be true in one situation is bottlenecking the legal process. In the case of 98 reversed in another. What gets trumpeted as a Zubulake v. UBS Warburg, the retrieval and universal truth (‘Windows is cheaper than presentation of content from a single email Linux’) may or may not be true in a specific stored on backup tapes was priced at $175,000. case, but it is most certainly false when claimed In another similar example, in the case of 99 universally.”96 Conversely, it is very unlikely Murphy Oil Corporation v Fluor Daniel it was that for every configuration Linux represents a stated that that the recovery of email content more cost-effective solution. into a presentable form would cost some $6.2 million dollars and take more than six months, 6.7 Longer-term Considerations excluding attorney time. UK Freedom of Of particular interest to the digital curator Information Legislation creates similar are the long-term cost implications of using obligations to provide information, which once open source software. As suggested above, there again can be problematic within a proprietary or are few (if any) conclusive sources offering 97 National Electronic Commerce Coordinating Council, quantitative savings information, but the 2004, "Effectively Managing the Discovery of reusability and accessibility implicit in open Electronic Records", http://www.ec3.org/Downloads/2004/Effectively_Ma 95 IDC, "Windows 2000 Versus Linux in Enterprise n_Discovery_of_El_Records.pdf [Accessed: 7 April Computing", 2005, 12:15]. http://www.microsoft.com/windows2000/docs/TCO.p 98 Zubulake v. UBS Warburg LLC 217 F.R.D. 309 df [Accessed: 7 April 2005, 12:15]. (S.D.N.Y. 2003) [Note: Zubulake I, Opinion of 13 96 David A. Wheeler, 2005,"Why Open Source May 2003], Zubulake v. UBS Warburg LLC, 216 Software/Free Software (OSS/FS, FLOSS, or FOSS)? F.R.D. 280 (S.D.N.Y. 2003) [Note: Zubulake II, Look at the Numbers!", Opinion of 24 July 2003] http://www.dwheeler.com/oss_fs_why.html 99 Murphy Oil v. Fluor Daniel, Inc., 2002 WL 246439 [Accessed: 7 April 2005, 11:30]. (E.D. La. 19 Feb 2002 ) Page 50 DCC Digital Curation Manual opaque system. These sums represent the cost of additional 24% of applications hung when inaccessible digital resources. By introducing presented with valid keyboard and mouse input. transparency at every level of one’s information When the same test was undertaken five years infrastructure, in terms of both technology and earlier using a then current Linux distribution, comprehension, such costs can inevitably be the failure rate was just 9%; and since then the mitigated. Irrespective of legal obligations to reliability of open source software has present information, long-term use inevitably improved. Comparable studies by IBM and requires the repackaging and migration of digital Bloor Research have had similar results.101 materials to accommodate the technological An eWeek survey in 2002 found that environments and standards of the day. Only by MySQL was comparable to the proprietary establishing and effectively documenting an market leader, Oracle, and offered better understanding of our digital resources can such performance than a number of other proprietary activities be more straightforwardly - and more applications, including Sybase Inc’s ASE, cost-effectively - undertaken. IBM’s DB2 and Microsoft’s SQL Server 2000 Enterprise Edition.102 6.8 Performance and Reliability As far as performance is concerned, the Performance and reliability are important results for open source are also promising. As factors in determining the contemporary and with TCO, performance benchmarks are often long-term usefulness of our digital assets, and dependent on environment, as well as whatever further empirical evidence suggests that open assumptions the tester has made; the only real source technologies compare favourably in these benchmark that can be of value to an individual terms with their proprietary peers. Again, there user is the one that most closely mirrors the are few formal quantitative studies based work actually being done. PC Magazine found explicitly around the performance of digital in November 2001 that Linux with SAMBA curation applications or curation processes, but significantly outperformed Windows 2000. At one may regard the more generic examples that do exist as a useful barometer. According to a nt.pdf [Accessed: 7 April 2005, 12:15]. study undertaken at the University of Wisconsin 101 Li Ge, Linda Scott and Mark VanderWiele, 2003, in 2000, 21% of Windows 2000 applications "Putting Linux Reliability to the Test", http://www- 106.ibm.com/developerworks/linux/library/l-rel/ crashed when presented with random testing 100 [Accessed: 7 April 2005], 12:15; using valid keyboard and mouse input. An http://gnet.dhs.org/stories/bloor.php3 [Accessed: 7 April 2005, 12:15]. 100 Justin E. Forrester and Barton P. Miller, 2000, "An 102 Timothy Dyck, February 2002, "Server Databases Empirical Study of the Robustness of Windows NT Clash", eWeek, Applications Using Random Testing", http://www.eweek.com/article2/0,3959,293,00.asp ftp://ftp.cs.wisc.edu/paradyn/technical_papers/fuzz- [Accessed: 7 April 2005, 12:15]. Andrew McHugh, Open Source for Digital Curation Page 51 one stage in the test, using a 1Ghz Pentium 3 Serving Web pages is one of several areas with 512MB of RAM and handling thirty client where open source software is dominant. connections, the Linux software was 78% faster According to Netcraft’s statistics on web than Microsoft’s.103 In February 2003, a team of servers the Apache Web Server was responsible physicists broke the Internet2 Land Speed for some 70% of all Web pages in July 2005, Record using GNU/Linux, sending 6.7GB of with the closest rival Microsoft’s Internet uncompressed data from Sunnyvale, California Information Server responsible for just 23%. to Amsterdam in the Netherlands in just 58 GNU/Linux is the second most prevalent 104 seconds. operating system for web servers with 29%, behind Windows which has just under half of 6.9 Market Share the entire market share. Other open source The market share enjoyed by open source operating systems (such as FreeBSD) comprise software is significant from a digital curation around 6% of all those that serve Web pages.105 perspective since it is likely to determine to a Open source software enjoys prominence great extent the demand from within the in other areas of the Internet too. Sendmail is the community for preservation of these digital leading email server, with 42% of the market assets and their associated information. While share.106 The DNS server Bind, an application not the only important factor, it is often that translates human-readable Web site names advantageous in the interests of longevity to go into a format understandable by computers, had with what most people are using. Generally a 95% market share in 2000.107 Further speaking, it is marginalised or minimally emphasising the web-based prominence of open adopted hardware, software and standards that source, PHP is the most commonly used Web are more likely to become irretrievably lost. programming language in the world, running on Several examples exist where open source over eighteen million sites during January 2005, software leads the field, or enjoys prominence outstripping its primary rivals ASP.NET, Java on a commensurate level with its proprietary Server Pages, and Cold Fusion.108 equivalents.

103 Oliver Kaven, November 2001, "Performance Tests: 105 http://news.netcraft.com/, [Accessed: 7 April 2005, File Server Throughput and Response Times", 12:25] Pcmag.com, 106 http://cr.yp.to/surveys/smtpsoftware6.txt, [Accessed: 7 http://www.pcmag.com/article2/0,1759,16227,00.asp April 2005, 12:25] [Accessed: 7 April 2005, 12:15]. 107 Bill Manning, "in-addr version distribution", 104 Katie Dean, February 2003, "Data Flood Feeds Need http://www.isi.edu/~bmanning/in-addr-versions.html for Speed", Wired, [Accessed: 7 April 2005, 12:25]. http://www.wired.com/news/infostructure/0,1377,576 108 http://www.php.net/usage.php, [Accessed: 7 April 25,00.html [Accessed: 7 April 2005, 12:15]. 2005, 12:25] Page 52 DCC Digital Curation Manual

At a more general level, part of the findings of the Free/Libre and open source Software (FLOSS): Survey and Study published in June 2002 found that 43.7% of German establishments, 31.5% of British establishments, and 17.7% of Swedish establishments reported using open source or free software.109 Open source is well established within the infrastructure of the Internet, but it is only in relatively recent times that appropriate software has become available to make Linux a viable choice for desktop computer users. Improvements in graphical user interfaces (GUIs) and the potential to use Linux without recourse to unfamiliar command line instructions have made it a more appealing prospect for casual or non-expert users. A number of companies and organisations are planning or beginning migration. Public sector institutions, who require autonomy over their software infrastructures and digital preservation straightforwardness are understandably among the most enthusiastic.

109 http://www.infonomics.nl/FLOSS/, [Accessed: 7 April 2005, 12:25] Andrew McHugh, Open Source for Digital Curation Page 53

home computer enthusiasts. Continued 7 Future Developments innovation, constant refinement and an emphasis on the financial savings and legal It seems likely that the future of open freedom associated with open source will help source is assured, as projects continue to gather to convince users of the intrinsic benefits. It is momentum, established applications mature and unlikely that any battle will be won simply by more and more specialist tools become emphasising the superior archival characteristics available. There is a growing realisation within of open source software and digital assets the software development world that openness is encoded with open formats. In many cases it a desirable quality and that relying too much on will require even more than just functional the commercially driven solutions distributed by superiority to convince users of the benefits of proprietary developers may threaten the open source, due to the high impact marketing longevity of one’s digital assets. An open source that accompanies and champions many software infrastructure, dealing in open and commercial products. A consumer-oriented standardised formats ensures that transparency is example is in the case of portable digital audio. maintained from the conception of information Despite many experts agreeing that the open until its long-term storage. Of course, a range of Ogg Vorbis format offers a better encoding factors, both financial and behavioural mean that algorithm in terms of sound quality per byte, the digital realm is unlikely to be homogenised most users are still encoding millions of hours in the near future, and it is certain that digital of their music collections in the proprietary and curators will have to continue to find patented MP3 format, buoyed in their imaginative ways to successfully manipulate, endeavours by television and web publicity, to migrate and re-use digital information that can’t the extent that now “digital audio” is almost be so straightforwardly comprehended. While in synonymous in the popular consciousness with principle the adoption of open standards is of MP3. Only by continuing to offer innovative, great value to all kinds of organisations it is usable and functionally rich solutions can open unrealistic to expect all important business source and its inherent digital curation qualities decisions to be made based solely on archival expect to be fully exploited. considerations. The open source community must continue to develop solutions that offer immediate benefits in every sense, not just in terms of digital curation, but for all levels of digital creators and users, from IT administrators, managers, research scientists, and digital librarians to students, teachers and Page 54 DCC Digital Curation Manual

over the long-term. There is money to be made 8 Conclusion through open source software, but it is a The open source and free software consequential thing, and seldom do economic development and distribution philosophies are concerns drive or direct the development and now well established, and offer several benefits release agenda. By embracing open source tools throughout digital curation work flow. The where they are available and functionally conceptual key to understanding the value of sufficient, our digital materials will be more these technologies and their associated standards easily comprehensible in the future. Many is to understand the significance of transparency. creators, custodians and re-users of digital By understanding the precise nature of our own information have a responsibility to ensure that digital assets and those we seek to integrate with the materials they create offer maximum and exploit elsewhere we are able to take more usability and accessibility in the contemporary effective steps towards ensuring their continued and are guarded against the problems of and long-term accessibility. Removing obsolescence in the future. Using open source restrictive legal barriers to this endeavour software and open standards as a facilitator to facilitates the digital curation process further this is a worthwhile starting point, and still. Open source solutions arrive free from combined with other digital curation strategies commercially motivated opacity and represent a can be effective. consensus in favour of continued accessibility, comprehension and reusability. By their nature they facilitate integration and interfacing with existing infrastructures. Standardisation of formats is an unprofitable concept for those within a competitive market economy who are mainly interested in promoting their own unique product path. Unfortunately, it is also a great facilitator for straightforward long-term digital curation, its promotion and presence rendering the process significantly less irksome. Open source technology is founded on an unwillingness to reinvent or monopolise the wheel; a sentiment that through active collaboration our software and digital assets can be more effectively structured, injected with greater functionality and made more sustainable Andrew McHugh, Open Source for Digital Curation Page 55

Stallman, R. M, J. Gay, L. Lessig, 2002, Free Bibliography Software, Free Society Selected Essays of Richard M. Stallman, Free Software Foundation Print Stanescu A, 2005, Assessing the durability of Dibona, C., S. Ockman, M. Stone, 1999, Open formats in a digital preservation environment: Sources: Voices from the Open Source The INFORM methodology, OCLC Systems and Revolution, Sebastopol, CA, O Reilly Services, International Digital Library Dubois, P. MySQL, 2003, Indiana: Sams Perspectives, Vol 21, Number 1, 2005, pp 61-81 Publishing, Weber, S., 2004, The Success of Open Source, Gibb, B. S. Damodaran, 2002, ebXML : Harvard, MA, Harvard University Press Concepts and Application, Wiley, ISBN: Welsh, M, M. K. Dalheimer, T. Dawson, L. 076454960X Kaufman, 2003, Running Linux, Sebastopol CA, Kotok, A., D.R.R. Webber, 2001, ebXML: The O Reilly New Global Standard, New Riders, ISBN: Williams, S., 2002, Free as in Freedom Richard 0735711178 Stallman’s Crusade for Free Software, Moody, G., 2002, Rebel Code: Linux and the Sebastopol, CA: O Reilly Open Source Revolution, Penguin Online Netproject, 2003, The IDA Open Source Migration Guidelines, European Communities Abbott, D., Overcoming the Dangers of Technological Obsolescence: Rescuing the BBC Raymond E. S., 2001, The Cathedral and the Domesday Project, DigiCULT.Info 4, Page 4 , Bazaar - Musings on Linux and Open Source by http://www.digicult.info/pages/newsletter.php an Accidental Revolutionary (Revised edition), [Accessed 7 April 2005, 11:49]. Sebastopol, CA, O Reilly Berlind, D., 30 July 2002, Who Gave Microsoft Ross, S., A. Gow, 1999, Digital archaeology? Control of your IT Costs? You Did, Rescuing Neglected or Damaged Data ZDNET.com, Resources, London & Bristol: British Library http://techupdate.zdnet.com/techupdate/stories/ and Joint Information Systems Committee, main/0,14179,2875958,00.html[Accessed: 7 ISBN 1900508516 April 2005, 12:30] Ross, S., M. Donnelly, M. Dobreva, D. Abbott, Bernstein D. J., 2001, Internet Host SMTP A. McHugh, A. Rusbridge, 2005, Digicult Server Survey Technology Watch Report 3 http://cr.yp.to/surveys/smtpsoftware6.txt Page 56 DCC Digital Curation Manual

[Accessed: 7 April 2005, 12:30] 10.lotus.com/ldd/devbase.nsf/articles/doc20000 91200 [Accessed: 7 April 2005, 13:31] Bishop,T., 12 September 2003, Should Microsoft Be Liable For Bugs?, Seattle Post Connor, D., 2001, Linux Slips Slowly into the Online, Enterprise Realm, Network World Fusion, http://seattlepi.nwsource.com/business/139286_ http://www.nwfusion.com/news/2001/0319speci msftliability12.html [Accessed: 7 April 2005, alfocus.html [Accessed: 7 April 2005, 13:31] 12:30] Conway, P., December 2003, Zope at Duke Books from the Past University: Open Source Content Management http://www.booksfromthepast.org [Accessed: 7 in a Higher Education Context, DigiCULT.Info, April 2005, 13:31] Issue 6, p. 10, Brooks, J, 26 April 2004, Office 2003 vs. http://www.digicult.info/pages/newsletter.php OpenOffice.org, eWeek [Accessed: 7 April 2005, 11:57]. http://www.eweek.com/article2/0,1759,1571626 Creative Commons, ,00.asp [Accessed: 7 April 2005, 13:31] http://creativecommons.org/ [Accessed: 7 April 2005, 11:57]. Cameron, N., September 2003, Open Source Bookmarks Australian Heritage, Dahdah, H., February 2003, Open Source http://www.computerworld.com.au/index.php?id Library System a Welcome Gift, Computer =522130461&fp=16&fpid=0 [Accessed 7 April World, 2005, 11:56]. http://www.computerworld.com.au/index.php/id CAMiLEON BBC Domesday Rescue Project: ;534895878;relcomp;1 [Accessed: 7 April 2005, http://www.si.umich.edu/CAMILEON/domesda 13:31] y/domesday.html [Accessed: 7 April 2005, Danish Board of Technology, Open Source 13:31] Software in e-government CECILL License, http://www.tekno.dk/pdf/projekter/p03_opensou http://www.cecill.info/licences/Licence_CeCILL rce_paper_english.pdf [Accessed: 7 April 2005, _V1.1-US.html [Accessed: 7 April 2005, 11:43]. 13:31] Center of Open Source and Government Dean, K., 2003, Data Flood Feeds Need For (eGovOS) http://www.egovos.org/ [Accessed: 7 Speed, Wired.com, April 2005, 13:31] http://www.wired.com/news/infostructure/0,137 7,57625,00.html [Accessed: 7 April 2005, Connell, C., Open Source Projects Manage 13:35] Themselves? Dream On IBM Lotus Developer Network Archives, http://www- The DigiCULT Report Full Report Andrew McHugh, Open Source for Digital Curation Page 57

Technological Landscapes for tomorrow’s Applications Using Random Testing, cultural economy: Unlocking the value of ftp://ftp.cs.wisc.edu/paradyn/technical_papers/f cultural heritage, pp.212. Available online at uzz-nt.pdf [Accessed: 7 April 2005, 12:15]. http://www.digicult.info/pages/report.php, Freshmeat.net http://freshmeat.net, Accessed: 7 [Accessed: 7 April 2005, 11:46] April 2005, 13:31 Digitale Bibliothek, http://digbib.iuk.hdm- Guercio, M. C. Cappiello, File Formats stuttgart.de/gsdl/cgi-bin/library [Accessed: 7 Typology and Registries for digital April 2005, 13:31] preservation, (DELOS, WP6 D6.3.1), Dravis, P., Open Source Software: Perspectives http://www.dpc.delos.info [Accessed 7 April for Development 2005, 11:48]. http://www.infodev.org/symp2003/publications/ Ge, L., L. Scott, M.Vanderwiele, 17 December OpenSourceSoftware.pdf [Accessed: 7 April 2003, Putting Linux Reliability to the Test, IBM 2005, 13:31] DeveloperWorks, http://www- Dyck,T., February 2002, Server Databases 106.ibm.com/developerworks/linux/library/l-rel/ Clash, eWeek, [Accessed: 7 April 2005, 13:31] http://www.eweek.com/article2/0,3959,293,00.a Gilfillan, I., 16 December 2003, PostgreSQL vs. sp [Accessed: 7 April 2005, 13:31] MySQL:Which is Better?, Database Journal, enCore Open Source MOO project, http://www.databasejournal.com/features/postgr http://lingua.utdallas.edu/encore/, [Accessed: 7 esql/article.php/3288951 [Accessed: 7 April April 2005, 13:31] 2005, 13:31] EROS: An Open Source Multilingual Research Glance, D. G., J. Kerr,A. Reid, February 2004, System for Image Content Retrieval dedicated to Factors Affecting the Use of Open Source Conservation-Restoration exchange between Software in Tertiary Education Institutions, Cultural Institutions, First Monday,Volume 9, Number 2, http://www.c2rmf.fr/documents/c2r_eros.pdf http://www.firstmonday.org/issues/issue9_2/gla [Accessed: 7 April 2005, 13:31] nce/ [Accessed: 7 April 2005, 13:31] European Schoolnet Virtual School: Software, GNET, January 2000, How Do Linux and Freeware and Shareware: Windows NT Measure Up in Real Life?, ID- www.eun.org/goto.cfm?sid=220 [Accessed: 7 side, http://gnet.dhs.org/stories/bloor.php3 April 2005, 13:31] [Accessed: 7 April 2005, 13:31] Forrester, J.E., B.P. Miller, 2000, An Empirical Haber, L., April 2002, City Saves With Study of the Robustness of Windows NT Linux,Thin Clients, ZDNet.com, Page 58 DCC Digital Curation Manual http://techupdate.zdnet.com/techupdate/stories/ The Linux Weekly News http://lwn.net/ main/0,14179,2860180,00.html [Accessed: 7 [Accessed: 7 April 2005, 13:31] April 2005, 13:37] Manning, B. Bind, 2000, Internet Usage The Halloween Documents Statistics, http://www.isi.edu/~bmanning/in- http://opensource.org/halloween/ [Accessed: 7 addr-versions.html [Accessed: 7 April 2005, April 2005, 13:31] 13:31] INFONOMICS Free/Libre and Open Source Mantarov, B., 1999 Open Source Software as a Software: Survey and Study New Business Model The Entry of Red Hat http://www.infonomics.nl/FLOSS/report/ Software, Inc. on the Operating System Market [Accessed: 7 April 2005, 13:31] with Linux, Dissertation submitted in partial Insight Special Report: Why Europe Needs Free fulfilment of the degree of MSc in International and Open Source Software and Content in Management at the University of Reading, Schools, 2004, http://www.eun.org/insight- http://bmantarov.free.fr/bojidar/academic/Disser pdf/special_reports/Why_Europe_needs_foss_In tation_- sight_2004.pdf [Accessed: 7 April 2005, 13:31] _Open_source_software_as_a_new_business_m odel.pdf [Accessed: 7 April 2005, 13:31] Kaven, O., November 2001, Performance Tests: File Server Throughput and Response Times, McMillan, R., 26 March 2004, SCO Linux PC Magazine, Licensee Has Second Thoughts on Deal, http://www.pcmag.com/article2/0,1759,16227,0 Computer World, 0.asp [Accessed: 7 April 2005, 13:31] http://www.computerworld.com/governmenttop ics/government/legalissues/story/ Lee, C., October 2001, Open Source: A 0,10801,91671,00.html [Accessed: 7 April Promising Piece of the Digital Preservation 2005, 13:31] Puzzle. A slightly different version appears as Munich Linux Decision Final, 14 June 2004, Open-Source Software: A Promising Piece of DesktopLinux.com, the Digital Preservation Puzzle, Electronic http://www.desktoplinux.com/news/NS7137390 Currents, Midwest Archives Conference (MAC) 752.html [Accessed: 7 April 2005, 13:31] Newsletter, Volume 29, Number 2 (113), 26-28, http://www- NASA Open Source License, personal.si.umich.edu/~calz/oss_preservation.ht http://www.opensource.org/licenses/nasa1.3.php m [Accessed: 7 April 2005, 13:31] [Accessed 7 April 2005, 11:56]. Linux Migration.com, Netcraft Web Server Survey, http://www.linuxmigration.com/ [Accessed: 7 http://news.netcraft.com/ [Accessed: 7 April April 2005, 13:31] 2005, 13:31] Andrew McHugh, Open Source for Digital Curation Page 59

Norwegian Board of Technology Global http://www.php.net/usage.php [Accessed: 7 CountryWatch on Open Source Policy, April 2005, 13:31] http://www.teknologiradet.no/html/592.htm PostgreSQL or MySQL http://www- [Accessed: 7 April 2005, 13:31] css.fnal.gov/dsg/external/freeware/pgsql-vs- OpenSector.org, Public Sector Related Content, mysql.html [Accessed: 7 April 2005, 13:31] http://opensector.org/ [Accessed: 7 April 2005, Raymond, E. S., 1999-2004, The Cathedral and 13:31] the Bazaar, Open Source and Industry Alliance, http://www.catb.org/~esr/writings/cathedral- http://www.osaia.org/ [Accessed: 7 April 2005, bazaar/ [Accessed: 7 April 2005, 13:31] 13:31] Raymond, E. S., version 4.4.7, 2003, The The Open Source Initiative, Jargon File, http://www.catb.org/~esr/jargon/ http://www.opensource.org [Accessed: 7 April [Accessed: 7 April 2005, 13:31] 2005, 13:31] Raymond, E. S., 2000, The Software Release Open Source Software Watch (OSS Watch), Practice How-To, http://www.oss-watch.ac.uk/ [Accessed: 7 April http://www.tldp.org/HOWTO/Software- 2005, 13:31] Release-Practice-HOWTO/index.html Open Source Technology Group, [Accessed: 7 April 2005, 13:31] http://www.ostg.com [Accessed: 7 April 2005, Ross, S, M. Donnelly, M. Dobreva, D. Abbott, 13:31] A. McHugh, A. Rusbridge, 2005, Digicult O’Reilly,T., May 2004, The Open Source Technology Watch Report 3 Paradigm Shift, Tim.Oreilly.com , http://www.digicult.info/pages/techwatch.php http://tim.oreilly.com/opensource/paradigmshift [Accessed: 7 April 2005, 11:30] _0504.html [Accessed: 7 April 2005, 13:31] Ross, S, A. Gow, 1999, Digital archaeology? Perens, B. et al., Free Software Leaders Stand Rescuing Neglected or Damaged Data Together, Resources, (London & Bristol: British Library http://perens.com/Articles/StandTogether.html and Joint Information Systems Committee), [Accessed: 7 April 2005, 13:31] ISBN 1900508516, http://www.ukoln.ac.uk/services/elib/papers/sup Perens, B., Open Standards Principles and porting/pdf/p2.pdf [Accessed: 7 April 2005, Practice, 11:47] http://perens.com/OpenStandards/Definition.htm l [Accessed: 7 April 2005, 13:31] Rusbridge, A., April 2004, XENA: Electronic Normalising Tool, DigiCULT.Info, Issue 7, PHP Usage Statistics, Page 60 DCC Digital Curation Manual page 32, 11:43]. http://www.digicult.info/pages/newsletter.php Trimble, P.S., December 2000, Open Minds on [Accessed 7 April 2005, 11:49] Open Source, FCW.com, Science Commons, http://www.fcw.com/fcw/articles/2000/1204/pol http://science.creativecommons.org [Accessed: 7 -nasa-12-04-00.asp [Accessed: 7 April 2005, April 2005, 11:57]. 11:57]. Shankland, S., M. Kane, R. Lemos, 30 October Wallen Jr, J., 29 November 2002, 2001, How Linux Saved Amazon Millions, OpenOffice.org versus Microsoft Office, CNET news.com, http://news.com.com/2100- ZDNet.com Australia on 1001-275155.html?legacy=cnet&tag=owv http://www.zdnet.com.au/insight/0,39023731,20 Sourceforge.net http://sourceforge.net 270300,00.htm [Accessed: 7 April 2005, 13:31] [Accessed: 7 April 2005, 13:31] Wheatley, P., 2004, Institutional Repositories in Stallman, R. M. et al. The Free Software the context of Digital Preservation DPC Foundation, http://www.fsf.org [Accessed: 7 Technology Watch Series Report 04-02, April 2005, 13:31] http://www.dpconline.org/docs/DPCTWf4word. Stallman, R. M. Stallman.org, pdf [Accessed: 22 July 2005, 14:35] http://www.stallman.org [Accessed: 7 April Wheeler, D.A., Rev. 2005, Why Open Source 2005, 13:31] Software / Free Software (OSS/FS, FLOSS or Staples, T., April 2003, The Fedora Project: An FOSS)? Look at the Numbers!, Open-Source Digital Object Repository http://www.dwheeler.com/oss_fs_why.html Management System, D-Lib Magazine, Volume [Accessed: 7 April 2005, 13:31] 9, Number 4, Windows 2000 Versus Linux in Enterprise http://www.dlib.org/dlib/april03/staples/04staple Computing An Assessment of Business Value s.html [Accessed: 7 April 2005, 12:15]. for Selected Workloads Stone, B., 2004, The Linux Killer, http://www.microsoft.com/windows2000/docs/ http://www.wired.com/wired/archive/12.07/linu TCO.pdf [Accessed: 7 April 2005, 13:31] x.html?pg=4&topic=linux&topic_set=(none) Witten, I. H., D. Bainbridge, S. J. Boddie, [Accessed: 7 April 2005, 13:31] October 2001, Greenstone Open Source Digital Thurgood, A., January 2005, The GPL and non- Library Software, D-Lib Magazine, Volume 7, U.S. law, Open Source Law Blog, Number 10, http://www.oslawblog.com/2005/01/gpl-and- http://www.dlib.org/dlib/october01/witten/10wit non-us-law.html [Accessed: 7 April 2005, ten.html [Accessed: 7 April 2005, 13:31] Andrew McHugh, Open Source for Digital Curation Page 61

Fora Association of C and C++ Users (ACCU) Open Source Forum 2004, 14-17 April 2004, Oxford, England http://www.reportlab.com/conferences/accu2004 /index.html Page 62 DCC Digital Curation Manual

face of obsolescence and ensure their Glossary of Terms longevity.

Creative Commons - A non-profit Proprietary Software - Examples of organisation devoted to expanding the range software where the user cannot control of creative work available for others to functionality or study or edit the code. legally build upon and share. Representation Information - The Curatability - A measure of the ease with information that maps a Data Object into which a digital resource can be curated. more meaningful concepts. An example is Distribution - A software release the ASCII definition that describes how a representing a packaging up of several sequence of bits (i.e., a Data Object) is individual programs; most commonly a mapped into a symbol. In order to keep version of GNU/Linux with associated things manageable, Representation applications. Information can be factored in distinct types, such as structure, semantics and others. The - The process of recreating Emulation latter can include software and standards, existing hardware or software environments among other things. This normalisation with software. allows one, for example, to describe two sets Free Software - Software released under of information which are identical, but terms conforming to the four fundamental which are held in different structures freedoms established by the Free Software (formats), by combining the same Semantic Foundation, which demand transparency and description with different Structure the legal and practical freedoms to change, descriptions. re-use and re-distribute code. Software License - An agreement Freedom of Information - UK legislation distributed alongside computer software compelling public bodies to release determining acceptable legal use for that information on request. software.

Institutional Repository - A software Source Code - Pre-compiled, human- infrastructure designed to store digital readable program code. resources for the facilitation of their Standard - An accepted practice, management and/or preservation. technology or specification. - The process of moving digital Migration Total Cost of Ownership - The financial resources to alternative hardware or software costs associated with a particular activity or environments to facilitate their use in the Andrew McHugh, Open Source for Digital Curation Page 63

policy, incorporating all costs incurred, PHP - PHP Hypertext PreProcessor. including acquisition, maintenance, staff and Perl - Practical Extraction and Report retraining costs. Language.

Acronyms and Abbreviations SQL - Structured Query Language.

BBC - British Broadcasting Corporation. TCO - Total Cost of Ownership.

BIOS - Biological Innovation for Open VLE - Virtual Learning Environment. Source. WINE - Wine is Not an Emulator. BSD - Berkeley System Design. XENA - XML Electronic Normalising of CVS - Concurrent Version System. Archives.

ebXML - Electronic Business Using XML. XML - eXtensible Markup Language. FAQ - Frequently Asked Questions.

FLOSS - Free, Libre or Open Source Software.

FS - Free Software.

FSF - Free Software Foundation.

FUD - Fear, Uncertainty and Doubt.

GNU - GNU’s Not Unix.

GPL - General Public License.

HTML - Hypertext Markup Language.

IBM - International Business Machines.

LOCKSS - Lots of Copies Keeps Stuff Safe.

METS - Metadata Encoding and Transmission Standard.

OSI - Open Source Initiative.

OSS - Open Source Software.

PDF - Portable Document Format. Page 64 DCC Digital Curation Manual

systems, including a Microsoft X-Box games About the Author console.

Since graduating with a Scots Law Degree with Honours from Glasgow University in 2000, Andrew McHugh has concentrated on developing a wide range of skills mainly in the digital realm. He collected his Masters in Information Technology, again from Glasgow University in 2001 and since then has been employed within HATII (the Humanities Advanced Technology and Information Institute) at this University in various capacities. Within the institution’s Department of Music he revolutionised the information infrastructure, applying and honing several skills and performing a diverse range of roles, with responsibilities ranging from database and server administration to web programming, application development and desktop cluster design and management. In late 2004 he joined the Digital Curation Centre in the position of Advisory Services Manager, leading a world- class team of digital curation practitioners in offering leading-edge expertise and insight in a range of issues to a primarily HE and FE audience. In his spare time Andrew maintains an aptitude and enthusiasm for software development and continues to develop web- based solutions for a range of customers including commercial and heritage clients. He is a keen user of open source technologies with several GNU/Linux distributions including Fedora Core, Gentoo and SUSE distributed across the hard disk partitions of his various