State of the Art Report (StoAR)

Name:

Work Package:

Responsible:

Document history

Date Version Status Author(s) Changes / comments <20.10.2004> 1 In progress ARA <13.01.2005> 1 First version ARA

1. Definition of the subject

Persistent identification is regarded as an increasingly important aspect for strategies aimed to ensure effective management of digital information resources and their long-term access. The assignment of unique identifiers allows digital objects to be labelled or referenced in such a way that they can be reliably found over time in a dynamic distributed information environment.

To date, various schemes have been suggested as a global approach to persistent identification, including the URN (), the DOI (Digital Object Identifiers), the PURL (Persistent URL), the , the ARK (Archival Resource Key) and the XRI (eXtensible Resource Identifier). Each of these approaches came to birth as a result of progressive ongoing research and has succeeded, in varying degrees, to build communities around it.

This document is aimed to provide an overview of current activity in the field of persistent identifiers and an understanding of the diverse persistent identification strategies being used and developed. The document is to serve as a reference for review and evaluation of all these various schemes and as a basis for recommendations of relevance to the eSciDoc project.

2. Keywords

Persistent identifier, Persistent identification, Persistent link, Digital identifier, Persistence, Uniform Resource Name, Uniform Resource Identifier, Uniform Resource Locator, National Bibliography Number, Persistent URL, Handle, Digital Object Identifier, Archival Resource Key, eXtensible Resource Identifier, URI, URL, URN, NBN, PURL, DOI, ARK, infoURI, XRI.

1 Table of contents: Name: ...... 1 Work Package: ...... 1 Responsible: ...... 1 Document history ...... 1 1. Definition of the subject ...... 1 2. Keywords...... 1 3. Summary...... 3 4. Third party activities:...... 4 5. State of the art...... 5 5.1 Introduction...... 6 5.1.1 Scope...... 6 5.1.2 Definitions ...... 6 5.1.3 Background...... 7 5.1.4 Aspects of persistent naming...... 8

5.2 Functional and Organisational Requirements...... 8 5.3 Persistent Identification Systems...... 9 5.3.1 Uniform Resource Identifiers (URIs)...... 9 5.3.1.1 Uniform Resource Locators (URLs)...... 9 5.3.1.2 Uniform Resource Names (URNs) ...... 10 5.3.2 Persistent Uniform Resource Locators (PURLs) ...... 12 5.3.3 The Handle System ...... 13 5.3.4 Digital Object Identifiers (DOIs) ...... 17 5.3.5 Archival Resource Keys (ARKs) ...... 19 5.3.6 Other Identifier Schemes ...... 21 5.3.6.1 info Uniform Resource Identifier ( info URI)...... 21 5.3.6.2 eXtensible Resource Identifier (XRI) ...... 22 5.3.7 Comparative Summary ...... 22

5.4 Evaluation of Persistent Identification Schemes ...... 23 5.4.1 Using the URN scheme ...... 23 5.4.2 Using PURLS...... 23 5.4.3 Using Handles...... 24 5.4.4 Using DOIs...... 24 5.4.5 Using ARKs...... 25 5.4.6 Summary...... 25

5.5 Ensuring Persistent Access to Resources...... 26 6. Relevance and conclusions for the project ...... 27 7. Open questions...... 28 8. References...... 29

2 3. Summary

Persistent identifiers and an associated resolver service are an attempt to solve the common problem of broken links that occurs when resources on the web are moved to a new location or completely removed. Many Persistent Identification approaches have been proposed to tackle this problem by providing both consistent naming scheme for online resources and a resolver service to redirect users to the current location of a resource based on its persistent identifier.

The main purpose of this report is to present the current state of technologies that assist the issue of Persistent Identification. Three aspects are taken into consideration: the semantics of the identifier itself, the issue of resolving the identifier to a resource or to further information on how to access the resource (metadata, another file, an html file etc.…) and, finally, recognising the importance of encouraging others to share responsibility for maintaining this access persistent.

The schemes that are discussed and that constitute potential options are: the Uniform Resource Name (URN) of the Internet Engineering Task Force, the Persistent URL (PURL) of the Online Computer Library Center, the Handle System of the Corporation for National Research Initiatives, the Digital Object Identifier (DOI) of the International DOI Foundation and the Archival Resource Key (ARK) of the California Digital Library.

The report sums up by defining dealing with the problem of persistence in two dimensions. First, it is supported by technical infrastructure manifested through an implementation based on one of the available systems that meet the requirements. Secondly, it needs to be governed by policies and guidelines that help promoting awareness for commitment to persistence and that encourage institutions and individuals to take responsibility towards guaranteeing the persistence of the resources they own.

IMPORTANT:

The report makes a couple of recommendations, but these are to be taken with caution as they are neither found on a thoroughly sufficient evaluation against a set of requirements nor based on profound and convincing practical tests. They are, rather, based on impressions and shallow experience gained while having to deal with this topic for the first time.

3 4. Third party activities:

Organisation Identification Scheme Status of Activity

The Internet Engineering Uniform Resource Names ( URN s) Available/ In Progress Task Force (The URN URN is a namespace of namespaces for URIs. This urn:NBN (RFC3188), in particular is Working Group) includes urn:ISBN (namespace for books), urn:ISBN being used by a number of national (namespace for journals) and urn:NBN (namespace for libraries across Europe. (IETF ) national bibliographic items).

Online Computer Library Persistent Uniform Resource Locators ( PURL s) Available Center PURLs are simply URLs which use no new protocols, but A number of resolution servers have a set of tools that provide assistance to maintain URLs been put in place. By May 2004, over (OCLC ) with a commitment to persistence. PURLs use the 500,000 PURLs have been registered inherent redirection facility of the http protocol and provide and 86 Million plus have been persistence not of the resource but of the name. resolved with the OCLC Server.

Corporation for National The Handle System Available Research Initiatives The Handle System is a comprehensive system for Used by a number of different assigning, managing, and resolving persistent identifiers, organisations- including the IDF, (CNRI ) known as "handles", for digital objects and other DSpace, ADLSCORM, and various resources on the Internet. digital library production and research projects.

The International DOI Digital Object Identifier ( DOI ) ANSI/NISO Z39.84-2000 (Syntax) Foundation The DOI system of unique identifiers is based on the Phenomenal take-up by publishers, Handle System and allows the allocation of a unique with CrossRef providing the drive for (IDF ) digital identifier to commercial digital publications. this success. Citation linking is the most important aspect providing success for DOI/CrossRef.

California Digital Library Archival Resource Key ( ARK ) Work In Progress

ARKs actionability ties to 3 targets: the digital object, the A final review is currently undergoing (CDL ) metadata describing the object, and a persistence a test phase. As of June 2004, commitment statement. The commitment statement is to ARK registered 10 name assigning reveal the degree of permanence of an object that is authorities. guaranteed in terms of availability and stability.

Organization for the eXtensible Resource Identifier ( XRI ) Work In Progress Advancement of The XRI defines a URI-compatible identifier scheme and The committee approved a draft Structured Information resolution protocol for which the core specification is version of XRI’s Generic Syntax and Standards based in part the Internationalised Resource Identifiers Resolution. (IRIs). (OASIS )

Online Computer Library Info Uniform Resource Identifier ( info URI ) Work In Progress Center ( OCLC ) jointly infoURI scheme for information assets with identifiers in Adoption and use will ultimately with Los Alamos National public namespaces. Namespaces participating in the determine info URI’s future. Laboratory ( LANL ) "info" URI scheme are regulated by an "info" Registry mechanism.

4 5. State of the art

5 5.1 Introduction

5.1.1 Scope This report is concerned with persistent digital identifiers and their current state of development and deployment. The paper looks at the following issues involved in assuring the long term availability of references and citations made to digital objects in the context of open access on the Internet:

§ The role of persistent identifiers in ensuring consistent availability and accessibility of resources in an open networked environment. § Existing schemes for persistent identifiers and the long-term considerations involved in establishing naming systems and the local and global considerations which must be addressed. § Immediate options available and recommended courses of action.

5.1.2 Definitions

Digital Identifier . Digital identifier is a generic term for a label or name composed of a sequence of characters that can be transmitted electronically. Digital identifiers can be associated with electronic, non-electronic, or abstract entities, such as books, images, reports, metadata records or events. Within this report the terms ‘name’ or ‘identifier’ mean ‘digital identifier’.

Persistence . A persistent identifier will permanently name the same resource, and will never be reused to name a different resource. This persistence refers primarily to the permanence of the identifier rather than to the resource itself. Persistence is provided through governance rather than through purely technical constraints, being the responsibility of the organisation that creates the identifier.

Namespace . A namespace is a domain/scope in which an identifier is created and is valid.

Global and Local Identifiers . A global identifier is valid within the overall digital network in which it is used. A global identifier is interoperable and may be used across different systems. A local identifier is valid only within a local application or domain. A local identifier may be extended into a potentially global identifier by the inclusion of the label of its namespace.

Resolution and Actionable Identifiers . Resolution determines the location of a resource from its identifier. A resolver is a software application that enables a “click” on an identifier link to yield the internet location of, and thus display, the resource, and possibly supplementary services related to the resource. Such an identifier is said to be ‘actionable’.

Metadata . Metadata is data that describes a resource. It may be used both to discover a resource and to provide a detailed record about the resource. Metadata may include details such as the title, the author, the subject matter, the resource type and the location of a resource. It may also capture relationships between resources, such as: the identifiers of related resources; the source from which a particular resource was derived; whether a resource is part of a collection; and the version of a resource.

Interoperability . If metadata follows an agreed standard then it is possible for systems to exchange metadata easily, hence making the metadata interoperable.

6 5.1.3 Background One of the keys to the development of a distributed digital library or archive is the ability to rely on the links between the digital resources to remain static over time, whether these are links embedded as citations in other resources or links embedded in metadata in bibliographies, indexes, and catalogues or in bookmarks and gateways. Much of the value of digital resources for scholarly communication lies in enabling resources to be cited reliably with actionable links over long periods of time. Libraries and archives collect for the long-term to support scholarly research and need to support access to materials beyond the life of current technology and possibly beyond the organisational structures which exist today.

Therefore libraries, archives, academic institutions and publishers have an interest in the persistence of resource identification. Persistent naming is a characteristic of all open architecture digital library experiments. Persistent identifiers and their associated metadata are needed to support a variety of transactions including the search, retrieval, selection and use of digital objects.

In this context, we may identify two core elements constituting the task of maintaining the availability of resources in an open access repository:

• To preserve, or archive, the resource in order to ensure that it survives; • To maintain its accessibility to ensure that it can be located and displayed by those who wish to use it and that citations and links made to it continue to provide access to it.

The issue of persistent identification is primarily concerned with the second, but it is also closely related to the first. William Arms sums up the importance of persistent identifiers in his article on the “Key concepts in the architecture of the digital library ” [1]:

“Names are a vital building block for the digital library. Names are needed to identify digital objects, to register intellectual property in digital objects, and to record changes of ownership. They are required for citations, for information retrieval, and are used for links between objects”.

Why not URLs as persistent identifiers?—Setting the Scene Most resources currently on the World Wide Web are described using a Uniform Resource Locator (URL) and the URL, as generally used, describes a resource in terms of its current location. In other words, it describes the domain name of the server hosting the resource, the directory path and the document name.

This location based nature of the URL is its major weakness for document naming, since the identifier is subject to change whenever the resource is moved either to a different server or a different place in the file structure of the same server. When this happens, all references to this resource embedded in other documents or databases will be broken. The volatility of material on the web generated by movement of files and re-organisation of sites, not to mention deletion of materials, has highlighted the difficulty in guaranteeing continuous access to resources in a system which uses a naming structure based on file location to provide they key to its access.

7 5.1.4 Aspects of persistent naming There are three important practical aspects to persistent identification of online resources:

§ Choice of an Identification System . (What to name the resources?) § A system of resolution to map the identifier to the resource identified. (How will users get to the resources, or resource descriptions, given their identifiers?) § Maintenance of access to the resource through continued association of the current location of the resource with the identifier. (How to make sure that the links continue to work over time?)

5.2 Functional and Organisational Requirements Persistent digital identifiers have to conform to a set of commonly agreed-on minimal requirements. These can be summarised in the following:

§ The primary purpose of a digital identifier is to name and guarantee an indirect look-up to a known object, so that it can be referred to unambiguously. § A ‘persistent’ digital identifier, especially when used for archival data must of necessity be capable of outlasting any systems and protocols that are currently in use. § A ‘persistent’ digital identifier must be capable of being registered globally, be unique and offer durability of service over time. § Registered and global persistent digital identifiers must assist the accurate referencing of resources for re-use across different systems, thus enabling interoperability. § Locally unique resources will need to be registered for global reference if being used beyond their local namespace or scope.

In addition to these requirements, there are a number of considerations that should be made when choosing and using a particular identification scheme:

§ Whether there is a need for an association between an identifier and the location of the resource. An identifier that can be resolved in some way may be actionable. Provision of a resolvable identifier probably implies the existence of an intermediate registry. § Whether there will be multiple locations associated with a resource. This requirement places further demands on a resolution system. § Whether there is a requirement to associate metadata describing a resource with its identifier. Provision of metadata associated with an identifier implies the existence of a registry. § Backwards compatibility to support existing legacy naming conventions.

Finally, processes for the management of digital identifiers need to be well defined. Some related issues are:

§ The organisation that defines an identifier is responsible for its uniqueness and persistence and for maintaining up-to-date details of its location. Active management processes should be in place to encourage this responsibility. § Lifecycle management of persistent digital identifiers (naming, assignment, publication, resolution and maintenance) to guarantee their global register and long-term validity. § The investment and cost overhead in creating and maintaining central registries and systems to support publication and discovery.

8 5.3 Persistent Identification Systems A persistent identifier is a name, or identifier, given to a resource so that it uniquely identifies it and will be forever associated with it. It will never be reassigned to any other resource and will not change regardless of where the resource is located or whatever protocol is used to access it.

There are many formal identifier or naming schemes which have been discussed in the context of the naming of digital resources (e.g. URIs, URNs, Handles, DOIs, ARKs, ISBNs, ISSNs, SICIs, BICIs, PII), although very few of these will be fully effective as persistent identifiers facilitating access to online resources in a distributed system unless they are either registered as URI naming schemes and supported by a resolution system or they are incorporated into another naming scheme which has some form of resolution system associated with it.

The importance of resolution cannot be overestimated. Without a resolution system, requests using the identifier cannot be routed to the appropriate server and used to retrieve the resource, or a reasonable substitute for it in the form of metadata.

5.3.1 Uniform Resource Identifiers (URIs) This is a generic name for any class of ways of identifying resources on the Internet at present. URIs can be further classified into two broad categories: Locators (URLs) and Names (URNs).

“A URI can be further classified as a locator, a name, or both. The term "Uniform Resource Locator" (URL) refers to the subset of URI that identify resources via a representation of their primary access mechanism (e.g., their network "location"), rather than identifying the resource by name or by some other attribute(s) of that resource. The term "Uniform Resource Name" (URN) refers to the subset of URI that are required to remain globally unique and persistent even when the resource ceases to exist or becomes unavailable.” RFC 2396, [4] .

5.3.1.1 Uniform Resource Locators (URLs) A global identification mechanism for a resource in the World Wide Web is a URL (Uniform Resource Identifier). There is a set of protocols or namespaces for URL addresses, which contain information specifying the resource’s course of access. These include:

- http , e.g. http://www.bbc.co.uk/ - ftp (file transfer protocol), e.g. ftp://ftp.freeware.com - (the email addressing scheme), which could be used to identify a person or organisation, e.g. mailto: [email protected] - , news , file ,...etc

For a resource that is available on the web, a global identifier may be created simply within the URI http or ftp namespace. However, as stated earlier, this type of identifier provides no guarantee of persistence. If the location (the URL) of the resource is the same as its URI, which is a likely scenario when the resource is first created and identified, then resolution is achieved simply via the web http. However, if the resource is later moved so that its URL is changed, this straightforward resolution will fail, generally with no warning or alternative offered to a potential user.

9 5.3.1.2 Uniform Resource Names (URNs) The need for persistence in the naming of networked resources was recognised early in the life of the Internet and has been the subject of discussion for several years within the Internet Engineering Task Force (IETF). Work on the development of a unique identifier that would be independent of location and would remain permanent even though resources might move or disappear began at the same time as work was proceeding on the formalisation of URL schemes [2]. The identifier was named the URN (Uniform Resource Name) [3]. At the end of 1995, the implementers reached a broad consensus on the syntax of a URN and agreed to work towards a technical solution that would accommodate the needs of most users. Nevertheless, it was until May 1997 that the URN Syntax was formally agreed upon and published in RFC 2141- “URN Syntax”, [5] .

The syntax of the URN as expressed in RFC 2141 is as follows:

urn: :

The NID, which must be registered and approved by IETF to avoid duplication, ensures the global uniqueness of the identifier. The namespace specific string can take any form specified by the naming authority provided that it is unique within that namespace and avoids the use of a small number of restricted characters as specified in RFC 2141, [5].

Because the local, or namespace specific, string can be in any form, this structure allows maximum flexibility in the identifier while providing a mechanism to assure global uniqueness and facilitating interoperability between discrete systems.

Besides, the simple structure of the identifier reflects recognition of the need to accommodate different requirements and different schemes. Therefore, URN could easily incorporate existing legacy naming schemes such as ISBNs, ISSNs and NBNs [6]:

§ urn:ISBN: the namespace of book identifiers. e.g. urn: ISBN:0262531283 § urn:ISSN: the namespace of serial publication identifiers. e.g. urn: ISSN:0302-9743 § urn:NBN: the namespace of national bibliographic numbers. e.g. urn: NBN:fi-fe19981001

These identifier schemes (except NBN) pre-dated the web and digital identifiers. Their namespaces were registered as URNs, following the IETF approval, recognising that they are useful as digital identifiers. There are underlying assumptions in the URN-namespace registration set-up that the assignment and namespace in a URN is a managed process, although there is allowance for experimental, formal and informal namespaces.

Resolution of URNs Despite the differences expressed over the form and utility of URNs during the course of its development, there was a general agreement to strictly distinguish between naming schemes and resolution systems. That is to say that a naming scheme, as a procedure for creating unique URNs that conform to a specific syntax, is independent of the resolution service which resolves them to locate the resource. The IETF concluded that the URN as a naming scheme should not be tied to any specific resolution system.

10 A variety of solutions have been proposed, including a new DNS resource record, NAPTR (Naming Authority PoinTeR), as described in [7] (RFC 2915), that provides rules for mapping parts of URIs to domain names. This was made obsolete by a revision and renaming of the proposal issued as an Internet draft in February 2001. The new proposal is named the Dynamic Delegation Discovery System (DDDS, RFC 3401) [8]. The DDDS approach is to use the DNS to locate "resolvers" that can provide information on individual resources, potentially including the resource itself. The use of http proxy servers is also a possibility. At present, the organisational and technical levels are still being continuously reviewed. In October 2002, several components necessary for URN resolution using the DDDS technology were standardised. However, practical applications have not yet been implemented within the URN context.

In short, there is little prospect of an early conventional solution to the resolution issue. All URN systems implemented in the near to medium future will require proxy servers to enable them to be used by standard web browsers and to route requests to the appropriate host server. This means that the identifiers will have to be encapsulated in a URL using the proxy server address.

National Bibliographic Identifiers (NBNs) The National Bibliographic Identifiers (NBN) is a URN Namespace Identifier (NID) which was developed and registered by the National Library of Finland [15]. It was subsequently considered and accepted in principle by the Persistent Identifier subcommittee of the CDNL (Committee of Directors of National Libraries) as a potential vehicle for a common URN based identifier for national libraries.

The NBN as originally conceived by the National Library of Finland was simply an electronic National Bibliography Number, and was essentially a running number preceded by a country code assigned to a single digital resource. The specification was subsequently further developed to accommodate a variety of needs in a later registration proposal published in October 2001 as RFC3188– “ Using National Bibliography Numbers as Uniform Resource Names ”, [6].

Several European national libraries use NBN-based URNs for identification of electronic resources in their digital archives. One such example is the DiVA system (The Digital Scientific Archive) developed at Uppsala University Library in Sweden [16]. DiVA, which has been in full operation since January 2003, is a publishing system used by various institutes belonging to 10 universities in 3 countries (mainly in Sweden, Denmark and Norway). URN:NBNs were chosen as persistent identifiers assigned by The National Library of Sweden or the Uppsala University which acts as a delegate naming authority. The same URN:NBN is intended to be used for different manifestations of the same content. In the future, DiVA aims that the co-ordination of electronic academic publishing at Swedish Universities, including long-term access and preservation, will be central and the resolution service will be further developed on an international level. Another example is the E-depot system, DIAS (Digital Information Archiving System), accomplished in December 2002 at the Dutch national library [17].

Efforts from “Die Deutsche Bibliothek” , in their EPICUR (Enhancement of Persistent Identifier Services- Comprehensive Method for Unequivocal Resource Identification) project, are underway in order to establish a national URN co-ordination agency [13]. “Die Deutsche Bibliothek” aims to have a corporate strategy for URNs to guarantee high technical availability of a digital object. The strategy includes assignment, registration, administration and resolution of URNs. 11 “Die Deutsche Bibliothek” recognised that creating appropriate organisational framework conditions in the URN scope is as essential as the technical realisation of a Persistent Identification Service. These efforts include the provision of active support for URN embedding in bibliographic presentation formats as permanent component, the expansion of the guidelines that define the structure of URN strings, as well as co-operating with representatives of library associations towards establishing standards on the technical realisation of a global resolving mechanism based on uniform structure of URN string.

Within “Die Deutsche Bibliothek”, URNs are assigned only to items which have a perspective of long term preservation for example the objects archived by the library itself or those archived by certified repositories. “Die Deutsche Bibliothek” functions as the naming authority (delegations are also possible). A resolution service, based on the use of proxies, has been put in place. URNs can be mapped to multiple URLs allowing different copies of the same object can be retrieved. The naming authority and the institution applying URNs have a contract stating that the URN-URL relations must be registered and more importantly that the URL must be kept updated. An administration interface is available to manage the URN-URL relationships. The system is currently used by over 60 university libraries.

5.3.2 Persistent Uniform Resource Locators (PURLs)

PURLs (Persistent URLs) were developed by the Online Computer Library Centre (OCLC) in the mid- 1990s, primarily to reduce the maintenance burden of the URLs contained in catalogue records created for internet resources [18]. OCLC was an active participant in the IETF working groups on URNs and was fully aware of how far the groups were from consensus on a standard solution for persistent identification. They, therefore, developed PURLs as an interim solution to address the lack of progress in persistent naming for Internet resources.

A PURL looks like a URL, simply because it is a URL. Hence the syntax:

http:// /

A PURL uses the http protocol and has 2 additional parts: (1) a resolver address, and (2) a name. Note that the resolver address is the IP address or domain name of the PURL Resolver. This portion of the PURL is resolved by the Domain Name Server (DNS). The name is user-assigned and resolved by the associated PURL Resolver..

Example: http:// purl.oclc.org/OCLC/OLUC/32127398/1

However, instead of addressing directly the location of an Internet resource, a PURL addresses an intermediate resolution service or PURL server. This PURL Resolution Service associates the PURL with the actual URL and returns that URL to the client software. The client then completes the URL transaction to obtain the resource. The transaction is a standard Hypertext Transfer Protocol (HTTP) redirect, thus no acceptance of new protocols or modifications to client software are required. The process is depicted in figure 1 below.

12 http://purl.oclc.org /OCLC/OLUC/32127398/1 PURL Client Server HTTP Redirect

HTTP GET Resource Server

resource

Figure 1: PURL’s Resolution Mechanism

PURLs are created and maintained using a standard Web browser to access a forms based service hosted by the relevant PURL server.

PURL servers also support the concept of partial redirection. They solve as much of a PURL as they can find in the database and append the remainder (unresolved portion) to the end of the resolved URL. This is ideal if the hierarchical structure at a given site is relatively stable, in that only the site’s root needs to be registered and maintained in the PURL server.

The servers come with a set of tools to create users, groups and associate roles to them. These roles given in some sorts of access control lists specify privileges associated to users or groups for every single PURL. A role can be that of a reader (only able to resolve PURLs), a writer (able to create PURLs in a given domain) or a maintainer (able to edit existing PURLs and what they resolve to). PURL maintenance can be performed by connecting to a PURL Resolver using a Web browser and then using the PURL Resolver's maintenance forms to make the appropriate changes to the desired PURL. Only authorised PURL maintainers can modify a PURL.

OCLC has made available the PURL source code to enable institutions to make use of the technology [18]. Since the introduction of the PURL model and services, a large number of institutions have installed their own PURL servers. Within the U.S. Government, for example, the Government Printing Office and the Department of Energy’s Office of Scientific and Technical Information use PURLs and their own installations of the PURL Resolver to manage their connections to the full text of documents. The National Library of Australia (NLA) has previously also run its own PURL Server before taking it down due to the very low take-up. Recent statistics (May 2004) show that over 500,000 PURLs have been registered and more than 86 million have been resolved via the OCLC PURL Server .

5.3.3 The Handle System The Handle system is a distributed persistent naming system developed for digital library applications [21]. It was developed by the Corporation for National Research Initiatives (CNRI) and had its origin in a computer science technical reports project, Networked Computer Science Technical Reports Library (NCSTRL), funded by the Defense Advanced Research Projects Agency (NCSTRL) in the US. Part of

13 this project was to develop an architecture for the underlying infrastructure of an open distributed digital library [20].

The high level architecture developed by the project is described in a seminal paper “A framework for distributed digital object services” by Robert Kahn and Robert Wilensky published in May, 1995 and subsequently by William Arms in an article “Key Concepts in the Architecture of the Digital Library” published in D-Lib Magazine, July 1995 [1].

The handle system is more than a simple naming scheme, it is supported by a resolution system consisting of a distributed system of global, local, and caching servers.

Figure 2: Handle System Overview

A Global Handle Registry run by CNRI registers the top level naming authorities, both to ensure the uniqueness of the names and to route requests for handle resolution. It is unique among handle services only in that it provides the service used to manage naming authorities, all of which are managed as handles. The naming authority handle is a special handle that provides information to be used by clients to access and utilise the local handle service for handles under the naming authority in question. Local Handle Services are run by organisations and/or individuals. They resolve the requests routed to them and return the current address(es) or other information about the resource sought. They therefore hold handles that provide information about resources as registered by a respective naming authority. A Local Handle Service can itself be composed of a number of servers. Finally, caching servers associated with local servers allow frequently accessed handles to be resolved without need to request the address from the Global Registry.

The “handle” itself is a persistent identifier consisting of two parts [9]. The syntax is as follows:

/

where the naming authority is an administrative unit authorised to create and maintain handles and the name of the resource is a string which must be unique to that authority but which has no prescribed syntax.

Example: loc.pnp/123456t

14 The naming authority may contain sub-authorities. The sub-authority identifier is separated from the name of the higher authority by a full stop. In the example above “loc” is the identifier for the Library of Congress and “pnp” is a sub-authority that identifies the Prints and Photographs Division.

Resolution of Handles Figure 3 below shows an example of a typical handle resolution process. A client such as a web browser encounters a handle, typically as a hyperlink or other kind of reference. The client first extracts the handle’s naming authority and queries the Global Handle Registry that returns the service information, including the location, of the Local handle service responsible for resolving handles issued by the naming authority in question. The service information allows the client to communicate with the Local Handle Service to resolve the handle. Each handle can be associated with one or more pieces of typed data including the resource’s URL. Note that it would also be possible to associate multiple instances of the same data type, e.g., multiple URLs, with a single handle. This capability is referred to as multiple resolution.

Handle System

1. Query naming authority Global Client Handle Registry 2. Local handle service

3. Query local name Local Handle Service 4. Respond with handle data

5. HTTP GET Resource Server

resource resource resource

Figure 3: Handles’ Resolution Mechanism

The Handle software is made publicly available and can be downloaded from the CNRI Handle site. CNRI makes available local service software, client software and simple management tools, a caching handle server, tools for the creation and administration of handles and naming authorities, and a proxy server to enable Web clients to resolve handles. To enable web browsers to resolve handles without the use of a proxy, CNRI has developed a Handle Resolver plug-in, available for download.

Security and Privacy

The Handle System allows secured name resolution and administration over the public Internet. The Handle System protocol defines standard mechanisms for both client and server authentication, as well as service authorisation. It also provides security options to assure data integrity and confidentiality.

15 Each handle may define its own administrator(s) or administrator group(s). Ownership of each handle is defined in terms of its administrator or administrator group(s) and administration can be supported in a distributed environment. By default, most handle data stored in the Handle System is publicly accessible, unless otherwise specified by the handle administrator. Handle administrators must pay attention when adding handle values that contain private information. They may choose to mark these handle values readable only by the handle administrator(s), or to store these as encrypted handle values, so that these values can only be read within a controlled audience.

The Handle system is actively being advocated as a realisable persistent identifier infrastructure by a number of organisations. This includes the Library of Congress, the International DOI Foundation for resolution of DOIs (more in the next section), the US Defense Technical Information Center (DTIC) together with the US Department of Defense (DoD), the MIT Libraries jointly with the Hewlett-Packard Company, The Content ID Forum (cIDF) and several others.

The Library of Congress has implemented the Handle System in its National Digital Library Program. It is a broad based program to convert collections of historic materials to digital formats and to make them available across the Internet. It has naming authorities under CNRI’s Global Handle Resolver for its major units. Over 400,000 Handles have been assigned since 1995 to digital objects maintained by the Library. Handles are used as identifiers in the Library’s American Memory Collection (a large digital library collection), as identifiers for electronic finding aids for the Library’s archival collections, and as persistent links to digital content described by the Library’s distributed cataloging records and electronic finding aids. There is a current project to investigate the assignment of persistent identifiers to the Library’s XML schemas.

DTIC uses the Handle System to manage digital objects from all the US DoD agencies, and operates across the USA as a Handle Naming Authority. Handles are assigned to DTIC’s full text, publicly releasable technical reports available on the internet. In addition, a separate DTIC Central Handle Service Directory stored in an Oracle database, contains searchable key metadata for each Handle resource. A search of key metadata (i.e., Title, Corporate Author, Personal Author, Report No., DTIC AD No., Publication Year) returns a results list from which a Handle can be selected. The central directory serves two purposes: 1) To provide handle resolution for any known Handle when a Handle prefix and suffix are known; 2) To ‘discover’ a Handle when some information is known about a resource, but not its Handle.

The recently announced ADL CORDRA development, at DoD, is also based on the use of handles as the native identification system. CORDRA will be based on existing learning content, repository and digital library standards and specifications. The focus is on selecting and adopting existing standards and specifications, and customising these if needed with community-specific profiles and extensions, not creating new standards.

MIT Libraries and the Hewlett-Packard Company have jointly developed DSpace. DSpace is an open source software platform that enables institutions to provide open archives of their publications, including pre-prints and reports. It is increasingly being used by several institutions world-wide for this purpose. DSpace uses the Handle System to assign and resolve persistent identifiers for all the digital material that enters the repository’s archive. In addition, the institution hosting the DSpace system

16 must stand behind the commitment statement that the assigned identifier is intended to remain consistently valid and resolvable over time. The original host institution may at some point assign that commitment statement to another institution (conveying responsibility for maintenance/administration of the submitted material to the other institution). The DSpace open source software package includes the handle software.

5.3.4 Digital Object Identifiers (DOIs) The DOI was developed by the AAP (American Association of Publishers) as a response to the growing concern over rights protection in open electronic networks. It was officially launched at the Frankfurt Book fair in 1997. A year later, the International DOI Foundation was created to solely manage the DOI system. In the fall of the year 2000, it was partly (syntax) registered as an American National Standard ANSI/NISO Z39.84-2000.

The intention was to develop an industry wide, standard identifier that would be assigned to a work or manifestation of a work at the time of creation. Its development was seen as a first step in the development of rights management infrastructure in the digital environment. The identifier is intended to support systems to control transactions, provide a key for rights management systems and facilitate communication between publishers and their clients.

To achieve this, the DOI has been developed as a complete system consisting of four components:

§ The syntax and assignment rules of the DOI identifier itself § A resolution system (based on the Handle system) § A metadata structure to identify the item (based on the INDECS metadata set) § A set of policies to ensure that the system behaves in ways that are predictable and consistent.

The syntax of the DOI is a subset of handle syntax. The DOI is composed of a Prefix and a Suffix. Within the prefix are the Directory Code

and the Registrant Code . The suffix is named the DOI Suffix String :

./

The International DOI Foundation (IDF) assigns the Directory code and the Registry Code. The Directory Code is currently always “10” for a DOI and the Registry code is the value of the code assigned by the IDF to a particular publisher, rights holder or registrant. The DOI Suffix String can be any string that meets the requirements of the registrant to whom the registry code has been assigned.

Example: 10.1001/PUBS.JAMA(278)3,JOC7055-ABST

The DOI syntax is an American National Standard ANSI/NISO Z39.84-2000. The flexibility of the syntax makes it possible to accommodate any other identifier system.

Resolution of the DOI the DOI is supported by a resolution system. It is an application of the Handle system developed by CNRI (see previous section). A record of the DOI together with a brief metadata record and the current

17 location of the object is registered in the DOI system server. This server acts as a resolver to the objects themselves, which continue to reside on the publisher’s site or on a site licensed by the publisher. Changes of location are registered with the central server. DOIs are resolvable through a native handle server for those with a handle enabled browser or via a proxy server to resolve to the handle server for those with standard browsers.

DOI Metadata The metadata is an important component of the DOI system:

§ For reverse lookup to find the identifier for citation purposes. § To support multiple resolution (to allow choice between multiple versions of the same work copies located at different sites, especially for licensing purposes). § To determine which work, version, or manifestation of an object is represented without recourse to inspection of the item to facilitate automated selection and transactions.

To support these processes, registration of a minimum set of standard metadata with the item is required. For this purpose, the DOI has adopted the INDECS (Interoperability of Data in E-Commerce Systems) metadata set. The INDECS project, designed to provide a common metadata framework to support E-commerce in intellectual property, meshes neatly with the requirements of DOI. Eight mandatory elements are prescribed in the minimum, or kernel set of metadata, three of which (Genre, Type and origination) are to be derived from standard lists.

DOI implements the indecs approach, which has at its heart the concept of rights management. IDF has introduced the concepts of DOI and indecs into many digital rights management activities such as MPEG-21, OEBF, TV-Anytime, etc. However, it is worth to note that DOI does not mandate a single metadata standard; Implementers may use any existing metadata standard; it does however require that for full interoperability the metadata set be mapped to the indecs Data Dictionary.

CrossRef and Other Registration Agencies CrossRef is the largest user of the DOI system. CrossRef is a collaborative reference linking service that functions as a sort of digital switchboard. It holds no full text content itself, but rather links users to digital content through DOIs that are tagged to object-level metadata supplied by the participating publishers, including the URL for the digital content. The end result is an efficient, scalable linking system through which a researcher can click on a reference in a citation database, for example, and access the cited article. More than 200 publishers deposit and maintain DOIs in the CrossRef System to link citations and references to the full text of journal articles and books. To date over 9 million identifiers have been registered in the CrossRef system and more are being added each day.

CrossRef is a DOI Registration Agency. The primary role of Registration Agencies is to provide services to Registrants - allocating DOI prefixes, registering DOIs, and providing the necessary infrastructure to allow Registrants to declare and maintain metadata and state data. The Registration Agency concept allows the Handle System to be a distributed system. At present, the DOI accounts for a total of 10 Registration Agencies, each agency specialises in a well defined sector of coverage on behalf of specific user communities. CrossRef, for example, is providing citation-linking services for

18 the scientific publishing sector, so publishers will choose CrossRef as their Registration Agency if they wish to use the specific service or services offered by CrossRef.

The TSO (The Stationery Office) in the UK is also a DOI Registration Agency. TSO publishes on behalf of UK’s Parliaments and assemblies and is the UK's definitive source of official and regulatory information. The TSO recently joined the International DOI Foundation in order to help its government and private clients better manage digital resources. With a major emphasis in the UK on E- government and with support from the E-government Envoy, the TSO is seeking to provide a series of services that use the DOI to support management of government information over time.

In addition to their role explained above, the DOI Registration Agencies are expected to actively promote the widespread adoption of the DOI, and to co-operate with the IDF in the development of the DOI System as a whole. Like CrossRef, Registration Agencies may choose to provide other DOI- related services to Registrants, without limitation (so long as they conform with IDF Policy). These services may include any combination of value added services in, for example, data, content, rights management or further exploitation of the metadata that they collect.

Other Considerations The DOI system is run on a cost recovery basis. It is an open system, free at the point of use and anyone may use a DOI to link to services. The DOI can be integrated into local environments and databases without charge. The cost of assigning a DOI is borne by the publisher or resource rights holder when it is assigned.

The majority of libraries and academic institutions are not in the same league as the major publishers and there are fears that the system may not be affordable for them. In addition to the charges associated with membership and the payments to registration agencies for the purchase of prefixes and assignment of identifiers, there is the concern that the rights management aspect may have the effect of making it easier to close off materials by making them unaffordable.

5.3.5 Archival Resource Keys (ARKs) One of the most recently proposed identifier scheme is ARK (Archival Resource Key), which has been developed by John Kunze in his work for the US NLM (National Library of Medicine). The ARK proposal is still an internet draft, of which the latest version was issued in July 2004. It is currently being tested and implemented by the California Digital Library (CDL) for collections that it manages.

The ARK is a scheme intended to facilitate the persistent naming and retrieval of information objects and is developed specifically to meet the needs of the keepers of archival digital objects.

A founding principle of the ARK is that persistence is purely a matter of service. Persistence is neither inherent in an object not conferred on it by a particular naming syntax. Rather, persistence is achieved through a provider's successful stewardship of objects and their identifiers.

In addition to an identifier syntax and a resolution procedure, the total scheme comprises three services to which ARKs are bound to via the resolution procedure. These services are:

19 § To give users a link from an object to a promise of stewardship which identifies the provider’s to set of responsibilities including a “commitment to persistence” statement. § To give users a link from an object to its metadata. The metadata descriptive record is to act partly as a "receipt" with which users and archivists can verify an object's identity after brief inspection. § To give users a link to the object itself (or to a copy).

An ARK is represented by a sequence of characters (a string) that contains the label, "ark:" , optionally preceded by the beginning part of a URL.

[http://Name Mapping Authority/]ark:/ Name Assigning Authority Number/Name

The URL, or the "Name Mapping Authority" (NMA), is optional and replaceable. The significance of the NMA is that it is not a part of the identifier and has no effect from the point of view of object identification. In other words, ARKs that differ only in the optional NMA component are considered to identify the same object .

Example: The following ARKs would identify the same resource

http://ark.cdlib.org/ark:/ 13030/1234567 http://some.other.org/ark:/ 13030/1234567 ark:/ 13030/1234567

A Name Assigning Authority (NAA) is an organisation that creates (or delegates creation of) long-term associations between identifiers and the object services described above. Examples of NAAs include national libraries, national archives, and publishers. The ARK namespace reserved for an NAA is the set of names bearing its particular NAAN. For example, all strings beginning with "ark:/13030/" are under control of the NAA registered under 13030, which defines the California Digital Library.

NAANs are registered in a manner similar to URN Namespaces, but they are pure numbers consisting of 5 digits or 9 digits. Registration of NAAs and their associated NAANs is currently the responsibility of the California Digital Library.

Each NAA is free to assign names from its namespace according to its own policies and conforming to the ARK’s specified guidelines. These guidelines specify naming rules that express inter-relatedness between objects, including object hierarchies and objects variants.

Resolution of ARKs Figure 4 below illustrates the ARK’s resolution process. In order to derive an actionable identifier from an ARK, a working Name Mapping Authority (NMA) must be found. An NMA is a service that is able to respond to the three basic ARK service requests. Upon encountering an ARK, a user (or client software) looks inside it for the optional NMA part. If it contains an NMA that is working, the NMA discovery step may be skipped. If a new NMA needs to found, the client looks inside the ARK again for the NAAN. Querying a global database, it then uses the NAAN to look up all current NMAs that service ARKs issued by the identified NAA. Once the NMA is located, the resolution of the name can proceed to lookup the object mapped services described above.

20 Client 1. Find a working NMA Global ARK 2. NMA able to serve Database

3. Resolve name Name Mapping 4. Respond with mapping data Authority

resoreso reso

Figure 4: ARKs’ Resolution Mechanism

ARK comes with complete client software as well as a copy of the Global ARK Database (file-based) to look up NMAs. It is desirable that the file may be stored in a local file system for efficient access, but it needs to be reloaded periodically to incorporate updates.

ARK is still in the testing phases and it currently accounts for 9 Name Assigning Authorities served by handful of Name Mapping Authorities. These are the National Library of Medicine, the Library of Congress, the National Agriculture Library, the California Digital Library, the World Intellectual Property Organization, the University of California San Diego, the University of California San Francisco, the University of California Berkeley and the Rutgers University Libraries. ARK has also been proposed to DSpace in a attempt to extend its repertoire of persistence identifiers so that DSpace adopters will be able to install, configure and use ARKs and not just restricted to Handles.

5.3.6 Other Identifier Schemes In addition to the above mentioned systems, there are other identification schemes which are still undergoing development. The most notable ones are the info Uniform Resource Identifier ( info URI) and the eXtensible Resource Identifier (XRI).

5.3.6.1 info Uniform Resource Identifier ( info URI) The info URI scheme is being developed from within the library and publishing communities. Representing the library communities are Herbert Van de Sompel of LANL (Los Alamos National Laboratory) Research Library and Stu Weibel of OCLC (Online Computer Library Center), while representing the publishing communities are Tony Hammond formerly of Elsevier and Eamonn Neylon of Manifest Solutions, a publishing technology consultancy. The infoURI has been created because it was recognised that many of the existing public identifiers commonly used by the library and publishing communities are not ‘official’ URIs. The sole purpose of infoURI is the disclosure of the identity of an information asset from a public namespace. That means that an info URI has no representations associated with the resource that it identifies, and as such no retrieval actions can be performed. Therefore, unlike URN, infoURI makes no claims of persistence.

The info URI scheme is being promoted by ANSI/NISO (the North American National Information Standards Organisation). The scheme is supported by a registry of namespaces that is run by a

21 Maintenance Agency appointed by NISO. A Namespace Authority, which is the body that owns and manages a public namespace, can request registration of the public namespace it owns. Examples of such are the widely deployed National Library of Medicine PubMed Identifiers (pmid), or Source Identifiers (sid) as used by the OpenURL Framework.

5.3.6.2 eXtensible Resource Identifier (XRI) OASIS members have formed a new Technical Committee whose intention is to create a URI- compatible scheme, eXtensible Resource Identifier (XRI), that would enable the identification of resources (including people and organisations) and the sharing of data across distributed directory services, domains, enterprises and applications. The Technical Committee will also define an XRI secure resolution protocol and a special set of identifiers for XRI metadata (identifiers that describe other identifiers). An early version of the scheme’s specifications has been approved as a Committee Draft, “XRI Generic Syntax and Resolution 1.0”, in January 2004 and the Technical Committee is currently working on version 1.1 of these specifications, which will also include a Trusted Resolution specification. The current draft specifies XRIs as similar to URIs, but contain additional syntactic elements and extend the unreserved character set to include characters beyond those allowed in generic URIs. To accommodate applications that expect generic URIs, this specification defines rules for transforming an XRI into a valid URI.

5.3.7 Comparative Summary To sum up, the table below summarises the characteristics of each of the identification systems addressed in this report:

ue q System Type ce Cost d en y uni t type ution ution l dar l l

Stan Objec Reso Reso Persist Service Global URL RFC2616 Any Yes (DNS) Single No No Yes

URN:ISSN ISO3297 for ISSN Journal No n/a Yes No Yes

URN:ISBN ISO2108 for ISBN Book No n/a Yes No Yes

URN:NBN RFC3188 Any No n/a Yes No Yes

PURL No Any Yes Single Yes No Yes

Handle RFC3650 Any Yes Multiple Yes Yes Yes

DOI Z39.84-2000, Syntax Any Yes (Handle) Multiple Yes Yes Yes

ARK No Any Yes Multiple Yes unknown Yes

info URI RFC3668 Any No n/a Yes No Yes

XRI No Any Not yet n/a Yes unknown Yes

22 5.4 Evaluation of Persistent Identification Schemes

As outlined in the previous section, there are a number of options currently available to implement a persistent identifier system. All of the options provide varying degrees of location independence, (i.e. the name of the item is unrelated to the physical location of the item) and persistence if the links are properly managed. It is important to note in this context that persistence is a function of organisation and management, not technology. Whatever system is used will require commitment to manage the naming and maintain the relationship between the name and location of the resource. This section is an attempt to evaluate each of the schemes in terms of its technical capabilities.

5.4.1 Using the URN scheme As described earlier, there is no built-in resolution infrastructure that can support URNs. If URN were implemented in the near to medium term, the identifiers could only be resolved using an http proxy server. Nevertheless, the use of URN style identifiers would have some advantages.

Advantages § It has flexibility naming system that is able to accommodate existing identifier schemes easily. § It is independent of any particular protocol and independent of the resource location. § Provided an appropriate resolution system is put in place, a URN can be used to refer to multiple instances of a resource. This has been demonstrated by “Die Deutsche Bibliothek” . The resolution process can be implemented in either a centralised a decentralised model.

Disadvantages § No significant deployment within large scale systems, due to the absence of an agreed-upon resolution service. § Not accessible using standard browsers except through a proxy server, or with the help of a self- implemented plug-in software § No administration and management tools currently available. Users of the URN scheme have had the responsibility to develop their own management tools that meet their requirements. § If the using URN requires the registration of a new namespace, the burden may prove to be high.

5.4.2 Using PURLS Use of the PURL option would be the simplest option and would require little effort beyond active promotion and a closer examination of its use together with the development of implementation guidelines.

Advantages § It is a simple solution with minimal investment, as it relies on existing technologies. § It can be deployed in a central or decentralised model with contributing institutions/organisations installing their own PURL server(s). § PURL server has an advantage over standard redirects in that it comes with a collection of utilities that support the management of URLs by allowing users to update their own resource, a critical aspect of assuring the integrity and maintenance of links.

Disadvantages § Total location-independence is not a guarantee due to partial resolution. § PURL is protocol dependent. It is a URL implementation using standard http redirect. 23 § It can only support a one to one resolution as it is based on a redirect; it would be difficult to support one to many resolution, e.g. multiple instances or multiple versions. § Standard browsers return the URL in the location field. This would in many cases become the citation unless the PURL is clearly indicated on or in the object itself or browsers are modified to display the persistent URL that was requested rather than the actual URL representing the current location of the resource.

5.4.3 Using Handles The Handles System offers a naming scheme which is flexible and extensible and which will support distributed naming and resolution with an option to resolve centrally. It has the capability for multiple resolution which has also been further developed and exploited by the International DOI Foundation and one may be able to take advantage of theses developments. It also has management, security, registration and resolution capabilities. As a whole, it is a technically feasible option. The Handle system is an open standard, so anyone can build/use one; both source code and APIs are public. It relies of course on the top level Global Handle Registry to be in place somewhere (but so does the Internet assume there will always be a root server and directory around somewhere). CNRI has a commitment to funding and maintaining these; were that to fail, there are enough large scale implementers of Handles to ensure that it will be picked up by someone to ensure further development and support.

Advantages § Flexibility of the naming system. It allows for further guidelines and syntactic restrictions to be introduced by naming authorities. In addition, the naming system is URN compliant and Can accommodate existing schemes easily. § A handle can refer to multiple instances of a resource and/or multiple data types associated with the resource. § The complete client/server software and management tools are available from CNRI. § The system is proven. Handles have been implemented in major applications which share similar requirements of open access repository, like DSpace. § The system has a built in resolution infrastructure which supports both a centralised and decentralised model. § It is independent of any particular protocol and independent of the resource location.

Disadvantages § More complex than the other systems to establish § Not yet useable through standard browsers without plug-in software for the client - but can be used with a proxy server as an interim measure

5.4.4 Using DOIs As a handles implementation, the DOI inherits all of the advantages of the Handles system. Besides, the DOI’s additional components, including its own metadata structure and policy framework mean make it a completely managed system from which many additional benefits could be derived. One can also argue that real advantages of using the DOI lie in becoming a Registration Agency. The requirement to lodge metadata with the registered DOIs and the ability to define application profiles for that metadata would enable to build value added services on the metadata.

24 Advantages § DOI metadata mapping is based on the indecs (interoperability of data in e commerce) framework. Conformance with this framework facilitates the use of DOIs with MPEG-and ONIX compliant tools for multimedia content management and digital rights management, and with other schemes following the same principles. § DOIs and associated metadata ensure accurate, interoperable and efficient product information is available both externally but also internally, reducing costs in data management and handling. § Involvement in a system which is becoming widely deployed for digital resources of all kinds.

Disadvantages § The high costs of start up (Fees, negotiations, business plan, IT and business systems set up) § Not yet useable through standard browsers without plug-in software for the client - but can be used with a proxy server as an interim measure § There could be a degree of loss of autonomy involved and a level of uncertainty about the degree to which the interests of large commercial publishers will impact adversely on those of archival institutions. On the other hand, close work with these organisations may lead to increased mutual understanding and co-operation.

5.4.5 Using ARKs If it is accepted as a standard, the ARK system is a promising avenue for the identification of digital archival materials. As a system that is only in its final stages of development, it is too early to determine what will become of it. The concept of making the Name Mapping Authority optional and mutable and recognising only the NAAN and assigned name components as the ARK identifier is elegant in its simplicity. If it works and is accepted, it can be used today and will only require minor adjustments to browsers to enable them to substitute the “http” string for the “ark” string for resolution.

Advantages § The identifier is location and protocol independent. § Resolution can proceed with standard browsers today, if NMA is known. § It can support multiple resolution. § It explicitly tackles the issue of commitment to persistence by clearly separating the role of original name assignment from the role of ongoing support for access.

Disadvantages § Too new to judge the likelihood of acceptance. § Too new to realistically assess costs of implementing such a system but undoubtedly would require considerable investment. § Work on supporting tools for maintenance, management and administration is still not concluded.

5.4.6 Summary The options at present are limited. There is an inherent risk in embarking on a particular course of action when the scheme and processes involved around it are not settled or if the technology is not sufficiently tested as is the case of XRI, infoURI. In addition any system with no resolution support would need an own solution by implementing a resolution service that meets the requirements and would possibly need to operate using an http proxy server as is the case of URN at least in the short

25 to medium term. In addition, absence of software tools supporting the identifier’s life cycle from assignment to administration and maintenance would reduce the value of the system do not work in the favour of URNs. Supporting simple resolution by relying on plain redirects does not meet the expectation of a large scale system, hence, excluding PURL eventhough it offers a greater flexibility than any of the other systems through its freely available to download and customise software.

Of the systems available at present, only the Handle and the DOI Systems deal efficiently with all of the above issues and seem to offer to offer the best option for reasons outlined above (please refer back to the sections: Using Handles and Using DOIs). However, it is also important to mention that ARK, if sufficiently tested, adopted, accepted and further developed may constitute a better alternative. While it is acknowledged that the ARK is new to the scene, it addresses some needs that are not met by other strategies. For example, in addition to being actionable and globally unique, ARKs provide end-users with a statement of organisational commitment on the part of the assignor. However, as ARK is a relatively new strategy, it may still need to be proven effective by a range of users before widespread uptake is observed.

Recommendation 1 Recommended is to consider, test and/or evaluate further the Handle System and the DOI System as they offer the most complete solutions. Also recommended is, to actively monitor ARK’s developments of which the prospects look bright. If later requirements specify the need for a service-cost free system, then the URN seems to be better positioned where experiences from “Die Deutsche Bibliothek” through their own development would be of great benefit. In all cases, the option of having a repertoire of persistent identifiers rather than being restricted to one scheme should also not be excluded.

5.5 Ensuring Persistent Access to Resources A system of persistent identification is only one tool that can assist in the achievement of persistent access to resources. Decent information management practices and guidelines are a prerequisite for the effective use of persistent identifiers and a striking necessity to help towards achieving the goal of persistence. Name persistence is not a technological problem. Some schemes will provide more reliable tools to achieve such persistence, but whatever system is used, persistence of citation will be achieved through the commitment of institutions responsible for assigning the names and maintaining the resolution services, not by the inherent qualities of the identifiers themselves.

It is important to avoid helping build expectations, amongst resource owners, that if they assign a persistent identifier, somehow permanence and persistence is assured. Persistence will always be a managed process, both at the global level in terms of the resolver services used to route requests based on identifiers to the organisations keeping the resources and metadata, and at the local level to ensure that the identifier is associated with the resource and the request can be met.

The implementation of global registers and associated persistent identification resolvers requires ongoing maintenance and, therefore, ongoing permanent allocation of resources. The resolution services are only as up to date as the physical locations to which the persistent identifiers point. While some of this updating can be automated, responsibility for this updating and ensuring its reliability

26 must be assigned within each agency, program or office, or through a trusted third party. It is not sufficient to create identifiers and leave them without maintenance; active management is needed in order to gain the benefits.

Recommendation 2 whatever Persistent Identification System is opted for, it has to be supported with guidelines and policies that encourages institutions and individuals to take responsibility for the persistence of the resources they own rather than helping build the assumption that once the resources are assigned persistent identifiers, all responsibility for the management of the resource is discharged.

6. Relevance and conclusions for the project

A Persistent Identification Service is a core infrastructure component that would be of direct relation to most eSciDoc’s services defined in the project’s working hypothesis. For instance, persistent identifiers play an important role in the archiving platform. The there is a strong association between the issue of persistent identification and the preservation issue in the sense that persistence of naming is essential for reliable long-term access and is one of the main reasons. On the other hand, the value of preserving material would be substantially reduced if the material cannot be identified reliably, found, and accessed when cited or referenced.

Persistent identifiers are also essential for general linking in the catalogues and indexes comprised in the digital library service. In addition to being an irritation for users trying to access material, constant change and movements of material can impose an unacceptable maintenance burden on the maintainers of lists, indexes, gateways or catalogues of library’s material. Having this maintenance centralised via the deployment of a Persistent Identification Service, supported by automatic monitoring tools would minimise this burden dramatically.

There are many categories of material that need to be identified and managed for persistence. The best rule of thumb is that any item (or collection of items) which has been made publicly available online and which is likely to have been cited or referenced in other digital objects or online resources should be candidates for persistent identification.

A critical part of persistence is establishing a clear definition that allows an institution to clearly communicate the level of commitment it is making to maintaining the availability (access to) of a resource online to its user community. In other words, the successful implementation of a Persistent Identification Service requires ongoing maintenance and, therefore, policies and guidelines should be established for the creation and maintenance of persistent identifiers and the management of related services.

27 7. Open questions

1. Costs of the system The evaluation excluded looking at the cost issues due to the lack of requirements in this context.

2. Identification/Resolution:

What to identify? What objects/object categories need to be addressed with identifiers? Which of these need to be managed for persistence? Do we identify only archived material? When to identify? Note: DSpace only identifies communities, collections and items with persistence, as they enter the archive. Bundles & bitstreams are assigned global unique ids that are not managed for persistence.

Resolution What an identifier identifies (the object) is not necessarily what the identifier resolves to (object, metadata, versions, ...etc), in the context of multiple resolution, what information do we expect the identifier to resolve to? What kind of metadata (metadata set) needs to be stored with, or linked to, a persistent identifier?

3. Objects/Relationships:

These questions are rather too deep and very little has been done in this context. They are still questions frequently asked at research level, which means that communities are far from reaching practical standard solutions.

Relationship concerns seem to fall into at least two categories: granularity and versioning issues.

Granularity Granularity has to do with the naming of sub-objects. Are sub-objects identified (and/or resolved) separately from their parents? What about naming a chapter or a section in a digital book? Would it be considered to have names for internal content parts?

Versioning Creating names that reveal to others that one object is a version of another. There are different dimensions of versioning: Language of content, format of the content, revision of the content -- the idea of different drafts or releases of an object, ...etc. Are object versions (copies) identified (and/or resolved) separately from their original object?

4. Commitment to persistence:

Ensuring persistence is a matter of service/maintenance. How to promote the concept of persistence? Perhaps develop best practice guidelines for the creation of persistent identifiers, their maintenance and the management of related services? Or establish policies to govern the commitment of different participating parties to persistence.

28 8. References

[1] Key Concepts in the Architecture of the Digital Library William Y. Arms; CNRI (D-Lib Magazine July, 1995) http://www.dlib.org/dlib/July95/07arms.html

[2] Functional Requirements for Uniform Resource Names (RFC 1737) Masinter, Larry; Sollins, Karen (Date Created: Dec 1994) http://www.ietf.org/rfc/rfc1737.txt

[3] Functional Recommendations for Internet Resource Locators (RFC 1736) Kunze, John A (Date Created: Feb 1995) http://www.ietf.org/rfc/rfc1736.txt

[4] Uniform Resource Identifiers (URI): Generic Syntax (RFC 2396) Berners-Lee, Tim; Fielding, Roy T; Masinter, Larry (Date Created: Aug 1998) http://www.ietf.org/rfc/rfc2396.txt

[5] URN Syntax (RFC 2141) Moats, Ryan (Date Created: May 1997) http://www.ietf.org/rfc/rfc2141.txt

[6] Using National Bibliography Numbers as Uniform Resource Names (RFC 3188) Moats, Ryan (Date Created: October 2001) http://www.ietf.org/rfc/rfc3188.txt

[7] The Naming Authority Pointer (NAPTR) DNS Resource Record (RFC 2915) M. Mealling, R. Daniel (Date Created: September 2000) http://www.ietf.org/rfc/rfc2915.txt

[8] Dynamic Delegation Discovery System (DDDS): The Comprehensive DDDS (RFC 3401) M. Mealling (Date Created: October 2002) http://www.ietf.org/rfc/rfc3401.txt

[9] Handle System Overview (RFC 3650) S. Sun, L. Lannom, B. Boesch- CNRI (Date Created: November 2003) http://www.ietf.org/rfc/rfc3650.txt

[10] ARK Persistent Identifier Scheme (Internet Draft) Kunze, John A.; Rodgers, R.P.C. (Date Created: 20 Jul 2004) http://www.ietf.org/internet-drafts/draft-kunze-ark-08.txt

[11] The "info" URI Scheme for Information Assets with Identifiers in Public Namespaces (Internet Draft) H. Van de Sompel, T. Hammond, E. Neylon, S. Weibel (Date Created: 09 Jul 2004) http://www.ietf.org/internet-drafts/draft-vandesompel-info-uri-02.txt

[12] Application of Persistent Identifiers as One Approach to Ensure Guarantee Long-Term Availability of Online Theses : the Established Uniform Resource Name (URN) Management at Die Deutsche Bibliothek Schroeder, Kathrin (Date Created: 2003) http://edoc.hu-berlin.de/etd2003/schroeder-kathrin/HTML/index.html

[13] Enhancement of Persistent Identifier Services- A Comprehensive Method for unique Resource Identification. Schroeder, Kathrin (Date Created: April 2003) http://www.persistent-identifier.de

[14] URN User Guide Hakala, Juha; Koch, Traugott (Finland) http://www.lib.helsinki.fi/meta/URN-help.html

29 [15] Uniform Resource Names Hakala, Juha In: Tietolinja News (Date Created: 15 Mar 1998) http://www.lib.helsinki.fi/tietolinja/0199/urnart.html

[16] DiVA Project : Development of an Electronic Publishing System Andersson, Stefan; Hansson, Peter ; Klosa, Uwe; Muller, Eva In: D-Lib magazine (Date Created: Nov 2003) http://www.dlib.org/dlib/november03/muller/11muller.html

[17] Preservation Requirements in a Deposit System Dr. Raymond J. Van Diessen (Date Created: December 2002) (Sweden) http://www.kb.nl/hrd/dd/dd_onderzoek/reports/3-preservation.pdf

[18] PURL home page Online Computer Library Center Inc. (OCLC) (Regularly Updated) http://purl.oclc.org/

[19] Getting a Handle on Federal Information : Persistent Identification Using Handles CENDI (Last Updated: 18 Feb 2003) (United States of America) http://cendi.dtic.mil/activities/01_29_03_handles_program.html

[20] Handle System (Overview) http://www.handle.net/overviews/overview.html

[21] Handle System Corporation for National Research Initiatives (Regularly Updated) (United States of America) http://www.handle.net/index.html

[22] International DOI Foundation home page International DOI Foundation (Regularly Updated) http://www.doi.org/

[23] DOI Handbook (Version 4 April 2004) Paskin, Norman (Date Created: Jul 2003) http://www.doi.org/hb.html

[24] DOIs : the Key to Interoperability O'Neill, Shane (Date Created: 2002) (United Kingdom) http://www.cilip.org.uk/update/issues/dec02/article3dec.html

[25] Syntax for the Digital Object Identifier National Information Standards Organization (NISO) (Date Created: 12 Dec 2000) http://www.niso.org/standards/standard_detail.cfm?std_id=480 ANSI/NISO Z39.84-2000

[26] Towards Electronic Persistence Using ARK Identifiers Kunze, John A. ARK specification is available. http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf

[27] ‘info’ URI Scheme http://www.info-uri.info

[28] OASIS Extensible Resource Identifier (XRI) Technical Committee OASIS http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=xri

[29] Persistent Identifiers : ERPANET Seminar 17 - 18 Jun 2004 , University College, Cork, (Ireland) http://www.erpanet.org/events/2004/cork/index.php

[30] At the Event: ERPANET Seminar on Persistent Identifiers Monika Duke, UKOLN (17 - 18 Jun 2004) http://www.ariadne.ac.uk/issue40/erpanet-ids-rpt/

30