<<

ARC file Revision 3.0 Proposal Steen Christensen, Det Kongelige Bibliotek Michael Stack, Edited by Michael Stack Revision History Revision 1 09/09/2004 Initial conversion of wiki working doc. [http://crawler.archive.org/cgi-bin/wiki.pl?ArcRevisionProposal] to docbook. Added suggested edits suggested by Gordon Mohr (Others made are still up for consideration). This revision is what is being submitted to the IIPC Framework Group for review at their London, 09/20/2004 meeting.

Table of Contents

1. Introduction ...... 2 1.1. IIPC Archival Data Format Requirements ...... 2 1.2. Input ...... 2 1.3. Scope ...... 3 1.4. Acronyms, Abbreviations and Definitions ...... 3 2. ARC Record Addressing ...... 4 2.1. Reference ...... 4 2.2. The ari Scheme ...... 4 2.3. Discussion ...... 5 3. ARC changes ...... 6 3.1. ARC Record Metadata Line ...... 6 4. ARC Record Metadata (IIPC Archival Data Format Requirement 2.4) ...... 6 4.1. Metadata ARC Record Types ...... 7 5. Recording of the Complete Request (IIPC Archival Data Format Requirement 2.6) ...... 8 5.1. Discussion ...... 8 6. Duplicate Reduction (IIPC Archival Data Format Requirement 2.13) ...... 8 6.1. Use Case ...... 8 6.2. Proposed Implementation ...... 8 7. Format Transformations (IIPC Archival Data Format Requirement 2.11) ...... 9 7.1. Use Case ...... 10 7.2. Transformation Attributes ...... 10 7.3. Proposed Implementation ...... 10 8. For Consideration ...... 10 8.1. Up the default ARC file size ...... 10 8.2. ARC writers should record length in custom GZIP header 'extra' header ...... 11 9. Miscellaneous ...... 11 9.1. av_* toolset ...... 11 9.2. GZIPping of ARC Records ...... 11 9.3. Recording unfetched content ...... 11 9.4. ARC File EBNF ...... 11

Abstract

Proposed Revision to the Internet Archive ARC file format to add metadata, recording of the fetching request and support for content transforms.

1 ARC file Revision 3.0 Proposal

1. Introduction

This document proposes a set of changes to the Internet Archive (IA) ARC file format that directly address require- ments drawn up by the International Internet Preservation Consortium (IIPC) [http://www.netpreserve.org/]. In the main, the IIPC requirements call for the ARC file to support the writing of content metadata, the recording of the re- quest made fetching content, and support for content tranformations. The IA, the formulator of the current ARC file format, is a member of the IIPC and participated in the development of the IIPC Archival Data Format requirements. 1.1. IIPC Archival Data Format Requirements

Below are listed the key requirements from Section 2 of IIPC Archival Data Format Requirements [http://netarkivet.dk/website/publications/Archival_format_requirements-2004.pdf] (TODO: This links to a copy. Update):

• 2.1 Open Archival Information System (OAIS) [http://www.rlg.org/longterm/oais.html] compatible

• 2.3 The format must support all Internet protocols

• 2.4 The format must support metadata

• 2.5 Data integrity must be easy to verify and maintain

• 2.6 It must be possible to retrieve the original bitstream (Request and response).

• 2.11 Support format transformations

• 2.13 Support duplicate reduction

• 2.14 The format should be efficent

1.1.1. Other "Requirements"

Riders on the above IIPC requirements listing that the IA want the ARC revision to support include:

• Recording of metadata to support writing of arbitrary crawltime metadata such as operator journal notes.

• Recording of response SSL certificates and authentication credentials used logging into a site.

1.2. Input

Below we list key documents that fed the development of this proposal:

• Report covering Discussion of ArcRevision Proposals [http://crawler.archive.org/cgi-bin/wiki.pl?ArcRevisionCopenhagenDiscussion], June 10th, 2004 at the Heritrix Copenhagen Workshop. A listing of ARC proposed changes was presented. This document summarizes what came of the ensuing discussion.

• ARC file versions 1.0 and 2.0 are described here, Arc File Format [http://www.archive.org/web/researcher/ArcFileFormat.php]. Version 2.0 was never implemented. An amend- ment to ARC file version 1.0 (1.1) adding XML metadata to the head of the ARC is described here, Internet Archive ARC files [http://crawler.archive.org/articles/developer_manual.html#arcs]. Heritrix currently, version 1.0.0, writes ARC 1.1 files. The Alexa crawler, the harvester responsible for the bulk of archive.org repository

2 ARC file Revision 3.0 Proposal

writes ARC 1.0 files.

• Open Archival Information System (OAIS) Resources [http://www.rlg.org/longterm/oais.html], a “...conceptual framework for an archival system dedicated to preserving and maintaining access to digital information over the long term”, informs the IIPC Archival Data Format Requirements document.

1.3. Scope 1.3.1. Suggested Timeline

Finished proposal: 09/2004, in time for the London IIPC meeting. Review and edit over the winter. Implementation at least by IA in first quarter of 2005. 1.3.2. Key Stakeholders

• International Internet Preservation Consortium (IIPC) [http://netpreserve.org/]

• Internet Archive (IA) [http://archive.org/]

1.3.3. Other Stakeholders

Users other than the Internet Archive of Heritrix [http://crawler.archive.org/], an open-source crawler that writes its fetchings as ARCs. 1.3.4. Constraints

Below are copied forward from ArcRevision [http://crawler.archive.org/cgi-bin/wiki.pl?ArcRevision] Copenhagen Discussion document:

• Any revision must consider the 400 terrabytes of version 1 legacy Internet Archive ARCs stored in San Fran- cisco with a (dated) copy in Alexandria, Egypt and a recent Amsterdam (IA Europe) copy.

• The proprietary software used manipulating and playing back this ARC legacy, the av_* tools and the Wayback Machine [http://web.archive.org/collections/web.html], are not easily changed.

1.4. Acronyms, Abbreviations and Definitions 1.4.1. ARC Record

An ARC file is made up of concatenated ARC Records. An ARC Record begins with a single line, known as the URL-record or ARC Record metadata line. Here's is the Arc ARC Revision 1.0 [http://www.archive.org/web/researcher/ArcFileFormat.php] definition of the ARC Record metadata line:

This ARC Record metadata line is immediately followed by the raw, unadulterated content byte-stream, usually HT- TP response headers, an empty line, and then the requested page. An ARC Record is the ARC Record metadata line plus recorded content.

3 ARC file Revision 3.0 Proposal

2. ARC Record Addressing

There is a need for uniquely addressing individual ARC Records. For example, this proposal talks of being able to record a pointer to an extant Archive ARC Record in place of content if archiving software determines the content already present in the archive. Such an ARC Record pointer mechanism would take the address of an ARC Record. An ARC Record address would also be used tying metadata to the content described.

To date, the Internet Archive, the maintainer of the largest repository of ARCs, has queried particular ARC Records using a combination of date and URL. A CGI, the Wayback Machine [http://www.archive.org/web/web.php], parses the passed URL path -- e.g http://web.archive.org/web/20010202083300/http://archive.org/ -- to extract date and URL components. Armed with these path-portions, it does a lookup into an index to find which ARC file contains the sought-for ARC Record and at which offset the ARC Record resides.

Going forward date plus URL will be insufficent distingushing records. At a minimum, a scheme based on these two attributes alone does not allow for uniquely addressing different ARC Record versions of the same content (See Sec- tion 7, “Format Transformations (IIPC Archival Data Format Requirement 2.11)” section below). Other issues are that multiple record writes of the URL in the same moment become more likely with whats proposed below, and how to insure identifiers don't clash as ARC Record queries cross institutions. 2.1. Reference

The below documents were consulted during development of an ARC Record Addressing Scheme.

• RFC 2396 - Uniform Resource Identifiers (URI): Generic Syntax [http://www.faqs.org/rfcs/rfc2396.html]

• "duri" and "tdb" URN namespaces based on dated URIs [http://larry.masinter.net/duri.html]

• Persistent Identifiers [http://www.nla.gov.au/padi/topics/36.html]

• The Persistent Identifier Scheme [http://www.ietf.org/internet-drafts/draft-kunze-ark-08.txt]

• URIs, URLs, and URNs: Clarifications and Recommendations 1.0 [http://www.w3.org/TR/uri-clarification/]

• New URI Schemes: 99% Harmful [http://infomesh.net/2001/09/urischemes/]

• I am a name and a number [http://www.ariadne.ac.uk/issue24/metadata/intro.html]

2.2. The ari Scheme

For writing the unique address of ARC record, we propose a new URI scheme, ari for arc record identifier or archive resource identifier. An ari is defined as a compound of elements found in ARC Record metadata lines. In words, an ari is the ARC Record metadata URL plus the ARC Record metadata (GMT) date of the ARC Record re- cording plus a serial number. The serial number is currently yet-to-be-specified but for the purposes of this revision of the proposal, let the serial number be three hex digits wide and have it roll over at 0xFFF to begin again at 0x000. The serial number is meant to distingush records written in the same moment (The ARC Record serial number is not currently present on the ARC Record metadata line. Its addition is proposed later below).

In RFC 'layout form': ari: ; ;

TODO: Formal RFC BNF-like specification which describes what date looks like (RFC 822 [http://www.faqs.org/rfcs/rfc822.html] dates?), legal characters and width in IP, etc.

4 ARC file Revision 3.0 Proposal

2.2.1. Example aris

1. ari:20040724100457;1FE;http://www.archive.org/index.html

2. ari:20040728195737;1FF;http://www.dh.gov.uk/PolicyAndGuidance/OrganisationPolicy/Modernisation/NHSP lan/fs/en?CONTENT_ID=4082690&chk=/DU1UD

2.3. Discussion

The URN-like ari has the following properties:

1. An ari is a globally unique ARC Record identifier. The likelihood that two ARC writers are both writing the same URI at the exact same moment using the same ARC Record serial no. is consider too rare of an occur- rance to matter.

2. An ari is not "fetchable"/"actionable". You cannot copy an ari URI into your browser location bar and have the browser fetch the pointed-to resource.

3. While the ARC Record URI is effectively redundant information -- the date plus serial no. could be made suffi- cent to guarantee uniqueness -- its included as a means of verifying addresses and so, without having to resort to metadata, linkages between records can be traced. For example, its possible to look at a metadata ari and see the URI, date, and serial number of the content it describes even if the described record goes missing. Such a scheme where key information about referenced resource is contained inside the pointer makes an indexer's work easier associating records (e.g. If the original record is missing, all subsequent transforms, re-arc'ings, metadata, request headers, etc., can still be found by an indexer by just reading the ARC Record metadata line).

4. ari is similar to the Persistent Identifier used by the Australian Governments National Library's Pandora project. See Appendix: PANDORA Persistent Identifier Standard of Archiving the Web: The PANDORA Archive at the National Library of Australia [http://www.nla.gov.au/nla/staffpaper/2001/cathro3.html].

5. '2.3.1 Proxy into HTTP/HTML' of RFC2718 [http://www.ietf.org/rfc/rfc2718.txt] suggests a gateway between HTTP/HTML for any proposed new scheme so common browsers can fetch the new scheme resources. A ver- sion of the Internet Archive Wayback Machine [http://www.archive.org/web/web.php] expanded to accomod- ate the ARC Record serial number field can be imagined (Currently, if a date is not specified in a Wayback Ma- chine query, the latest is returned. A '*' for date returns all. Same could be the case for the extra serial number field). Another possible wayback-like mapping scheme would rewrite an ari as a fetchable URI by doing the following. Given a gateway named ari.gw.com with a DNS wildcard record in place such that any *.ari.gw.com resolves to ari.gw.com, and given an ari of ari:20040724100457;1FE;http://www.archive.org:80/index.html, the mapping rewrite would produce: http://www.archive.org.80.20040724100457.1FE.ari.gw.com/index.html

Rewriting the domain portion of the ari URI component undoes the need for the gateway to rewrite all but ab- solute links present in the ari record. Here is another example using the same gateway and the long ari given in the ari example no.2 above: http://www.dh.gov.uk.20040728195737.1FE.ari.gw.com/PolicyAndGuidance/OrganisationPolicy/Modernisation/NHSPlan/fs/en?CONTENT_ID=4082690&chk=/DU1UD

Such a gateway would work best for programs. A REST-like webservice for delivering ARC Record content might look like this. Humans will prefer the Wayback query interface.

6. Only ARC Records of the proposed version 3.0 can be addressed using aris (Though we might allow that an ari with an empty serial number means pre-3.0 addresses).

7. Here are some criticism of the ari scheme:

5 ARC file Revision 3.0 Proposal

a. There will be a tendency to read an ari as "The state of resource X at time Y". For example, using the ex- ample no.1 ari address from above, it can be read as "The state of the index.html page on archive org on 07/24/2001 10:04:57". While this reading holds for example no.1, it falls down when an ARC Record's URL is itself an ari (See samples below where we write about Metadata and linkages to content).

b. Aris are cumbersome particularly in their redundant URL data. Having the URL in the ari is like putting a picture of the addressed building on an envelope alongside the address so the postman can say, "yep, thats it", when she arrives. The argument for carrying the date and URL in the identifier is that without having to resort to metadata, even if intermediary records are lost, its still possible to find any referenced records querying with portions of the identifier. If we relaxed some and instead said that an ARC parser can rely on the ARC Record metadata line content rather than on the ari URI alone, then if we add the referred to URL as a new field ('Location' or 'Reference'), an ARC parser will still be able to reconcile records aris would be more amenable. If we did this, the aris cited as examples above would look like ari:20040724100457;1FE and ari:20040728195737;1FF. If we did this, we might as well use The ARK Persistent Identifier Scheme [http://www.ietf.org/internet-drafts/draft-kunze-ark-08.txt] (Problem: Using such a terse ari or ARK, how would we refer to legacy records in pre-3.0 ARCs?).

3. ARC file format changes 3.1. ARC Record Metadata Line

Each ARC Record is introduced by a single line of the following format for version 1 ARC files (From [Arc File Format] [http://www.archive.org/web/researcher/ArcFileFormat.php]):

Version 2 was never used but the spec. had the line looking like this:

This proposal for revision 3.0 is for an ARC Record metadata line that looks like:

The added serial serial number is needed composing the ari address for a record.

The checksum will include its type as in urn:sha1:5RT35RT35RT35RT35RT35RT35RT35RT3. What has been hashed will be described in the first record of the ARC in its metadata: e.g. Content only, or Content + Request headers, etc. 4. ARC Record Metadata (IIPC Archival Data Format Requirement 2.4)

ARC Record metadata will be written as distinct new ARC Records. Its allowed that there may be multiple metadata instances for any particular ARC Record instance (and even metadata about metadata); each will be written as a new ARC Record.

The metadata format is not specified in this proposal. Metadata specification is considered out of scope. In scope is where to add metadata and how its associated with the described content.

6 ARC file Revision 3.0 Proposal

Its entertained that metadata may be written RFC 822 [http://www.faqs.org/rfcs/rfc822.html]-header style of only ASCII characters with long lines continued on the next begun with a space -- See Section 3.1 -- but it could be done as RDF (Resource Description Framework) [http://www.w3.org/RDF/]. "The Resource Description Framework (RDF) is a language for representing information about resources in the World Wide Web. It is particularly intended for representing metadata about Web resources, such as the title, author, and modification date of a Web page, copy- right and licensing information about a Web document, or the availability schedule for some shared resource."

Metadata could include:

• Referrer page.

• List of the pages' outbound links.

• The certificate volunteered by the server if communication done over SSL (https) or authentication information used logging in if page required authentication.

• Content hash with specification of the hashing mechanism used either by prefix or in a separate distinct element.

• The SSL certificate received setting up a secure connection.

While metadata will often contain the address of the content described -- e.g. in the RDF about attribute -- its pro- posed that the linking of metadata and content described be done on the ARC Record metadata line so its not neces- sary for an ARC parser to interpret metadata making reconciliation of content and its metadata. The proposed tech- nique is to have the metadata ARC Record metadata line URL point at the ARC Record for the content described. For example, given an ARC Record:

http://www.archive.org/index.html 201.201.201.111 20040724100457 text/html 1FE 3043 HTTP/1.1 200 OK Date: Tue, 03 Aug 2004 01:26:42 GMT Server: Apache/2.0.47 (Unix) mod_ssl/2.0.47 OpenSSL/0.9.7c PHP/4.3.4 ...

Its metadata would look like this:

ari:20040724100457;1FE;http://www.archive.org/index.html 192.168.201.167 20040724100888 application/rdf+xml;arcrecordtype=metadata 1FF 5096 ...

Subsequent metadata records that describe the same content will be distingushed by date and serial no.

A perhaps absurd example, metadata about the metadata record above, would have an ARC Record metadata line URL of ari:20040724100888;1FF;ari:20040724100457;1FE;http://www.archive.org/index.html, and so on.

Resolution of record type and their interlinking is not dependent on the presence or content of metadata. 4.1. Metadata ARC Record Types

Metadata ARC Records are distingushed by their ari URL and mimetype. For example, application/rdf+xml [http://www.aaronsw.com/2002/rdf-mediatype.html] would be the mimetype for rdf metadata and (though a slight perversion) text/rfc822-headers [http://www.faqs.org/rfcs/rfc1892.html] for RFC 822 [http://www.faqs.org/rfcs/rfc822.html]-type metadata.

7 ARC file Revision 3.0 Proposal

An invented, optional qualifier, arcrecordtype may be required (TBD). An optional presence might save parsers hav- ing to dip into the metadata to distingush certain like-looking types (More on these below). Possible values would include metadata, duplicate, and transform. For example, a metadata record might have the following type: applica- tion/rdf+xml;arcrecordtype=metadata. 5. Recording of the Complete Request (IIPC Archiv- al Data Format Requirement 2.6)

Its proposed that the full request be recorded in a distinct ARC Record, just as we propose doing metadata. Linking of requests and the content fetched will be done as described above for metadata. Requests will be distingushed by the message/http mimetype (See "19.1 Internet Media Type message/http and application/http" of RFC2616 (HTTTP) [http://www.faqs.org/rfcs/rfc2616.html]). The RFC allows msgtype qualifiers so we can actually write message/http;msgtype=request as the mimetype on requests (Later in this proposal we describe a case where we'll want to make use of the sister message/http;msgtype=response mimetype). 5.1. Discussion

Requests (and responses) cannot be recorded easily INSIDE of metadata. If the metadata form is XML then to guard against XML-illegal characters appearing in the request, the request headers would need to be encoded/escaped or the request data would need to be BASE64 encoded, a transformation done on the original byte stream. Both schemes violate the requirement that we record the original bytestream. While it might be possible to do the headers in RFC 822 [http://www.faqs.org/rfcs/rfc822.html]-type metadata, it seems easier recording the raw requests as their own ARC Record of their own mimetype. 6. Duplicate Reduction (IIPC Archival Data Format Requirement 2.13) 6.1. Use Case

A number of websites are harvested on a daily basis, the majority of all pages remain unchanged between successive harvests. In order to save storage space it should not be necessary to save unchanged pages. It must however be pos- sible to record that the page was visited and verified identical to existing version in the archive. 6.2. Proposed Implementation

Recording a pointer to content already extant in the archive will be done by recording the current request and re- sponse into distinct ARC Records -- as a record of the attempted fetch -- with a metadata ARC Record that has a pointer back to the archive record we want to avoid duplicating. The below illustration uses pseudo RDF as the metadata format.

The first time a page is visited 3 records are created:

• The response (Currently response headers followed by the content): http://www.archive.org/index.html 201.201.201.111 20040724100457 text/html 1FE 3043 HTTP/1.1 200 OK Date: Tue, 03 Aug 2004 01:26:42 GMT Server: Apache/2.0.47 (Unix) mod_ssl/2.0.47 OpenSSL/0.9.7c PHP/4.3.4 ...

8 ARC file Revision 3.0 Proposal

• The metadata: ari:20040724100457;1FE;http://www.archive.org/index.html 192.168.201.167 20040724100888 application/xml+rdf;arcrecordtype=metadata 1FF 5096 ...

• The request: ari:20040724100457;1FE;http://www.archive.org/index.html 192.168.201.167 20040724100999 message/http;msgtype=request 200 3333 200 OK ...

The next time the page is visited, AND the crawler determines the page already extant in the repository, 3 records are created:

• The response only (No content. Note special response mimetype): http://www.archive.org/index.html 201.201.201.111 20040724104444 message/http;msgtype=response 210 3043 HTTP/1.1 200 OK^M Date: Tue, 03 Aug 2004 01:26:42 GMT Server: Apache/2.0.47 (Unix) mod_ssl/2.0.47 OpenSSL/0.9.7c PHP/4.3.4 ...

• The metadata including description that this is a pointer to a duplicate: ari:20040724104444;210;http://www.archive.org/index.html 192.168.201.167 20040724105555 application/xml+rdf;arcrecordtype=duplicate 211 5096 ... http://www.archive.org/index.html ...

• The request: ari:20040724104444;210;http://www.archive.org/index.html 192.168.201.167 20040724106666 message/http;msgtype=request 212 3333 200 OK ...

7. Format Transformations (IIPC Archival Data Format Requirement 2.11)

9 ARC file Revision 3.0 Proposal

7.1. Use Case

A document is harvested in a proprietary text processing format. Support and development of the text processing ap- plication stops. In order to maintain the ability to read the information in the stored document, a transformation pro- gram is used to create a copy of the document in a supported document format. 7.2. Transformation Attributes

Transformations should result in a freestanding, complete record. There should be no dependency on the original re- cord surviving to guard against failure of transformed record on loss of the original. 7.3. Proposed Implementation

Here is how transformations will work using the metadata vocabulary proposed above. The below illustration uses pseudo RDF as the metadata format.

• The initial version: http://www.archive.org/mydocument.doc 201.201.201.111 20040724100457 application/msword 1FE 3043 HTTP/1.1 200 OK Date: Tue, 03 Aug 2004 01:26:42 GMT Server: Apache/2.0.47 (Unix) mod_ssl/2.0.47 OpenSSL/0.9.7c PHP/4.3.4 ...

The transformation process creates 2 new records:

• The transformed version: ari:20040724100457;1FE;http://www.archive.org/mydocument.doc 192.168.201.143 20040724100888 application/openoffice.org 200 1024 ...

• Metadata on the transformation: ari:20040724100888;200;ari:20040724100457;1FE;http://www.archive.org/mydocument.doc 192.168.201.143 20040724109999 application/xml+rdf;arcrecordtype=transform 211 5096 ... ari:20040724100457;C0A8C9A7.1FE;http://www.archive.org/mydocument.doc ...

8. For Consideration 8.1. Up the default ARC file size

Currently the default ARC file size is 100mb. We should consider upping the default size from 100mb to 500mb or to 600mb (CD size). Disks are bigger now. Would mean less files to sling.

10 ARC file Revision 3.0 Proposal

8.2. ARC writers should record GZIP length in custom GZIP header 'extra' header

ARC writers will need to be amended to write Revision 3.0 arcs. ARC readers will need to be amended to parse the new ARC format. While revisiting both, its been proposed by Tom Emerson that we add an optional gzip header 'ex- tra' field that has in it the GZIP'd record length to facility skipping over ARC Records. See GZIPed Records: Stash- ing the Length [http://www.dreamersrealm.net/tree/blog/2004/07/18/#gzip_size_proposal]. 9. Miscellaneous 9.1. av_* toolset

Notes on changes needed in av_* toolset to accomodate new scheme types and new ARC Record metadata line as well as new arc size. 9.2. GZIPping of ARC Records

TODO: Describe the way arcs are gzipped; that they are not one big gzip member, but a member per record. 9.3. Recording unfetched content

Describe the Los Alamos Labs use case where ARC Record was not the result of fetching -- i.e. no request headers and the mimetype describes exactly what follows (Los Alamos Labs: See Message 530 [http://groups.yahoo.com/group/archive-crawler/message/530]) -- and how aris could be used recording these record types. 9.4. ARC File EBNF

TODO: Include formal EBNF description of ARC (To byte level).

11