ARC File Revision 3.0 Proposal
Total Page:16
File Type:pdf, Size:1020Kb
ARC file Revision 3.0 Proposal Steen Christensen, Det Kongelige Bibliotek <ssc at kb dot dk> Michael Stack, Internet Archive <stack at archive dot org> Edited by Michael Stack Revision History Revision 1 09/09/2004 Initial conversion of wiki working doc. [http://crawler.archive.org/cgi-bin/wiki.pl?ArcRevisionProposal] to docbook. Added suggested edits suggested by Gordon Mohr (Others made are still up for consideration). This revision is what is being submitted to the IIPC Framework Group for review at their London, 09/20/2004 meeting. Table of Contents 1. Introduction ............................................................................................................................2 1.1. IIPC Archival Data Format Requirements .......................................................................... 2 1.2. Input ...........................................................................................................................2 1.3. Scope ..........................................................................................................................3 1.4. Acronyms, Abbreviations and Definitions .......................................................................... 3 2. ARC Record Addressing ........................................................................................................... 4 2.1. Reference ....................................................................................................................4 2.2. The ari Scheme ............................................................................................................. 4 2.3. Discussion ................................................................................................................... 5 3. ARC file format changes ........................................................................................................... 6 3.1. ARC Record Metadata Line ............................................................................................ 6 4. ARC Record Metadata (IIPC Archival Data Format Requirement 2.4) ............................................... 6 4.1. Metadata ARC Record Types .......................................................................................... 7 5. Recording of the Complete Request (IIPC Archival Data Format Requirement 2.6) .............................. 8 5.1. Discussion ................................................................................................................... 8 6. Duplicate Reduction (IIPC Archival Data Format Requirement 2.13) ................................................. 8 6.1. Use Case .....................................................................................................................8 6.2. Proposed Implementation ............................................................................................... 8 7. Format Transformations (IIPC Archival Data Format Requirement 2.11) ............................................ 9 7.1. Use Case ................................................................................................................... 10 7.2. Transformation Attributes ............................................................................................. 10 7.3. Proposed Implementation ............................................................................................. 10 8. For Consideration .................................................................................................................. 10 8.1. Up the default ARC file size .......................................................................................... 10 8.2. ARC writers should record GZIP length in custom GZIP header 'extra' header ........................ 11 9. Miscellaneous ....................................................................................................................... 11 9.1. av_* toolset ................................................................................................................ 11 9.2. GZIPping of ARC Records ........................................................................................... 11 9.3. Recording unfetched content ......................................................................................... 11 9.4. ARC File EBNF .......................................................................................................... 11 Abstract Proposed Revision to the Internet Archive ARC file format to add metadata, recording of the fetching request and support for content transforms. 1 ARC file Revision 3.0 Proposal 1. Introduction This document proposes a set of changes to the Internet Archive (IA) ARC file format that directly address require- ments drawn up by the International Internet Preservation Consortium (IIPC) [http://www.netpreserve.org/]. In the main, the IIPC requirements call for the ARC file to support the writing of content metadata, the recording of the re- quest made fetching content, and support for content tranformations. The IA, the formulator of the current ARC file format, is a member of the IIPC and participated in the development of the IIPC Archival Data Format requirements. 1.1. IIPC Archival Data Format Requirements Below are listed the key requirements from Section 2 of IIPC Archival Data Format Requirements [http://netarkivet.dk/website/publications/Archival_format_requirements-2004.pdf] (TODO: This links to a copy. Update): • 2.1 Open Archival Information System (OAIS) [http://www.rlg.org/longterm/oais.html] compatible • 2.3 The format must support all Internet protocols • 2.4 The format must support metadata • 2.5 Data integrity must be easy to verify and maintain • 2.6 It must be possible to retrieve the original bitstream (Request and response). • 2.11 Support format transformations • 2.13 Support duplicate reduction • 2.14 The format should be efficent 1.1.1. Other "Requirements" Riders on the above IIPC requirements listing that the IA want the ARC revision to support include: • Recording of metadata to support writing of arbitrary crawltime metadata such as operator journal notes. • Recording of response SSL certificates and authentication credentials used logging into a site. 1.2. Input Below we list key documents that fed the development of this proposal: • Report covering Discussion of ArcRevision Proposals [http://crawler.archive.org/cgi-bin/wiki.pl?ArcRevisionCopenhagenDiscussion], June 10th, 2004 at the Heritrix Copenhagen Workshop. A listing of ARC proposed changes was presented. This document summarizes what came of the ensuing discussion. • ARC file versions 1.0 and 2.0 are described here, Arc File Format [http://www.archive.org/web/researcher/ArcFileFormat.php]. Version 2.0 was never implemented. An amend- ment to ARC file version 1.0 (1.1) adding XML metadata to the head of the ARC is described here, Internet Archive ARC files [http://crawler.archive.org/articles/developer_manual.html#arcs]. Heritrix currently, version 1.0.0, writes ARC 1.1 files. The Alexa crawler, the harvester responsible for the bulk of archive.org repository 2 ARC file Revision 3.0 Proposal writes ARC 1.0 files. • Open Archival Information System (OAIS) Resources [http://www.rlg.org/longterm/oais.html], a “...conceptual framework for an archival system dedicated to preserving and maintaining access to digital information over the long term”, informs the IIPC Archival Data Format Requirements document. 1.3. Scope 1.3.1. Suggested Timeline Finished proposal: 09/2004, in time for the London IIPC meeting. Review and edit over the winter. Implementation at least by IA in first quarter of 2005. 1.3.2. Key Stakeholders • International Internet Preservation Consortium (IIPC) [http://netpreserve.org/] • Internet Archive (IA) [http://archive.org/] 1.3.3. Other Stakeholders Users other than the Internet Archive of Heritrix [http://crawler.archive.org/], an open-source crawler that writes its fetchings as ARCs. 1.3.4. Constraints Below are copied forward from ArcRevision [http://crawler.archive.org/cgi-bin/wiki.pl?ArcRevision] Copenhagen Discussion document: • Any revision must consider the 400 terrabytes of version 1 legacy Internet Archive ARCs stored in San Fran- cisco with a (dated) copy in Alexandria, Egypt and a recent Amsterdam (IA Europe) copy. • The proprietary software used manipulating and playing back this ARC legacy, the av_* tools and the Wayback Machine [http://web.archive.org/collections/web.html], are not easily changed. 1.4. Acronyms, Abbreviations and Definitions 1.4.1. ARC Record An ARC file is made up of concatenated ARC Records. An ARC Record begins with a single line, known as the URL-record or ARC Record metadata line. Here's is the Arc ARC Revision 1.0 [http://www.archive.org/web/researcher/ArcFileFormat.php] definition of the ARC Record metadata line: <URL> <IP-address> <Archive-date> <Content-type> <Archive-length><nl> This ARC Record metadata line is immediately followed by the raw, unadulterated content byte-stream, usually HT- TP response headers, an empty line, and then the requested page. An ARC Record is the ARC Record metadata line plus recorded content. 3 ARC file Revision 3.0 Proposal 2. ARC Record Addressing There is a need for uniquely addressing individual ARC Records. For example, this proposal talks of being able to record a pointer to an extant Archive ARC Record in place of content if archiving software determines the content already present in the archive. Such an ARC Record pointer mechanism would take the address of an ARC Record. An ARC Record address would also be