ARC File Revision 3.0 Proposal

Total Page:16

File Type:pdf, Size:1020Kb

ARC File Revision 3.0 Proposal ARC file Revision 3.0 Proposal Steen Christensen, Det Kongelige Bibliotek <ssc at kb dot dk> Michael Stack, Internet Archive <stack at archive dot org> Edited by Michael Stack Revision History Revision 1 09/09/2004 Initial conversion of wiki working doc. [http://crawler.archive.org/cgi-bin/wiki.pl?ArcRevisionProposal] to docbook. Added suggested edits suggested by Gordon Mohr (Others made are still up for consideration). This revision is what is being submitted to the IIPC Framework Group for review at their London, 09/20/2004 meeting. Table of Contents 1. Introduction ............................................................................................................................2 1.1. IIPC Archival Data Format Requirements .......................................................................... 2 1.2. Input ...........................................................................................................................2 1.3. Scope ..........................................................................................................................3 1.4. Acronyms, Abbreviations and Definitions .......................................................................... 3 2. ARC Record Addressing ........................................................................................................... 4 2.1. Reference ....................................................................................................................4 2.2. The ari Scheme ............................................................................................................. 4 2.3. Discussion ................................................................................................................... 5 3. ARC file format changes ........................................................................................................... 6 3.1. ARC Record Metadata Line ............................................................................................ 6 4. ARC Record Metadata (IIPC Archival Data Format Requirement 2.4) ............................................... 6 4.1. Metadata ARC Record Types .......................................................................................... 7 5. Recording of the Complete Request (IIPC Archival Data Format Requirement 2.6) .............................. 8 5.1. Discussion ................................................................................................................... 8 6. Duplicate Reduction (IIPC Archival Data Format Requirement 2.13) ................................................. 8 6.1. Use Case .....................................................................................................................8 6.2. Proposed Implementation ............................................................................................... 8 7. Format Transformations (IIPC Archival Data Format Requirement 2.11) ............................................ 9 7.1. Use Case ................................................................................................................... 10 7.2. Transformation Attributes ............................................................................................. 10 7.3. Proposed Implementation ............................................................................................. 10 8. For Consideration .................................................................................................................. 10 8.1. Up the default ARC file size .......................................................................................... 10 8.2. ARC writers should record GZIP length in custom GZIP header 'extra' header ........................ 11 9. Miscellaneous ....................................................................................................................... 11 9.1. av_* toolset ................................................................................................................ 11 9.2. GZIPping of ARC Records ........................................................................................... 11 9.3. Recording unfetched content ......................................................................................... 11 9.4. ARC File EBNF .......................................................................................................... 11 Abstract Proposed Revision to the Internet Archive ARC file format to add metadata, recording of the fetching request and support for content transforms. 1 ARC file Revision 3.0 Proposal 1. Introduction This document proposes a set of changes to the Internet Archive (IA) ARC file format that directly address require- ments drawn up by the International Internet Preservation Consortium (IIPC) [http://www.netpreserve.org/]. In the main, the IIPC requirements call for the ARC file to support the writing of content metadata, the recording of the re- quest made fetching content, and support for content tranformations. The IA, the formulator of the current ARC file format, is a member of the IIPC and participated in the development of the IIPC Archival Data Format requirements. 1.1. IIPC Archival Data Format Requirements Below are listed the key requirements from Section 2 of IIPC Archival Data Format Requirements [http://netarkivet.dk/website/publications/Archival_format_requirements-2004.pdf] (TODO: This links to a copy. Update): • 2.1 Open Archival Information System (OAIS) [http://www.rlg.org/longterm/oais.html] compatible • 2.3 The format must support all Internet protocols • 2.4 The format must support metadata • 2.5 Data integrity must be easy to verify and maintain • 2.6 It must be possible to retrieve the original bitstream (Request and response). • 2.11 Support format transformations • 2.13 Support duplicate reduction • 2.14 The format should be efficent 1.1.1. Other "Requirements" Riders on the above IIPC requirements listing that the IA want the ARC revision to support include: • Recording of metadata to support writing of arbitrary crawltime metadata such as operator journal notes. • Recording of response SSL certificates and authentication credentials used logging into a site. 1.2. Input Below we list key documents that fed the development of this proposal: • Report covering Discussion of ArcRevision Proposals [http://crawler.archive.org/cgi-bin/wiki.pl?ArcRevisionCopenhagenDiscussion], June 10th, 2004 at the Heritrix Copenhagen Workshop. A listing of ARC proposed changes was presented. This document summarizes what came of the ensuing discussion. • ARC file versions 1.0 and 2.0 are described here, Arc File Format [http://www.archive.org/web/researcher/ArcFileFormat.php]. Version 2.0 was never implemented. An amend- ment to ARC file version 1.0 (1.1) adding XML metadata to the head of the ARC is described here, Internet Archive ARC files [http://crawler.archive.org/articles/developer_manual.html#arcs]. Heritrix currently, version 1.0.0, writes ARC 1.1 files. The Alexa crawler, the harvester responsible for the bulk of archive.org repository 2 ARC file Revision 3.0 Proposal writes ARC 1.0 files. • Open Archival Information System (OAIS) Resources [http://www.rlg.org/longterm/oais.html], a “...conceptual framework for an archival system dedicated to preserving and maintaining access to digital information over the long term”, informs the IIPC Archival Data Format Requirements document. 1.3. Scope 1.3.1. Suggested Timeline Finished proposal: 09/2004, in time for the London IIPC meeting. Review and edit over the winter. Implementation at least by IA in first quarter of 2005. 1.3.2. Key Stakeholders • International Internet Preservation Consortium (IIPC) [http://netpreserve.org/] • Internet Archive (IA) [http://archive.org/] 1.3.3. Other Stakeholders Users other than the Internet Archive of Heritrix [http://crawler.archive.org/], an open-source crawler that writes its fetchings as ARCs. 1.3.4. Constraints Below are copied forward from ArcRevision [http://crawler.archive.org/cgi-bin/wiki.pl?ArcRevision] Copenhagen Discussion document: • Any revision must consider the 400 terrabytes of version 1 legacy Internet Archive ARCs stored in San Fran- cisco with a (dated) copy in Alexandria, Egypt and a recent Amsterdam (IA Europe) copy. • The proprietary software used manipulating and playing back this ARC legacy, the av_* tools and the Wayback Machine [http://web.archive.org/collections/web.html], are not easily changed. 1.4. Acronyms, Abbreviations and Definitions 1.4.1. ARC Record An ARC file is made up of concatenated ARC Records. An ARC Record begins with a single line, known as the URL-record or ARC Record metadata line. Here's is the Arc ARC Revision 1.0 [http://www.archive.org/web/researcher/ArcFileFormat.php] definition of the ARC Record metadata line: <URL> <IP-address> <Archive-date> <Content-type> <Archive-length><nl> This ARC Record metadata line is immediately followed by the raw, unadulterated content byte-stream, usually HT- TP response headers, an empty line, and then the requested page. An ARC Record is the ARC Record metadata line plus recorded content. 3 ARC file Revision 3.0 Proposal 2. ARC Record Addressing There is a need for uniquely addressing individual ARC Records. For example, this proposal talks of being able to record a pointer to an extant Archive ARC Record in place of content if archiving software determines the content already present in the archive. Such an ARC Record pointer mechanism would take the address of an ARC Record. An ARC Record address would also be
Recommended publications
  • Full Document
    R&D Centre for Mobile Applications (RDC) FEE, Dept of Telecommunications Engineering Czech Technical University in Prague RDC Technical Report TR-13-4 Internship report Evaluation of Compressibility of the Output of the Information-Concealing Algorithm Julien Mamelli, [email protected] 2nd year student at the Ecole´ des Mines d'Al`es (N^ımes,France) Internship supervisor: Luk´aˇsKencl, [email protected] August 2013 Abstract Compression is a key element to exchange files over the Internet. By generating re- dundancies, the concealing algorithm proposed by Kencl and Loebl [?], appears at first glance to be particularly designed to be combined with a compression scheme [?]. Is the output of the concealing algorithm actually compressible? We have tried 16 compression techniques on 1 120 files, and the result is that we have not found a solution which could advantageously use repetitions of the concealing method. Acknowledgments I would like to express my gratitude to my supervisor, Dr Luk´aˇsKencl, for his guidance and expertise throughout the course of this work. I would like to thank Prof. Robert Beˇst´akand Mr Pierre Runtz, for giving me the opportunity to carry out my internship at the Czech Technical University in Prague. I would also like to thank all the members of the Research and Development Center for Mobile Applications as well as my colleagues for the assistance they have given me during this period. 1 Contents 1 Introduction 3 2 Related Work 4 2.1 Information concealing method . 4 2.2 Archive formats . 5 2.3 Compression algorithms . 5 2.3.1 Lempel-Ziv algorithm .
    [Show full text]
  • Steganography and Vulnerabilities in Popular Archives Formats.| Nyxengine Nyx.Reversinglabs.Com
    Hiding in the Familiar: Steganography and Vulnerabilities in Popular Archives Formats.| NyxEngine nyx.reversinglabs.com Contents Introduction to NyxEngine ............................................................................................................................ 3 Introduction to ZIP file format ...................................................................................................................... 4 Introduction to steganography in ZIP archives ............................................................................................. 5 Steganography and file malformation security impacts ............................................................................... 8 References and tools .................................................................................................................................... 9 2 Introduction to NyxEngine Steganography1 is the art and science of writing hidden messages in such a way that no one, apart from the sender and intended recipient, suspects the existence of the message, a form of security through obscurity. When it comes to digital steganography no stone should be left unturned in the search for viable hidden data. Although digital steganography is commonly used to hide data inside multimedia files, a similar approach can be used to hide data in archives as well. Steganography imposes the following data hiding rule: Data must be hidden in such a fashion that the user has no clue about the hidden message or file's existence. This can be achieved by
    [Show full text]
  • How to 'Zip and Unzip' Files
    How to 'zip and unzip' files The Windows 10 operating system provides a very easy way to zip-up any file (or folder) you want by using a utility program called 7-zip (Seven Zip). The program is integrated in the context menu which you get when you right-click on anything selected. Here are the basic steps you need to take in order to: Zip a file and create an archive 1. Open a folder in your machine and select any file (by clicking on it once). I selected a single file called 'how-to send an email.docx' 2. Now right click the mouse to have the context menu appear 3. In the context menu you should be able to see some commands like the capture below 4. Since we want to zip up the file you need to select one of the bottom two options a. 'Add to archive' will actually open up a dialog of the 7-zip application and will give you the opportunity to customise the archive. b. 'Add to how-to send an email.zip' is actually the quickest way possible to create an archive. The software uses the name of the file and selects a default compression scheme (.zip) so that you can, with two clicks, create a zip archive containing the original file. 5. Now you can use the 'how-to send an email.zip' file and send it as a 'smaller' email attachment. Now consider that you have just received (via email) the 'how-to send an email.zip' file and you need to get (the correct phrase is extract) the file it contains.
    [Show full text]
  • Bicriteria Data Compression∗
    Bicriteria data compression∗ Andrea Farruggia, Paolo Ferragina, Antonio Frangioni, and Rossano Venturini Dipartimento di Informatica, Universit`adi Pisa, Italy ffarruggi, ferragina, frangio, [email protected] Abstract lem, named \compress once, decompress many times", In this paper we address the problem of trading that can be cast into two main families: the com- optimally, and in a principled way, the compressed pressors based on the Burrows-Wheeler Transform [6], size/decompression time of LZ77 parsings by introduc- and the ones based on the Lempel-Ziv parsing scheme ing what we call the Bicriteria LZ77-Parsing problem. [35, 36]. Compressors are known in both families that The goal is to determine an LZ77 parsing which require time linear in the input size, both for compress- minimizes the space occupancy in bits of the compressed ing and decompressing the data, and take compressed- file, provided that the decompression time is bounded space which can be bound in terms of the k-th order by T . Symmetrically, we can exchange the role of empirical entropy of the input [25, 35]. the two resources and thus ask for minimizing the But the compressors running behind those large- decompression time provided that the compressed space scale storage systems are not derived from those scien- is bounded by a fixed amount given in advance. tific results. The reason relies in the fact that theo- We address this goal in three stages: (i) we intro- retically efficient compressors are optimal in the RAM duce the novel Bicriteria LZ77-Parsing problem which model, but they elicit many cache/IO misses during formalizes in a principled way what data-compressors the decompression step.
    [Show full text]
  • Jar Cvf Command Example
    Jar Cvf Command Example Exosporal and elephantine Flemming always garottings puissantly and extruding his urinalysis. Tarzan still rabbet unsensibly while undevout Calhoun elapsed that motorcycles. Bela purchase her coccyx Whiggishly, unbecoming and pluvial. Thanks for newspaper article. Jar file to be created, logical volumes, supports portability. You might want but avoid compression, and archive unpacking. An unexpected error has occurred. Zip and the ZLIB compression format. It get be retained here demand a limited time recognize the convenience of our customers but it be removed in whole in paper at mine time. Create missing number of columns for our datatypes. Having your source files separate from your class files is pay for maintaining the source code in dummy source code control especially without cease to explicitly filter out the generated class files. Best practice thus not to censorship the default package for any production code. Java installation and directs the Jar files to something within the Java Runtime framework. Hide extensions for known file types. How trim I rectify this problem? Java releases become available. Have a glow or suggestion? On the command line, dress can snap a Java application directly from your JAR file. Canvas submission link extract the assignment. To prevent package name collisions, this option temporarily changes the pillar while processing files specified by the file operands. Next, but people a API defined by man else. The immediately two types of authentication is normally not allowed as much are less secure. Create EAR file from the command line. Path attribute bridge the main. Please respond your suggestions in total below comment section.
    [Show full text]
  • I Came to Drop Bombs Auditing the Compression Algorithm Weapons Cache
    I Came to Drop Bombs Auditing the Compression Algorithm Weapons Cache Cara Marie NCC Group Blackhat USA 2016 About Me • NCC Group Senior Security Consultant Pentested numerous networks, web applications, mobile applications, etc. • Hackbright Graduate • Ticket scalper in a previous life • @bones_codes | [email protected] What is a Decompression Bomb? A decompression bomb is a file designed to crash or render useless the program or system reading it. Vulnerable Vectors • Chat clients • Image hosting • Web browsers • Web servers • Everyday web-services software • Everyday client software • Embedded devices (especially vulnerable due to weak hardware) • Embedded documents • Gzip’d log uploads A History Lesson early 90’s • ARC/LZH/ZIP/RAR bombs were used to DoS FidoNet systems 2002 • Paul L. Daniels publishes Arbomb (Archive “Bomb” detection utility) 2003 • Posting by Steve Wray on FullDisclosure about a bzip2 bomb antivirus software DoS 2004 • AERAsec Network Services and Security publishes research on the various reactions of antivirus software against decompression bombs, includes a comparison chart 2014 • Several CVEs for PIL are issued — first release July 2010 (CVE-2014-3589, CVE-2014-3598, CVE-2014-9601) 2015 • CVE for libpng — first release Aug 2004 (CVE-2015-8126) Why Are We Still Talking About This?!? Why Are We Still Talking About This?!? Compression is the New Hotness Who This Is For Who This Is For The Archives An archive bomb, a.k.a. zip bomb, is often employed to disable antivirus software, in order to create an opening for more traditional viruses • Singly compressed large file • Self-reproducing compressed files, i.e. Russ Cox’s Zips All The Way Down • Nested compressed files, i.e.
    [Show full text]
  • What's New for PC ARC/INFO
    What’s New for PC ARC/INFO 4.0 This guide is primarily intended for existing users of PC ARC/INFO. New users will find this discussion useful, but are recommended to refer to the documentation that accompany this release. They introduce the concepts of PC ARC/INFO. The on-line Help includes a ‘Discussion Topics’ section which provides information on starting and using PC ARC/INFO 4.0 as well as Command Reference sections which detail the use of each command. Highlights of PC ARC/INFO 4.0 Windows 32 bit Application Double Precision Coverages Background Images in ARCEDIT and ARCPLOT ARC commands available in all modules TABLES subcommands replaced with ARC processor commands Permanent Relates Shared Selection Sets between modules New functionality Support for Annotation subclasses and stacked annotation WinTab digitizer support Improved Customization tools Increased limits Improved performance Faster searches - Indexed items Improved menu interface Contents: Directory and command processor changes Environment variables are no longer required New ARC command processor New ARCEXE directory structure Implementation of External and Internal commands Most ARC commands are accessible from all modules COMMANDS displays both Internal and External commands New command search path New SML search path Custom commands ARC has optional {/w} parameter on startup New names for PC ARC/INFO Windows menus Updated menu interface New module initialization files 1 Contents cont. Changing workspace and drive location with &WS, A and CD New directory name for
    [Show full text]
  • Software Requirements Specification
    Software Requirements Specification for PeaZip Requirements for version 2.7.1 Prepared by Liles Athanasios-Alexandros Software Engineering , AUTH 12/19/2009 Software Requirements Specification for PeaZip Page ii Table of Contents Table of Contents.......................................................................................................... ii 1. Introduction.............................................................................................................. 1 1.1 Purpose ........................................................................................................................1 1.2 Document Conventions.................................................................................................1 1.3 Intended Audience and Reading Suggestions...............................................................1 1.4 Project Scope ...............................................................................................................1 1.5 References ...................................................................................................................2 2. Overall Description .................................................................................................. 3 2.1 Product Perspective......................................................................................................3 2.2 Product Features ..........................................................................................................4 2.3 User Classes and Characteristics .................................................................................5
    [Show full text]
  • Activity-Based Model User Guide
    July 2017 Activity-Based Model User Guide Coordinated Travel – Regional Activity-Based Modeling Platform (CT-RAMP) for Atlanta Regional Commission Atlanta Regional Commission 229 Peachtree St., NE, Suite 100 Atlanta, Georgia 30303 WSP | Parsons Brinckerhoff 3340 Peachtree Road NE Suite 2400, Tower Place Atlanta, GA 30326 Atkins 1600 RiverEdge Parkway Atlanta, Georgia 30328 Table of Contents 1 Overview ............................................................................................................................................... 1 1.1 Hardware and Software Prerequisites .......................................................................................... 2 1.2 Distributed Setup .......................................................................................................................... 3 2 System Setup and Design ...................................................................................................................... 6 3 Running Population Synthesizer ........................................................................................................... 7 3.1 PUMS Data Tables Setup ............................................................................................................... 7 3.2 Control Data Tables Setup ............................................................................................................ 8 3.3 Control Totals ................................................................................................................................ 8 3.4 File
    [Show full text]
  • Guide to the Iosh Video Library
    IOSH VIDEO LENDING LIBRARY CATALOG Small Pieces LARGE PUZZLE TABLE OF CONTENTS GUIDE TO THE IOSH VIDEO LIBRARY ............................................................................................. 3 IIOSHOSH VVIDEOIDEO LLIBRARYIBRARY AAGREEMENTGREEMENT ......................................................................................... 9 - A -..................................................................................................................................... 11 AACCIDENT(S)CCIDENT(S) AAND/ORND/OR IINVESTIGATIONSNVESTIGATIONS .................................................................................... 11 ASBESTOS AWARENESS ............................................................................................................. 11 BACK SAFETY............................................................................................................................ 11 BBUS,US, TTRUCKINGRUCKING & FFLEETLEET MMAINTENANCEAINTENANCE....................................................................................... 13 CCHEMICALSHEMICALS .............................................................................................................................. 14 CCOMPRESSEDOMPRESSED GGASAS..................................................................................................................... 14 CONFINED SPACE ...................................................................................................................... 14 CCONSTRUCTIONONSTRUCTION SSAFETYAFETY............................................................................................................
    [Show full text]
  • MULTI-LEVEL ADAPTIVE COMPRESSION TECHNIQUE for PAYLOAD ENCODING in STEGANOGRAPHY ALGORITHMS 1 2 3 Jagan Raj Jayapandiyan , Dr
    MULTI-LEVEL ADAPTIVE COMPRESSION TECHNIQUE FOR PAYLOAD ENCODING IN STEGANOGRAPHY ALGORITHMS 1 2 3 Jagan Raj Jayapandiyan , Dr. C. Kavitha , Dr. K. Sakthivel 1Research Scholar, Dept. of Comp. Science, Periyar University, Salem, T.N, India 2Asst Prof, Dept. of Comp. Science, Thiruvalluvar Govt. Arts College, T.N, India, 3Professor, Dept of CSE, K. S. Rangasamy College of Technology, TN, India Abstract— This research work recommends a method that adaptively chooses the strongest II. STEGANOGRAPHY AND compression algorithms for steganography COMPRESSION encoding amongst several compression methods. A. Steganography Selection of the best method for every secret file type is always based on several factors such as the Steganography can be done in various media type of cover image being used for formats. The method and approach are different communication, size of the secret message being depending on the secret data that are concealed in the transferred, message file type, compression ratio stego-media or cover image. As mentioned in Fig.1, of the shared secret message file, the compression the Steganography type may differ and the same ratio of the secret message to the stego medium, process for encoding and decoding the hidden etc. This proposal provides a holistic solution to message may follow based upon the steganographic handle compression techniques for all the secret data and stego image. message file types. Keywords—Steganography, Compression, ranking, dynamic selection, Information Security I. INTRODUCTION The term Steganography derives from the Greek words "stegos" and "grayfia," meaning "covered writing" or "writing secretly" Steganography is an art and science to camouflage a secret text or data by embedding it in a media file.
    [Show full text]
  • DROID: User Guide
    DROID: User Guide © Crown copyright 2020 You may re-use this information (excluding logos) free of charge in any format or medium, under the terms of the Open Government Licence. To view this licence, visit: nationalarchives.gov.uk/doc/open- government-licence or email [email protected] Where we have identified any third-party copyright information, you will need to obtain permission from the copyright holders concerned. This publication is available for download at nationalarchives.gov.uk. DROID: User Guide Contents 1 Introduction .................................................................................................................................................. 3 1.1 What is DROID? .......................................................................................................................................... 3 1.2 What is the purpose of this guidance? ................................................................................................... 3 1.3 Who is this guidance for? ......................................................................................................................... 3 2 Running DROID ............................................................................................................................................ 4 2.1 Installing and using DROID ...................................................................................................................... 4 2.2 Creating a profile ......................................................................................................................................
    [Show full text]