<<

Management of academic multimedia content for long­ term access and inter­institutional exchange

Work package 4: Archival strategies / Handling and persistent identification of objects

Glossary

AIP Archival Information Package DIP Dissemination Information Package IANA Internet Assigned Numbers Authority ISO International Organisation for Standardisation MIME Multipurpose Internet Mail Extensions MPEG Moving Picture Experts Group OAIS Open Archival System PDI Preserving Data Information SIP Submission Information Package URL Uniform Resource Locator URN Uniform Resource Name

2 Table of contents

D 4.1 Report on file formats and quality standards of multimedia documents. 5 Introduction ...... 5 General: container format vs. ...... 5 containers ...... 6 MPEG‐4 Part 14 (MP4)...... 6 (AVI) ...... 6 (FLV) ...... 7 ...... 8 Video ...... 9 MPEG‐4 Part 10 (H.264/ MPEG‐4 (MPEG‐4 AVC))...... 9 MPEG‐4 Part 2 / MPEG‐4 Visual (MPEG‐4 Advanced Simple Profile) ...... 9 ...... 9 ...... 10 Audio codecs...... 11 MPEG‐1 Audio Layer 3 (MP3) ...... 11 (AAC) ...... 11 ...... 12 Summary of formats and supported codecs...... 13 Browser support for formats and codecs...... 13 Streaming ...... 13 D 4.2 Report on the Swiss community multimedia needs in terms of distribution formats and long­term preservation...... 14 Introduction ...... 14 Media distribution...... 14 Archiving ...... 14 D 4.3 Report on a general strategy allowing long­term preservation of multimedia content...... 15 Introduction ...... 15 Open Archival System (OAIS) ...... 15 Mandatory responsibilities of an OAIS...... 16 Functional entities...... 16 Digital migration ...... 17 General strategy...... 18 Goals of a long‐term preservation strategy...... 18 Elements of a long‐term preservation strategy...... 19 Developing a long‐term preservation strategy...... 20 Best practices...... 21 Multimedia quality...... 21 File formats ...... 21 Archival vs. distribution ...... 22 Persistent identification ...... 22 References ...... 25 Annex A: Open Archival Information System and Long­term preservation of electronic document­based information ...... 26

3 Open Archival Information System – reference model (ISO 14721)...... 26 Scope ...... 26 Basics of an OAIS...... 26 Types of information packages...... 29 Mandatory responsibilities of an OAIS...... 29 Functional entities...... 31 Sustainability and digital migration ...... 32 Types of migrations...... 32 Long­term preservation of electronic document­based information (ISO 18492). 34 Scope ...... 34 Goals of a long‐term preservation strategy...... 34 Elements of a long‐term preservation strategy...... 35 Developing a long‐term preservation strategy...... 37

Table of figures

Figure 1: Containers and supported codecs...... 13 Figure 2: Browser video support ...... 13 Figure 3: Simple model of an Open Archival Information System (OAIS)...... 26 Figure 4: Data interpreted using Representation information produces Information...... 27 Figure 5: Information package associated to descriptive information...... 28

4 D 4.1 Report on file formats and quality standards of multimedia documents

Introduction In the world of multimedia content, there exist a very large number of formats and codecs. This study concentrates on the more widely used as well as the more appropriate for the task of long‐term preservation of multimedia content.

General: container format vs. codec When speaking of multimedia files, we often say “avi” or “mp4” files. However, “avi” and “mp4” are just container formats that define how to store video and audio data streams inside a file.

A container will allow storing several tracks (usually, one video and one or more audio tracks). The way individual tracks are represented is defined by the respective codec. A codec is an algorithm that encodes, respectively decodes a stream of audio or video data.

Examples of codecs are, among others: H.264 (video), AAC (audio), MP3 (audio).

A track can have metadata attached to it, for example, the resolution and aspect ratio for a video. Containers can also have metadata, with information such as the title, author and date of the video.

Although a container does not define how to store the audio and video streams, not all containers support all codecs.

The following sections will list the most widely used video and audio containers, along with the codecs they support.

5 Video containers

MPEG‐4 Part 14 (MP4) This file format is based upon the international ISO/IEC 14496‐ 12:2004 (MPEG‐4 Part 12: ISO base media file format). This standard stems directly from Apple’s Quicktime container format (MOV). MP4 (MPEG‐4 Part 14) is fundamentally identical to the MOV format, with the addition of formal specifications and MPEG features.

MP4 supports streaming. Besides audio and video data, it can also contain subtitles (MPEG‐4 Timed Text).

Summary

Filename extension: .mp4* MIME­types: video/mp4, audio/mp4, application/mp4 Developed by: ISO Container for: audio, video, text Extended from: Quicktime .mov and MPEG‐4 Part 12 Standard: ISO/IEC 14496‐14

*non‐standard, but widely used extensions: .m4a for audio‐only files and . for video‐only files.

Audio Video Interleave (AVI) AVI is a format developed by that can contain audio and video. It suffers from several limitations, which can be solved by using additional software, or by using more recent container formats. Some of the limitations are: there is no official support for video metadata, such as aspect ratio. AVI is not intended to contain data encoded with variable frame rate.

This container format remains widely used as of today.

Summary

Filename extension: .avi MIME­types: video/vnd.avi, video/avi, video/msvideo, video/x‐ msvideo Developed by: Microsoft Container for: audio, video

6 Flash Video (FLV) Flash Video is a container format for audiovisual content and is used by Player. Flash Video can be embedded in Shockwave Flash files (SWF) and is the most used format for embedded video on the web. It is namely used by sites such as YouTube, Google Video and many news websites.

There are two different file formats specified for Flash Video: FLV and F4V. In FLV, audio and video data is encoded in the same way as in SWF files. F4V is based upon the ISO base media file format (ISO 14496‐12) and the technical specification is publicly available. The format was however not registered by the MP4 registration authority. F4V can contain video encoded with H.264, as well as MP3 and AAC audio data and still images (GIF, JPG, PNG).

Flash Video can be viewed using the and the web browser plug‐in. It is usually delivered using progressive download over HTTP, and can also be streamed via a streaming server (Flash Media Server, WebORB,…) over the RTMP protocol.

Summary

Filename extensions: .flv, .f4v, .f4p, .f4a, .f4b MIME­types: video/x‐flv, video/mp4, video/x‐m4v, audio/mp4a‐ latm, video/, video/, audio/mp4 Developed by: Adobe Systems (originally Macromedia) Container for: audio, video, text, data Extended from: ISO/IEC 14496‐12 (for F4V, not officially registered)

F4V formats

File Mime­type Description extension .f4v video/mp4 Video .f4p video/mp4 Protected video .f4a audio/mp4 Audio .f4b audio/mp4 Audio book

7 OGG OGG is a container format that is free and open. It is maintained by the Xiph.org foundation. The design goal of this format is to provide high quality multimedia content and efficient streaming.

It can contain several audio, video streams, as well as text (for subtitles). It also supports metadata, which is contained in OGG and not built into the OGG container itself. This allows the use of any metadata specification.

OGG is free and not restricted by any patents.

Summary

Filename extensions: .ogg, .ogv, .oga, .ogx, .spx MIME­types: video/ogg, audio/ogg, application/ogg Developed by: Xiph.org foundation Container for: audio, video, text Standard: free, open standard maintained by the Xiph.org foundation

8 Video codecs

MPEG‐4 Part 10 (H.264/ MPEG‐4 Advanced Video Coding (MPEG‐4 AVC)) This , known and H.264, also known as MPEG‐4 Part 10 or MPEG‐ 4 Advanced Video Coding (MPEG‐4 AVC) is a standard developed by the MPEG group, or rather, a family of standards, separated into several different profiles. Formally the standard is named as follows: “ISO/IEC 14496‐10 ‐ MPEG‐4 Part 10, Advanced Video Coding”.

The profiles implement different optional features that provide, for example, better image quality, smaller file sizes, …

Using these profiles, the goal of this codec is to ensure flexibility so that it can be used in a wide variety of applications, ranging from streaming of video with low to high bitrates and resolutions, to high quality content, DVD storage and more.

H.264 video is streamable and can be embedded in an MP4 container.

This codec is patented and licensing is brokered by the MPEG LA group1.

MPEG‐4 Part 2 / MPEG‐4 Visual (MPEG‐4 Advanced Simple Profile) This codec, MPEG‐4 Part 2 implements another video compression algorithm than MPEg‐4 Part 10 and should not be confused with it.

It’s a standard that was developed by the MPEG group, ISO 14496‐2, similar to the previous standards MPEG‐1 and MPEG‐2.

Widely used implementations of this standard are: Divx, and . Xvid is open‐source, whereas Divx and 3ivx are closed‐source.

MPEG‐4 ASP video can be embedded in an AVI and an MP4 container.

This codec is patented and licensing is brokered by the MPEG LA group1.

THEORA Theora is a free video codec maintained by the Xiph.org foundation. It is streamable, and is most often used in an OGG container. It can supposedly be embedded in almost any container format (needs confirmation).

1 http://www.mpegla.com/

9

Mozilla Firefox version 3.5 and later supports Theora in an OGG container without the need of any plugin. An open‐source decoder is made available by Xiph.org.

Theora is a free codec that is not encumbered by any patents.

HuffYUV HuffYUV is a lossless video codec published under the GNU General Public License (GPL). The algorithm in use is similar to the lossless JPEG one.

The advantage if this codec is that it is lossless, i.e. the output of the decompressor is exactly the same as the original input before compression.

10 Audio codecs

MPEG‐1 Audio Layer 3 (MP3) MP3 is the most widely used . It was developed by ISO and the corresponding standards are ISO 11172‐3 and ISO 13818‐3.

The design goal of this codec is to represent audio by using less data and to still give an appropriate sound output compared to the uncompressed audio data.

It allows for both constant and compression. Constant bitrate encoding is limited to a set number of rates, ranging from 32 to 320 kbits per second. MP3 can contain up to 2 channels.

MP3‐encoded audio can be embedded in any video container.

A noteworthy audio encoder for MP3 is the LAME project, which is open‐ source.

This codec is patented.

Advanced Audio Coding (AAC) AAC is a standardised audio codec whose goal is to be the successor of MP3. It is designed to provide better for streams encoded at the same bitrate as MP3.

This compression format corresponds to the ISO standards ISO 13818‐7 and ISO 14496‐3, as part of the MPEG‐2 and MPEG‐4 specifications.

It also uses the concept of different profiles depending on the use, for example a “low‐complexity” profile designed to be fast and playable in real‐time with limited CPU resources, and higher profiles with better quality but increased encoding and decoding times.

AAC can encode audio at any given bitrate, it is not limited to a certain number of predefined bitrates. This codec can contain up to 48 audio channels.

Audio encoded using AAC can be embedded in a MP4 container.

11 This codec is patented, however, no license fee is required to be able to stream or distribute audio in ACC2.

Vorbis Vorbis is a free, open standard audio codec maintained by the Xiph.org foundation.

Audio encoded using the Vorbis codec can be embedded inside OGG and MP4 containers. It can theoretically contain an arbitrary number of audio streams; there is no set limit.

Audio quality tests performed showed that Vorbis provides better quality than MP3 and AAC at medium and high bitrates, whereas for low bitrates, HE‐ AAC (high efficiency profile) gives better quality than Vorbis.

Web browser Mozilla Firefox 3.5 can read Vorbis audio (stand‐alone or with a video in an OGG container) natively, without the need of additional plugins.

This codec is not restricted by any patents. The reference implementation of the standard is open source. There are additional free and open source implementations of the Vorbis codec as well.

Free Lossless Audio Codec (FLAC) FLAC is, as the name suggests, a free and lossless audio codec. Audio data is compressed without discarding any data from the input source. Sound compressed using FLAC can be decompressed to an identical audio stream to the original sound. It is maintained by the Xiph.org foundation.

For audio content, the compression rate achieved by FLAC typically ranges from 30% to 60%, whereas general lossless schemes such as ZIP or GZIP only reach 10%‐20%. In comparison, lossy codecs usually display compression rates of more than 80% by discarding parts of the original audio.

The codec is free; anyone can implement the specification of this format and is not restricted by any patent. The sources for the reference implementation are published under the General Public License (GPL).

Being lossless, this audio format is ideal for archival and preservation purposes.

2 http://www.vialicensing.com/Licensing/AAC_FAQ.cfm?faq=5#5

12

Summary of formats and supported codecs

Codecs Video Audio H.264 MPEG‐4 ASP Theora MP3 AAC Vorbis FLAC

MP4 X X X X X AVI X X FLV X X X F4V X X X

Containers OGG X X X X Figure 1: Containers and supported codecs

Browser support for formats and codecs

Web browsers Container + codecs Firefox Opera Chrome Safari MP4: H.264 + AAC X X OGG: Theora + X X X Vorbis Figure 2: Browser video support

Streaming Real Time Streaming Protocol (RTSP)

RTSP is a protocol for streaming multimedia content. It is notably used by applications such as Quicktime, RealPlayer and .

Real Time Messaging Protocol (RTMP)

RTMP is a proprietary format that was developed by Adobe Systems for streaming multimedia content between a server and a Flash player.

13 D 4.2 Report on the Swiss community multimedia needs in terms of distribution formats and long‐term preservation

Introduction This document describes the needs of the Swiss institutions for both distribution formats and long‐term preservation, by analysing the results of the deliverables D1.1 and 1.2 of work package 1.

Most institutions currently operate a recording system or plan to do so in the near future. The nature of the content is mainly audio and video (projection and/or camera feed) for the following type of events: lessons, lectures, seminars, conferences, etc.

Media distribution Media delivery is done mainly through HTTP download, as well as streaming (RTSP or other protocols).

Most of the time, end‐user viewing of media content is done with a Flash player, integrated into an HTML page. The Quicktime player plugin is also used.

Archiving The survey revealed that no institution archives high‐resolution versions of their media.

To provide long‐term access to the media content their produce and host, the institutions need to implement archival strategies and follow some best practices described in D4.3, to ensure the content is preserved and updated to remain readable by users a long time after it has been created.

14 D 4.3 Report on a general strategy allowing long‐term preservation of multimedia content

Introduction This section describes the requirements and best practices for a strategy for long‐term preservation of multimedia content.

The general strategy proposed is described in detail in Annex A: “Open Archival Information System and Long­term preservation of electronic document­based information”. This study is based on the international standards ISO 14721:2003 and 18492:2005.

A set of best practices is then listed based on the study detailed in the aforementioned annex.

Open Archival System (OAIS)

Together, the data object and the representation information form the information content (or information object). In order to preserve this information, it must be accompanied by preserving digital information (PDI).

The PDI contains four different types of information:

‐ Origin of the information object, who has been in charge of it since its creation as well as the history of all changes applied to it. ‐ Context: describes relationships between the information object and other information packages. For example, the context can describe why and how the information object is linked to another information object. ‐ Identification: provides one or several identifiers that uniquely identify the information object. This identifier can be composed of one or several attributes. ‐ Integrity: a mechanism ensuring that the information object is protected against non‐allowed and undocumented modifications. It can be for example a checksum on the information object.

The information object and its preserving digital information are packed together to be stored in the archive as one logical unit (Figure 5).

15 Mandatory responsibilities of an OAIS

Here is a summary of the most important and mandatory responsibilities of an archival system. Each of these points is explained in further detail in the annex.

A. Negotiate with Producers and accept their appropriate information

B. Acquire sufficient proficiency to guarantee sustainability

C. Determine which communities are the target audience that can understand the provided information

E. Apply a strategy and documented procedures that guarantee the preservation of information

Functional entities

The following section lists the main functions of an OAIS that interest us in the context of this project. a. Generate an AIP (Archival information package)

This function transforms one or several SIPs (Submitted Information Package) into one or several AIPs that conform to the internal standards of data formatting inside the archive.

This may imply conversions to other file formats or representations or a reorganisation of the information content of the SIPs. b. Technological watch

This function’s goal is to follow the evolution of emerging digital technologies, information standards as well as software and hardware platforms, and detect technologies that could cause the environment of the OAIS to become obsolete.

Based on this technological watch, prototyping can be used to better evaluate the impact of technologies and design migration plans accordingly.

16 Digital migration

There are several types of migrations of digital content in an archive. For multimedia content, the migration used is mainly the transformation.

Transformation

Changes are made to the information object and/or PDI, and thus the representation information is altered in consequence as well.

The goal of a transformation is a maximal preservation of the information. The resulting AIP replaces the original AIP that underwent the transformation. The new AIP is seen as being a new version of the original content, which can be kept for preservation history control.

The representation information plays a key role in transformations. A transformation can be either reversible or irreversible. In the case of a lossless codec, it is considered as reversible, because we can obtain back the entirety of the original content by decoding it. With lossy codecs however, the transformation is irreversible because some parts of the original data is discarded during the compression.

An OAIS should keep all different versions of a multimedia document in storage.

17 General strategy

Based on the standard norms for archival and preservation of electronic‐ based documents, a general strategy comprising the following main points can be described.

This section presents a summary of the main points defining a preservation strategy. More detail is given in Annex A.

Goals of a long‐term preservation strategy

Readable electronic document­based information

Data must be readable into the future. Besides maintaining the physical supports, the data must be formatted in a way that users can read and understand it.

Intelligible electronic document­based information

Representation data is needed to tell the computer how to interpret the preserved information. For multimedia content, this means that information should be present that tells exactly what codecs, which implementation and version was used to produce said content.

An executable version of the coders and decoders should be stored in the repository as well. If possible, the should also be included, so that we never run into the problem of software obsolescence and data that is not readable anymore. Using codecs defined by international standards and open implementations is a way of achieving this goal.

Identifiable electronic document­based information

Each multimedia document must be identifiable in a unique way, using one or several attributes.

Authentic electronic document­based information

An archive should ensure the authenticity of the preserved information, and that it has not been modified, deleted or corrupted over time. Mechanisms for access restriction must be set up and enforced. Techniques such as CRCs and hash functions can help verify authenticity of the content.

18 Elements of a long‐term preservation strategy

Migrating electronic document­based information

Inevitably, a long‐term strategy for electronic information will involve migrations to newer formats.

It is recommended for storage repositories to migrate all the electronic information from the wild variety of formats used by content creators to a smaller number of “standardized” formats upon their transfer to the repository.

Specifically, proprietary formats should be avoided. Among the technology neutral formats that should be taken into consideration are the available ISO standards and other open standard for multimedia formats and compression techniques.

Addressing software dependence

It may be difficult to provide long‐term access to electronic information that can only be used within a specific software application or environment, especially if a vendor discontinues support or does not provide new versions for said software.

It may be possible to eliminate software by sacrificing some loss of structure, for example, a text document in a proprietary format can be migrated to straight text thus losing aspects of physical representation. This is however not applicable to audiovisual content. A solution in this case is the migration to standard formats.

Metadata

Metadata stores information about the context, processing and use of electronic‐based information. To a certain extent, metadata can be extracted automatically by software applications, such as: file size, file format, length, hash digest…

However, manual intervention may sometimes be needed for additional metadata such as classification, keywords, and so on.

19 Metadata consists of information about the context, processing and use that supports the identification, retrieval and preservation of authentic electronic‐ document based information.

Interoperability should be kept in mind for interactions with other entities and repositories.

Developing a long‐term preservation strategy

Quality control

The repository should set up controls and rules that document how and when the electronic information has been managed and maintained.

This documentation should include the procedures and policies, description of events such as losses of data during migrations, as well as the results of periodical audits for quality control that have been made to verify that the repository’s policies have been respected.

Security: Application/software access control

There should be automated procedures that control the modification and deletion of information inside the repository. When an electronic document is modified or deleted, it should be automatically logged by the software, with the name of the person and the reason for the change.

The access to the stored information should be restricted so that it is only accessible through a thoroughly tested and documented entry point so that unauthorized access is prevented.

Security policy

The following procedures should be described and set up to warrant the repository’s security:

• security measures for the transfer of electronic document‐based information to the repository • procedures describing access control and monitoring • the location of the physical storage facility to minimise the risks of disasters • a disaster recovery plan • a backup system

20 Best practices

Multimedia quality To archive multimedia content, the highest possible resolution and quality of both audio and video streams should be kept.

If the codecs used are not lossless, the encoding must be done in such a way (typically the bitrate setting) that no quality of the multimedia document is lost during the encoding.

File formats As mentioned in Annex A, proprietary file formats and codecs should be avoided as much as possible and international standards should be used instead. The discussion following is based on Figure 1: Containers and supported codecs.

Among the file formats studied in this document, we will retain the most interesting ones, namely the MP4 and OGG containers because they are international standards, as well as the F4V container format for its mainstream‐ness and support for modern high quality codecs (H.264, AAC).

MP4, OGG and F4V are flexible container formats that support high quality video and audio codecs (H.264, AAC, Theora, Vorbis, FLAC), suitable for archival storage.

MP4 is directly an ISO standard, which is a great advantage for the perspective of long‐term preservation. A disadvantage of MP4 is that its corresponding codecs (H.264, MPEG‐4 ASP, AAC) are patented and thus subject to licensing fees for use and distribution of content.

F4V is based on an ISO standard as well, even though it was not officially registered by the competent authority. Its technical specification is publicly available.

The OGG container, as well as the supported codecs Theora, Vorbis and FLAC are open standards as well, though from a separate foundation and not from ISO. Particularly interesting is FLAC, which is a lossless audio codec capable of attaining good compression rates while keeping the original data intact.

Using open standards in an archival system is a great way of removing software dependency and facilitating future migration to newer formats.

21 Noteworthy also is that MP3 audio encoding is supported by all the containers mentioned in this document. However, it is generally advised to use its more efficient successor AAC.

Archival vs. distribution

Annex A mentions three different information packages:

Submission Information Package (SIP): information package as it is submitted to the archive by a Producer Archival Information Package (AIP): information package as it is stored inside the archive Dissemination Information Package (DIP): information package as it is displayed to the user

We can thus describe a typical scenario for an archival system:

The SIPs will be the multimedia files that are submitted to the archival repository. AIPs will be the files that are stored in the archive and the DIP are the files that will be distributed and displayed to the end users. It is very probable that in such a case, the archive will allow producers to submit multimedia content in a variety of different formats, for example AVI, DivX, MPEG‐2, FLV, MP4 and OGG (which will constitute SIPs), convert those files and store them in a normalised, standard format, for example Theora and FLAC streams embedded in an OGG container (AIPs), and finally make the content available to users by using an Flash player getting F4V data (H2.264 + AAC) through a streaming serving using the RTMP protocol (DIPs).

In this context, the focus was on the AIPs and the way to store and preserve this multimedia data.

For information, Figure 2: Browser video support shows the current video support of web browsers.

Persistent identification

In most cases, it is desirable for a multimedia archival system to have a persistent identification mechanism for its available content. This way, users can access the resources using a single, unique and immutable identifier.

22 To this end, the most suitable solution appears to be the Unified Resource Name (URN). It allows registering one or several Unified Resource Locators (URLs) to a specific and globally unique name.

This system relies on a resolving service mapping the URNs to its corresponding URLs. The resolving service can be delegated to a third party server or directly incorporated as an additional functionality into the archive repository. Both alternatives should be considered:

• either each individual archive repository implements a resolving service for its own persistent resources, or • we could imagine a nation‐wide centralised resolving service where all multimedia content is registered

The second alternative appears able to provide more consistency. Furthermore, this would resemble an already used scheme in Switzerland, where the Swiss National Library handles and manages URN policies for bibliographical documents in the country. In its current implementation, the actual URN resolving service is provided by the German National Library.

Since the archive already assigns internal unique identifiers for its resources, expanding these internal identifiers to global ones (by adding archive and content specific prefixes, for example) is no difficult task.

Assignment of URLs

In this system, one or several URLs can be assigned to a single URN. This way, several alternative URLs can be provided to ensure constant access in case of failure of one of the resource providers. This can also be used to link to several different types or versions of a same resource: for example, a video URL and its corresponding, identical archived version stored in a multimedia archival system. In this scheme, it is the responsibility of the archival system to guarantee the long‐term preservation for the content for which URNs are assigned.

23 Syntax

The syntax of a URN is defined as follows by RFC 21413:

::= "urn:" ":"

NID: namespace identifier NSS: namespace specific string

The namespace identifier determines the syntactic interpretation of the namespace specific string. The officially registered NIDs are defined by IANA.4 Among the standard list, we can find International Standard Audiovisual Number (ISAN). This point should be further investigated to determine whether an archive is allowed to use an ISAN namespace identifier as is, if it needs to respect to a set of specifications to comply with the standard. Registration of a new NID is also possible; IANA mentions registration procedures functioning in a first come first serve basis after a review period.

The NSS can be chosen freely. The system providing URNs has the responsibility of ensuring their uniqueness.

Examples (source: [5])

Below are two examples using the ISAN and the MPEG namespace identifiers, respectively. urn:isan:0000­0000­9E59­0000­O­0000­0000­2 The URN for "Spider‐Man (film)", identified by its audiovisual number.

urn:mpeg:mpeg7:schema:2001 Default Namespace Rules for MPEG‐7 video metadata.

Alternative

An alternative to the URN persistent identification scheme is Digital Object Identifier (DOI). DOI implements the URN concept and augments it with an additional data model.

3 http://tools.ietf.org/html/rfc2141 4 http://www.iana.org/assignments/urn‐namespaces/

24 It is a priori not obvious whether or not this data model adds useful functionality to the URN‐archive repository interaction, since the archival system already manages metadata about its content objects: information packages with preserving data information (PDI) and associated descriptive information (see Figure 5).

References

[1] Open Archival Information System – reference model (ISO 14721) [2] Long‐term preservation of electronic document‐based information (ISO 18492) [3] http://diveintohtml5.org/video.html [4] http://xiph.org/ [5] http://en.wikipedia.org/ [6] http://www.iana.org [7] http://tools.ietf.org/

25 Annex A: Open Archival Information System and Long‐term preservation of electronic document‐based information

Open Archival Information System – reference model (ISO 14721)

Scope ISO 14721:2003 specifies a reference model for an open archival information system (OAIS). The purpose of this ISO 14721:2003 is to establish a system for archiving information, both digitalized and physical, with an organizational scheme composed of people who accept the responsibility to preserve information and make it available to a designated community.

Basics of an OAIS

An OAIS is interfaced with Producers, Users and Management, as shown in Figure 3.

The term Producer represents people, or a system that submits information to an OAIS for preservation.

The Management defines the global policies of the OAIS, define what information to preserve,

Users represents the people or client systems that interact with the OAIS to search and consult preserved information. The target community of users is defined as the users who have interest in the information stored and are able to understand it.

Producer User

Figure 3: Simple model of an Open Archival Information System (OAIS)

26 Massive and fast development of digital data, is a challenge for archives: easily lost or corrupted, and evolution of technology makes software and hardware obsolete after a few years.

Definition of information

Information is always represented by a certain type of data. To successfully preserve information, it is capital that the OAIS understands the data using its representation information.

The representation information specifies how to interpret the stored data and such extract its meaning (Figure 4 below).

Data interpreted using Representation produces Information object information object

Figure 4: Data interpreted using Representation information produces Information

In the context of archiving multimedia content, such as a video, the file (the data) must be interpreted using all the necessary codecs (the representation information) to produce and playback the audiovisual content (the information).

In an archive, multimedia content should always be accompanied by precise representation information describing the file format, the codecs (and the versions) used by the content in question. If possible (depending on licensing and such), the OAIS should also store the necessary software to code and decode the multimedia file.

Together, the data object and the representation information form the information content (or information object). In order to preserve this information, it must be accompanied by preserving digital information (PDI).

The PDI is divided into 4 categories:

‐ Origin of the information object, who has been in charge of it since its creation as well as the history of all changes applied to it. ‐ Context: describes relationships between the information object and other information packages. For example, the context can describe why and how the information object is linked to another information object.

27 ‐ Identification: provides one or several identifiers that uniquely identify the information object. This identifier can be composed of one or several attributes. ‐ Integrity: a mechanism ensuring that the information object is protected against non‐allowed and undocumented modifications. It can be for example a checksum on the information object.

The information object and its preserving digital information are packed together to be stored in the archive as one logical unit (Figure 5).

A package is associated with descriptive information about the information object (metadata) so the community of users can search the archive and retrieve information of interest. This metadata can be extracted from the information object and the PDI or come from another source.

Information object Preserving digital (data object+ information (PDI) representation) Packaging information

Package 1 Descriptive

information

Figure 5: Information package associated to descriptive information

28 Types of information packages

An OAIS can distinguish between 3 different types of information packages:

Submission Information Package (SIP): information package as it is submitted to the archive by a Producer Archival Information Package (AIP): information package as it is stored inside the archive Dissemination Information Package (DIP): information package as it is displayed to the user

Mandatory responsibilities of an OAIS

This section presents the minimal responsibilities an OAIS must take.

A. Negotiate with Producers and accept their appropriate information

An OAIS negotiates with Producers to agree upon the kind of content to be preserved, ensuring that it respects the its mission and that it responds to the needs of the target community of users.

The OAIS must also extract or obtain sufficient descriptive data to allow the target users to find the wanted information efficiently.

B. Acquire sufficient proficiency to guarantee sustainability

An OAIS must obtain sufficient proficiency of the information content so as to be able to preserve it. Namely, it must handle issues related to the three following categories: i. copyright, intellectual property or other legal restrictions ii. authorisation to modify the representation information iii. protocols with external organisations (if necessary) i. An archive must respect all legal restrictions in effect. It can establish directives for the distribution and duplication of its information. ii. If it happens that the information content is not stored in a form suitable to the target users anymore or does not comply with internal requirements for its conservation, the OAIS must be allowed to migrate the data to a new

29 format. This means that the corresponding representation information must be updated as well. iii. In order to fulfil its mission, an OAIS can establish protocols with external organisations, for example to share or delegate the storage of some common representation information with another organisation. Such protocols must be reviewed to ensure that they are respected and that they remain useful.

C. Determine which communities are the target audience that can understand the provided information

It is very important to clearly define community or communities of users that are the audience of the content to store in the archive, and to take the necessary steps so they can understand this information.

The potential evolution of the definition of the target community of users must also be taken into account. During the definition of the preservation policies, starting by choosing a community broader than the current community minimises the difficulty of later extending it.

As such, the archive must take into account the evolution of the way multimedia content is consulted by its users, and take steps to adapt the way the information is made available when needed.

D. Ensure the information to preserve is immediately comprehensible by the target community of users (no need for experts)

It can happen that the archive must adapt the representation information to ensure the information objects are easily understandable by the users because knowledge base of the target users evolved.

E. Apply a strategy and documented procedures that guarantee the preservation of information in spite of unexpected events, in the limits of the reasonable, and allowing the diffusion of information, an authenticated copy of the original or allowing to retrieve the original

It is essential that the OAIS defines and applies well‐documented procedures and methods for the preservation of its information packages.

30 For example, migrations that modify information objects and/or PDI’s must be carefully verified to guarantee that no information is lost or corrupted.

A long‐term plan of the technologies used must be established, and updated to match the technological evolutions so that migrations of information content can be prepared in advance and executed in time.

F. Make the preserved information available to the target community of users

By definition, an OAIS makes available its information content to the target community of users. It can present it in different ways and provide search tools to browse the collections of information objects.

The access to some information packages may be restricted and thus only displayed to authorised users. The OAIS’ access policies must be made known publicly.

Functional entities

The following section lists the main functions of an OAIS that interest us in the context of this project. a. Generate an AIP (Archival information package)

This function transforms one or several SIPs (Submitted Information Package) into one or several AIPs that conform to the internal standards of data formatting inside the archive.

This may imply conversions to other file formats or representations or a reorganisation of the information content of the SIPs.

Example: if a submitted file is in proprietary format and cannot be described, even visually, it is recommended to migrate it to a non‐proprietary format to guarantee its preservation.

b. Technological watch

This function’s goal is to follow the evolution of emerging digital technologies, information standards as well as software and hardware platforms, and

31 detect technologies that could cause the environment of the OAIS to become obsolete.

Based on this technological watch, prototyping can be used to better evaluate the impact of technologies and design migration plans accordingly.

Sustainability and digital migration

Digital migration is defined as the transfer of digital information inside an OAIS with the goal of preservation.

There are 3 characteristics that distinguish digital migration from a transfer in general: ‐ a goal of preservation of the entirety of the information content, ‐ the perspective that the new implementation, or version, of the content will replace the previous one inside the archive system, ‐ all the aspects of the transfer are integral parts of the OAIS.

The major factors for the need of a digital migration of AIPs are the following:

‐ improved efficiency: the fast evolution of hardware and software allows significant gains in terms of storage capacity and performance ‐ new requirements for user service: users adopt new technologies and their expectations of service evolve alongside technologies

Digital migrations are time costly, costly and expose the OAIS to higher risks of loss of data. Hence the OAIS must carefully plan them and take into consideration the problems and the different alternatives in terms of migration.

Types of migrations

There are four main types of digital migration in an OAIS. In the context of preservation of audiovisual content, the emphasis will be on the last one, namely transformation.

A. Storage media renewal One or several pieces of storage media, containing one or several AIPs, is replaced by a media of the same type by a bit‐by‐bit copy.

32

B. Duplication AIPs are transferred bit‐by‐bit to a storage media of the same or of another type. Package information, information objects and PDIs remain unchanged.

C. Repackaging Packaging information undergoes changes.

D. Transformation Changes are made to the information object and/or PDI, and thus the representation information is altered in consequence as well.

The goal of a transformation is a maximal preservation of the information. The resulting AIP replaces the original AIP that underwent the transformation. The new AIP is seen as being a new version of the original content, which can be kept for preservation history control.

The representation information plays a key role in transformations. A transformation can be either reversible or irreversible.

A transformation is reversible when the new representation of the information univocally corresponds to the original representation. It is always possible to make the inverse transformation to obtain the original representation of the information.

A transformation is irreversible when we cannot guarantee that the transformation can be inversed. Typically, this can happens when the information is migrated to a richer, more elaborate format than the original format.

When migrating audiovisual content to a new format, some information can be lost if it is encoded with insufficient quality. This issue does not occur with lossless codecs.

33 Long‐term preservation of electronic document‐based information (ISO 18492)

Scope This chapter provides practical guidance for long‐term preservation and retrieval of electronic document‐based information, when the retention period exceeds the expected life of the technology (hardware or software) used to create and maintain the information. For this end, it makes used of technology neutral information standards for supporting long‐term preservation and access.

Goals of a long‐term preservation strategy

This section briefly presents five key issues for long‐term preservation of electronic documents.

Readable electronic document­based information

A strategy for long‐term preservation needs to ensure that the data is readable into the future. Besides the problematic of maintenance of physical storage support, an important point is of ensuring data readability is data formatting. The data needs to be formatted in a way that enables users to process it in the future, using technology neutral formats.

Intelligible electronic document­based information

The computer needs to have access to information describing how to interpret the bit sequences that are stored, i.e. needs to know how to make meaning out of it. This is the representation information described in ISO 14721 (see previous chapter).

Identifiable electronic document­based information

The electronic documents should be organized in such a way that users and information systems can distinguish between information objects using a unique attribute.

In the context of persistent identification, the repository can incorporate methods such as URN or DOI to ensure the content is uniquely identifiable not only internally, but also to the end users, and in a persistent way.

34

Understandable document­based information

A long‐term preservation strategy should ensure the electronic information it stores is understandable by both computers and humans. The meaning of electronic document‐based information is not solely determined by its context and additional information about its context of use and of creation should be conveyed in the form of descriptive metadata associated to it.

Authentic electronic document­based information

A repository for preservation of electronic documents needs to guarantee that its information is authentic, i.e. that over time it has not been modified or corrupted.

The strategy should apply mechanisms to restrict access to its information and protect it from deliberate or accidental alteration, corruption or deletion.

Said mechanisms include secure client‐server architectures for access restriction and techniques such as cyclical redundancy checks (CRC) and one‐ way hash functions (e.g. SHA‐1) for the verification of the authenticity of the electronic information.

Elements of a long‐term preservation strategy

An accurate and reliable long‐term preservation strategy for electronic information means the following must be ensured: • it can be read and interpreted by a computer application • it can be rendered in a format understandable by humans • it has the structure and context that existed at the time of the creation or receipt of the information

A strategy implementing these goals above is divided into two main activities: information migration for the first two ones, and the last one is handled through the use of appropriate metadata.

Migrating electronic document­based information

Inevitably, a long‐term strategy for electronic information will involve migrations to newer formats. To ensure preservation and access over time, a repository needs to address three challenges:

35

1. In the foreseeable future, it will be very difficult for storage repositories to have access or support all packages and formats for creating and using electronic information. 2. Some information is likely to be software‐dependent and thus only available in a specific software environment 3. Operating system and applications will inevitably be rendered obsolete by technological progress, and repositories will have to periodically update their systems and software and migrate their information content to the new environment

Migration of electronic content can successfully address these challenges.

Addressing software dependence

It may be difficult to provide long‐term access to electronic information that can only be used within a specific software application or environment, especially if a vendor discontinues support or does not provide new versions for said software.

It may be possible to eliminate software by sacrificing some loss of structure, for example, a text document in a proprietary format can be migrated to straight text thus losing aspects of physical representation. This is however not applicable to audiovisual content. A solution in this case is the migration to standard formats.

Software upgrades and new software installation

Install of new software or upgrades is inevitable in the lifetime of an electronic repository for long‐term access.

When software is upgraded, and between the upgrade and the old software is guaranteed, all electronic information should automatically moved to the new representation and environment.

When new software replaces existing software, the electronic information should be migrated by using the export feature of the old system and the import feature of the new system.

Migration to standard formats

It is recommended for storage repositories to migrate all the electronic information from the wild variety of formats used by content creators to a

36 smaller number of “standardized” formats upon their transfer to the repository.

The “standardized” formats can be a consensus on widely used and are likely to cover a majority of a particular class of data (for example, audiovisual content).

Specifically, proprietary formats should be avoided. Among the technology neutral formats that should be taken into consideration are the available ISO standards and other open standards for multimedia formats and compression techniques. Examples: PDF/A­1, XML, TIFF, JPEG and MPEG.

Metadata Metadata stores information about the context, processing and use of electronic‐based information. To a certain extent, metadata can be extracted automatically by software applications, such as: file size, file format, length, hash digest…

However, manual intervention may sometimes be needed for additional metadata such as classification, keywords, and so on.

Metadata (data about data) consist of information about the context, processing and use that supports the identification, retrieval and preservation of authentic electronic‐document based information.

Interoperable metadata

Organisations designing the capture and use of metadata that may be used in the future in an interoperable environment should take ISO/TS 23081‐1 into consideration.

Developing a long‐term preservation strategy Quality control

The repository should set up controls and rules that document how and when the electronic information has been managed and maintained.

This documentation should include the procedures and policies, description of events such as losses of data during migrations, as well as the results of periodical audits for quality control that have been made to verify that the repository’s policies have been respected.

37

Security: Application/software access control

There should be automated procedures that control the modification and deletion of information inside the repository. When an electronic document is modified or deleted, it should be automatically logged by the software, with the name of the person and the reason for the change.

The access to the stored information should be restricted so that it is only accessible through a thoroughly tested and documented entry point so that unauthorized access is prevented.

Security policy

The following procedures should be described and set up to warrant the repository’s security:

• security measures for the transfer of electronic document‐based information to the repository • procedures describing access control and monitoring • the location of the physical storage facility to minimise the risks of disasters • a disaster recovery plan • a backup system

38