<<

Digital Preservation at the Church of Jesus Christ of Latter-day Saints 28th IEEE Mass Storage Systems and Technology Conference April 2012 — Gary T. Wright

Introduction to the Church of Jesus Church Audiovisual Capabilities Christ of Latter-day Saints Over the last two decades, the Church The Church of Jesus Christ of Latter-day has developed state-of-the-art digital Saints is a worldwide Christian church with audiovisual capabilities to support its vast, more than 14.4 million members and 28,784 worldwide communications needs. To congregations. With headquarters in Salt illustrate one such need, twice a year the Lake City, (USA), the Church operates Church holds General Conference in its three universities, a business college, 136 remarkable Conference Center, which seats temples, and thousands of seminaries and 21,000. institutes of religion around the world that Church members from all over the world enroll more than 700,000 students in travel to attend these conferences in order religious training. to hear timely, relevant counsel from The Church has a scriptural mandate to inspired ecclesiastical leaders. keep records of its proceedings and However, since the majority of the preserve them for future generations. members are unable to attend in person, the World-famous Accordingly, the Church has been creating meetings are broadcast in high definition and Orchestra at and keeping records since 1830, when it video via satellite to more than 7,400 was organized. A Church Historian’s Office Church buildings in 102 countries. The was formed in the 1840s, and in 1972 it was broadcasts are simultaneously translated renamed the Church History Department. into 32 languages. In addition, the meetings Today, the Church History Department are streamed live on the Church’s website has ultimate responsibility for preserving (lds.org) and on the Mormon Channel records of enduring value that originate (radio.lds.org). from its ecclesiastical leaders and within the Ultimately, surround sound digital various Church departments, the Church’s audio tracks for 96 languages are created educational institutions, and its affiliations. (comprising 1.75 TB of audio content) to With such a broad range of record augment the digital video taping of each sources, the array of digital record types meeting—making the Church of Jesus requiring preservation is also extensive. Christ of Latter-day Saints the world’s However, the vast majority of storage largest language broadcaster. capacity in the Church History Because of their exalting and enduring Department’s digital preservation archive is value, all General Conference meeting allocated to audiovisual records. videos, along with associated audio tracks, Salt Lake Temple are preserved digitally. photographed by Henok Montoya

1

The Church’s advanced audiovisual Typically, about 40% is targeted for capabilities are also used to support weekly preservation. In addition, a multi-petabyte broadcasts of Music and the Spoken Word— backlog is currently being ingested into the the world’s longest continuous network digital preservation system. broadcast (now in its 83rd year). In just ten years, the Media Services Each broadcast features an inspirational Department anticipates that it will have message and music performed by the generated a cumulative archival capacity of Mormon Tabernacle Choir and the more than 100 petabytes for a single copy. Orchestra at Temple Square. The broadcast This means that the Church History is aired live by certain radio and Department’s digital preservation archive for stations and is distributed to approximately audiovisual data will become one of the largest 2000 other stations for delayed broadcast. in the world within a decade. And all the other The Mormon Tabernacle Choir was Church records of enduring value will add to named “America’s Choir” by President the total capacity of this world-class digital and was awarded the archive. (the United States’ highest honor for artistic excellence) by The Family History Department President George W. Bush. And in 2010, the In 1894, the Church of Jesus Christ of Music and the Spoken Word broadcast was Latter-day Saints established the inducted into the National Radio Hall of Genealogical Society of Utah as a nonprofit, Fame. It is no wonder that the Church of educational institution. The Society began Jesus Christ of Latter-day Saints considers to gather family history records from broadcasts of Music and the Spoken Word archives around the world, and then started worthy of digital preservation. to microfilm many of them in 1938. Today, As a gift to the world, the Church of the Society is known as FamilySearch—the Jesus Christ of Latter-day Saints launched a world’s largest family history organization. new website (biblevideos.lds.org) on Operated by the Family History December 4, 2011 that provides free Department of the Church of Jesus Christ of videos of the birth, life, death, and Latter-day Saints, FamilySearch provides resurrection of the Lord Jesus Christ. services free of charge to the public. Viewable with a free mobile app, these Millions of people use FamilySearch videos are faithful to the biblical account. records, resources, and services to learn The text also accompanies each video scene. about their family history. This remarkable gift provides a new and Comprising more than 2.4 million rolls, meaningful way to learn about Jesus Christ. the FamilySearch collection of microfilmed Of course, all these digital videos will be records is the largest collection of family preserved by the Church. history records in the world. The collection The Church’s Media Services contains more than 13.1 billion images of Department, which created the Bible historical and vital records created in more videos, generates multiple petabytes of than 100 countries. production audiovisual data annually.

2

Since 1999, FamilySearch has been Departments collaborated on digital digitizing this enormous microfilm preservation requirements, architectures, collection in order to make the records and potential solutions to meet their more accessible via the Internet. Also, tens individual requirements. of millions of additional records are being In 2008, it was discovered that the photographed around the world with National Library of New Zealand had digital cameras every year. developed a thorough set of business If authorized, these records (both requirements for digital preservation. Steve digitized and digitally photographed) are Knight, Program Director for Preservation published on the FamilySearch website Research and Consultancy at the Library, (.org) as they become was kind enough to share these available. requirements with the Church—which are Once the images are published, they are probably the most comprehensive available indexed by hundreds of thousands of anywhere, even today. The National volunteers. This global team indexes Library of New Zealand’s requirements millions of names each month. As provided an excellent basis from which permitted, published indexes, along with both Church departments developed their images of records, may be viewed free of own business requirements. charge at the FamilySearch website. One shared requirement was developing Between its microfilm digitization a solution that largely conforms to the pipeline, ongoing digital capture of family Reference Model for an Open Archival history records, and associated indexes, Information System (OAIS).1 This ISO FamilySearch is generating multiple Standard (ISO 14721:2003) highlights petabytes of data each year. All this digital functions and processes that are needed to information must be preserved for future implement an effective digital preservation generations because of its priceless and operation, but does not cover actual enduring value. implementation. It also introduces Within ten years, FamilySearch expects terminology that has become a useful de that it will have generated a cumulative facto standard in the digital preservation archival capacity of more than 100 community. OAIS is universally considered petabytes for a single copy. the industry’s definitive model for Clearly, the FamilySearch digital structuring a digital preservation system. preservation archive will also become one of the Both Church departments recognized the largest in the world within a decade— wisdom of using OAIS as a reference model independent of the Church History to guide both solution evaluation and Department’s digital archive. system architecture. Consequently, the digital preservation systems employed by Architecting the Church’s Digital the Family History and Church History Preservation Systems Departments today are OAIS-compliant.

For several years culminating in 2010, the Family History and Church History

3

Another shared requirement was After several discussions with qualified, minimizing the total cost of ownership of relevant people, concerns over the ability of archival storage. In 2008, an internal study open source repositories to adequately scale was performed to compare the costs of eliminated these potential solutions from acquisition, maintenance, administration, consideration by both Church departments. data center floor space, and power to Alternative solutions considered were archive hundreds of petabytes of digital Tessella Safety Deposit Box (SDB) and Ex records using disk arrays, optical disks, Libris Rosetta. In order to determine if SDB virtual tape libraries, and automated tape and Rosetta would be able to scale to meet cartridges. The model also incorporated Church needs, scalability proof of concept assumptions about increasing storage tests were conducted for both products. densities of these different storage The Family History Department technologies over time. conducted a pilot program, with Tessella Calculating all costs over a ten year assistance, to evaluate SDB capabilities. The period, the study concluded that the total Rosetta evaluation included joint scalability cost of ownership of automated tape testing between Ex Libris and the Church cartridges would be 33.7% of the next History Department. Results of this testing closest storage technology (which was disk have been published on the Ex Libris arrays). Concerns about system costs and website (exlibrisgroup.com). The white long term bit rot on hard disks eliminated paper is titled “The Ability to Preserve a MAID (Massive Array of Idle Disks) Large Volume of Digital Assets—A Scaling technology as a potential archival storage Proof of Concept.” medium for the Church. Results of the scalability tests indicated Today, the Church History Department that both products would be able to meet and the Family History Department both Church needs. use automated tape cartridges for archival Next, the Family History Department storage. LTO-4, LTO-5, IBM TS1140, and made a side-by-side comparison of product StorageTek T10000C tape drives are features and capabilities. After summing currently in production. the comparison results, it was determined A third requirement the two that either product would meet the needs of departments shared was scalability. As the Family History Department. By then, mentioned previously, both departments the Church History Department had expect to have accumulated more than 100 implemented CHIPS (Church History petabytes for a single copy of digital Interim Preservation System) using Rosetta records within a decade. for a more comprehensive test. Such large archives require a system In the end, the choice between Rosetta architecture that enables rapid scaling of and SDB was made by each department automated ingest, archive storage capacity, according to its own business requirements access, and periodic validation of archive (not technical requirements). data integrity.

4

To illustrate, the Family History After investigating a host of potential Department had a goal of developing a storage layer solutions, the Church History digital preservation reference platform that Department chose NetApp StorageGRID to could be shared with FamilySearch partner provide the Information Lifecycle archives as open source software. The Ex Management (ILM) capabilities that were Libris business model did not support such desired. In particular, StorageGRID’s data a joint endeavor, but Tessella indicated a integrity, data resilience, and data willingness to participate. replication capabilities were attractive. As a result, the Family History In order to support ILM migration of Department decided to use SDB as the AIPs from disk to tape, StorageGRID foundation for its Digital Preservation utilizes IBM Tivoli Storage Manager (TSM) System (DPS) Version 3. DPS Version 1 was as an interface to tape libraries. the SDB proof of concept, and DPS Version DRPS also employs software extensions 2 (also called iPRES) was an internally developed by Church Information and developed interim preservation system for Communications Services (shown in the FamilySearch. iPRES tapped into the red boxes below). These software FamilySearch Digital Pipeline to make dual, extensions will be discussed later. high resolution copies of family history record collections coming down the pipeline. Meanwhile, the Church History Department completed the CHIPS proof of concept test with successful results and so decided to move forward with Rosetta as the foundation for its Digital Records Preservation System (DRPS). Neither SDB nor Rosetta provides a storage layer to function as a digital archive with replication capabilities. Both products provide configurable preservation workflows and advanced preservation planning functions, but they only write a single copy of an Archival Information Architecture of the Church History Department’s Package1 (AIP—the basic archival unit) to Digital Records Preservation System (DRPS) an NFS storage device (assumed to be online disk storage) for permanent storage. DRPS is a dark archive—meaning that An appropriate storage layer must be delivery of requested records is only integrated with each product in order to provided to the department or institution provide the full capabilities of a digital that produced the records. Likewise, preservation archive, including AIP records in the archive may only be replication. discovered by the organization that

5 produced the records. Furthermore, estimation of the amount of data corruption authorized access requests are serviced by that will occur in a digital preservation tape DRPS staff; thus producers of records do archive. Therefore, working solutions must not have direct access to the DRPS archive. be implemented to detect and correct these This arrangement enhances security of the bit errors. archive. DRPS Solutions to Data Corruption Data Corruption in a Digital In order to continuously ensure data Preservation Archive integrity of its tape archive, DRPS employs A critical requirement of a digital fixity information. preservation system is the ability to Fixity information is a checksum (i.e., continuously ensure data integrity of its integrity value) calculated by a secure hash archive. This requirement differentiates a algorithm to ensure data integrity of an AIP tape archive from other tape farms. file throughout preservation workflows and Modern IT equipment, including servers, after the file has been written to the archive. storage, network switches and routers, By comparing fixity hash values before incorporate advanced features to minimize and after records are written, transferred data corruption. Nevertheless, undetected across a network, moved or copied, DRPS errors still occur for a variety of reasons. can determine if data corruption has taken Whenever data files are written, read, place during the workflow or while the AIP stored, transmitted over a network, or is stored in the archive. DRPS uses a variety processed, there is a small but real of hash values, cyclic redundancy check possibility that corruption will occur. values, and error-correcting codes for such Causes range from hardware and software fixity information. failures to network transmission failures In order to implement fixity information and interruptions. Bit flips (also called bit as early as possible in the preservation rot) within data stored on tape also cause process, and thus minimize data errors, data corruption. DRPS provides ingest tools developed by Recently, data integrity of the entire Church Information and Communications DRPS tape archive was validated. This Services (ICS) that create SHA-1 fixity validation run encountered a 3.3x10-14 bit information for producer files before they error rate. are transferred to DRPS for ingest (see the Likewise, the USC Shoah Foundation DRPS architecture shown previously). Institute for Visual History and Education Within Rosetta, SHA-1 fixity checks are has observed a 2.3x10-14 bit error rate within performed three times—(i) when the its tape archive, which required the deposit server receives a Submission preservation team to flip back 1500 bits per Information Package1 (SIP), (ii) during the 8 petabytes of archive capacity.2 SIP validation process, and (iii) when AIP These real life measurements—one taken files are moved to permanent storage. from a large archive and the other from a relatively small archive—provide a credible

6

Rosetta also provides the capability to and alteration of files that are written to the perform fixity checks on AIP files after they grid. have been written to permanent storage, The highest level domain utilizes the but the ILM features of StorageGRID do not SHA-1 fixity information discussed utilize this capability. Therefore, previously. A SHA-1 hash value is StorageGRID must take over control of the generated for each AIP (or object) that fixity information once the files have been Rosetta writes to permanent storage (i.e., to ingested into the grid. StorageGRID). Also called the Object Hash, By collaborating with Ex Libris on this the SHA-1 hash value is self-contained and process, ICS and Ex Libris have been requires no external information for successful in making the fixity information verification. hand off from Rosetta to StorageGRID. Each object contains a SHA-1 object hash This is accomplished with a web service of the StorageGRID formatted data that developed by ICS that retrieves SHA-1 hash comprise the object. The object hash is values generated independently by generated when the object is created (i.e., StorageGRID when the files are written to when the gateway node writes it to the first the StorageGRID gateway node. Ex Libris storage node). To assure data integrity, the developed a Rosetta plug-in that calls this object hash is verified every time the object web service and compares the StorageGRID is stored and accessed. SHA-1 hash values with those in the Furthermore, a background verification Rosetta database, which are known to be process uses the SHA-1 object hash to correct. verify that the object, while stored on disk, StorageGRID provides an HTTP API that has neither become corrupt nor has been automatically returns its SHA-1 hash values altered by tampering. when called, but this API cannot be used at Underneath the SHA-1 object hash the present time because Rosetta only domain, StorageGRID also generates a writes to permanent storage using POSIX Content Hash when the object is created. commands. However, Ex Libris has Since objects consist of AIP content data committed to the Church a future plus StorageGRID metadata, the content enhancement to Rosetta that will utilize the hash provides additional protection for AIP StorageGRID HTTP API and thus eliminate files. Because the content hash is not self- the need for the ICS-developed web service. contained, it requires external information This will be a more elegant solution to fixity for verification, and therefore is checked information hand off between Rosetta and only when the object is accessed. StorageGRID. Each StorageGRID object has a third and Turning now to the storage layer of fourth domain of data protection applied, DRPS, StorageGRID is constructed around and two different types of protection are the concept of object storage. To ensure utilized. object data integrity, StorageGRID provides First, a cyclic redundancy check (CRC) a layered and overlapping set of protection checksum is added that can be quickly domains that guard against data corruption computed to verify that the object has not

7 been corrupted or accidentally altered. This The process begins when the TSM client CRC provides a verification process that appends a CRC value to file data that is to minimizes resource use, but is not secure be sent to the TSM server during a client against deliberate alteration. session. As part of this session, the TSM Second, a hash-based message server performs a CRC operation on the authentication code (HMAC) message data and compares its value with the value authentication digest is appended. This calculated by the client. message digest can be verified using the Such CRC value checking continues until HMAC key that is stored as part of the the file has been successfully sent over the metadata managed by StorageGRID. network to the TSM server—with its data Although the HMAC message digest takes integrity validated. more resources to implement than the CRC Next, the TSM server calculates and checksum described above, it is secure appends a CRC value to each logical block against all forms of tampering as long as of the file before transferring it to a tape the HMAC key is protected. drive for writing. Each appended CRC is The CRC checksum is verified during called the “original data CRC” for that every StorageGRID object operation—i.e., logical block. store, retrieve, transmit, receive, access, and When the tape drive receives a logical background verification. But, as with the block, it computes its own CRC for the data content hash, the HMAC message digest is and compares it to the original data CRC. If only verified when the object is accessed. an error is detected, a check condition is Once a file has been correctly written to a generated, forcing a re-drive or a StorageGRID storage node (i.e., its data permanent error—effectively guaranteeing integrity has been ensured through both protection of the logical block during SHA-1 object hash and CRC fixity checks), transfer. StorageGRID invokes the TSM Client In addition, as the logical block is loaded running on the archive node server in order into the tape drive’s main data buffer, two to write the file to tape. other processes occur— As this happens, the SHA-1 (object hash) (1) Data received at the buffer is cycled fixity information is not handed off to TSM. back through an on-the-fly verifier that Rather, it is superseded with new fixity once again validates the original data CRC. information composed of various cyclic Any introduced error will again force a re- redundancy check values and error- drive or a permanent error. correcting codes that provide TSM end-to- (2) In parallel, a Reed-Solomon error- end logical block protection when writing the correcting code (ECC) is computed and file to tape. appended to the data. Referred to as the Thus the DRPS fixity information chain “C1 code,” this ECC protects data integrity of control is altered when StorageGRID of the logical block as it goes through invokes TSM. Nevertheless, validation of additional formatting steps—including the the file’s data integrity continues seamlessly addition of an additional ECC, referred to until it is written to tape. as the “C2 code.”

8

As part of these formatting steps, the C1 that described previously, to ensure code is checked every time data is read integrity of the data as it is written to the from the data buffer. Thus, protection of the StorageGRID storage node. original data CRC is essentially From there, StorageGRID fixity checking transformed to protection from the more occurs, as explained previously for object powerful C1 code. access—including content hash and HMAC Finally, the data is read from the main message digest checking—until the data is buffer and is written to tape using a read- transferred to Rosetta for delivery to its while-write process. During this process, the requestor, thus completing the DRPS data just written data is read back from tape and integrity validation cycle. loaded into the main data buffer so the C1 code can be checked once again to verify DRPS Data Integrity Validation the written data. SHA-1 DRPS Ingest Tools SHA-1 created for producer files A successful read-while-write operation control assures that no data corruption has SHA-1 SHA-1 checked upon ingest control and write to permanent storage occurred from the time the file’s logical Web service retrieves StorageGRID block was transferred from the TSM client Storage Extensions SHA-1, then Rosetta plug-in compares with Rosetta SHA-1 until it is written to tape. And using these SHA-1 StorageGRID SHA-1 created for ingested files control SHA-1 and other fixity checked ECCs and CRCs, the tape drive can validate during write to storage nodes CRCs, IBM TSM end-to-end logical block logical blocks at full line speed as they are ECCs Tivoli Storage Manager protection being written! During a read operation (i.e., when Rosetta accesses an AIP), data is read from Summary of the DRPS data integrity validation cycle the tape and all three codes (C1, C2, and the original data CRC) are decoded and Ensuring Ongoing Data Integrity checked, and a read error is generated if Unfortunately, continuously ensuring any process indicates an error. data integrity of a DRPS AIP does not end The original data CRC is then appended once the AIP has been written correctly to to the logical block when it is transferred to tape. Periodically, the tape(s) containing the the TSM server so it can be independently AIP needs to be checked to uncover errors verified by that server, thus completing the (i.e., bit flips) that may have occurred since TSM end-to-end logical block protection the AIP was correctly written. cycle. Fortunately, IBM LTO-5 and TS1140 tape This advanced and highly efficient TSM drives can perform this check without end-to-end logical block protection having to stage the AIP to disk, which is capability is enabled with state-of-the-art clearly a resource intensive task—especially functions available with IBM LTO-5 and for an archive with a capacity measured in TS1140 tape drives. hundreds of petabytes! When the TSM server sends the data over the network to a TSM client, CRC checking is done once again, in reverse of

9

IBM LTO-5 and TS1140 drives can What is needed is the ability to transform perform data integrity validation in-drive, “ordered by AIP” to “ordered by file per which means a drive can read a tape and tape.” concurrently check the AIP logical block ICS, Ex Libris, NetApp, and IBM are CRC and ECCs discussed above (C1, C2, currently collaborating on a solution to and the original data CRC). Good or bad provide such optimization. The solution status is reported as soon as these internal will eliminate tape thrashing and thus checks are completed. And this is done optimize tape as an effective archival without requiring any other resources! storage medium for DRPS. Clearly, this advanced capability enhances the ability of DRPS to perform DPS V3 Storage Layer and the periodic data integrity validations of the FamilySearch Digital Pipeline entire archive more frequently, which will After exploring in detail several potential facilitate the correction of bit flips that storage layer solutions, the Family History occur after AIPs are written correctly to Department decided to build its DPS V3 tape. storage layer from scratch. This SDB tape Optimizing the Use of Tape with adapter is currently being developed by FamilySearch software engineers. It is a DRPS single-tier storage layer custom built to Since Rosetta was originally designed to support DPS V3 integration with the work with permanent storage implemented FamilySearch Digital Pipeline and to as a network file system on spinning disks, provide tape optimization as discussed its process automation module relies on above. AIP files being readily available for online, Introduced previously, the FamilySearch random retrieval. Hence Rosetta automated Digital Pipeline deals with all aspects of processes are ordered (i.e., scheduled) by digitizing and capturing digital images; AIP, since there is no need to optimize also transporting them via truck or digital access to files stored on disk. transmission; plus auditing image quality; This approach is somewhat problematic and performing image skew and cropping, for the DRPS tape-based archive, however. file format conversion, and metadata The reason is that the most efficient way to capture. Fixity information is generated at automate a process using streaming media image capture time, for both microfilm is to read and/or write any tape only once. scanners and field cameras, and is This requires a different scheduling scheme maintained throughout the pipeline so the than ordering by AIP, since multiple AIPs images can be correctly written to the DPS may be written to the same tape, or an AIP archive. may have its files spread over multiple Hundreds of millions of images are tapes. Either situation will cause processed every year through the undesirable tape thrashing as the AIP- FamilySearch Digital Pipeline, which is ordered process runs. largely automated.

10

During pipeline processing, a digital CRC fixity information discussed below, to artifact transfer service writes the images 5 TB cartridges (native capacity) using and required metadata (including fixity StorageTek T10000C tape drives. information) to transfer tapes. Transfer tapes are then transported to data center facilities across North America for the purposes of (i) publishing lower resolution versions of the images to the FamilySearch website and (ii) preserving high resolution versions of the images in DPS V3. When a preservation transfer tape is received at a Church preservation facility, it is inserted into a StorageTek SL8500 Modular Library System and the operator runs a program to instruct DPS V3 to load the tape into a drive and identify it in the Architecture of the Family History Department’s DPS transfer tape database. Digital Preservation System Version 3 Once identified as a new transfer tape, a (DPS V3) DPS V3 software process known as a As the AIPs are written, the drives TapeWorker reads the contents of the tape operating in Data Integrity Validation into temporary disk storage and starts a Mode check the CRC fixity information Safety Deposit Box (SDB) ingest workflow included with the AIPs, thus providing for each collection of images on the tape. SBD tape adapter end-to-end data integrity Multiple TapeWorker processes may run in protection as the AIPs are written to tape. parallel to scale throughput. Using the T10000C Verify command, Like Rosetta, SDB utilizes configurable in-drive data validation can be performed workflows. Multiple SDB JobQueue for a single file, multiple files, an AIP, or processes run simultaneously to ingest the the entire tape. Thus, T10000C drives incoming SIPs and package them as AIPs. provide DPS V3 with the ability to As part of the ingest process, file validation periodically check data integrity of DPS V3 services perform fixity checks using the AIPs—even the entire archive—without fixity information that was created having to stage the AIPs to disk. previously in the pipeline. The validation In summary, DPS V3 provides data services also perform file format integrity validation capabilities similar to characterization for individual files. Church History’s DRPS. When a JobQueue process completes, it Public access to records in the DPS V3 creates an entry in a Java Message Service open archive will be governed by digital queue and tells SDB that the AIPs have access rights of the individual records. The been written to disk. The SDB tape adapter FamilySearch Digital Pipeline’s Role Based then runs a TapeWriter to pull AIPs from Access Control services will enable this disk and write them, along with additional governance.

11

Such access requests are expected to be (hundreds of gigabytes in size) produced relatively infrequent, however, since by Media Services. FamilySearch publishes most records that These audiovisual files are packaged will be archived in DPS V3 on its website, with the MXF (Material Exchange Format) albeit at lower resolution. container format. As explained previously, a single General Conference session MXF Other Preservation Solutions SIP consists of the Conference video plus 96 A digital preservation best practice is to audio tracks. Also, an American Sign preserve records at the highest resolution Language video is included. affordable and practical. Most of the still Ideally, the Church History Department images produced by the Church, whether would like DRPS to do file validation of the they come from the FamilySearch Digital components within the MXF wrapper and Pipeline or the Media Services Department, extract technical metadata from them. are created as TIFF (Tagged Image File Unfortunately, no tools are known to exist Format) files. While TIFF provides very at the present time that will do all this. high resolution, it also consumes The open source tool MediaInfo can considerable archive storage capacity. extract technical metadata, but only from To provide image resolution equivalent the MXF wrapper itself. Therefore, ICS to TIFF while significantly reducing archive engineers developed an MXF Extraction capacity, both the Family History and Tool that allows extraction of technical Church History Departments preserve still metadata from the components within the images in the lossless JPEG 2000 file format. MXF wrapper. This tool is currently being Tests conducted by FamilySearch used to ingest MXF SIPs. software engineers have shown that the Rosetta does not presently support original bit stream of a TIFF image can be ingest of repeating tracks (such as multiple recreated from its lossless JPEG 2000 audio and/or video tracks). In order to archival version. Yet storage capacity ingest MXF SIPs such as General reduction benefits ranging from 50% to 60% Conference sessions that have such or more are consistently realized with repeating tracks, the MXF Extraction Tool lossless JPEG 2000. described above concatenates data from With more than 300 million images each track in the metadata it extracts. This archived annually in the lossless JPEG 2000 method of ingest is acceptable to the format, FamilySearch is likely the world’s Church History Department until Ex Libris largest user of the JPEG 2000 format, provides Rosetta ingest support for according to industry expert Dr. Robert repeating tracks. Buckley.3 In the meantime, to help other While the Family History Department is institutions deal with this challenge, the focused on increasing the number of still Church History Department has made a images it preserves annually, the Church modified version of the MXF Extraction History Department is focused on Tool available to Ex Libris for distribution preserving very large audiovisual files to Rosetta customers. The modified version

12 limits extraction to one video and one audio growing collection of priceless digital track, which meets the needs of these records to be preserved. institutions. The active archive will be the repository used to send copies of preserved Dual, State-of-the-Art Digital records to authorized requestors. Both Archive Facilities facilities will archive the same records, but the deep archive will be used for access Many Church physical records and only if the active archive is unable to artifacts of priceless value, including the service an authorized access request. FamilySearch microfilm collection, are These redundant, state-of-the-art archive currently stored in the Granite Mountain facilities will help the Church safely and Records Vault. This unique facility features securely preserve digital records of exalting and six tunnels that are bored into the side of a priceless value so they can be shared with the solid granite mountain (one of several world—now and into the future. prominent mountains that surround and protect the Salt Lake Valley). The tunnels Conclusion have ambient conditions that are naturally conducive to records preservation. The Church of Jesus Christ of Latter-day In order to efficiently preserve the Saints is making a considerable investment burgeoning capacity of digital records the to build the complementary, world-class Church is generating, plans have been digital archives and facilities described in completed to renovate the Granite this paper. The benefits of preserving the Mountain Records Vault in order to equip Church’s exalting and inspiring records two of the vaults exclusively for digital cannot be measured in financial terms, preservation storage media. however. Those benefits include building These special vaults, which will become character, strengthening families, linking the Church’s “deep” digital archive, will be people of all nationalities, races, and physically isolated from the other four religions with their ancestors, and vaults in the facility. Such isolation will preserving the heritage of mankind—all of allow archival environmental conditions for which are designed to foster both personal magnetic tape to be maintained consistently and family happiness. for the automated tape cartridges that will be housed in the vaults. Isolation will also References provide a high degree of physical security. 1 CCSDS 650.0-B-1BLUE BOOK, “Reference An “active” preservation facility, to be Model for an Open Archival Information located in a different disaster zone, is also System (OAIS),” Consultative Committee for planned. This building will provide Space Data Systems (2002) environmental, fire protection, and security 2 Private conversation with Gustman facilities similar to the deep archive in the (CTO) at the USC Shoah Foundation Institute Granite Mountain Records Vault, and will August 19, 2009 allow more copies of the Church’s rapidly 3 Private conversation with Dr. Robert Buckley August 23, 2011 ([email protected]) 13