The Recovery and Preservation of Critical Exploration Datasets for a Large Multinational Oil Company

The Recovery and Preservation of Critical Exploration Datasets for a Large Multinational Oil Company Guy C. Holmes – BSc, MBA Chief Executive Officer SpectrumData Suite 1, 14 Brodie Hall Drive, BENTLEY WA 6102 [email protected] Introduction In February of 2002, a large multinational oil company requested that a project be undertaken to consolidate, and in many cases reconstruct, a large dataset consisting of approximately 80,000 original magnetic tapes of various ages, formats, media types, and condition. The collection contained data acquired during 30 years of oil and gas exploration in over 50 different countries. The project requirements were unique for a number of reasons. The most interesting and challenging of which was that this was the second attempt at performing the project for the company due to the failure of a first attempt by another party. This failed attempt left portions of the data in jeopardy of being permanently lost, corrupted or disassociated from their invaluable metadata. The project involved reading the tapes, consolidating the data into logical data sets, converting the various data types to an industry standard format, and outputting the data to a new set of high density data cartridges in triplicate. The vast majority of data in this collection was in the form of seismic survey data which is the principal exploration methodology used in oil and gas exploration. The tape collection consisted of the following tape types: − 9 track reel to reel tape − 3480 cartridge − 3490E cartridge − 8mm Helical Scan Cartridge − 4mm DDS DAT Cartridges − Digital Linear Tape (DLT) − A variety of smaller, less known media types including DC2120’s, DC6150, and 7 track magnetic tapes. The Consequences of Removing or Modifying Blocking Structures From Magnetic Tape Files As this project was already attempted once by another party the first essential element of the task was to isolate exactly what was done prior to our involvement in the project. An initial review of the data found that most of the low density tapes that needed to be read were severely damaged and deteriorated. In most cases the tapes that had not been converted in the previously failed project represented small portions of a larger dataset that had been successfully copied to higher density media. As an example, portions of a data set that may have previously been recorded on 800 9 track tapes, were now on 10 DLT IV cartridges with the exception of 40 of the original 9 tracks that had not been read due to deterioration or damage. The higher density DLT IV cartridges created in the previous project were not a one to one identical copy of the original 9 track tapes. Instead each DLT IV cartridge contained many individual 9 track tapes, written to DLT IV in an altered de-blocked format, with only a file mark between the end of one original 9 track tape and the start of the next. To fully appreciate the complexity of this restoration and migration project, one needs to have a basic understanding of the underlying structure of data when it is stored on magnetic tape. Magnetic tape is a linear recording medium. When reading a linear magnetic tape, locating a specific record requires reading or passing over every record recorded on the tape before it. To read data from tape, a tape drive may have to read through almost the entire spool of tape before it can read the record requested by the user. As an example, to get to the fifth record on a tape a user must read the first four records before it can read the fifth. To write data to tape, the tape drive writes sequentially, one record after another along the length of the tape. Data cannot be written to linear tape in any random location without the risk of overwriting existing data. In order for all pre-existing data on a tape to remain, data must be written at the end of the existing data sets. Tape drives write data to tape in blocks. Each block consists of a number of bytes and typically the software controlling the tape drive determines how many bytes per block it will write. These blocks are separated by inter-record gaps (effectively blank tape). A group of blocks written on a tape followed by a marker called a file mark constitute a logical file on a tape. Tape drives use these file marks and inter-record gaps to seek to particular locations on the tape for specific data. More than one logical file can be written to a tape and each may contain many physical files. Logical files on tape contain at least one block of data but typically contain many hundreds or thousands of blocks. In most cases software being used to read data on tape will require that the data match a defined file and blocking structure for the data to be successfully read and interpreted. To further appreciate the complexity of this project, it is important to understand how even the smallest modification to the blocking structure of a specified data format can directly affect the ability of software to interpret the data. As this project required the conversion of a vast amount of seismic data, I have chosen to use a highly specified format of seismic data known as SEGB to further demonstrate that a small change in blocking structure can have a very large impact on data integrity. Field Seismic Recording Exploration companies use the seismic method to explore for oil as their primary means of geo-scientific investigation. A seismic survey essentially consists of a seismograph, an array of seismic receivers known as geophones, and a synthetic source of seismic energy. This synthetic seismic energy, when released, travels through the different layers of the earth and eventually is reflected back to the surface. The time it takes for the energy to reach the surface and the wavelength of the returning seismic energy is measured by the geophones. For each burst of seismic data a seismic shot record is created and is written to tape as a single file. This seismic shot file is typically a multiplexed file and is generally written to tape as either one or two blocks of data per logical file. As discussed in the introduction of this paper, many of the tapes received for this project were duplicates where the data had been copied from many original 9 track tapes to a single new DLT IV cartridge. Because the capacity of a DLT IV cartridge is much greater than that of the original 9 track tapes, it was not uncommon to find that several hundred original 9 track tapes had been copied onto a single DLT IV cartridge. One of the critical issues created by the transfer of these original 9 track tapes to DLT IV cartridge during the first failed attempt at this project is that all of the original file and blocking structure stored on the 9 track tapes was not transferred to the new DLT IV cartridges. Essentially, data from a single 9 track tape consisting of many files, where each file contained many blocks, was transferred into a single file on a new tape with a different block structure. The removal of the original blocking and file structure from this data during the previous attempt at this project created some interesting and challenging technical issues. Firstly, true preservation of the data required that the data first be returned to its original recording format including all vital file and blocking structures. This would then allow for each seismic shot to be identified, validated and preserved prior to any conversion or migration processes being applied. For most SEGB seismic field data, the first block of a SEGB shot file is referred to as a “header” block, and the second block the “data” block. Most software applications that read field seismic data, require that the header block be correctly formatted and a specific number of bytes in length. The header block often contains vital information about the data block that follows it on the linear tape and in most cases a data block in isolation (without a header block) can not be interpreted by software. The length in bytes of a SEGB header or data block may vary from one shot file to another. As the data was binary and had lost its original blocking structure during copying, the resulting file was a stream of bytes that no longer contained the vital blocking structures to delineate one shot from another, or one header from another. Instead of 100 seismic shot files, each 960,240 bytes long (consisting of a 240 byte header block followed by a 960,000 byte data block), a new file of 96,024,000 bytes (100 x 960,240 byte original files concatenated together) had been created on tape. This new file was also written to tape with a block length of 10240 bytes. The original blocking structure of the data was now lost and what was once only two blocks per file had now become a single logical file of over 9,000 blocks. See figure 1. Figure 1 – Blocking Structure Changes Through Migration Process To conventional seismic software, this resulting new data structure would have been completely un-interpretable as there is a high degree of dependency between the interpretation of data from tape by software and the blocking structure of the data itself. SpectrumData was able to develop complex software routines that navigated the new blocking and file structure of the data and converted it back to its original format.

The Recovery and Preservation of Critical Exploration Datasets for a Large Multinational Oil Company

Argest® Backup User Guide

Western Peripherals ™ Division of ~

IBM Tape Device Drivers IBM

Device and Network Interfaces

Federal Register/Vol. 67, No. 250/Monday, December 30, 2002

BRU PE 3.X User Guide

A Standard Description for Magnetic Tape Files

RSTS/E Programming Manual Order Number: AA-EZ09B-TC

Federal Register/Vol. 67, No. 123/Wednesday, June 26, 2002

Federal Register/Vol. 67, No. 250/Monday, December

IBM Tape Device Drivers Installation and User's Guide

How Much Information Is Produced in the World Each Year