FILEFILE FORMATSFORMATS

Summary records created or formatted electronically are Rapid changes in technology mean that file formats covered under the act. can become obsolete quickly and cause problems for Proprietary, Non-proprietary, Open your records management strategy. A long-term view Standard and Open Source File Formats and careful planning can overcome this risk and ◆ ensure that you can meet your legal and operational Proprietary formats. Proprietary file formats are requirements. controlled and supported by just one software developer. (.DOC) format is an Legally, your records must be trustworthy, complete, example. accessible, admissible in court, and durable for as ◆ Non-proprietary formats. These formats are long as your approved records retention schedules supported by more than one developer and can require. For example, you can convert a record to be accessed with different software systems. For another, more durable format (e.g., from a nearly example, eXtensible Markup Language (XML) is obsolete software program to a text file). That copy, becoming an increasingly popular non-proprietary as long as it is created in a trustworthy manner, is format. legally acceptable. ◆ Open Source formats. In general, open source The software in which a file is created usually has a refers to any program whose source code is made default format, often indicated by a file name suffix available for use or modification as users or other (e.g., *.PDF for portable document format). Most developers see fit. Open source software may be software allows authors to select from a variety of developed, modified and distributed by formats when they save a file (e.g., document independent software companies for profit. The [DOC], [RTF], text [TXT]). Some Linux operating system is an example. software, such as Adobe Acrobat, is designed to ◆ convert files from one format to another. Open Standard formats. Open standard software formats are created using publicly available Legal Framework specifications. Although software source codes remain proprietary, the availability of the For more information on the legal framework you standard increases compatibility by allowing must consider when selecting digital file formats, other developers to create hardware and software refer to the chapter Records Management in an solutions that interact with, or substitute for, Electronic Environment in the Electronic Records other software. The Portable Document Format Management Guidelines and Appendix A6 of the (.PDF) is based on an open standard. Trustworthy Information Systems Handbook. Also review the requirements of the: Types ◆ Public Records Act [PRA] (Code of Laws of South There are hundreds of file formats used to encode Carolina, 1976, Section 30-1-10 through digital information. Below are brief descriptions of 30-1-140, as amended) available at the basic files you are likely to encounter. Use the www.scstatehouse.org/code/t30c001.htm, which resources in the Annotated List of Resources for supports government accountability by mandating more detailed information on specific file formats. the use of retention schedules to manage records Basic file format types include: of South Carolina public entities. This law governs the management of all records created by agencies or entities supported in whole or in part MORE ➔ by public funds in South Carolina. Section 30-1-70 establishes your responsibility to protect South Carolina Department of Archives & History the records you create and to make them www.state.sc.us/scdah/erg/erg.htm available for easy use. The act does not January 2005 Version 1 — FF discriminate between media types. Therefore, Page 1 ◆ Text files. Text files are most often created in are widely usable in many different software word processing software programs. Common file programs. TIFF files are either uncompressed formats for text files include: or compressed using a lossless algorithm — Proprietary formats, such as Microsoft Word – Graphics Interchange Format (GIF) files, files and WordPerfect files, which carry the which are widely used for Internet extension of the software in which they were applications. GIF is a lossless compression created. format but is limited to 256 colors or less. — RTF or Rich Text Format files, are supported by – Joint Photographic Experts Group (JPEG) a variety of applications and saved with files, which are used for full-color or gray- formatting instructions (such as page layout). scale images. Used primarily for photographs, — Portable Document Format (PDF) files contain the standard JPEG format uses a lossy an image of the page, including text and compression algorithm that discards some graphics. PDF files are widely used for read- information to achieve a smaller file size. only file sharing and printing. Adobe Acrobat is, by far, the most popular PDF file although – Portable Network Graphics (PNG) files. A other types are available. Acrobat reader, lossless compression designed to replace GIF available for no charge, is necessary for files. PNG is completely patent and license reading an Adobe PDF file. free and is of higher quality than GIF. ◆ ◆ Graphics files. Graphics files store an image (e.g., Data files. Data files are created in database photograph, drawing) and are divided into two software programs. Data files are divided into basic types: fields and tables that contain discrete elements of information. The software builds the — Vector-based files that store the image as relationships between these discrete elements. geometric shapes stored as mathematical For example, a customer service database may formulas, which allow the image to be scaled contain customer name, address, and billing without distortion. Common types of vector- history fields. These fields may be organized into based file formats include: separate tables (e.g., one table for all customer – Drawing Interchange Format (DXF) files, name fields). You may convert data files to a text which are widely used in computer-aided format, but you will lose the relationships among design software programs, such as those used the fields and tables. For example, if you convert by engineers and architects the information in the customer database to text, – Encapsulated PostScript (EPS) files, which you may end up with ten of names, ten are widely used in desktop publishing pages of addresses, and a thousand pages of software programs billing information, with no indication of which – Computer Graphics Metafile (CGM) files, information is related. which are widely used in many image- ◆ Spreadsheet files. Spreadsheet files store the oriented software programs (e.g., Photoshop) value of the numbers in their cells, as well as the and offer a high degree of durability relationships of those numbers. For example, one – Shapefiles (SHP), ESRI GIS applications use cell may contain the formula that sums two other vector coordinates to store non-topological cells. Like data files, spreadsheet files are most geometry and attribute information for often in the of the software features. program in which they were created. Some software programs can import and export data — Raster-based files that store the image as a from other sources, including software programs collection of pixels. Raster graphics are also designed for such data sharing (e.g., Data referred to as bitmapped images. Raster Interchange Format [DIF]). Spreadsheet files can graphics cannot be scaled without distortion. be exported as text files, but the value and Common types of raster-based file formats relationship of the numbers are lost. include: – Bitmap (BMP) files, which are uncompressed, relatively low-quality files used most often in MORE ➔ word processing applications South Carolina Department of Archives & History – Tagged Image File Format (TIFF) files, which www.state.sc.us/scdah/erg/erg.htm January 2005 Version 1 — FF Page 2 ◆ Video and audio files. These files contain moving with content through the use of pre-defined images (e.g., digitized video, animation) and tags, HTML is simple to use but limited in sound data. They are most often created and scope. Other markup languages such as XHTML viewed in programs and and XML offer greater flexibility. stored in proprietary formats. Common files — eXtensible Hypertext Markup Language formats in use include QuickTime, Motion Picture (XHTML) combines the flexibility found in XML Experts Group (MPEG) formats and Real Video. with the ease of use associated with HTML. ◆ Markup languages. Markup languages, also called Strict XHTML rules improve consistency and markup formats, contain embedded instructions provide the ability to create your own markup for displaying or understanding the content of tags. Because they share similar rules, the file. They provide the means to transmit and converting XHTML into XML is easier than share information over the web. The World Wide converting HTML into XML. Web Consortium (W3C) (www.w3c.org) supports — eXtensible Markup Language (XML) is a these standards. Common markup language file relatively simple language based on SGML that formats include the following: is gaining popularity for managing and sharing — Standard Generalized Markup Language (SGML), information. XML provides even greater a common markup language used in flexibility and control than XHTML while government offices worldwide, is an avoiding the complexities associated with international standard. HTML and XML are SGML. derived from SGML. For additional information on file formats see the — Hypertext Markup Language (HTML) is used to Digital Imaging guidelines. display most of the information on the World MORE ➔ Wide Web. Because presentation is combined

Table 1 summarizes the common file formats. Table 1: Common File Formats (contains both proprietary and non-proprietary formats)

File Format Type Common Formats Example Applications Description Text PDF, RTF, TXT, DOC, WPD Letters, reports, memos, e-mail Created or saved as text messages saved as text (may include graphics)

Vector graphics DXF, EPS, CGM, SHP Architectural plans, complex Store the image as geometric illustrations, GIS shapes in a mathematical formula for undistorted scaling

Raster graphics TIFF, BMP, GIF, JPEG, PNG Web page graphics, simple Store the image as a illustrations, photographs collection of pixels which cannot be scaled without distortion Data file Proprietary to software Human resources files, mailing lists Created in database software program programs

Spreadsheet file Proprietary to software Financial analyses, statistical Store numerical values and program, DIF calculations calculations Video and audio files QuickTime, MPEG, Short video to be shown on a web Contain moving images and Real Networks, WMV, WAV, site, recorded interview to be sound MP3 shared on CD-ROM Markup languages SGML, HTML, XHTML, XML Text and graphics to be displayed Contain embedded on a web site instructions for displaying and understanding the content of a file or multiple files

South Carolina Department of Archives & History www.state.sc.us/scdah/erg/erg.htm January 2005 Version 1 — FF Page 3 Preservation: Conversion and Migration the human eye to fill in the missing detail. Others Your most basic decision about file formats will be are designed to be “lossless.” You may choose to whether you want to convert and/or migrate your compress some files and not others. file formats. If you convert your records, you will Importance of Planning change their formats, perhaps to a software- The challenges of preservation can be overcome independent format. If you migrate your records, with good planning. Use the resources in the you will move them to another platform or storage Annotated List of Resources. Review the decision medium, without changing the file format. However, tree on page 29 in the Guidelines on Best Practices you may need to convert records in order to migrate for Electronic Information white paper for them to ensure that they remain accessible. For preliminary planning and use the CLIR workbook in example, if you migrate records from a Macintosh Risk Management of Digital Information: A File operating system to a Microsoft Windows operating Format Investigation to assess your unique situation system, you need to convert the records to a file and risk. Thoroughly discuss the “Suggestions for format that is accessible in a Windows operating Better File Format Decisions” listed below, to weigh system (e.g., RTF, Word 2000). the specific pros and cons of each suggestion for You will face three basic types of loss determining your agency. your course of action: ◆ Data. If you lose data, you lose, to a varying Suggestions for Better File Format Decisions degree, the content of the record. Bear in mind ◆ Accessibility. The file format must enable staff that, legally, your records must be complete and members and the public to find and view the trustworthy. record. In other words, you cannot convert the ◆ Appearance. You also risk loss of the structure of record to a format that is highly compressed and the record. For example, if you convert all word easy to store, but inaccessible. processing documents to RTF, you may lose some ◆ Longevity. Developers should support the file of the page layout. You must determine if this format long-term. If the file format will not be loss affects the completeness of the record. If the supported long-term, you risk having records that structure is essential to understanding the record, are not durable, because the software to read or this loss may be unacceptable. modify the file may not be available. Records can ◆ Relationships. Another risk is the loss of the be migrated or converted if you determine a file relationships of the data in the file (e.g., format is no longer supported. Open source, open spreadsheet cell formulas, database file fields). standard and non-proprietary formats are Again, this loss may affect the legal requirement preferable to completely proprietary ones. for complete records. ◆ Accuracy. If you convert your records, the file Keep in mind that a copy of a record is legally format you convert to should result in records admissible only if it is created in a trustworthy that have an acceptable level of data, manner and is accurate, complete, and durable. appearance, and relationship loss. ◆ Compression Completeness. If you convert your records, the file format you convert to should meet your As part of your strategy, you may choose to operational and legal objectives for acceptable compress your files. The pros and cons are degree of data, appearance, and relationship loss. summarized in Table 2 below. ◆ Flexibility. The file format needs to meet your Table 2: Pros and Cons of File Compression objectives for sharing and using records. For Pros Cons example, you may need to frequently share copies of the records with another agency, use the Saves storage space May result in data loss records in your daily work, or convert and/or More quickly and easily Introduces an additional migrate the records later. If the file format can transmittable layer of software dependency only be read by specialized hardware and/or (the compression software) software, your ability to share, use, and manipulate the records is limited. The greatest challenge in compressing files is that MORE ➔ South Carolina Department of Archives & History you may lose data. Compression options vary in www.state.sc.us/scdah/erg/erg.htm their degree of data loss. Some are intentionally January 2005 Version 1 — FF “lossy,” such as the JPEG format, which relies on Page 4 Annotated List of Resources State of Australia. “Management of Electronic Primary Resources Records, 4.0 Electronic Records Format.” In Standard for the Management of Electronic DLM Forum. Guidelines on Best Practices for Records. Version 1.0. North Melbourne, Australia: Electronic Information. Luxembourg: European State of Victoria, 2000. Communities, 1997. www.prov.vic.gov.au/vers/standards/pros9907/ europa.eu.int/ISPO/dlm/documents/ 99-7s4.htm guidelines. www.prov.vic.gov.au/vers/standards/pros9907/ This white paper was published by the DLM Forum, 99-7toc.htm an organization of records management experts This portion of the Australian standards for from the Member States of the European Union electronic records management summarizes the and the European Commission. The paper provides desirable characteristics of a file format and what a basic overview of the file formats in use the file format must be able to support. The worldwide. Topics include the information life second URL provides the table of contents for the cycle; the design, creation, and maintenance of entire electronic document that discusses all the electronic records; short-term and long-term standards. access; and accessing and sharing information. World Wide Web Consortium (W3C) Lawrence, G.W., W.R. Kehoe, O.Y. Rieger, et al. Risk www.w3.org Management of Digital Information: A File Format Investigation. Washington, D.C.: Council on W3C is a consortium of organizations around the Library and Information Resources, 2000. world that develops and promotes common web www.clir.org/pubs/abstract/pub93abst.html protocols. The site contains news, specifications, guidelines, software, and tools for web This publication provides an overview of file format development on a wide variety of topics, including issues related to records management strategies. markup languages and transfer protocols. The publication also provides a comprehensive workbook for users to help them develop a records Cornell University. “Digital Preservation management strategy. Management: Selecting Short Term Strategies For Long Term Problems” Additional Resources www.library.cornell.edu/iris/tutorial/dpm/ Electronic Recordkeeping Resources. index.html www-personal.si.umich.edu/~calz/ermlinks/ An online tutorial available from Cornell University ermlinks.htm Library. The tutorial provides basic information This web site provides a comprehensive list of links including terms and concepts related to digital to other Internet resources related to electronic preservation. records management. The site is managed by Cal Lee, who originally constructed it while employed at the Kansas State Historical Society. Topics include security, preservation, access, and technology infrastructure. South Carolina Department of Archives and History. Trustworthy Information Systems Handbook. Version 1, July 2004. www.state.sc.us/scdah/erg/tis.htm This handbook provides an overview for all stakeholders involved in government electronic records management. Topics center around ensuring accountability to elected officials and citizens by developing systems that create reliable and authentic information and records. The handbook outlines the characteristics that define trustworthy information, offers a methodology for ensuring trustworthiness, and provides a series of South Carolina Department of Archives & History worksheets and tools for evaluating and refining www.state.sc.us/scdah/erg/erg.htm January 2005 system design and documentation. Version 1 — FF Page 5