DIGITAL DILEMMAS: ARCHIVING E-MAIL
PRE-CONFERENCE WORKSHOP
ASSOCIATION OF CANADIAN ARCHIVISTS
JUNE 10, 2008
Presented By: COLLABORATIVE ELECTRONIC RECORDS PROJECT TEAM Nancy Adgent, Steve Burbeck, Ricc Ferrante, Lynda Schmitz Fuhrig, Darwin Stapleton
DIGITAL DILEMMAS: ARCHIVING E-MAIL WORKSHOP June 10, 2008
9:00 – 10:30 CERP Inception, funding, goals (Darwin Stapleton) Need for e-mail preservation, why an issue (Ricc Ferrante) Identifying the issues, developing guidelines (Nancy Adgent) Results of testing (Nancy Adgent, Lynda Schmitz Fuhrig) Review workflow and tools (Lynda Schmitz Fuhrig)
Questions
10:30 – 10:45 Break
10:45 - 12:00 Exercise 1: Complete accession and processing forms (Nancy Adgent) Exercise 2: Convert msg to mbox via Aid4Mail (Nancy Adgent) Exercise 3: Convert pst to mbox via MessageSave (Lynda Schmitz Fuhrig) Exercise 4: Start AIP (Lynda Schmitz Fuhrig)
12:00 – 12:15 Questions
12:15 – 1:30 Lunch on your own (McConnell Hall cafeteria is a short walk from the class location)
1:30 – 3:00 Overview of technical issues (Ricc Ferrante) AIP post parsing (Ricc Ferrante) Why xml (Ricc Ferrante) Overview of parser (Steve Burbeck) How testbed message oddities contributed to development Collaboration with NC Demonstration of parser (Steve Burbeck)
Questions
3:00 - 3:15 Break
3:15 - 4:15 Exercise 5: Convert mbox via parser (Steve Burbeck) Exercise 6: Complete AIP (Lynda Schmitz Fuhrig) DSpace Introduction (Ricc Ferrante) Exercise 7: Parse attendee’s messages Summary (Ricc Ferrante)
4:15 - 4:30 Questions 2
TABLE OF CONTENTS
Page
Exercise 1: Complete accession and processing forms ...... 25
Exercise 2: Convert msg to mbox via Aid4Mail ...... 39
Exercise 3: Convert pst to mbox via MessageSave ...... 43
Exercise 4: Start AIP ...... 46
Exercise 5: Convert mbox via parser ...... 62
Exercise 6: Complete AIP ...... 66
Exercise 7: Parse attendee’s messages ...... 67
Appendix A: Forms and Guidelines on CERP website ...... 68
Appendix B: Application to Use Materials Produced by CERP ...... 69
Appendix C: CERP Processing Workflow Model ...... 70
Appendix D: Metadata Narrative Template...... 71
Appendix E: METS Sample ...... 73
Appendix F: EAD Sample ...... 76
Appendix G: Resources and Related Projects ...... 79
Appendix H: Software Download Links ...... 80
This documentation is released by the Collaborative Electronic Records Project under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License, 2008 and 2009. This license can be viewed at http://creativecommons.org/licenses/by-nc-sa/3.0/us/
Citation CERP. (2008). “Digital Dilemmas: Archiving E-Mail.” Sleepy Hollow, NY and Washington DC: The Collaborative Electronic Records Project.
Digital Dilemmas: Archiving E-Mail Collaborative Electronic Records Project
Association of Canadian Archivists June 10, 2008
Dr. Darwin H. Stapleton
Executive Director Rockefeller Archive Center
15 Dayton Avenue Sleepy Hollow, NY 10591 914-631-4505 [email protected] [email protected]
Overview
Funding Purpose Collaboration/Management
Accomplishments "We're good, but not perfect" Serendipity
4
Riccardo Ferrante
IT Archivist and Electronic Records Program Director
Smithsonian Institution Archives
Capital Gallery Building 600 Maryland Avenue, SW, Suite 3000 Washington, D.C. 20024-2520 202-633-5906 [email protected]
Nancy Adgent
Project Archivist Rockefeller Archive Center
15 Dayton Avenue Sleepy Hollow, NY 10591 914-366-6355 [email protected]
ROCKEFELLER ARCHIVE CENTER
RAC DEPOSITOR CHART RAC
R. R. R. R. Brothers Indi- Foundation Family University Fund viduals
Rockefeller Other Related General Education Board Common Markle -wealth NAR Foundation Fund founded JDR Jr. founded Russell Sage Foundation Foundation Center Population Own Council China Own Some Foundation Medical for Child Board On Deposit Development
Key Survey Findings
No records management policy
No naming standards
No procedures for organizing or saving
Some have no on-site IT staff
Inbox Folder Organization
6
Inbox – Non Standard File Names
Suggested Subject Names: “Staff Meeting.Minutes.2006.08.02”
ISSUES Unknown formats
Deteriorating media Data on portable devices Native format vs. converting Upgraded hardware/old media
Obsolete or unsupported software Duplicates, personal, junk mingled Information quantity & rate increase Information quantity & rate increase Traditional archival concepts/new era
Best Practices Guidance
E-MAIL GUIDELINES
7
TRANSFER GUIDELINES
Prepared by the Collaborative Electronic Records Project Rockefeller Archive Center January 2007 This document may be freely used and modified by any non-profit organization.
Retention Guidelines
Records Disposition Schedule
8
Forms Accession Administrative & Descriptive Metadata
Transfer
Verification
Migration/Refresh
METS AIP Metadata
Accession Administrative Metadata
Descriptive Metadata
9
Completed METS AIP Form
Transfer Documentation Form
Verification Documentation
10
Migration / Refresh Schedule
From CD to server
From Word to PDF From preservation copy CD to new CD
Testbed Findings
W Y S I W Y G ?
Internet Header Metadata
11
Header Metadata Return-Path:
Change of Author
Change of Creation Date
12
Original Capture mbox
Web Browser Display --======_-1211362437==_======_E2mXatt Content-Disposition: attachment; filename="XXXXX.doc“ Content-Type: application/octet-stream; name="XXXXX.doc"Content-Transfer- Encoding: base64 0M8R4KGxGuEAAAAAAAAAAAAAAAAAAAAAPgADAP7/CQAGAAAAAAAAAAAAAAABAAAAJQ AAAAAEAAAJwAAAAEAAAD+////AAAAACQAAAD//////////////////////////////////// 14 pages of character strings were in this space. AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA-- ======_-1211362437==_======-- From ???@??? Mon Sep 17 16:44:44 2001 Return-Path:
Aid4Mail Conversion
13
Missing Attachment
Attachment Conversion
Lynda Schmitz Fuhrig
Project Archivist Smithsonian Institution Archives
Capital Gallery Building 600 Maryland Avenue, SW, Suite 3000 Washington, D.C. 20024-2520
202-633-5917 [email protected]
14
SIA testbed relationship Three SI units agreed to participate SIA • Collects, preserves, and makes available the official records of the Smithsonian Institution • Carries out a records management program for Smithsonian offices, advising them on the disposition of records and pertinent documentary materials in analog and digital form.
Administrative Financial Public-Research • Deposits records regularly • Deposits made irregularly. • Has very active relationship with SIA. • Records disposition schedules with SIA. • Archives formalized records being created • One of our largest depositors. disposition schedule for office and/or updated
Bad transfer
Transferred email
36,000+ messages
15
Archival Information Package
CERP model
* The SIP is the submission information package. It contains the email collection (variety of formats possible) received from the depositor and metadata narrative (both information supplied by the depositor and updated by the archivist).
* The AIP is the archival information package. It contains the source email from the depositor, metadata (manually created METS, narrative, and other), finding aid (manually created), .mbox files, parsed XML file, parsed attachments, bad messages from parser, and parser subject-sender log.
* The DIP is the dissemination information package. Package could include the entire package for viewing/downloading or a specific email message/s for viewing. The AIP remains in its original form.
CERP model continued SIP * SIP to AIP •Archivist converts the collection to the .mbox (generic email format), if not already in this format. •Archivist runs the parser to convert the .mbox file/s to an XML preservation file with encoded attachments. •Archivist creates a package of all components (metadata, source, outputs, finding aids) in the zip format and submits to a digital repository.
AIP *
AIP to DIP The researcher queries the digital repository (DSpace) to find and retrieve the email collection results. DIP *
16
Proprietary format
Workflow at SIA
Retrieve or receive email account.
Methods include ExMerge, FTP, email attachment, server transfer
Conduct virus scan Burn original to CD or other media Make use copy
17
Review potential accession
Metadata narrative
Metadata narrative cont’d
18
Attachment extraction
19
Attachments in native format
Finding aid - EAD
20
MBOX files to parser
METS METS
Archival Information Package
21
Archival Information Package
METS.xml – doorway .pst – source email as into DSpace containing is Dublin Core tags Metadata narrative.zip .xml – email collection – narrative about from parser processing applied to .xslt – stylesheet for collection and other rendering email within pertinent information, DSpace format reports EAD.zip – html and xml Parser directory versions of finding aid tree.zip – .mbox files, bad messages, Subject sender log.zip bad messages, encoded attachments – output from parser encoded attachments greater than 25K
EAD zip
Metadata narrative zip
DROID, JHOVE reports, metadata narrative
22
Parser directory tree zip
Encoded attachments, bad messages, MBOX files within folder structure
Subject-sender log zip
Output from parser
DSpace
23
Exercise 1
Complete Processing Forms
Accession Process
Transfer Verify
24
EXERCISE 1 Using the information below excerpted from the Narrative Finding Aid, complete the METS AIP Metadata Form and the Accession Administrative and Descriptive Metadata Forms.
Accession #2008-CERP 1 Scope & Content Note This accession totals 7.15 megabytes in native (.msg) format. Information in this accession spans nine months overlapping the first and second years of CERP and Adgent’s employment with Rockefeller Archive Center. It reflects the typical content described in the Series Description. Messages consist of selected correspondence related to CERP and were chosen to illustrate a range of e-mail preservation challenges including incoming, outgoing, and forwarded with Word, Excel, WordPerfect, PowerPoint, jpg, and Publisher attachments. Senders and recipients are external and inter-office. Container List / Submission Information Packet (SIP) Folders: Demo set.msg 2006-03-08 to 2006-12-12 Demo set.Aid4Mail.mbox 2006-03-08 to 2006-12-12
Processing Note Messages were selected and copied from the e-mail account holder’s desktop folder, “Nancy’s CERP Mail”, in .msg format and were converted via Aid4Mail to .mbox format for parsing and preservation as .xml. While Archive Center staff has attempted to remove personal, confidential, and sensitive messages from the e-mail used for demonstration purposes, it is possible that some such content remains and the Archive Center cannot assume responsibility for any that remain. Archivists discovering this type content should exercise discretion in using it. Provenance E-mail messages were copied from the e-mail account holder’s active Microsoft Outlook imap.rockefeller.edu/Records/E Rec Project folder onto her desktop by the account holder in April 2008. Fourteen messages were selected for CERP demonstration use and were copied to new folders as shown in Container List above. Related Materials Other records in the Rockefeller Archive Center Collection include material from various departments and are arranged in the following Record Groups (Sub-Communities): Administration and Operations Exhibits Library Memorabilia Photographs Press and Publications Projects Website Access Notes RESTRICTIONS: Closed. Accessible only by CERP team.
ACCESSION ADMINISTRATIVE METADATA for E-MAIL
Accessioning Archivist: Today’s Date:
Depositing organization
Type of Accession
Type of Material
Size
Terms/Restrictions
Access
Retain until
Copyright: (If yes, describe)
Sensitive Data:
Encrypted: (If yes, provide key)
From Server/Type Media
Location, physical environment, security of records since creation:
If information being deposited has been altered since creation, indicate changes:
Altered by:
Date(s):
If migrated/refreshed/virus scanned before accessioned, provide information on an attached Migration/Refresh Schedule Form.
*Attach Completed Electronic Records Transfer, Verification, and Migration Forms*
26
ACCESSION ADDITIONAL DESCRIPTIVE METADATA for E-MAIL
Creator’s E-mail Address:
Computer ID#
Creator’s Title:
Types Information (e.g. lab notes, staff meeting minutes, etc.):
E-mail application(s):
Attachment formats: Text:
Graphics & Images:
Database:
Web/HTML:
Architectural/Engineering:
Scientific:
Proprietary:
Other:
Removable Media ID Name / # Title: Byte size # Folders
Folder Name Byte Size # Files Date Range
File Name Byte Size # Messages # Attachments
*Attach Narrative Finding Aid *
27
AIP METS FOR DSPACE FORM
Archival Term/DSpace Term/
Collection/Community Name
Record Group/Sub-Community
Series/Collection
E-Mail Account Holder’s Name
Department
Accession Number
Accession/Bundle Name
Date Range, inclusive
E-Mail Collection Folder Name(s)
Subject Headings
28
ACCESSION ADMINISTRATIVE METADATA for E-MAIL
Accessioning Archivist: Nancy Adgent Today’s Date: April 23, 2008
Depositing organization
Type of Accession
Type of Material
Size
Terms/Restrictions
Access
Retain until
Copyright: (If yes, describe) No
Sensitive Data: Yes
Encrypted: (If yes, provide key) No
From Server/Type Media
Location, physical environment, security of records since creation: Outlook Inbox folder, then copied to desktop
If information being deposited has been altered since creation, indicate changes:
Altered by:
Date(s):
If migrated/refreshed/virus scanned before accessioned, provide information on an attached Migration/Refresh Schedule Form.
*Attach Completed Electronic Records Transfer, Verification, and Migration Forms*
29
ACCESSION ADDITIONAL DESCRIPTIVE METADATA for E-MAIL
Creator’s E-mail Address: [email protected]
Computer ID#
Creator’s Title: Project Archivist
Types Information (e.g. lab notes, staff meeting minutes, etc.): Correspondence
E-mail application(s): Microsoft Outlook
Attachment formats: Text: MS Word; pdf
Graphics & Images: jpg; PowerPoint; Publisher
Database:
Web/HTML:
Architectural/Engineering:
Scientific:
Proprietary:
Other: Excel
Removable Media ID Name / # N/A Title: Byte size # Folders
Folder Name Byte Size # Files Date Range
File Name Byte Size # Messages # Attachments
*Attach Narrative Finding Aid *
30
AIP METS FOR DSPACE FORM
Archival Term/DSpace Term/
Collection/Community Name
Record Group/Sub-Community
Series/Collection
E-Mail Account Holder’s Name
Department
Accession Number
Accession/Bundle Name
Date Range, inclusive
E-Mail Collection Folder Name(s)
Subject Headings
31
Electronic Records Transfer Form Collection: Content Retain Record Date Date Content Format & To This Group/Series Creator(s)/Inbox Owner Position(s) Range Created Type Version Media # Source Date
NOTES:
Transferring Archivist: Date:
Electronic Records Verification Form Collection:
Record # # Size in Copy- En- Virus NOTES (e.g. Information Group/Series Folders Folder Name Files File Name Bytes right crypted Scan kept/deleted & reasons)
Verified by: Title: Date:
33
Electronic Records Media Refresh/Migrate/Destroy Schedule FROM FROM TO Virus Size Arch- YEAR DATE COLLECTION RG SERIES MEDIA TO MEDIA FORMAT FORMAT Scan After ivist
34
Electronic Records Transfer Form Collection: Rockefeller Archive Center Content Retain Record Date Accession Content Format & To This Group/Series Creator(s)/Inbox Owner Position(s) Range Date Type Version Media # Source Date 3/8/06- CERP Nancy Adgent Project Archivist 12/12/06 04/11/08 Email & att Outlook 2003 Desktop n/a Inbox* 12/31/08
NOTES: *Copy/paste from Inbox to desktop folder by e-mail account owner
Transferring Archivist: Nancy Adgent Date: 4/23/08
Electronic Records Verification Form Collection: Rockefeller Archive Center
Record # # Size in Copy- En- Virus NOTES (e.g. Information Group/Series Folders Folder Name Files File Name Bytes right crypted Scan kept/deleted & reasons) No virus; deleted 1 sensitive; today’s date at top of every CERP/Demo Mail 1 In E Rec Project 15 Demo Set.msg 7.15 MB N N Y message; original date in headers No virus found; today’s date at top Demo of every message; original date in CERP/Demo Mail 1 In E Rec Project 1 set.Aid4Mail.mbox 9.08 MB N N Y headers
Verified by: Nancy Adgent Title: Project Archivist Date: 4/23/08
36
Electronic Records Media Refresh/Migrate/Destroy Schedule FROM FROM TO Virus Size Arch- YEAR DATE COLLECTION RG SERIES MEDIA TO MEDIA FORMAT FORMAT Scan After ivist 2008 12.31 RAC CERP Demo Mail Server Destroy All destroy n/a 0 NA 2010 09.01 RAC CERP All Server TBD All TBD NA
37
ACCESSION METADATA MATRIX
38
Exercise 2
Aid4Mail Conversion
Copy/paste files from Inbox to Desktop
Convert from msg to mbox via Aid4Mail
Rename file to messages.mbox
Select & Copy From Inbox
Copy Inbox Files to Desktop
Pasted Files in Desktop Folder
Aid4Mail Conversion – Step 1
Aid4Mail Conversion – Step 2
40
Aid4Mail Conversion – Step 3
Aid4Mail Conversion – Step 4
Aid4Mail Conversion – Step 5
41
Aid4Mail Conversion – Step 6
Aid4Mail Done
Aid4Mail Conversion – Rename File
42
Exercise 3
MessageSave Conversion
Bad processing
MessageSave in Outlook
Create a Mail folder on your desktop. Launch Outlook. Open the .pst.
43
Go to the MessageSave tool in the upper right-hand corner. Click on Options to expand the box. Be sure you have the .pst opened in Outlook.
Select Include Subfolders. Select MBOX for format. Browse to Desktop to select the Select Save Now. Mail folder.
44
Processing
Folder should open with MBOX files File name should be main folder- subfolder
Create folders with the file names
45
Rename files “messages.mbox” and place into corresponding folders. This is needed in order for the parser to process the email collection correctly.
Exercise 4
Begin AIP
46
Archival Information Package
METS
Finding aid
47
DROID output in XML
Partial AIP
LUNCH
12:15 – 1:30
Technical Review
Riccardo Ferrante IT Archivist Smithsonian Institution Archives [email protected]
General digital preservation
concerns
Obsolescence is a 3-fold risk - Physical
- Format - Technical
The heart of the problem
The computing industry
Rapid advancements Proprietary formats
49
1954
Rand Home Computer Photo courtesy of Smithsonian Institution
Email Issues
Structure Relationships, at multiple levels Attachments Standards for email formats trail behind new email features Volume
Access
Email Challenges
Owner’s organization of email
50
Email Challenges Email with attachments
Email Challenges
Embedded emails and attachments
Standards v. Features
RFC 822 RFC 2822 Standard for the Format of Internet Message Format ARPA Internet Text | Messages | | | 1982 2001 1960 2010
1961 1972 1995 1999 Electronic First email HTML and other Y2K event propels messaging between two MIME types and retooling to support between networked supported by 3- and 4-digit year mainframe computers using email clients notations accounts @ symbol
51
Volume
Access
Acquisition delays
Years after the email account was active Technical environment
Original email system and version obsolete
Long term usability
File format considerations
Viability of having email servers as staging platforms for accessioned email accounts
Addressing the challenges
An email account model:
- preserves the full body of correspondence
- maintains the user-created relationships
- eliminates redundant metadata
52
XML as a preservation format
Good prospects for format longevity Self- describing, flexible
Non-proprietary standard Can determine if results are well-formed Can determine if results are well-formed Base is ASCII Strong open-source Large range of tools available Commercial support as well
The Email Account Schema
Enables validation of successful preservation Supports of email accounts by different systems
CERP Parser – multiple formats, no original systems
EMCAP parser – single format, active original systems Final schema produced by NC staff fully
addresses a complete email account at all levels.
Email Account Schema – Summary Level
53
Anatomy of an email message
Anatomy of an email message
Anatomy of an email message
54
Steve Burbeck
IT Consultant
(910) 458-2056 [email protected]
http://www.runningempty.org/Steve/
Parsing E-Mail: Lessons Learned
ACA Workshop June 10, 2008
Overview of Topics
Loose Email standards Issues with native email formats
The impact of the wide variety of email clients in use Commercial tools vs. open source
The CERP Parser, how it is constructed and how to use it.
55
Loose Email “Standards” RFC2822 and other standards are a good start that handle most cases. Yet email continues to evolve and standards continue to lag. To be widely adopted, lagging standards must support virtually all preexisting practices…an impossible goal without compromises that are open to interpretation. Different email client vendors interpret the standards differently. And there are the inevitable mismatches between interpretations (and inevitable bugs).
Variety is the Spice of Email
Dozens of common email systems and 100s of others
We have encountered mail from Eudora (multiple versions), Simeon for MacPPC, Outlook/Exchange (multiple versions), AppleMail, Lotus Notes, Groupwise, Mozilla/Firefox, Pegasus Mail, and various Internet mail services such as gmail, Hotmail, YahooMail, Juno, and AOL Each has its peculiarities. Some use non-standard date formats European and Asian mail may contain non- ASCII (actually, non UTF-8) characters Older email may have HTML in inappropriate places Forwarded and other “child” messages may be included in nonstandard forms
Commercial vs. Open Source
Why Not Use Commercial Solutions? Most commercial solutions aim at the earliest possible legal destruction of email rather than long-term storage.
The storage formats are determined by the vendor, usually with an eye to advantaging their own business
Proprietary software suppliers may not even be in business 20 years hence. Benefits of Open Source
The software can be maintained by the archivist community at large, Storage formats can be optimized for archival needs.
56
The Storage Format - XML
Why not just use Native email format?
Which one? How well is it documented? How long will software exist to read it? Which companies (if any) have a real commitment to stability and longevity? Why eXtensible Markup Language (XML)? XML is open, human readable and “self describing”
A good descriptive schema supports validity checking
There are many open source tools to create, manipulate and read XML
The Importance of a Common Schema
A Schema defines how the XML tags for the various parts of an email relate to each other.
It will be made public, so you don’t have to reinvent the wheel
Email Conversion Results We have converted and validated 70 thousand messages in three test sets to the XML Mail- Account schema
Smithsonian - 5,537 messages in 232 Mb of recent Outlook mail 99.97% successfully parsed (4 could not be parsed), Smithsonian - 20,000 messages in a 1.5 Gb Outlook account 99.975% successfully parsed (5 could not be parsed)
Rockefeller Archives - 43,778 messages in 378 Mb of older eclectic mail 99.85% successfully parsed (74 unparsed, but improvement is clearly possible) Parse speed: about a quarter gigabyte per hour on a Thinkpad T40
57
Lessons Learned
100% success is an unrealistic goal
Some emails are just too broken to parse without manual intervention
We can achieve at least 99.9% success (and save the few unparsed emails for human inspection) This error rate is not unlike physical archives
Development of the CERP Email Parser First and foremost, it is a prototype
It was developed (co-evolved) in open source tools along with our changing understanding of requirements and an evolving XML schema. It was built in an Open Source development system: Squeak Smalltalk v3.9 A portable development environment that runs on Windows, Linux, and Macintosh (www.squeak.org) Squeak was chosen because it is a very powerful prototyping system.
We can debate the relative merits of other prototyping languageslanguages (Java,(Java, Ruby, whatever, …) off-line.line.
The Web Application Interface
The parser can be run from within Squeak, but most users will prefer to run it from a Web browser The Web interface is built with a popular Squeak Web Application development framework called Seaside (www.seaside.st)
Seaside uses a web server (Comanche) that is embedded in Squeak.
Although Comanche can be used as a general- purpose web server, its usage here is confined to supporting the parser and the Seaside application interface.interface.
58
Parser Demonstration
Running the CERP Email Running the CERP Email Parser Place the root directory of all mail accounts to be parsed in a common Windows directory, default C:\digpres\mail Start the Squeak parser image and ensure that Seaside is running If necessary, start Seaside by executing “WAKom startOn: 9092” Then open a browser window and direct it to http://localhost:9092/seaside/EmailParsing
The Parser Web Page
59
Parsing Results Status
Parser Subject-Sender Log Parser Subject-Sender Log
Parser Subject-Sender Log (cont.)
60
Parsed E-mail Body Excerpt
Parsed E-Mail Attachment Reference
Validation in Oxygen
61
Validation Message
External Attachment File
Exercise 5
Convert mbox via parser
62
Characteristics of METS
Metadata Encoding and Transmission Syntax Sections address descriptive, administrative, file groups, file hierarchies, and behavior Proven non-proprietary standard
Supporting maintenance body
Adopted by a range of similar systems Integrates use of extension schemas
CERP Use of METS CERP Use of METS
Archival package definition Descriptive, file groups, structural map sections
Import into DSpace
Support of Dublin Core schema as a descriptive section extension Required development of a METS Import module for DSpace
63
64
METS & the AIP
Dublin Core Choice
Fields selected
StructMap structure, content, use
File Group structure, content, use
Archival Information Package
65
Exercise 6
Complete AIP Assembly
DSpace
Using METS with Dspace
66
Demonstration
Loading AIP into DSpace
Exercise 7
Parse attendee’s messages
CERP http://siarchives.si.edu/cerp
Rockefeller Archive Center http://archive.rockefeller.edu
Smithsonian Institution Archives http://siarchives.si.edu
67
APPENDIX A
CERP FORMS AND GUIDANCE
Available on CERP website: http://siarchives.si.edu/cerp/progress.htm Depositor Survey – Electronic Record Status E-Mail Guidelines for Managers and Employees Responsible Recordkeeping: E-mail Records Email Guidance Transfer Guidelines Record Retention and Disposition Guidelines Metadata for E-Mail Form EAD sample
APPENDIX B
Collaborative Electronic Records Project (CERP)
Application to Use Materials Produced by CERP, including those on the CERP website
Although CERP products are not copyrighted, and our intent is for other archives and non-profit organizations to freely use the materials, for-profit entities should obtain our written permission and should not charge users for material obtained from CERP. Any organization, regardless of profit-making status, should appropriately credit CERP for materials used in whole or in part.
Images are owned by the Rockefeller Archive Center (RAC) or Smithsonian Institution Archives (SIA) or are copyright-free Microsoft Clip Art. Written permission to use images owned by RAC or SIA must be obtained from the RAC or SIA, as appropriate for the image. Credit line: “Courtesy of the Rockefeller Archive Center” or “Courtesy of Smithsonian Institution Archives” respectively.
Name of Organization: ______Address: ______Telephone: ______Fax: ______E-mail address: ______Website URL: ______Name and title of person requesting permission: ______Name of CERP products you wish to use: ______Intended use of materials: ______
Please note that CERP publications are not intended to provide, nor to constitute, legal or accounting advice and, therefore, any entity adopting or adapting them should consult with the appropriate professionals to ensure that the entity’s particular legal and accounting needs are met.
APPENDIX C
CCEERRPP mmooddeell
SIP *
SIP to AIP •Archivist converts the collection to the .mbox (generic email format), if not already in this format. •Archivist runs the parser to convert the .mbox file/s to an XML preservation file with encoded attachments. •Archivist creates a package of all components (metadata, source, outputs, finding aids) in the zip format and submits to a digital repository.
AIP *
AIP to DIP The researcher queries the digital repository (DSpace) to find and retrieve the email collection results.
DIP *
* The SIP is the submission information package. It contains the email collection (variety of formats possible) received from the depositor and metadata narrative (both information supplied by the depositor and updated by the archivist). a
* The AIP is the archival information package. It contains the source email from the depositor, metadata (manually created METS, narrative, and other), finding aid (manually created), .mbox files, parsed XML file, parsed attachments, bad messages from parser, and parser subject-sender log.
* The DIP is the dissemination information package. Package could include the entire package for viewing/downloading or a specific email message/s for viewing. The AIP remains in its original form.
70
APPENDIX D
METADATA NARRATIVE TEMPLATE
June 10, 2008 Smithsonian Institution Archives
This directory was created as part of the Collaborative Electronic Records Project (CERP).
Depositor: Smithsonian Institution Archives Title: Smithsonian Institution Archives, Office of the Director, Email Records, 2004-2006 Transfer method: SIA server Received by: Ginger Yowell Account holder: Thomas F. Soapes Received on: 3/29/2007 Email application: Outlook Folder name: F:\DSpace SIA\AIP\Soapes.pst Date range of rec’d: NA Date range of Sent Items: 12/08/2004-11/14/2006 Accession: 07-109
Abstract: This accession consists of records created by Thomas F. Soapes during his tenure as Acting Director of the Smithsonian Institution Archives (2005-2007). Sent email also includes correspondence while he was chair of the Archives Division at the National Air and Space Museum. Some correspondence is transitory and/or sensitive in nature.
SOURCE Data Soapes.pst (TS_cerp.pst in AIP) Analyses Virus checking results. None. See below.
Source data format validation of attachments. (JHOVE/DROID) None. See below.
ARCHIVAL MASTER Data TS_cerp.xml (created from mbox file/s of pst account) Attachments (encoded) METADATA METS file Finding Aid – EAD in XML and HTML Sender_Subject logs - Files containing senders and subject lines from parser.
PRESERVATION DESCRIPTION INFORMATION (PDI) Description of each step taken in the course of preparing the AIP. If a step is documented elsewhere in the archives as a standard practice, a reference to such is acceptable. All information necessary to document the authenticity of the record traceable (auditable) to the time it was ingested. 1. Initial Processing: Virus check, backup 2. Preservation Pre-processing: PST migrated into mbox format 3. Preservation: Mbox files transformed into XML file containing all emails in the account, accompanying attachments encoded. 1. No viruses detected. On Demand E-mail Scan conducted on .pst file. Working copy on external drive regularly backed up. Original .pst saved on SIA websites server. Folders Sent Items - Folder 1 Sent Items - Folder 2 Sent Items - Folder 3 Sent Items - Folder 4 2. .pst migrated into mbox format using MessageSave. Commercial off-the-shelf (COTS) product works as add-in within Outlook. Mbox file created for each mail account folder. Name changed from folder name to messages.mbox so parser in Step 3 could work. Messages.mbox placed into corresponding folders. Attachments extracted in native format using EZDetach. COTS product works as add- in within Outlook. No viruses detected. VirusScan On Demand used. Attachment formats are pdf and doc. Logs also created during parsing and are in the in the AIP. Finding aid created in EAD using Oxygen. HTML and XML output. Used DROID and JHOVE on attachments for format identification and validity. JHOVE and DROID reports are present in AIP. Used SiteOverview as well for sitemap 3. Mbox file transformed into XML file containing all emails in the account, accompanying attachments encoded. This is the email account preservation file using the Email Preservation XML schema, EPres. This account was parsed June 10, 2008, on an SIA workstation. METS file composed manually. AIP components zipped on June 10, 2008. RIGHTS MANAGEMENT Access to this directory is constrained in the following manner: -Administrators – full control STORAGE LOCATIONS SIA server LSF external drive MIGRATION No immediate preservation issues. Review account in 5 years. UPDATE
Lynda Schmitz Fuhrig, SIA, June 10, 2008
72
APPENDIX E METS SAMPLE -
-
74
75
APPENDIX F EAD SAMPLE
76
77
78
APPENDIX G
RESOURCES AND RELATED PROJECTS AND INITIATIVES
CERP (See website for list of resources) http://siarchives.si.edu/cerp/progress.htm
Chronopolis www.digitalpreservation.gov/news/2008/20080403news_article_chronopolis.html California-based datagrid preservation community with audit reporting and authenticity checking capability.
Fedora http://www.fedora-commons.org/ Fedora Commons is a non-profit organization providing digital content management and preservation technologies.
National Archives & Records Administration (U.S.) http://www.archives.gov/era/about/index.html Electronic Records Archives system being developed to preserve electronic records retaining dynamic features without requiring proprietary software.
Presidential Electronic Records Pilot System http://perpos.gtri.gatech.edu/ Georgia Tech Research Institute partnership with Georgia Institute of Technology
Washington State Digital Archives http://www.digitalarchives.wa.gov/Content.aspx?txt=background Custom-built Web interface and database to preserve and provide access to State and Local electronic records.
79
APPENDIX H
SOFTWARE DOWNLOAD LINKS
Aid4Mail http://www.aid4mail.com/downloads.php
Amber Outlook http://www.processtext.com/abcoutlk.html
Amber PDF http://www.processtext.com/abcpdf.html
Emailchemy http://weirdkid.com/products/emailchemy/
Fentun http://www.fentun.com/
File Merlin http://www.file-convert.com/fmn.htm
JHOVE http://hul.harvard.edu/jhove/download.html
Message Save http://www.techhit.com/messagesave/ oXygen XML editor http://www.oxygenxml.com/download.html
Xena http://xena.sourceforge.net/
80
NOTES
81