DIGITAL DILEMMAS: ARCHIVING E-

PRE-CONFERENCE WORKSHOP

ASSOCIATION OF CANADIAN ARCHIVISTS

JUNE 10, 2008

Presented By: COLLABORATIVE ELECTRONIC RECORDS PROJECT TEAM Nancy Adgent, Steve Burbeck, Ricc Ferrante, Lynda Schmitz Fuhrig, Darwin Stapleton

DIGITAL DILEMMAS: ARCHIVING E-MAIL WORKSHOP June 10, 2008

9:00 – 10:30 CERP Inception, funding, goals (Darwin Stapleton) Need for e-mail preservation, why an issue (Ricc Ferrante) Identifying the issues, developing guidelines (Nancy Adgent) Results of testing (Nancy Adgent, Lynda Schmitz Fuhrig) Review workflow and tools (Lynda Schmitz Fuhrig)

Questions

10:30 – 10:45 Break

10:45 - 12:00 Exercise 1: Complete accession and processing forms (Nancy Adgent) Exercise 2: Convert msg to mbox via Aid4Mail (Nancy Adgent) Exercise 3: Convert pst to mbox via MessageSave (Lynda Schmitz Fuhrig) Exercise 4: Start AIP (Lynda Schmitz Fuhrig)

12:00 – 12:15 Questions

12:15 – 1:30 Lunch on your own (McConnell Hall cafeteria is a short walk from the class location)

1:30 – 3:00 Overview of technical issues (Ricc Ferrante) AIP post parsing (Ricc Ferrante) Why xml (Ricc Ferrante) Overview of parser (Steve Burbeck) How testbed message oddities contributed to development Collaboration with NC Demonstration of parser (Steve Burbeck)

Questions

3:00 - 3:15 Break

3:15 - 4:15 Exercise 5: Convert mbox via parser (Steve Burbeck) Exercise 6: Complete AIP (Lynda Schmitz Fuhrig) DSpace Introduction (Ricc Ferrante) Exercise 7: Parse attendee’s messages Summary (Ricc Ferrante)

4:15 - 4:30 Questions 2

TABLE OF CONTENTS

Page

Exercise 1: Complete accession and processing forms ...... 25

Exercise 2: Convert msg to mbox via Aid4Mail ...... 39

Exercise 3: Convert pst to mbox via MessageSave ...... 43

Exercise 4: Start AIP ...... 46

Exercise 5: Convert mbox via parser ...... 62

Exercise 6: Complete AIP ...... 66

Exercise 7: Parse attendee’s messages ...... 67

Appendix A: Forms and Guidelines on CERP website ...... 68

Appendix B: Application to Use Materials Produced by CERP ...... 69

Appendix C: CERP Processing Workflow Model ...... 70

Appendix D: Metadata Narrative Template...... 71

Appendix E: METS Sample ...... 73

Appendix F: EAD Sample ...... 76

Appendix G: Resources and Related Projects ...... 79

Appendix H: Software Download Links ...... 80

This documentation is released by the Collaborative Electronic Records Project under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License, 2008 and 2009. This license can be viewed at http://creativecommons.org/licenses/by-nc-sa/3.0/us/

Citation CERP. (2008). “Digital Dilemmas: Archiving E-Mail.” Sleepy Hollow, NY and Washington DC: The Collaborative Electronic Records Project.

Digital Dilemmas: Archiving E-Mail Collaborative Electronic Records Project

Association of Canadian Archivists June 10, 2008

Dr. Darwin H. Stapleton

Executive Director Rockefeller Archive Center

15 Dayton Avenue Sleepy Hollow, NY 10591 914-631-4505 [email protected] [email protected]

Overview

Funding Purpose Collaboration/Management

Accomplishments "We're good, but not perfect" Serendipity

4

Riccardo Ferrante

IT Archivist and Electronic Records Program Director

Smithsonian Institution Archives

Capital Gallery Building 600 Maryland Avenue, SW, Suite 3000 Washington, D.C. 20024-2520 202-633-5906 [email protected]

Nancy Adgent

Project Archivist Rockefeller Archive Center

15 Dayton Avenue Sleepy Hollow, NY 10591 914-366-6355 [email protected]

ROCKEFELLER ARCHIVE CENTER

RAC DEPOSITOR CHART RAC

R. R. R. R. Brothers Indi- Foundation Family University Fund viduals

Rockefeller Other Related General Education Board Common Markle -wealth NAR Foundation Fund founded JDR Jr. founded Russell Sage Foundation Foundation Center Population Own Council China Own Some Foundation Medical for Child Board On Deposit Development

Key Survey Findings

No records management policy

No naming standards

No procedures for organizing or saving

Some have no on-site IT staff

Inbox Folder Organization

6

Inbox – Non Standard File Names

Suggested Subject Names: “Staff Meeting.Minutes.2006.08.02”

ISSUES Unknown formats

Deteriorating media Data on portable devices Native format vs. converting Upgraded hardware/old media

Obsolete or unsupported software Duplicates, personal, junk mingled Information quantity & rate increase Information quantity & rate increase Traditional archival concepts/new era

Best Practices Guidance

E-MAIL GUIDELINES

7

TRANSFER GUIDELINES

Prepared by the Collaborative Electronic Records Project Rockefeller Archive Center January 2007 This document may be freely used and modified by any non-profit organization.

Retention Guidelines

Records Disposition Schedule

8

Forms Accession Administrative & Descriptive Metadata

Transfer

Verification

Migration/Refresh

METS AIP Metadata

Accession Administrative Metadata

Descriptive Metadata

9

Completed METS AIP Form

Transfer Documentation Form

Verification Documentation

10

Migration / Refresh Schedule

From CD to server

From Word to PDF From preservation copy CD to new CD

Testbed Findings

W Y S I W Y G ?

Internet Header Metadata

11

Header Metadata Return-Path: METADATA Received: from (localhost [999.999.9.9] )by mailserver1 with LMTP for ; Fri, 19 May 2006 14:41:09 -0400 Received: from mailserver1edu ([999.999.9.9]) by mailserver1EDU (3.0.2/sieved- 3-0-build-942) for ; Fri, 19 May 2006 14:41:09 -0400 Received: from Stapleton-pc.Rockefeller.edu (localhost [999.999.9.9] )by mailserver1 with ESMTP id k4JIf82s018784 for ; Fri, 19 May 2006 14:41:08 -0400 (EDT) Message-Id: <7.0.1.0.2.20060519144014.0359d620@ mailserver1edu> X-Mailer: QUALCOMM Windows Eudora Version 7.0.1.0 Date: Fri, 19 May 2006 14:41:04 -0400 To: From: Subject: mitelman Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; format=flowed X-MASF: 0.00%

Change of Author

Change of Creation Date

12

Original Capture mbox

Web Browser Display --======_-1211362437==_======_E2mXatt Content-Disposition: attachment; filename="XXXXX.doc“ Content-Type: application/octet-stream; name="XXXXX.doc"Content-Transfer- Encoding: 0M8R4KGxGuEAAAAAAAAAAAAAAAAAAAAAPgADAP7/CQAGAAAAAAAAAAAAAAABAAAAJQ AAAAAEAAAJwAAAAEAAAD+////AAAAACQAAAD//////////////////////////////////// 14 pages of character strings were in this space. AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA-- ======_-1211362437==_======-- From ???@??? Mon Sep 17 16:44:44 2001 Return-Path: Received: from [123.45.67.890] (XXXXXX.rockefeller.edu [123.45.67.890]) by mail.rockefeller.edu (6.23.5/7.89.0) with ESMTP id f8HK8Js01021 for ; Mon, 17 Sep 2001 16:08:19 -0400 (EDT)Message-Id: Date: Mon, 17 Sep 2001 16:07:46 -0500To: XXXXXXXXFrom: Jane Doe Subject: Edited version of letterX- UIDL: ?e%!!Z"H!!V=^!!+[~!!Mime-Version: 1.0Content-Type: multipart/mixed; boundary="======_-1211361629==_======"This is a multi-part message in MIME format.--======_- 1211361629==_======Content-Type: text/plain; charset="iso-8859-1"-- ======_-1211361629==_======Content-Type: text/plain;"XXXXXXX.doc 1" (missing attachment)--======_- 1211361629==_======--

Aid4Mail Conversion

13

Missing Attachment

Attachment Conversion

Lynda Schmitz Fuhrig

Project Archivist Smithsonian Institution Archives

Capital Gallery Building 600 Maryland Avenue, SW, Suite 3000 Washington, D.C. 20024-2520

202-633-5917 [email protected]

14

SIA testbed relationship Three SI units agreed to participate SIA • Collects, preserves, and makes available the official records of the Smithsonian Institution • Carries out a records management program for Smithsonian offices, advising them on the disposition of records and pertinent documentary materials in analog and digital form.

Administrative Financial Public-Research • Deposits records regularly • Deposits made irregularly. • Has very active relationship with SIA. • Records disposition schedules with SIA. • Archives formalized records being created • One of our largest depositors. disposition schedule for office and/or updated

Bad transfer

Transferred

36,000+ messages

15

Archival Information Package

CERP model

* The SIP is the submission information package. It contains the email collection (variety of formats possible) received from the depositor and metadata narrative (both information supplied by the depositor and updated by the archivist).

* The AIP is the archival information package. It contains the source email from the depositor, metadata (manually created METS, narrative, and other), finding aid (manually created), .mbox files, parsed XML file, parsed attachments, bad messages from parser, and parser subject-sender log.

* The DIP is the dissemination information package. Package could include the entire package for viewing/downloading or a specific email message/s for viewing. The AIP remains in its original form.

CERP model continued SIP * SIP to AIP •Archivist converts the collection to the .mbox (generic email format), if not already in this format. •Archivist runs the parser to convert the .mbox file/s to an XML preservation file with encoded attachments. •Archivist creates a package of all components (metadata, source, outputs, finding aids) in the zip format and submits to a digital repository.

AIP *

AIP to DIP The researcher queries the digital repository (DSpace) to find and retrieve the email collection results. DIP *

16

Proprietary format

Workflow at SIA

Retrieve or receive email account.

Methods include ExMerge, FTP, email attachment, server transfer

Conduct virus scan Burn original to CD or other media Make use copy

17

Review potential accession

Metadata narrative

Metadata narrative cont’d

18

Attachment extraction

19

Attachments in native format

Finding aid - EAD

20

MBOX files to parser

METS METS

Archival Information Package

21

Archival Information Package

METS.xml – doorway .pst – source email as into DSpace containing is Dublin Core tags Metadata narrative.zip .xml – email collection – narrative about from parser processing applied to .xslt – stylesheet for collection and other rendering email within pertinent information, DSpace format reports EAD.zip – html and xml Parser directory versions of finding aid tree.zip – .mbox files, bad messages, Subject sender log.zip bad messages, encoded attachments – output from parser encoded attachments greater than 25K

EAD zip

Metadata narrative zip

DROID, JHOVE reports, metadata narrative

22

Parser directory tree zip

Encoded attachments, bad messages, MBOX files within folder structure

Subject-sender log zip

Output from parser

DSpace

23

Exercise 1

Complete Processing Forms

Accession Process

Transfer Verify

24

EXERCISE 1 Using the information below excerpted from the Narrative Finding Aid, complete the METS AIP Metadata Form and the Accession Administrative and Descriptive Metadata Forms.

Accession #2008-CERP 1 Scope & Content Note This accession totals 7.15 megabytes in native (.msg) format. Information in this accession spans nine months overlapping the first and second years of CERP and Adgent’s employment with Rockefeller Archive Center. It reflects the typical content described in the Series Description. Messages consist of selected correspondence related to CERP and were chosen to illustrate a range of e-mail preservation challenges including incoming, outgoing, and forwarded with Word, Excel, WordPerfect, PowerPoint, jpg, and Publisher attachments. Senders and recipients are external and inter-office. Container List / Submission Information Packet (SIP) Folders: Demo set.msg 2006-03-08 to 2006-12-12 Demo set.Aid4Mail.mbox 2006-03-08 to 2006-12-12

Processing Note Messages were selected and copied from the e-mail account holder’s desktop folder, “Nancy’s CERP Mail”, in .msg format and were converted via Aid4Mail to .mbox format for parsing and preservation as .xml. While Archive Center staff has attempted to remove personal, confidential, and sensitive messages from the e-mail used for demonstration purposes, it is possible that some such content remains and the Archive Center cannot assume responsibility for any that remain. Archivists discovering this type content should exercise discretion in using it. Provenance E-mail messages were copied from the e-mail account holder’s active imap.rockefeller.edu/Records/E Rec Project folder onto her desktop by the account holder in April 2008. Fourteen messages were selected for CERP demonstration use and were copied to new folders as shown in Container List above. Related Materials Other records in the Rockefeller Archive Center Collection include material from various departments and are arranged in the following Record Groups (Sub-Communities): Administration and Operations Exhibits Library Memorabilia Photographs Press and Publications Projects Website Access Notes RESTRICTIONS: Closed. Accessible only by CERP team.

ACCESSION ADMINISTRATIVE METADATA for E-MAIL

Accessioning Archivist: Today’s Date:

Depositing organization : Contact name: Address: Telephone: E-mail:

Type of Accession :

Type of Material :

Size :

Terms/Restrictions :

Access :

Retain until :

Copyright: (If yes, describe)

Sensitive Data:

Encrypted: (If yes, provide key)

From Server/Type Media :

Location, physical environment, security of records since creation:

If information being deposited has been altered since creation, indicate changes:

Altered by:

Date(s):

If migrated/refreshed/virus scanned before accessioned, provide information on an attached Migration/Refresh Schedule Form.

*Attach Completed Electronic Records Transfer, Verification, and Migration Forms*

26

ACCESSION ADDITIONAL DESCRIPTIVE METADATA for E-MAIL

Creator’s E-mail Address:

Computer ID#

Creator’s Title:

Types Information (e.g. lab notes, staff meeting minutes, etc.):

E-mail application(s):

Attachment formats: Text:

Graphics & Images:

Database:

Web/HTML:

Architectural/Engineering:

Scientific:

Proprietary:

Other:

Removable Media ID Name / # Title: Byte size # Folders

Folder Name Byte Size # Files Date Range

File Name Byte Size # Messages # Attachments

*Attach Narrative Finding Aid *

27

AIP METS FOR DSPACE FORM

Archival Term/DSpace Term/

Collection/Community Name :

Record Group/Sub-Community :

Series/Collection :

E-Mail Account Holder’s Name :

Department :

Accession Number :

Accession/Bundle Name :

Date Range, inclusive :

E-Mail Collection Folder Name(s) :

Subject Headings :

28

ACCESSION ADMINISTRATIVE METADATA for E-MAIL

Accessioning Archivist: Nancy Adgent Today’s Date: April 23, 2008

Depositing organization : Rockefeller Archive Center Contact name: Nancy Adgent Address: 15 Dayton Avenue, Sleepy Hollow, NY 10591 Telephone: 914-366-6355 E-mail: [email protected]

Type of Accession : Deposit

Type of Material : e-mail

Size : 7.15 MB

Terms/Restrictions : not open for research

Access : To be viewed only by CERP team except specific demonstration files

Retain until : 2008.12.31

Copyright: (If yes, describe) No

Sensitive Data: Yes

Encrypted: (If yes, provide key) No

From Server/Type Media : desktop

Location, physical environment, security of records since creation: Outlook Inbox folder, then copied to desktop

If information being deposited has been altered since creation, indicate changes:

Altered by:

Date(s):

If migrated/refreshed/virus scanned before accessioned, provide information on an attached Migration/Refresh Schedule Form.

*Attach Completed Electronic Records Transfer, Verification, and Migration Forms*

29

ACCESSION ADDITIONAL DESCRIPTIVE METADATA for E-MAIL

Creator’s E-mail Address: [email protected]

Computer ID#

Creator’s Title: Project Archivist

Types Information (e.g. lab notes, staff meeting minutes, etc.): Correspondence

E-mail application(s): Microsoft Outlook

Attachment formats: Text: MS Word; pdf

Graphics & Images: jpg; PowerPoint; Publisher

Database:

Web/HTML:

Architectural/Engineering:

Scientific:

Proprietary:

Other: Excel

Removable Media ID Name / # N/A Title: Byte size # Folders

Folder Name Byte Size # Files Date Range

File Name Byte Size # Messages # Attachments

*Attach Narrative Finding Aid *

30

AIP METS FOR DSPACE FORM

Archival Term/DSpace Term/

Collection/Community Name : Rockefeller Archive Center

Record Group/Sub-Community : CERP

Series/Collection : CERP Demo Mail

E-Mail Account Holder’s Name : Nancy Adgent

Department : Collaborative Electronic Records Project

Accession Number : 2008-CERP 1

Accession/Bundle Name : Nancy’s Demo Set

Date Range, inclusive : 2006-03-08 to 2006-12-12

E-Mail Collection Folder Name(s) : Nancy’s In E Rec Project

Subject Headings : Electronic records Email Digital records Projects Smithsonian Institution Archives Rockefeller Archive Center Society of American Archivists Professional organizations CERP

31

Electronic Records Transfer Form Collection: Content Retain Record Date Date Content Format & To This Group/Series Creator(s)/Inbox Owner Position(s) Range Created Type Version Media # Source Date

NOTES:

Transferring Archivist: Date:

Electronic Records Verification Form Collection:

Record # # Size in Copy- En- Virus NOTES (e.g. Information Group/Series Folders Folder Name Files File Name Bytes right crypted Scan kept/deleted & reasons)

Verified by: Title: Date:

33

Electronic Records Media Refresh/Migrate/Destroy Schedule FROM FROM TO Virus Size Arch- YEAR DATE COLLECTION RG SERIES MEDIA TO MEDIA FORMAT FORMAT Scan After ivist

34

Electronic Records Transfer Form Collection: Rockefeller Archive Center Content Retain Record Date Accession Content Format & To This Group/Series Creator(s)/Inbox Owner Position(s) Range Date Type Version Media # Source Date 3/8/06- CERP Nancy Adgent Project Archivist 12/12/06 04/11/08 Email & att Outlook 2003 Desktop n/a Inbox* 12/31/08

NOTES: *Copy/paste from Inbox to desktop folder by e-mail account owner

Transferring Archivist: Nancy Adgent Date: 4/23/08

Electronic Records Verification Form Collection: Rockefeller Archive Center

Record # # Size in Copy- En- Virus NOTES (e.g. Information Group/Series Folders Folder Name Files File Name Bytes right crypted Scan kept/deleted & reasons) No virus; deleted 1 sensitive; today’s date at top of every CERP/Demo Mail 1 In E Rec Project 15 Demo Set.msg 7.15 MB N N Y message; original date in headers No virus found; today’s date at top Demo of every message; original date in CERP/Demo Mail 1 In E Rec Project 1 set.Aid4Mail.mbox 9.08 MB N N Y headers

Verified by: Nancy Adgent Title: Project Archivist Date: 4/23/08

36

Electronic Records Media Refresh/Migrate/Destroy Schedule FROM FROM TO Virus Size Arch- YEAR DATE COLLECTION RG SERIES MEDIA TO MEDIA FORMAT FORMAT Scan After ivist 2008 12.31 RAC CERP Demo Mail Server Destroy All destroy n/a 0 NA 2010 09.01 RAC CERP All Server TBD All TBD NA

37

ACCESSION METADATA MATRIX

38

Exercise 2

Aid4Mail Conversion

Copy/paste files from Inbox to Desktop

Convert from msg to mbox via Aid4Mail

Rename file to messages.mbox

Select & Copy From Inbox

Copy Inbox Files to Desktop

Pasted Files in Desktop Folder

Aid4Mail Conversion – Step 1

Aid4Mail Conversion – Step 2

40

Aid4Mail Conversion – Step 3

Aid4Mail Conversion – Step 4

Aid4Mail Conversion – Step 5

41

Aid4Mail Conversion – Step 6

Aid4Mail Done

Aid4Mail Conversion – Rename File

42

Exercise 3

MessageSave Conversion

Bad processing

MessageSave in Outlook

Create a Mail folder on your desktop. Launch Outlook. Open the .pst.

43

Go to the MessageSave tool in the upper right-hand corner. Click on Options to expand the box. Be sure you have the .pst opened in Outlook.

Select Include Subfolders. Select MBOX for format. Browse to Desktop to select the Select Save Now. Mail folder.

44

Processing

Folder should open with MBOX files File name should be main folder- subfolder

Create folders with the file names

45

Rename files “messages.mbox” and place into corresponding folders. This is needed in order for the parser to process the email collection correctly.

Exercise 4

Begin AIP

46

Archival Information Package

METS

Finding aid

47

DROID output in XML

Partial AIP

LUNCH

12:15 – 1:30

Technical Review

Riccardo Ferrante IT Archivist Smithsonian Institution Archives [email protected]

General digital preservation

concerns

Obsolescence is a 3-fold risk - Physical

- Format - Technical

The heart of the problem

The computing industry

Rapid advancements Proprietary formats

49

1954

Rand Home Computer Photo courtesy of Smithsonian Institution

Email Issues

Structure „ Relationships, at multiple levels „ Attachments „ Standards for email formats trail behind new email features Volume

Access

Email Challenges

Owner’s organization of email

50

Email Challenges Email with attachments

Email Challenges

Embedded and attachments

Standards v. Features

RFC 822 RFC 2822 Standard for the Format of Message Format ARPA Internet Text | Messages | | | 1982 2001 1960 2010

1961 1972 1995 1999 Electronic First email HTML and other Y2K event propels messaging between two MIME types and retooling to support between networked supported by 3- and 4-digit year mainframe computers using email clients notations accounts @ symbol

51

Volume

Access

Acquisition delays

„ Years after the email account was active Technical environment

„ Original email system and version obsolete

Long term usability

„ File format considerations

„ Viability of having email servers as staging platforms for accessioned email accounts

Addressing the challenges

An email account model:

- preserves the full body of correspondence

- maintains the user-created relationships

- eliminates redundant metadata

52

XML as a preservation format

Good prospects for format longevity „ Self- describing, flexible

„ Non-proprietary standard Can determine if results are well-formed Can determine if results are well-formed Base is ASCII „ Strong open-source Large range of tools available Commercial support as well

The Email Account Schema

Enables validation of successful preservation Supports of email accounts by different systems

„ CERP Parser – multiple formats, no original systems

„ EMCAP parser – single format, active original systems Final schema produced by NC staff fully

addresses a complete email account at all levels.

Email Account Schema – Summary Level

53

Anatomy of an email message

Anatomy of an email message

Anatomy of an email message

54

Steve Burbeck

IT Consultant

(910) 458-2056 [email protected]

http://www.runningempty.org/Steve/

Parsing E-Mail: Lessons Learned

ACA Workshop June 10, 2008

Overview of Topics

Loose Email standards Issues with native email formats

The impact of the wide variety of email clients in use Commercial tools vs. open source

The CERP Parser, how it is constructed and how to use it.

55

Loose Email “Standards” RFC2822 and other standards are a good start that handle most cases. Yet email continues to evolve and standards continue to lag. To be widely adopted, lagging standards must support virtually all preexisting practices…an impossible goal without compromises that are open to interpretation. Different email client vendors interpret the standards differently. And there are the inevitable mismatches between interpretations (and inevitable bugs).

Variety is the Spice of Email

Dozens of common email systems and 100s of others

„ We have encountered mail from Eudora (multiple versions), Simeon for MacPPC, Outlook/Exchange (multiple versions), AppleMail, Lotus Notes, Groupwise, Mozilla/Firefox, , and various Internet mail services such as gmail, Hotmail, YahooMail, Juno, and AOL Each has its peculiarities. Some use non-standard date formats European and Asian mail may contain non- ASCII (actually, non UTF-8) characters Older email may have HTML in inappropriate places Forwarded and other “child” messages may be included in nonstandard forms

Commercial vs. Open Source

Why Not Use Commercial Solutions? „ Most commercial solutions aim at the earliest possible legal destruction of email rather than long-term storage.

„ The storage formats are determined by the vendor, usually with an eye to advantaging their own business

„ Proprietary software suppliers may not even be in business 20 years hence. Benefits of Open Source

„ The software can be maintained by the archivist community at large, „ Storage formats can be optimized for archival needs.

56

The Storage Format - XML

Why not just use Native email format?

„ Which one? How well is it documented? How long will software exist to read it? Which companies (if any) have a real commitment to stability and longevity? Why eXtensible Markup Language (XML)? „ XML is open, human readable and “self describing”

„ A good descriptive schema supports validity checking

„ There are many open source tools to create, manipulate and read XML

The Importance of a Common Schema

A Schema defines how the XML tags for the various parts of an email relate to each other.

„ , , ,

, , , etc. It is the Rosetta stone that guides how raw email is converted to XML …and it defines the structure for subsequent search, display, provenance, preservation, etc. The ‘Mail-Account’ XML schema serves the purposes of both CERP and EMCAP (thanks to David Minor of the NC State Archives)

It will be made public, so you don’t have to reinvent the wheel

Email Conversion Results We have converted and validated 70 thousand messages in three test sets to the XML Mail- Account schema

„ Smithsonian - 5,537 messages in 232 Mb of recent Outlook mail 99.97% successfully parsed (4 could not be parsed), „ Smithsonian - 20,000 messages in a 1.5 Gb Outlook account 99.975% successfully parsed (5 could not be parsed)

„ Rockefeller Archives - 43,778 messages in 378 Mb of older eclectic mail 99.85% successfully parsed (74 unparsed, but improvement is clearly possible) Parse speed: about a quarter gigabyte per hour on a Thinkpad T40

57

Lessons Learned

100% success is an unrealistic goal

„ Some emails are just too broken to parse without manual intervention

We can achieve at least 99.9% success (and save the few unparsed emails for human inspection) This error rate is not unlike physical archives

Development of the CERP Email Parser First and foremost, it is a prototype

„ It was developed (co-evolved) in open source tools along with our changing understanding of requirements and an evolving XML schema. It was built in an Open Source development system: Squeak Smalltalk v3.9 „ A portable development environment that runs on Windows, Linux, and Macintosh (www.squeak.org) „ Squeak was chosen because it is a very powerful prototyping system.

„ We can debate the relative merits of other prototyping languageslanguages (Java,(Java, Ruby, whatever, …) off-line.line.

The Web Application Interface

The parser can be run from within Squeak, but most users will prefer to run it from a Web browser „ The Web interface is built with a popular Squeak Web Application development framework called Seaside (www.seaside.st)

„ Seaside uses a web server (Comanche) that is embedded in Squeak.

„ Although Comanche can be used as a general- purpose web server, its usage here is confined to supporting the parser and the Seaside application interface.interface.

58

Parser Demonstration

Running the CERP Email Running the CERP Email Parser Place the root directory of all mail accounts to be parsed in a common Windows directory, default C:\digpres\mail Start the Squeak parser image and ensure that Seaside is running „ If necessary, start Seaside by executing “WAKom startOn: 9092” Then open a browser window and direct it to http://localhost:9092/seaside/EmailParsing

The Parser Web Page

59

Parsing Results Status

Parser Subject-Sender Log Parser Subject-Sender Log

Parser Subject-Sender Log (cont.)

60

Parsed E-mail Body Excerpt

Parsed E-Mail Attachment Reference

Validation in Oxygen

61

Validation Message

External Attachment File

Exercise 5

Convert mbox via parser

62

Characteristics of METS

Metadata Encoding and Transmission Syntax „ Sections address descriptive, administrative, file groups, file hierarchies, and behavior Proven non-proprietary standard

„ Supporting maintenance body

Adopted by a range of similar systems Integrates use of extension schemas

CERP Use of METS CERP Use of METS

Archival package definition Descriptive, file groups, structural map sections

Import into DSpace

„ Support of Dublin Core schema as a descriptive section extension Required development of a METS Import module for DSpace

63

64

METS & the AIP

Dublin Core Choice

Fields selected

StructMap structure, content, use

File Group structure, content, use

Archival Information Package

65

Exercise 6

Complete AIP Assembly

DSpace

Using METS with Dspace

66

Demonstration

Loading AIP into DSpace

Exercise 7

Parse attendee’s messages

CERP http://siarchives.si.edu/cerp

Rockefeller Archive Center http://archive.rockefeller.edu

Smithsonian Institution Archives http://siarchives.si.edu

67

APPENDIX A

CERP FORMS AND GUIDANCE

Available on CERP website: http://siarchives.si.edu/cerp/progress.htm ƒ Depositor Survey – Electronic Record Status ƒ E-Mail Guidelines for Managers and Employees ƒ Responsible Recordkeeping: E-mail Records ƒ Email Guidance ƒ Transfer Guidelines ƒ Record Retention and Disposition Guidelines ƒ Metadata for E-Mail Form ƒ EAD sample

APPENDIX B

Collaborative Electronic Records Project (CERP)

Application to Use Materials Produced by CERP, including those on the CERP website

Although CERP products are not copyrighted, and our intent is for other archives and non-profit organizations to freely use the materials, for-profit entities should obtain our written permission and should not charge users for material obtained from CERP. Any organization, regardless of profit-making status, should appropriately credit CERP for materials used in whole or in part.

Images are owned by the Rockefeller Archive Center (RAC) or Smithsonian Institution Archives (SIA) or are copyright-free Microsoft Clip Art. Written permission to use images owned by RAC or SIA must be obtained from the RAC or SIA, as appropriate for the image. Credit line: “Courtesy of the Rockefeller Archive Center” or “Courtesy of Smithsonian Institution Archives” respectively.

Name of Organization: ______Address: ______Telephone: ______Fax: ______E-mail address: ______Website URL: ______Name and title of person requesting permission: ______Name of CERP products you wish to use: ______Intended use of materials: ______

Please note that CERP publications are not intended to provide, nor to constitute, legal or accounting advice and, therefore, any entity adopting or adapting them should consult with the appropriate professionals to ensure that the entity’s particular legal and accounting needs are met.

APPENDIX C

CCEERRPP mmooddeell

SIP *

SIP to AIP •Archivist converts the collection to the .mbox (generic email format), if not already in this format. •Archivist runs the parser to convert the .mbox file/s to an XML preservation file with encoded attachments. •Archivist creates a package of all components (metadata, source, outputs, finding aids) in the zip format and submits to a digital repository.

AIP *

AIP to DIP The researcher queries the digital repository (DSpace) to find and retrieve the email collection results.

DIP *

* The SIP is the submission information package. It contains the email collection (variety of formats possible) received from the depositor and metadata narrative (both information supplied by the depositor and updated by the archivist). a

* The AIP is the archival information package. It contains the source email from the depositor, metadata (manually created METS, narrative, and other), finding aid (manually created), .mbox files, parsed XML file, parsed attachments, bad messages from parser, and parser subject-sender log.

* The DIP is the dissemination information package. Package could include the entire package for viewing/downloading or a specific email message/s for viewing. The AIP remains in its original form.

70

APPENDIX D

METADATA NARRATIVE TEMPLATE

June 10, 2008 Smithsonian Institution Archives

This directory was created as part of the Collaborative Electronic Records Project (CERP).

Depositor: Smithsonian Institution Archives Title: Smithsonian Institution Archives, Office of the Director, Email Records, 2004-2006 Transfer method: SIA server Received by: Ginger Yowell Account holder: Thomas F. Soapes Received on: 3/29/2007 Email application: Outlook Folder name: F:\DSpace SIA\AIP\Soapes.pst Date range of rec’d: NA Date range of Sent Items: 12/08/2004-11/14/2006 Accession: 07-109

Abstract: This accession consists of records created by Thomas F. Soapes during his tenure as Acting Director of the Smithsonian Institution Archives (2005-2007). Sent email also includes correspondence while he was chair of the Archives Division at the National Air and Space Museum. Some correspondence is transitory and/or sensitive in nature.

SOURCE Data Soapes.pst (TS_cerp.pst in AIP) Analyses Virus checking results. None. See below.

Source data format validation of attachments. (JHOVE/DROID) None. See below.

ARCHIVAL MASTER Data TS_cerp.xml (created from mbox file/s of pst account) Attachments (encoded) METADATA METS file Finding Aid – EAD in XML and HTML Sender_Subject logs - Files containing senders and subject lines from parser.

PRESERVATION DESCRIPTION INFORMATION (PDI) Description of each step taken in the course of preparing the AIP. If a step is documented elsewhere in the archives as a standard practice, a reference to such is acceptable. All information necessary to document the authenticity of the record traceable (auditable) to the time it was ingested. 1. Initial Processing: Virus check, backup 2. Preservation Pre-processing: PST migrated into mbox format 3. Preservation: Mbox files transformed into XML file containing all emails in the account, accompanying attachments encoded. 1. No viruses detected. On Demand E-mail Scan conducted on .pst file. Working copy on external drive regularly backed up. Original .pst saved on SIA websites server. Folders Sent Items - Folder 1 Sent Items - Folder 2 Sent Items - Folder 3 Sent Items - Folder 4 2. .pst migrated into mbox format using MessageSave. Commercial off-the-shelf (COTS) product works as add-in within Outlook. Mbox file created for each mail account folder. Name changed from folder name to messages.mbox so parser in Step 3 could work. Messages.mbox placed into corresponding folders. Attachments extracted in native format using EZDetach. COTS product works as add- in within Outlook. No viruses detected. VirusScan On Demand used. Attachment formats are pdf and doc. Logs also created during parsing and are in the in the AIP. Finding aid created in EAD using Oxygen. HTML and XML output. Used DROID and JHOVE on attachments for format identification and validity. JHOVE and DROID reports are present in AIP. Used SiteOverview as well for sitemap 3. Mbox file transformed into XML file containing all emails in the account, accompanying attachments encoded. This is the email account preservation file using the Email Preservation XML schema, EPres. This account was parsed June 10, 2008, on an SIA workstation. METS file composed manually. AIP components zipped on June 10, 2008. RIGHTS MANAGEMENT Access to this directory is constrained in the following manner: -Administrators – full control STORAGE LOCATIONS SIA server LSF external drive MIGRATION No immediate preservation issues. Review account in 5 years. UPDATE

Lynda Schmitz Fuhrig, SIA, June 10, 2008

72

APPENDIX E METS SAMPLE - - - Adgent, Nancy RAC CERP Demo mail - RAC CERP Object owned by RAC - - - - - Nancy Adgent - Collaborative Electronic Records Project - Nancy's Demo E mail Rockefeller Archive Center 2008-04-11 2008-04-23 - 2006-03-08 - 2006-12-12 - CERP - Digital records - Electronic records - Email - Professional organizations - Projects - Rockefeller Archive Center - Smithsonian Institution 73

- Society of American Archivists 2008-CERP 1 English Correspondence regarding activities and publications of the Collaborative Electronic Records Project; correspondents include CERP team, professional organization representatives, internal RAC staff Nancy's.In E Rec Project Deposit 2008-12-31 copied from desktop of account owner by account owner This accession is part of the Collaborative Electronic Records Project (CERP) and is closed to researchers. It was gathered for demonstration purposes only. Rockefeller Archive Center CERP team only Electronic mail Mixed material 7.15 MB msg mbox xml Nancy Adgent CERP Record Group CERP Demo Mail Series - - - - - -

74

- - - - - - - -

75

APPENDIX F EAD SAMPLE

76

77

78

APPENDIX G

RESOURCES AND RELATED PROJECTS AND INITIATIVES

CERP (See website for list of resources) http://siarchives.si.edu/cerp/progress.htm

Chronopolis www.digitalpreservation.gov/news/2008/20080403news_article_chronopolis.html California-based datagrid preservation community with audit reporting and authenticity checking capability.

Fedora http://www.fedora-commons.org/ Fedora Commons is a non-profit organization providing digital content management and preservation technologies.

National Archives & Records Administration (U.S.) http://www.archives.gov/era/about/index.html Electronic Records Archives system being developed to preserve electronic records retaining dynamic features without requiring proprietary software.

Presidential Electronic Records Pilot System http://perpos.gtri.gatech.edu/ Georgia Tech Research Institute partnership with Georgia Institute of Technology

Washington State Digital Archives http://www.digitalarchives.wa.gov/Content.aspx?txt=background Custom-built Web interface and database to preserve and provide access to State and Local electronic records.

79

APPENDIX H

SOFTWARE DOWNLOAD LINKS

Aid4Mail http://www.aid4mail.com/downloads.php

Amber Outlook http://www.processtext.com/abcoutlk.html

Amber PDF http://www.processtext.com/abcpdf.html

Emailchemy http://weirdkid.com/products/emailchemy/

Fentun http://www.fentun.com/

File Merlin http://www.file-convert.com/fmn.htm

JHOVE http://hul.harvard.edu/jhove/download.html

Message Save http://www.techhit.com/messagesave/ oXygen XML editor http://www.oxygenxml.com/download.html

Xena http://xena.sourceforge.net/

80

NOTES

81