Grainne Reilly revision: June 6, 2019 original: Feb 12, 2016 Proposal for Electronic Archiving System (EAS) as Free Open Source Software

EAS at Harvard

EAS is a system that enables ingest, management and basic preservation of email and also paves the way for access to email. It provides features to identify policy and curatorial issues e.g. rights management, events tracking etc.

EAS does not address the capture of email nor does it address discovery or email delivery for end users. It focuses on the curation of email in preparation for long term preservation.

The project was developed in conjunction with 3 core partners at Harvard University (Schlesinger Library, HU Archives and Countway Library) with 2 additional participants (Harvard Art Museums and GSD Loeb Library).

EAS was built to fulfill the needs of the Harvard University partners and is integrated with several other Harvard University systems – AMS, Policy, Wordshack and DRS.

1 Grainne Reilly revision: June 6, 2019 original: Feb 12, 2016 Proposal for Electronic Archiving System (EAS) as Free Open Source Software

Archiving Email

The lifecycle 1. Collection development a. Pre-acquisition appraisal ✗ 2. Accessioning a. Capture ✗ b. Normalization. ✔ 3. Archival Processing a. Item-level processing ✔ b. Bulk processing ✔ c. Intellectual arrangement ✔ d. Search capability ✔ e. Personal/Sensitive information processing ✔ 4. Preservation a. Packaging ✔ b. Repository þ 5. Online Discovery ✗ 6. Access ✗

✔ - EAS supports þ - EAS supports via DRS2 ✗ - EAS does not support

Not every institution will want to follow the entire lifecycle.

The community and the Tools In June 2015 there was an Archiving Email Symposium hosted by the Library of Congress with over 150 attendees. Attendees included people from The Smithsonian Institute, NARA, Emory University, Stanford University amongst others. There was interest in tools to help in preserving email. It was also apparent that there is no one tool that covers the entire life cycle. A combination of tools may help institutions in their efforts to archive email for long term preservation. In fact many institutions used Aid4Mail and/or Emailchemy to convert email to standard mbox or eml format before using that output as input to the next tool.

Open Sourcing EAS By open sourcing EAS it is more likely that other institutions will collaborate in making their tools interoperable with EAS. This would be advantageous to Harvard University where some EAS users have expressed an interest in the use of ePADD as a donor appraisal tool whose output might then be imported into EAS. It would also

2 Grainne Reilly revision: June 6, 2019 original: Feb 12, 2016 Proposal for Electronic Archiving System (EAS) as Free Open Source Software be advantageous to the community who are lacking a tool like EAS which could be used standalone.

EAS Technology EAS is written in Java. EASi, the user interface to EAS, is a java Struts2 web application that runs in Tomcat. Most of the software used in EAS is open source with the notable exception of the use of internal LTS software libraries, the use of Oracle as the database and Emailchemy as the software used to convert email from closed, proprietary file formats to a standard EML format. EAS does not provide an API for use by others.

• Tomcat 8 • Java 8 • Struts 2 • Ant • Gradle • Maven • Oracle 12 (commercial) • Hibernate 5.3.7 • Emailchemy embedded version (commercial) • Mime4j 0.6 • Solr 4.10.1 • Solrj • jQuery 1.8.2 • jQuery UI ThemeRoller • ajax-solr • flexigrid • YUI Grids • JSP • CSS • LTS utilities (proprietary) • FITS 0.8 • Spring batch 4.0.1

Since EAS may contain sensitive information, including HRCI, a security architecture was created to protect this data. This security architecture is mainly infrastructure Commented [RG1]: Security when in docker for example through the use of secure networks and ssh mounting of file systems.

3 Grainne Reilly revision: June 6, 2019 original: Feb 12, 2016 Proposal for Electronic Archiving System (EAS) as Free Open Source Software

EAS Integration with other Harvard Systems

EAS integrates with several Harvard Library systems as show in the diagram above.

AMS (Access Management System) AMS is an LTS system that provides authentication and basic authorization services for library systems. AMS in turn interacts with HarvardKey and LDAP. It is a web application that makes use of cookies and browser redirects. EAS redirects users’ browsers to AMS and inspects encrypted cookies that AMS creates. EAS makes use of an Access client in order to manage this.

Policy Policy is used for authorization to library systems. EAS makes use of a Policy client jar that is used to perform direct database queries.

Wordshack Wordshack is the authority control / vocabulary manager for EAS and for DRS2. Wordshack manages admin category, admin flag, email address, person, organization, software and topic terms. These terms are used throughout EAS. Interaction with Wordshack is via a RESTful api, however for performance reasons terms are stored locally in the EAS database. EAS makes use of the client jar files provided by Wordshack for interacting with Wordshack.

4 Grainne Reilly revision: June 6, 2019 original: Feb 12, 2016 Proposal for Electronic Archiving System (EAS) as Free Open Source Software

DRS (Digital Repository Service) EAS uses the DRS version 2 (DRS2) as the long term preservation repository for emails. EAS writes DRS2 specific batches to the file system when pushing items to DRS2 for long term preservation. EAS also interacts with DRS2 via a RESTful api, making use of two client jar files provided by DRS2. This RESTful api is used for several interactions with the DRS2 including: Collections are created in DRS2 via EASi Accounts are retrieved from DRS2 for use in EAS Billing codes are retrieved from DRS2 for use in EAS One code base to serve us all EAS is to continue to first and foremost serve the Harvard University community. This requires it’s continued use of LTS specific systems. To make EAS open source and useful by others outside of Harvard University it is necessary to disentangle EAS from other LTS systems and from commercial or proprietary software. For the initial release of EAS as OSS we are aiming for a minimum viable product – it will contain core features which will permit it to be deployed and be usable with limited functionality.

It is proposed to manage this through a dependency management build tool and configuration management. There will be one code base with one of two build versions produced – the LTS build version and the Open Source build version. The build file for the LTS version will be excluded from the open source github source control repository.

Internal LTS jar files should only be used in the Harvard University version of the built system and excluded from the open source dependencies. The open source version should only require open source dependencies and should result in a standalone built system. A later phase will address integration/interoperability with other tools.

EAS makes use of Emailchemy for conversion of emails to the standard EML format. This is a commercial product. It would be beneficial to refactor EAS to permit it’s use without this commercial product. This would facilitate packaging EAS in a Docker container since it would be a breach of license to include Emailchemy in a publicly available Docker container.

EAS also makes use of the commercial Oracle database. EAS does not make use of Oracle specific features and could be configured to also work with the open source PostgreSQL database. This would lower the barrier for adoption of EAS and also permit packaging a prepopulated demo of EAS in a Docker container.

5 Grainne Reilly revision: June 6, 2019 original: Feb 12, 2016 Proposal for Electronic Archiving System (EAS) as Free Open Source Software

EAS initial refactoring with Anti-corruption layers between Bounded Contexts

An Anti-corruption layer is a concept from Domain Driven Design . In the case of EAS there are several bounded contexts (authentication, access control, controlled vocabulary, collections management etc.) that could benefit from this layer, permitting future implementations to be plugged in more easily. One way of organizing the design of the Anti-corruption layer is as a combination of Facades, Adaptors and translators, along with the communication and transport mechanisms usually needed to talk between systems. Using dependency resolution and configuration management either the LTS specific or the default OSS specific implementations will be available. Configuration will be used to manage feature toggles and feature gates:

6 Grainne Reilly revision: June 6, 2019 original: Feb 12, 2016 Proposal for Electronic Archiving System (EAS) as Free Open Source Software

Example code for use of feature toggle and feature gate:

If (gateClient.isUngated(“my.feature.name”)){ doFeatureCode(); }

Future Possibilities

Integration with ePADD ePadd consists of 4 modules, in order of workflow usage they are: appraisal, processing, discovery and delivery. Mbox files are fed into the appraisal module and to proceed to the next module it is necessary to export to an archive, an internal ePADD non-standard artifact. Conceptually an archive is a collection of indexed messages along with a blob store. This archive then needs to be imported to the next module and the process repeats for each module. The delivery module does provide the ability to export emails to mbox format, but it may not be lossless. The April 2016 release of ePADD is planned to permit the export of emails to mbox format from the appraisal module – again it may not be lossless.

The intent of the ePADD appraisal module is for use standalone on a donor’s workstation. At Harvard University, it is desirable for donors to be able to use the appraisal module of ePADD and for curators to use the result of that processing in EAS. Some possible approaches for that achieving that follow below.

With the April 2016 release the donor will be able to export the result of their processing to mbox format which could then be imported into EAS. EAS could split these mbox files into eml files itself, without the use of Emailchemy (it is easy to identify the start of each new message by the presence of the “From_line”). This would need some mechanism for controlling this. Alternatively the mbox could be run through some software to produce eml files which could then be imported into EAS. LTS would require that this approach be recorded – via events and client agent.

As an alternative, ePADD could provide a client jar file for extracting emails from an archive into mbox or even eml format. EAS could use this to process an ePADD archive. The disadvantage of this approach is that it would only work with java applications. The ePADD archive contains serialized objects which can only be reliably reconstituted by using the java language to do so – this limits how portable these archives are.

7 Grainne Reilly revision: June 6, 2019 original: Feb 12, 2016 Proposal for Electronic Archiving System (EAS) as Free Open Source Software Potential Roles LTS project manager for managing oss infrastructure, framework and for moving eas to oss.. HL project manager for liaising with HL community, external community and HL leadership. LTS developers EAS DRS Discovery Access/Delivery Wordshack

8 Grainne Reilly revision: June 6, 2019 original: Feb 12, 2016 Proposal for Electronic Archiving System (EAS) as Free Open Source Software

Open Source Software Checklist

Basic Factors Explanation Remarks R A C I Usefulness Software should OSS version LTS ? ? ? be useful more should not or less “as is”. include LTS specific jars. OSS version should not require commercial products. Interoperability If the software EAS needs to be LTS ? ? ? interoperates refactored to with other provide software tools, abstraction/anti- the open source corruption layers project should where alternate have well implementations documented, may be plugged preferably in. standards based, interfaces to external code - web services, class interfaces, or otherpoints. License The software Choice is limited LTS ? OCG & HL should be by dependency HUIT released with a on software with CTO license restrictive statement e.g. licenses. Apache 2, GPL, If a given LGPL, MIT, BSD, dependency is AGPL v3. optional how does that affect the license requirement? Contributor Many open LTS ? OGC & HL License source projects HUIT Agreement require this. CTO Copyright At top of each LTS LTS OGC & HL Class Provost

9 Grainne Reilly revision: June 6, 2019 original: Feb 12, 2016 Proposal for Electronic Archiving System (EAS) as Free Open Source Software Office & HUIT CTO Patent Some software An example is LTS LTS OGC HL includes a the facebook & patent in reactjs library. Provost addition to the Investigate HU Office software. policies & HUIT governing this. CTO User For use by users ? ? ? ? Documentation Developer For use by LTS ? ? ? Documentation developers Code Class level at a LTS ? ? ? Documentation minimum Source control Github LTS ? ? ? Issue Tracking Github LTS uses jira. LTS ? ? ? How do we synchronize Github issues with internal LTS jira issues? Deployment Should we ePADD provides LTS ? ? ? packaging provide a ready a packaged to run version ready for implementation? deployment on This would Windows or Mac. enable easier EAS uses some adoption by linux specific others. functionality. We could provide a Dockerized version configured for quick setup. Demo Should we Provide a lot of provide a self- examples and contained concentrate on demo? having some really shiny ones to impress users/developers enough to take a

10 Grainne Reilly revision: June 6, 2019 original: Feb 12, 2016 Proposal for Electronic Archiving System (EAS) as Free Open Source Software closer look. Contributions Who should Contributions LTS? HL? ? ? decide what include ideas. contributions to accept? Committers Initially LTS only LTS ? ? ? Tests Do contributors Initially will only LTS ? ? ? need to provide accept ideas and tests for not code. contributed code : Unit tests, integration tests, functional tests? Documentation What level of Initially will not LTS ? ? ? documentation be accepting would we code. require for contributed code? Support Need a forum for Requires an LTS ? ? ? discussing Email list/Google features, group etc technical issues etc. What forum? Outreach and What forums do HL ? ? ? communications we want to post on? What events do we want to present at?

R: Responsible – who is assigned to do the work A: Accountable – who makes the final decision and has ultimate ownership C: Consulted – who must be consulted before a decision or action is taken I: Informed – who must be informed that a decision has been taken HL: Harvard Library LTS: Library Technology Services OGC: Office of the General Counsel

11 Grainne Reilly revision: June 6, 2019 original: Feb 12, 2016 Proposal for Electronic Archiving System (EAS) as Free Open Source Software

Proposed work for Open Sourcing EAS

Miscellaneous : Fields that are mandatory in EAS due to its integration with DRS2 will remain mandatory and will be populated with default values in the OSS version.

Future – may make these mandatory fields configurable in future.

1 Create a new build process with dependency resolution Description The current build process for EAS uses ANT with no dependency resolution. Alternatives are Ivy, Maven or Gradle. First choice is Gradle, second choice is Ivy. Maven is stubbornly “opinionated” and would not accommodate many of our existing LTS projects. Update Move to maven and docker and possibly ansible – ongoing Need to update version of docker/docker compose The LTS change control process has changed and is in flux due to the introduction of docker and ansible.

Implementation detail - Java serviceLoader may be used to facilitate switching implementations of services. Maven can then be used to pull in the correct implementation jar to the build.

Comments This enables a customized build for LTS versus the OSS version. This must work with the LTS change control process. LTS proprietary jars should be excluded from the OSS build dependencies. Jars from dependent projects (e.g. hibernate) should be pulled in using dependency management during the build.

Question: Fits includes ots.jar which is a proprietary LTS jar. What is the implication of this?

EAS currently uses 93 jar files in addition to those used by Fits and Solr.

See Grouper Build/Dependency Management for some reasoning on the choice of a build tool.

This is an absolute pre-requisite for this project. It is not

12 Grainne Reilly revision: June 6, 2019 original: Feb 12, 2016 Proposal for Electronic Archiving System (EAS) as Free Open Source Software possible to exclude jar files without this. Build system needs to be set up in order to continue development.

Future Dependencies LTS Artifactory instance should be populated with required jars Feedback RS – this is a technical debt project and therefore does not belong in this project but rather in a “technical debt” project Proposed phase Phase 1

2 Abstract out authentication Description Abstract out authentication so that it can be configured to 1. Use AMS for authentication 2. Use authentication information from an XML file 3. Facilitate plugging in of new authentication mechanism in future Update Switch from AMS to CAS. Do include XML/Json – use json schema for validation of data. Comments Authentication is currently closely tied to the user’s HUID, which is used throughout the system. The user’s email address which AMS returns via an LDAP lookup is also used. For security reasons, the LTS version must only work with AMS and a valid HUID. The OSS version should not be configurable to use AMS and should not include the access.jar file. It should fail gracefully if misconfigured. Future Internal database OAuth Shibboleth CAS LDAP Active Directory Open Connect Dependencies (1) Feedback AM – implement LDAP for phase 3 GR – depending on feedback from community decide on which implementation to use for phase 3 Proposed Phase Phase 1

3 Abstract out authorization Description Abstract out authorization so that it can be configured to 1. Use Policy for authorization

13 Grainne Reilly revision: June 6, 2019 original: Feb 12, 2016 Proposal for Electronic Archiving System (EAS) as Free Open Source Software 2. Use authorization information from an XML file 3. Facilitate plugging in of new authorization mechanism in future Update Talk with IAM about making direct API calls to grouper Need to set up grouper groups for EAS – run them by LTS Support If IAM won’t permit direct API calls – stay with using policy Still need to use Policy for DRS depositor lookup For OSS Use authorization information from an XML/Json file (use json schema for json validation).

Comments Currently the user HUID is used to look up policy information. For security reasons, the LTS version must only work with Policy and a valid HUID The OSS version should not be configurable to use Policy. It should fail gracefully if misconfigured. Future Internal Database Grouper LDAP Dependencies (1)(2) Feedback AM – implement LDAP for phase 3 GR – depending on feedback from community decide on which implementation to use for phase 3 Proposed phase Phase 1

4 Enable configuration to use PostgreSQL instead of Oracle Description EAS currently is configured to use Oracle. It makes no use of Oracle specific features and could work with PostgreSQL via minor configuration changes since EAS uses Hibernate ORM. Update Switch all versions to use PostgeSQL (RDS) Will involve work from Sharon’s group Will also involve input from AM/Sharon to ensure it remains at the right security level for HRCI. Comments Use of PostgreSQL removes a dependency on a commercial database. This eliminates constraints concerning license restrictions. Use of PosgreSQL: • Lowers the barrier to adopt EAS (no license to pay) • Permits the creation of a self contained, pre-populated EAS Demo in a Docker Container (It is a breach of the Oracle license to deploy the database in a Docker Container) The LTS version of EAS should continue to use Oracle for

14 Grainne Reilly revision: June 6, 2019 original: Feb 12, 2016 Proposal for Electronic Archiving System (EAS) as Free Open Source Software performance and operational reasons. The OSS version of EAS should be configurable to use either Oracle or PostreSQL. Future Dependencies (1) Feedback Proposed phase Phase 1

5 Abstract out Accounts (DRS owner codes) and Billing Codes Description Owner codes (Accounts) are stored in DRS2 and a local copy is created in the EAS database when enabled for use in EAS. Billing codes are stored in DRS2 and retrieved for use in EAS. Abstract out Accounts and Billing codes so that EAS can be configured to use: • Accounts and Billing Codes from DRS2 • Accounts and Billing Codes from an XML file • Facilitate plugging in of other means of retrieving Accounts and Billing Codes in future

Update Oss use XML or JSon configuration – use json schema for validation. Comments The LTS version should only work with Accounts and Billing codes from DRS2. OSS version should not be configurable to use Accounts and Billing Codes from DRS2. Future AM - Make it configurable to make accounts and billing codes optional. Dependencies (1)(2)(3) Feedback RS - should work out impact on time it might take to implement bullet 3 above. GR – if are making it configurable to read this information from an file, then we need to create an abstraction layer anyhow. GR – regarding the future option of making this information optional, this information is mandatory in the database and solr index because it is very important for LTS. Using dummy values from a default xml file will not inconvenience users who do not need this information. The system has not been architected for configuring optionality of database tables/fields and doing so would require significant work. Proposed phase Phase 1

15 Grainne Reilly revision: June 6, 2019 original: Feb 12, 2016 Proposal for Electronic Archiving System (EAS) as Free Open Source Software

6 Abstract out DRS Collections Description Collections are created in DRS2 via the EAS user interface. Minimal collection information is stored in the EAS database and the EAS Solr index. Need the ability to configure EAS to create Collections: • In DRS2 with minimal information in EAS • Minimal information only in EAS

Comments The LTS version should only work with Collections in DRS2. The OSS version should not be configurable to create Collections in DRS2 Future A separate project could manage collections. Library cloud has a separate project for managing its collections which could be used as a model for a future EAS collections management project. This would require providing an api in EAS for updating the core collection information in the Solr index and the local database. Dependencies (1)(2)(3) Feedback WG – need to be able to associate items with collections. Agreed - Make EAS configurable to only require title for collection and not collect any other information on Collections. Then in future abstract out creation of collections in other systems. Reduced estimate based upon above agreement. Proposed phase Phase 1

7 Abstract out Wordshack Terms Description Enable configuration of EAS to create and use controlled vocabulary terms • In Wordshack • In an XML file • Facilitate plugging in of other means of managing a controlled vocabulary

Update Use updatable XML/JSon file/store (since email addresses are created during import) OR In OSS version could just create terms directly in database? Question – UI for creating terms directly in database Comments Wordshack terms are intricately tied into the system – • on the server • the user interface (it uses a Wordshack widget) in

16 Grainne Reilly revision: June 6, 2019 original: Feb 12, 2016 Proposal for Electronic Archiving System (EAS) as Free Open Source Software conjunction with a proxy servlet filter • in the database • in the Solr index

The XML/JSon file should be kept simple for the initial release. The LTS version should only work with Wordshack terms. The phase 1 OSS version should not be configurable to work with Wordshack and should not include the wordshack client jar. The supported email clients are recorded in Wordshack as software terms.

Future Possibly expand to support other controlled vocabularies Dependencies (1)(2)(3) Feedback RS - If Wordshack were available as open source it would mean that it could be used and so we might not need to do this work. GR – we do not want to force OSS users to use our implementation of a controlled vocabulary. Also we do not want to build in a dependency on another project being open sourced in order to open source EAS. Proposed phase Phase 1

8 Remove Fits from OSS Version Description Enable configuration of EAS to remove FITS Comments Fits is used during import and push to DRS2. During ingest it is only required in order to get information. For the OSS version it can get the file format information by issuing the “file” command under linux (EAS already needs to run under linux so this introduces more non-portable code). This should still be configurable and fail gracefully if configured to use FITS in the OSS version. Future See 9

17 Grainne Reilly revision: June 6, 2019 original: Feb 12, 2016 Proposal for Electronic Archiving System (EAS) as Free Open Source Software Dependencies (1) Feedback Proposed phase Phase 1

9 Replace EAS Fits Servlet with OSS Fits Servlet Description When EAS was implemented Fits was included in the project and implemented as a web application (similar to Solr). Since then an open source version of the Fits Servlet has been developed and is almost ready for use. Once the open source version of the Fits Servlet has been released this should be used by EAS instead of it’s own implementation. Update This has already been implemented – using AM’s FITS docker image. However, the FITS Docker image should be made available on dockerhuh. Comments Use of the OSS version of the Fits Servlet will make it easier to keep up to date with the latest version of Fits. This may be managed by dependency resolution. The fits.jar file will still be required by the EAS web application in order to process the output from fits during import (used in order to populate the file format information for attachments). Future Dependencies OSS version of Fits Servlet must have been released Feedback Need to align licenses Proposed phase Phase 2

18 Grainne Reilly revision: June 6, 2019 original: Feb 12, 2016 Proposal for Electronic Archiving System (EAS) as Free Open Source Software

10 Configure Disabling of Push to DRS Description Provide the ability to configure EAS to • Push to DRS • Disable Push to DRS

Update For OSS version - disable push to DRS but enable creation. OSS version must still create package

Could create batch identical to DRS batch and OSS users could manipulate it themselves to produce what they want. Change descriptors to be less DRS specific in OSS version. LTS version uses OTS which contains a lot of LTS specific constants, LTS specific validation etc.

TODO – Tricia and Steve need to establish what is acceptable in the descriptors. Issues with descriptors: • Contain Wordshack URIs • Contain URNs • Contain drsAdmin data (schema) • Contain hulEventExtension (schema) For OSS perhaps simple descriptors should be created using jaxb and not using OTS.

Comments Through use of the RESTful api in item 21 it will provide the ability for other projects to pull the information required in order to create a package for preservation. Item 22 would provide the ability to actually create a bag for archiving making use of the api provided by item 21. Item 11 provides for export emails and attachments but not metadata as an mbox.

Future Dependencies (1) Feedback RS – need ability to create a very simple bag. WG – need output so do need to include ability to create a preservation package. This could be an appealing deliverable for an IMLS grant. Reduced estimate based on discussions. On further discussion with RS and WG will not create a bag until know what would be useful in the bag. Need to discuss this with

19 Grainne Reilly revision: June 6, 2019 original: Feb 12, 2016 Proposal for Electronic Archiving System (EAS) as Free Open Source Software community during the workshop. Proposed phase Phase 1

11 Add ability to configure EAS to not use Emailchemy Description This will remove a dependency on a Commercial tool. By removing this dependency it will be possible to package EAS in a Docker container. It also reduces the barrier to use of EASi by removing the necessity to pay for software. EAS should fail early and gracefully if EAS has been configured to not use Emailchemy and if a user submits a packet type which can only be processed by Emailchemy. Comments Most of work will be around project build configuration. Do not want to result in a more onerous deployment in LTS so need to make it as automated as possible. Future Dependencies (1) Feedback Proposed phase Phase 1

12 Add handling for eml files Description By permitting the submission of eml files in a packet users will have the option of using whatever tool they like to convert their emails to eml prior to using EAS. Comments Should the creator agent be a combination of eml and the tool used to convert to eml? If so it should be recorded in the controlled vocabulary as a software term. Future Dependencies (1) Feedback Proposed phase Phase 1

13 Add handling for mbox files by EAS itself Description This would permit handling of mbox files without requiring the use of Emailchemy. Many mailboxes can be saved from email servers etc in mbox format. It appears to be relatively simple to split an mbox file into individual eml files – the start of each new message is identified by the “From_line (use regex on /^From / lines).

Comments EAS should be recorded as the agent in the normalization event.

20 Grainne Reilly revision: June 6, 2019 original: Feb 12, 2016 Proposal for Electronic Archiving System (EAS) as Free Open Source Software Future Dependencies (1) Feedback Proposed phase Phase 1

14 Do thorough review of libraries used in EAS OSS version Description This is required in order to ensure that we are in compliance with all licenses of libraries used in EAS OSS. Part of this will be to list all the libraries used in the: • OSS version • LTS version This is also a required step in order to set up dependency resolution correctly.

Comments Can dependency management also handle licenses? We may need to manually include licenses etc Future Dependencies Feedback Proposed phase Phase 1

15 Do thorough cleanup of tests Description EAS has numerous unit and integration tests which are currently badly organized. These need to be cleaned up. With the refactoring it may make sense to introduce the use of mocks. Comments Future Dependencies Feedback Proposed phase Phase 2

16 Make User interface changes Description Use feature request toggles to enable/disable LTS specific language. Comments Future Dependencies Feedback Proposed phase Phase 1

21 Grainne Reilly revision: June 6, 2019 original: Feb 12, 2016 Proposal for Electronic Archiving System (EAS) as Free Open Source Software

17 Review for use of public/private/protected/package level methods Description The access modifiers on classes within EAS were not carefully managed. Leaving public methods which should in fact be private can lead to misuse of those methods. Comments Future Dependencies Feedback Proposed phase Phase 2

18 Handle configuration of other jobs Description There are several jobs used in EAS. These would need to be configurable (using feature toggles) for the LTS version or the OSS version. Comments The LTS version should permit running of these jobs: Loader, Importer, DRS prearchiver, DRS postarchiver, DRS packet events archiver, account monthly statistics. The OSS version should not permit running of DRS prearchiver, DRS postarchiver, DRS packet events. Future Dependencies Feedback Proposed phase Phase 1

19 Remove LTS proprietary jars Description The util.jar LTS proprietary jar provides functionality that is mostly now available in core Java or in open source libraries.

22 Grainne Reilly revision: June 6, 2019 original: Feb 12, 2016 Proposal for Electronic Archiving System (EAS) as Free Open Source Software Where possible the code should be refactored to use these implementations in order to remove reliance on LTS proprietary code. The ldap.jar LTS proprietary jar is not used. Both these jar files should be removed if possible. Comments Users of OSS projects need access to the source so any jar files used in the project should also be open source. Future Dependencies (1) Feedback Proposed phase Phase 1

20 Implement RESTful api Description To make EAS more open for use it would be beneficial to create a RESTful api Comments This RESTful api could be used by another application to create a bag (see 21). The RESTful api could be used by another application to create and manage collections (see 6). This api must be implemented so that it may be used by external clients via REST and by EAS itself in process. Future Dependencies (1) Feedback Proposed phase Phase 3

21 Implement LOC Bag creation Description Implement creation of a bag which makes use of the in process api from 20 above. This process should be triggered via the user interface

23 Grainne Reilly revision: June 6, 2019 original: Feb 12, 2016 Proposal for Electronic Archiving System (EAS) as Free Open Source Software Comments Use https://github.com/LibraryOfCongress/bagit-java to help in bag creation.

Question: what should be in the descriptor files? METS seems to not be popular. Need feedback from the community on this.

When items are successfully archived to DRS they are deleted from EAS (without generating any delete events). What should happen when a bag is created? Creation of a bag does not mean that the items have been successfully archived. Future Dependencies (18) Feedback Proposed phase Phase 3

22 Package for deployment Description To reduce the barrier to adoption it is desirable to provide a deployable version of EAS. Comments EAS uses some “ like” os specific commands – and so will not run on windows (one reason was due to a bug in the java File class which does not handle certain special characters in the file name). EAS could be packaged for mac using oracle AppBundler with hdiutil (ePADD does this).

It may be best to provide it in a Docker container.

Future Dependencies Feedback Proposed phase Phase 1

24 Grainne Reilly revision: June 6, 2019 original: Feb 12, 2016 Proposal for Electronic Archiving System (EAS) as Free Open Source Software

Proposed Roadmap

Prerequisites/phase1 Item 1 Create a new build process with dependency resolution

Phase 1 Item 2 Abstract out authentication Item 3 Abstract out authorization Item 5 Abstract out Accounts (DRS owner codes) and Billing Codes Item 6 Abstract out DRS Collections Item 7 Abstract out Wordshack Terms Item 8 Remove FITS from OSS version Item 10 Configure Disabling of Push to DRS Item 11 Add ability to configure EAS to not use Emailchemy Item 12 Add handling for eml files Item 13 Add handling for mbox files by EAS itself Item 16 Make User interface changes Item 18 Handle configuration of other jobs Item 19 Remove LTS proprietary jars Item 4 Enable configuration to use PostgreSQL instead of Oracle Item 14 Do thorough review of libraries used in EAS OSS version Item 22 Package for deployment

Phase 2 Item 15 Do thorough cleanup of tests Item 17 Review for use of public/private/protected/package level methods Item 9 Replace EAS Fits Servlet with OSS Fits Servlet

Phase 3 Item 20 Implement RESTful api Item 21 Implement LOC Bag creation

Phase 4 Details are to be decided by the community. Interoperability is to be as loosely coupled as possible –e.g. via file interchange, restful apis and the like. Make EAS interoperable with ePadd Make EAS interoperable with Bitcurator (redaction)

25 Grainne Reilly revision: June 6, 2019 original: Feb 12, 2016 Proposal for Electronic Archiving System (EAS) as Free Open Source Software Resources

Wordshack https://wiki.harvard.edu/confluence/display/LibraryTechServices/SysDev+- +WordShack

Access Management System https://wiki.harvard.edu/confluence/display/LibraryTechServices/SysDev+- +Access

Policy Server https://wiki.harvard.edu/confluence/display/LibraryTechServices/SysDev+- +Policy+Server

DRS2 https://wiki.harvard.edu/confluence/display/LibraryTechServices/SysDev+- +DRS2

Emailchemy http://www.weirdkid.com/products/emailchemy/

DArcMail http://www.digitalpreservation.gov/meetings/documents/aes15/1_LC_AES_SIA_E mailandCERP_DarcMail_20150602.pdf http://siarchives.si.edu/blog/yes-we%E2%80%99re-still-talking-about-email http://www.history.ncdcr.gov/SHRAB/ar/emailpreservation/mail-account/mail- account_docs.html

Bitcurator http://www.bitcurator.net/

ePADD http://library.stanford.edu/projects/epadd https://github.com/ePADD/epadd https://github.com/ePADD/muse

Lifecycle Tools for Archival Email Stewardship (in progress) https://docs.google.com/spreadsheets/d/1V1N22xnr5e0EbDlZWx58bjYO6rkrMrY H9wGX9-CK8c4/edit?pli=1#gid=986222267

26 Grainne Reilly revision: June 6, 2019 original: Feb 12, 2016 Proposal for Electronic Archiving System (EAS) as Free Open Source Software

Archiving Email Symposium 2015 http://www.digitalpreservation.gov/meetings/archivingemailsymposium.html

Email related RFCs https://tools.ietf.org/html/rfc5322

Email formats http://www.digitalpreservation.gov/formats/fdd/fdd000388.shtml http://www.digitalpreservation.gov/formats/fdd/fdd000383.shtml

fits http://projects.iq.harvard.edu/fits https://github.com/harvard-lts/fits

Open Source https://wiki.harvard.edu/confluence/display/LibraryTechServices/LTS+Open+Sou rce+Projects Introducing the OpenSource Maturity Model Making an Open Source Project Bloom

Licenses http://choosealicense.com/licenses/ https://en.wikipedia.org/wiki/Comparison_of_free_and_open- source_software_licenses http://opensource.org/licenses/

27 Grainne Reilly revision: June 6, 2019 original: Feb 12, 2016 Proposal for Electronic Archiving System (EAS) as Free Open Source Software

Jar files used by EAS access.jar (LTS proprietary) activation.jar antlr-2.7.6.jar aopalliance-1.0.jar apache-mime4j-0.6.jar aspectjrt-1.6.8.jar aspectjweaver-1.6.8.jar c3p0-0.9.1.jar cglib-nodep-2.2.jar com.ibm.jbatch-tck-spi-1.0.jar commons-cli-1.1.jar commons-codec-1.6.jar commons-collections-3.1.jar commons-configuration-1.5.jar commons-fileupload-1.2.1.jar commons-httpclient-3.1.jar commons-io-2.3.jar commons-lang-2.4.jar commons-lang3-3.1.jar commons-logging-1.1.3.jar commons-pool2-2.2.jar dom4j-1.6.1.jar drs2_services-dto.jar (LTS proprietary) drs2_services-util.jar (LTS proprietary) easi.jar ehcache-1.5.0.jar fits.jar fluent-hc-4.3.5.jar freemarker-2.3.15.jar geronimo-stax-api_1.0_spec-1.0.1.jar guava-15.0.jar hibernate-jpa-2.0-api-1.0.0.Final.jar hibernate-testing.jar hibernate-tools.jar hibernate3.jar httpclient-4.3.5.jar httpclient-cache-4.3.5.jar httpcore-4.3.2.jar httpmime-4.3.5.jar javassist-3.9.0.GA.jar jaxen-core.jar

28 Grainne Reilly revision: June 6, 2019 original: Feb 12, 2016 Proposal for Electronic Archiving System (EAS) as Free Open Source Software jaxen-jdom.jar jcl-over-slf4j-1.6.1.jar jdom.jar jettison-1.1.jar jstl.jar jta-1.1.jar ldap.jar (LTS proprietary) log4j-1.2.17.jar mail.jar mina-core-1.1.7.jar noggit-0.5.jar ognl-2.7.3.jar ojdbc14.jar oscache-2.1.jar ots.jar ((LTS proprietary) saxpath.jar servlet-api.jar slf4j-api-1.7.7.jar slf4j-log4j12-1.7.7.jar solr-solrj-4.10.1.jar spring-aop-3.2.3.RELEASE.jar spring-batch-core-2.2.2.RELEASE.jar spring-batch-infrastructure-2.2.2.RELEASE.jar spring-batch-test-2.2.2.RELEASE.jar spring-beans-3.2.3.RELEASE.jar spring-context-3.2.3.RELEASE.jar spring-context-support-3.2.3.RELEASE.jar spring-core-3.2.3.RELEASE.jar spring-expression-3.2.3.RELEASE.jar spring-jdbc-3.2.0.RELEASE.jar spring-orm-3.0.5.RELEASE.jar spring-retry-1.0.2.RELEASE.jar spring-test-3.2.3.RELEASE.jar spring-tx-3.2.3.RELEASE.jar standard.jar stax2-api-3.0.1.jar staxmate-2.0.0.jar struts2-core-2.1.8.1.jar struts2-json-plugin-2.1.8.1.jar swarmcache-1.0RC2.jar util.jar (LTS proprietary) velocity-1.4.jar velocity-tools-generic-1.1.jar woodstox-core-lgpl-4.0.7.jar

29 Grainne Reilly revision: June 6, 2019 original: Feb 12, 2016 Proposal for Electronic Archiving System (EAS) as Free Open Source Software wordshack-client.jar (LTS proprietary) wstx-asl-3.2.7.jar xercesImpl.jar xml.jar xpp3_min-1.1.4c.jar xstream-1.3.jar xwork-core-2.1.6.jar zookeeper-3.4.6.jar

30