Proposal for Electronic Archiving System (EAS) As Free Open Source Software
Total Page:16
File Type:pdf, Size:1020Kb
Grainne Reilly revision: June 6, 2019 original: Feb 12, 2016 Proposal for Electronic Archiving System (EAS) as Free Open Source Software EAS at Harvard EAS is a system that enables ingest, management and basic preservation of email and also paves the way for access to email. It provides features to identify policy and curatorial issues e.g. rights management, events tracking etc. EAS does not address the capture of email nor does it address discovery or email delivery for end users. It focuses on the curation of email in preparation for long term preservation. The project was developed in conjunction with 3 core partners at Harvard University (Schlesinger Library, HU Archives and Countway Library) with 2 additional participants (Harvard Art Museums and GSD Loeb Library). EAS was built to fulfill the needs of the Harvard University partners and is integrated with several other Harvard University systems – AMS, Policy, Wordshack and DRS. 1 Grainne Reilly revision: June 6, 2019 original: Feb 12, 2016 Proposal for Electronic Archiving System (EAS) as Free Open Source Software Archiving Email The lifecycle 1. Collection development a. Pre-acQuisition appraisal ✗ 2. Accessioning a. Capture ✗ b. Normalization. ✔ 3. Archival Processing a. Item-level processing ✔ b. Bulk processing ✔ c. Intellectual arrangement ✔ d. Search capability ✔ e. Personal/Sensitive information processing ✔ 4. Preservation a. Packaging ✔ b. Repository þ 5. Online Discovery ✗ 6. Access ✗ ✔ - EAS supports þ - EAS supports via DRS2 ✗ - EAS does not support Not every institution will want to follow the entire lifecycle. The community and the Tools In June 2015 there was an Archiving Email Symposium hosted by the Library of Congress with over 150 attendees. Attendees included people from The Smithsonian Institute, NARA, Emory University, Stanford University amongst others. There was interest in tools to help in preserving email. It was also apparent that there is no one tool that covers the entire life cycle. A combination of tools may help institutions in their efforts to archive email for long term preservation. In fact many institutions used Aid4Mail and/or Emailchemy to convert email to standard mbox or eml format before using that output as input to the next tool. Open Sourcing EAS By open sourcing EAS it is more likely that other institutions will collaborate in making their tools interoperable with EAS. This would be advantageous to Harvard University where some EAS users have expressed an interest in the use of ePADD as a donor appraisal tool whose output might then be imported into EAS. It would also 2 Grainne Reilly revision: June 6, 2019 original: Feb 12, 2016 Proposal for Electronic Archiving System (EAS) as Free Open Source Software be advantageous to the community who are lacking a tool like EAS which could be used standalone. EAS Technology EAS is written in Java. EASi, the user interface to EAS, is a java Struts2 web application that runs in Tomcat. Most of the software used in EAS is open source with the notable exception of the use of internal LTS software libraries, the use of Oracle as the database and Emailchemy as the software used to convert email from closed, proprietary file formats to a standard EML format. EAS does not provide an API for use by others. • Tomcat 8 • Java 8 • Struts 2 • Ant • Gradle • Maven • Oracle 12 (commercial) • Hibernate 5.3.7 • Emailchemy embedded version (commercial) • Mime4j 0.6 • Solr 4.10.1 • Solrj • jQuery 1.8.2 • jQuery UI ThemeRoller • ajax-solr • flexigrid • YUI Grids • JSP • CSS • LTS utilities (proprietary) • FITS 0.8 • Spring batch 4.0.1 Since EAS may contain sensitive information, including HRCI, a security architecture was created to protect this data. This security architecture is mainly infrastructure Commented [RG1]: Security when in docker for example through the use of secure networks and ssh mounting of file systems. 3 Grainne Reilly revision: June 6, 2019 original: Feb 12, 2016 Proposal for Electronic Archiving System (EAS) as Free Open Source Software EAS Integration with other Harvard Systems EAS integrates with several Harvard Library systems as show in the diagram above. AMS (Access Management System) AMS is an LTS system that provides authentication and basic authorization services for library systems. AMS in turn interacts with HarvardKey and LDAP. It is a web application that makes use of cookies and browser redirects. EAS redirects users’ browsers to AMS and inspects encrypted cookies that AMS creates. EAS makes use of an Access client jar in order to manage this. Policy Policy is used for authorization to library systems. EAS makes use of a Policy client jar that is used to perform direct database queries. Wordshack Wordshack is the authority control / vocabulary manager for EAS and for DRS2. Wordshack manages admin category, admin flag, email address, person, organization, software and topic terms. These terms are used throughout EAS. Interaction with Wordshack is via a RESTful api, however for performance reasons terms are stored locally in the EAS database. EAS makes use of the client jar files provided by Wordshack for interacting with Wordshack. 4 Grainne Reilly revision: June 6, 2019 original: Feb 12, 2016 Proposal for Electronic Archiving System (EAS) as Free Open Source Software DRS (Digital Repository Service) EAS uses the DRS version 2 (DRS2) as the long term preservation repository for emails. EAS writes DRS2 specific batches to the file system when pushing items to DRS2 for long term preservation. EAS also interacts with DRS2 via a RESTful api, making use of two client jar files provided by DRS2. This RESTful api is used for several interactions with the DRS2 including: Collections are created in DRS2 via EASi Accounts are retrieved from DRS2 for use in EAS Billing codes are retrieved from DRS2 for use in EAS One code base to serve us all EAS is to continue to first and foremost serve the Harvard University community. This reQuires it’s continued use of LTS specific systems. To make EAS open source and useful by others outside of Harvard University it is necessary to disentangle EAS from other LTS systems and from commercial or proprietary software. For the initial release of EAS as OSS we are aiming for a minimum viable product – it will contain core features which will permit it to be deployed and be usable with limited functionality. It is proposed to manage this through a dependency management build tool and configuration management. There will be one code base with one of two build versions produced – the LTS build version and the Open Source build version. The build file for the LTS version will be excluded from the open source github source control repository. Internal LTS jar files should only be used in the Harvard University version of the built system and excluded from the open source dependencies. The open source version should only reQuire open source dependencies and should result in a standalone built system. A later phase will address integration/interoperability with other tools. EAS makes use of Emailchemy for conversion of emails to the standard EML format. This is a commercial product. It would be beneficial to refactor EAS to permit it’s use without this commercial product. This would facilitate packaging EAS in a Docker container since it would be a breach of license to include Emailchemy in a publicly available Docker container. EAS also makes use of the commercial Oracle database. EAS does not make use of Oracle specific features and could be configured to also work with the open source PostgreSQL database. This would lower the barrier for adoption of EAS and also permit packaging a prepopulated demo of EAS in a Docker container. 5 Grainne Reilly revision: June 6, 2019 original: Feb 12, 2016 Proposal for Electronic Archiving System (EAS) as Free Open Source Software EAS initial refactoring with Anti-corruption layers between Bounded Contexts An Anti-corruption layer is a concept from Domain Driven Design . In the case of EAS there are several bounded contexts (authentication, access control, controlled vocabulary, collections management etc.) that could benefit from this layer, permitting future implementations to be plugged in more easily. One way of organizing the design of the Anti-corruption layer is as a combination of Facades, Adaptors and translators, along with the communication and transport mechanisms usually needed to talk between systems. Using dependency resolution and configuration management either the LTS specific or the default OSS specific implementations will be available. Configuration will be used to manage feature toggles and feature gates: 6 Grainne Reilly revision: June 6, 2019 original: Feb 12, 2016 Proposal for Electronic Archiving System (EAS) as Free Open Source Software Example code for use of feature toggle and feature gate: If (gateClient.isUngated(“my.feature.name”)){ doFeatureCode(); } Future Possibilities Integration with ePADD ePadd consists of 4 modules, in order of workflow usage they are: appraisal, processing, discovery and delivery. Mbox files are fed into the appraisal module and to proceed to the next module it is necessary to export to an archive, an internal ePADD non-standard artifact. Conceptually an archive is a collection of indexed messages along with a blob store. This archive then needs to be imported to the next module and the process repeats for each module. The delivery module does provide the ability to export emails to mbox format, but it may not be lossless. The April 2016 release of ePADD is planned to permit the export of emails to mbox format from the appraisal module – again it may not be lossless. The intent of the ePADD appraisal module is for use standalone on a donor’s workstation. At Harvard University, it is desirable for donors to be able to use the appraisal module of ePADD and for curators to use the result of that processing in EAS.