Technical Stream Archivematica Houston, TX | 14 November 2018 Introductions

Ross Spencer @ Artefactual Systems, Inc. [email protected] Schedule - Day Two

9:30 - 11:00 Checking-in; technical architecture, deployment and working open-source.

11:00 - 13:00 Break, community profiles, and lunch

13:00 - 15:00 Monitoring, logs, the API, and automation tools. Schedule - Day Three

09.30 - 10:30 Accessing Archivematica’s data, other GitHub highlights, features showcase.

10:30 - 14:00 Community profiles, lunch, and AM camp wrap-up Checking in:

● Day one Before we begin ● Gaps given what was presented? ● Introductions ● Hopes and aspirations? ● Technical architecture ✓ ● Deployment ✓ ● Working open source ✓ ○ Working with clients ✓ ○ Peering into the future ✓ Extent ● Supporting your users ✓ ○ Logs and monitoring ✓ ○ Automation ✓ ○ GitHub Highlights ✓ ● Information in Archivematica ✓ ● Features showcase ✓ It’s complicated…

● MCP Server Technical ● Gearman ● MCP Client Architecture ● Dashboard ● Storage Service

MCP Master Control Program MCP Server

● Is at the heart of Archivematica. ● Monitors watched directories. ● Kicks-off various job chains based on the state implied by those watched directories. ● Reads the information about the microservice chains (job chains) associated with a workflow (transfer or ingest) from the database. ● Sends information to Gearman server about what tasks to perform.

https://github.com/artefactual/archivematica/tree/qa/1.x/src/MCPServer Gearman Server

● Runs Gearman workers based on the information retrieved from the MCP Server reading from the MCP database. ● Runs the tasks, and collects the exit code, stdout and stderr and sends a callback to the MCP Server once the job is complete. ● The Gearman Server is running MCP Client tasks AKA Microservices. MCP Client

● A series of Python modules. ● Encapsulates what we’ve come to learn of as microservices. ● Jobs are run on a per-file, or per- level, e.g. per SIP. ● Everything from assign file UUIDs, to checksum, antivirus, or extracting files from packages. ● Scripts commonly wrap other tools or Python libraries. ● Data is written to the database, or to log files. ● Information written to the database is then written out to various files such as the AIP METS.

https://github.com/artefactual/archivematica/tree/qa/1.x/src/MCPClient Dashboard

● One of the two-ways that users can control what’s happening across all of those components. ● Most Dashboard functions have a near equivalent API call. ● The GUI part of what we know of as Archivematica.

https://github.com/artefactual/archivematica/tree/qa/1.x/src/dashboard

Storage Service

● A storage space abstraction layer. ● Can connect to a local filesystem or cloud based services such as Amazon S3. ● Multiple pipelines (Dashboard, Client, and Server) can connect to one storage service so all storage is controlled centrally. ● Commonly you’ll see transfer source locations, processing staging areas, and package storage locations (for SIPs/DIPs/AIPs). ● Provides segregation of content. ● Also has an API.

https://github.com/artefactual/archivematica-storage-service/tree/qa/0.x In this section we will look at:

● Production deployment methods ● Development deployment Deployment methods ● Optional packages for Archivematica Deployment

Packages: Ubuntu, CentOS, RedHat,

● https://github.com/artefactual-labs/am-packbuild

Ansible: Playbooks describe setup procedures that are executed automatically:

● https://github.com/artefactual/deploy-pub ● https://github.com/artefactual-labs/ansible-archivematica-src Deployment

Docker: Most recent developments in AM. Not in production yet, but great for development work:

● https://github.com/artefactual-labs/am

The Compose scripts are:

● https://github.com/artefactual-labs/am/tree/master/compose Optional Packages

Automation tools: github.com/artefactual/automation-tools

Fixity: github.com/artefactual/fixity

Development tools: github.com/archivematica-devtools

METS reader-writer: github.com/artefactual-labs/mets-reader-writer

Acceptance tests: github.com/artefactual-labs/archivematica-acceptance-tests ● GitHub ● ‘The’ wiki… ● Google Groups Working open ● archivematica/issues ○ source ● CONTRIBUTING.md ● Working with clients It’s all out there…

● It’s all about GitHub… * ○ Three organizations, archivematica, artefactual, and artefactual-labs ● Everything we’ve spoken about so far is there for all to access. ● Plus documentation, and installation instructions. ● Issues are temporarily spread across repositories. ○ Aiming for ○ New issues are all being placed in github.com/archivematica/issues ● Working on a Roadmap and a predictable release schedule. Artefactual and -labs

● 22 and 50 repositories, respectively ● github.com/artefactual still our primary organization ○ Archivematica ○ Storage-service ○ Automation tools ● github.com/artefactual-labs incubates future mainline code (ideally) ○ E.g. ■ Acceptance-tests ■ Mets-reader-writer ■ Packbuild (package building scripts) WikiWikiWeb*

The Archivematica wiki still exists:

● https://wiki.archivematica.org

but we’re trying to bring it together:

● Docs: ○ https://github.com/artefactual/archivematica-docs ○ https://github.com/artefactual/archivematica-storage-service-docs ● Issues repo: ○ E.g. Improvements become issues, or epics, depending on scope.

Ward Cunningham (1995) from Hawaiian meaning ...fast WikiWikiWiki...

● It is still there to be owned by the community. ● We’re suggesting ideas, but it needs some  ○ Case studies: ■ https://wiki.archivematica.org/Community/CaseStudies ○ Regional community user-groups: ■ https://wiki.archivematica.org/Community/Regional_User_Groups ● Some pages could still be translated into GitHub issues: ○ Improvements: ■ https://wiki.archivematica.org/Improvements ● And it provides the only reference documentation for the Archivematica and Storage Service APIs. We have a Google Group

● Now just the one group for general questions, we no longer allow posts to archivematica-tech.

https://groups.google.com/forum/#!forum/archivematica

● Great to users to engage with others in the community. ● Post updates, blog-posts, seek discussion. ● May not necessarily be about a problem or issue with Archivematica: ○ “I’m trying to do X, has anyone tried something similar?” And we have archivematica/issues as well

● Bringing what we can into GitHub ● Trying to be more responsive to community needs, transparency of process. ● Information on the google-group and issues-wiki: ○ Google-group announcement. ○ Issues wiki: https://github.com/archivematica/Issues/wiki Utilising GitHub Features

● Templates for various issue types to help guide folk: Waffle

● Members of archivematica/issues organization/community are invited to add labels. ● And then register an interest in issues via labels. ● Which translates into filters in Waffle to provide overview of what folks are interested in, and into what we’re working on. Waffle

Waffle:Archivematica But GitHub is for code! Gratis “It costs very little to open a ticket, and very little to close. Don’t be afraid to ask the question, and raise a potential issue...” GitHub for admins and devs and management

● Encourage and support your team members that want to engage. ● Don’t hold back either in suggesting issues. ● Or indeed, contributing if you can… (see next slide )

GitHub for everyone... ● Go for it! Good first issue...

Waffle:Archivematica CONTRIBUTING.md

● For bigger issues, the best way to begin is by engaging with the team at Artefactual, and the community: ○ Seek involvement early through GitHub and Google Groups ○ Worst case scenario: It’s already possible, or we’re already working on it - But we might need your help! ● Two main contributing resources: ○ We try and describe as much as we can about our processes, and what is needed as part of a pull-request in CONTRIBUTING.md. ○ Guidelines are applied equally internally, and for public submissions. ○ I’d highlight tools like Flake8, and PyLint as ways you can improve the quality of your code. ● And for the docs: ○ Contributing- and style- guidelines. CONTRIBUTING.md

● But the message is really, there are numerous ways to contribute: ○ Labelling… ○ Logging issues... ○ Promoting the issues that you want to see addressed… ○ Participating in the issues that you see appear! Working with clients: A developer’s perspective

● Support ● Analysis ● Acceptance tests and feature files ● Development Support

● Developers will support #devops and #analysts when clients have issues with installation, configuration, and day-to-day running of Archivematica. ● Always the possibility a piece of code isn’t performing as expected in such a large code-base. ● Always a possibility the assumption is that a piece of code will do one thing when it was written to do another. ● Can result in code changes, but can also be a documentation issue as well. Analysis

● Working with a client often means listening to requirements and finding the right solution. ● A feature may need refining to be suitable for a larger number of users. ● Analysts and developers work closely with the client to do this. ● Can often result in code changes but can also result in a new set of processes for the client where the capability is already technically in Archivematica. Acceptance tests (AMAUAT)

● Behaviour driven development (BDD) “The what, not the how” ● Written in ‘Gherkin’ and implemented in Behave. ○ Python library, vs. Ruby’s Cucumber. ● Living documentation once complete: ○ Describe the original vision for a feature. ○ Ensures that it works into the future. ● Found in Artefactual Labs: ○ https://github.com/artefactual-labs/archivematica-acceptance-tests ● Developer documentation: ○ archivematica-acceptance-tests/developer-documentation.rst Example:

Scenario 1: A transfer is setup with items in an objects folder

Given a folder in an Archivematica transfer source location. And the folder contains an 'objects' directory And the 'objects' directory contains all the digital files or folders associated with the transfer. When a transfer is started using that folder. Then the SIP that is created will preserve the structure of the files and folders in the 'objects' directory. And the Transfer METS will contain a structmap describing the ‘objects’ directory structure. Persistent Identifier (PID Binding) feature and implementation ● Feature file (Gherkin syntax): ○ https://github.com/artefactual-labs/archivematica-acceptance-tests/blob/mast er/features/core/pid-binding.feature ● Implementation (Steps) file: ○ https://github.com/artefactual-labs/archivematica-acceptance-tests/blob/mast er/features/steps/pid_binding_steps.py Development

● Development right now is done using branches and merges. ○ Branches: ■ Two primary branches: stable/1.x*, qa/1.x ■ Plus development branches usually forked off qa/1.x, labelled according to an issue in GitHub, e.g. ● $ git checkout -b dev/issue-1-improve-archivematica-docs ● Pull-requests: ○ We submit PRs against all of our work. ○ Those are code-reviewed by one-other developer on average, often two depending on scope. ● Development is increasingly guided by release-schedule, which helps to set client- and community-expectations: ○ Aiming for 3 per year. Peering into the future

● Labels ● Branches ● Pull-requests How can you see what’s going on?

● Labels show status: ○ Refining, in-progress, review, verified, done. ● Milestones too: ○ 1.8, 1.9, etc. the result of a release scoping early on in the process. ○ Increasingly guiding our commitment to a stable release schedule. ● Branches: ○ Our current methods mean that anything in qa/1.x or qa/0.x is pretty-much guaranteed to be part of the next point release: ■ Archivematica qa/1.x ■ Storage-Service qa/0.x ● And Pull-requests (PRs): ○ Archivematica PRs ○ Storage-Service PRs RFCs (Requests-for-comment)

● Another way to look into the future: ○ https://github.com/artefactual-labs/archivematica-rfcs-test ● Making larger, architectural decisions more visible to all. ● First RFC discusses the work required to replace METS handling across Archivematica with our standardised mets-reader-writer library. ○ Would require a substantial amount of refactoring. ○ A significant number of tests to be setup, and testing once implemented. Supporting your ● Monitoring users; ● The APIs ● Automation Tools Supporting your ● Other GitHub Highlights workflows... Logs and Monitoring

● Log files ● Zabbix ● Grafana Log files

● Shows the output of of the Archivematica services, MCP Client, Server, Dashboard, and the Storage Service. ● Commonly looking for a Python exception when a script may have errored:

ERROR 2018-11-08 14:03:51 django.request:base:handle_uncaught_exception:256: Internal Server Error: /status/ Traceback (most recent call last): File "/usr/local/lib/python2.7/site-packages/django/core/handlers/base.py", line 108, in get_response response = middleware_method(request) File "/usr/local/lib/python2.7/site-packages/django/middleware/locale.py", line 32, in process_request request, check_path=check_path) File "/usr/local/lib/python2.7/site-packages/django/utils/translation/__init__.py", line 198, in get_language_from_request return _trans.get_language_from_request(request, check_path) ...... File "/usr/local/lib/python2.7/site-packages/MySQLdb/__init__.py", line 81, in Connect return Connection(*args, **kwargs) File "/usr/local/lib/python2.7/site-packages/MySQLdb/connections.py", line 204, in __init__ super(Connection, self).__init__(*args, **kwargs2) OperationalError: (2005, "Unknown MySQL server host 'mysql' (110)") Reading logs less /var/log/archivematica/dashboard/dashboard.log less /var/log/archivematica/dashboard/dashboard.debug.log less /var/log/archivematica/MCPClient/MCPClient.log less /var/log/archivematica/MCPClient/MCPClient.debug.log less /var/log/archivematica/MCPServer/MCPServer.log less /var/log/archivematica/MCPServer/MCPServer.debug.log less /var/log/archivematica/storage-service/storage-service.log less /var/log/archivematica/storage-service/storage-service.debug.log Zabbix

● Monitoring tool for diverse IT components. ● Web-based, can be run locally. ● Provides: ○ Custom monitoring. ○ Alerts. ○ Granularity of output. ○ Templating. Zabbix Grafana

● General purpose dashboard and charting tool. ● Web-based, can be run locally. ● SQLite Backend. ● Provides: ○ Time-series charts. ○ Heatmaps. ○ Custom tables. Grafana and Zabbix

Archivematica Example using Zabbix plugin Grafana and Zabbix

AtoM Example using Zabbix plugin Grafana

Localhost:Example More information...

Diagnostics Guide:SlideShare.net The APIs

● Archivematica API ● Storage Service API ● $ python amclient.py Archivematica APIs

Archivematica:

● https://wiki.archivematica.org/Archivematica_API

Storage Service:

● https://wiki.archivematica.org/Storage_Service_API

API tutorial for University of Denver:

● API and AMClient Scripts (September 2018) API Highlights

List unapproved transfers:

http --pretty=format \ -f \ GET "http://127.0.0.1:62080/api/transfer/unapproved/" \ Authorization:"ApiKey test:test" API Highlights

List unapproved transfers (Response):

{ "message": "Fetched unapproved transfers successfully.", "results": [ { "directory": "amcamp-1", "type": "standard", "uuid": "b385d6b9-baed-4f41-a798-b3eb22752797" } ] } API Highlights

Approve a transfer:

http --pretty=format \ -f \ POST "http://127.0.0.1:62080/api/transfer/approve" \ Authorization:"ApiKey test:test" \ type="standard" \ directory="amcamp-1" API Highlights

Approve a transfer (Response):

{ "message": "Approval successful.", "uuid": "b385d6b9-baed-4f41-a798-b3eb22752797" } API Highlights

Monitor its progress (Transfer):

http --pretty=format \ -f \ GET "http://127.0.0.1:62080/api/transfer/status/b385d6b9-baed-4f41-a798-b3eb22752797/" \ Authorization:"ApiKey test:test" API Highlights

Monitor its progress (Transfer) (Response):

{ "directory": "amcamp-1-b385d6b9-baed-4f41-a798-b3eb22752797", "message": "Fetched status for b385d6b9-baed-4f41-a798-b3eb22752797 successfully.", "microservice": "Check transfer directory for objects", "name": "amcamp-1", "path": "/var/archivematica/.../completedTransfers/amcamp-1-b385d6b9-baed-4f41-a798-b3eb22752797/", "sip_uuid": "6bf7f7ea-27f9-4002-b114-fa7500cd8540", "status": "COMPLETE", "type": "transfer", "uuid": "b385d6b9-baed-4f41-a798-b3eb22752797" } API Highlights

Monitor its progress (Ingest):

http --pretty=format \ -f \ GET "http://127.0.0.1:62080/api/ingest/status/6bf7f7ea-27f9-4002-b114-fa7500cd8540/" \ Authorization:"ApiKey test:test" API Highlights

Monitor its progress (Ingest) (Response):

{ "directory": "amcamp-1-6bf7f7ea-27f9-4002-b114-fa7500cd8540", "message": "Fetched status for 6bf7f7ea-27f9-4002-b114-fa7500cd8540 successfully.", "microservice": "Remove the processing directory", "name": "amcamp-1", "path": "/var/archivematica/.../amcamp-1-6bf7f7ea-27f9-4002-b114-fa7500cd8540/", "status": "COMPLETE", "type": "SIP", "uuid": "6bf7f7ea-27f9-4002-b114-fa7500cd8540" } API Highlights

Check its storage status:

http --pretty=format \ GET "http://127.0.0.1:62081/api/v2/file/6bf7f7ea-27f9-4002-b114-fa7500cd8540/" \ Authorization:"ApiKey test:test" API Highlights

Check its storage status (Response):

{ "current_full_path": "/var/archivematica/.../amcamp-1-6bf7f7ea-27f9-4002-b114-fa7500cd8540.", "current_location": "/api/v2/location/02925a38-0a1f-492c-94f3-9d7fad383dd9/", "current_path": "6bf7/.../amcamp-1-6bf7f7ea-27f9-4002-b114-fa7500cd8540.7z", "encrypted": false, "misc_attributes": {}, "origin_pipeline": "/api/v2/pipeline/b5055f71-d1d3-4be6-a1eb-7dc0c0d9ec48/", "package_type": "AIP", "related_packages": [], "replicas": [], "replicated_package": null, "resource_uri": "/api/v2/file/6bf7f7ea-27f9-4002-b114-fa7500cd8540/", "size": 10776, "status": "UPLOADED", "uuid": "6bf7f7ea-27f9-4002-b114-fa7500cd8540" } $ python amclient.py

● Python command-line client, and module. ● Created to support work in the automated-acceptance-tests and automation tools. ● Wraps many of the commands you need from the API. ● Available in the automation-tools repository: ○ https://github.com/artefactual/automation-tools#archivematica-client $ python amclient.py

List unapproved transfers:

python -m transfers.amclient unapproved-transfers test \ --am-user-name test \ --am- http://127.0.0.1:62080

Approve transfer:

python -m transfers.amclient approve-transfer amcamp_1 test \ --am-user-name test \ --am-url http://127.0.0.1:62080 \ Automation Tools

● Automate transfers ● Perform pre-ingest tasks Architecture

● Designed to be run using cron to periodically start new transfers and monitor progress. ● Check for pre-transfer, or transfer scripts (hooks) to be run against content and run those in alphabetical order. ● Status written to an SQLite database. ● Once a transfer completes, status is updated, and the next one written to the automation tools database to be monitored. ● Ideally, an processing MCP is configured for the automation tools to be as automated as possible, i.e. no decision points such as ‘Create SIP, or send to Backlog’. Example

#!/bin/bash set -ux

AT_LOC_UUID=3d5bfc42-395d-4197-bba7-da38b02abd0d AT_LOC="/home/ross-spencer/git/mainline-archivematica/automation-tools" cd $AT_LOC echo `pwd` run_at () { python -m transfers.transfer --am-url "http://127.0.0.1:62080" \ --ss-url "http://127.0.0.1:62081" --user "test" --api-key "test" \ --ss-user "test" --ss-api-key "test" --transfer-source "$1" } run_at "$AT_LOC_UUID" AT_RET=$? echo $AT_RET Example

Status is monitored and logged: id uuid path unit_type status microservice current ------1 31f2ae86-0bb3-4703-97e4-6d03ae2b2d21 amcamp-1 ingest COMPLETE Create SIP(s) 0 2 3ff6547a-a955-4608-b892-323216da8b8a amcamp-2 transfer USER_INPUT Create SIP(s) 1 @bitarchivist’s slide-deck

● Timothy Walsh, Concordia, and formerly, Canadian Center for Architecture. ● Practical Experience with Automation Tools [SlideShare.net] (Archivematica Camp Baltimore, 2018) Other GitHub Highlights

● Fixity ● Devtools Fixity

● Fixity is a client that makes use of the Storage Service’s check fixity API endpoint:

http -v --pretty=format \ GET "http://127.0.0.1:62081/api/v2/file/6bf7f7ea-27f9-4002-b114-fa7500cd8540/check_fixity/" \ Authorization:"ApiKey test:test"

● Like the automation-tools, recommendation is to run using Cron. ● The Storage Service updates itself with status information following the check, e.g.:

+------+------+------+------+ | AIP UUID | Fixity Last Checked | success | error_details | +------+------+------+------+ | 5e21dd0d-190e-4ffb-b752-76d860bea898 | 2018-11-06 12:10:22.981609 | 0 | invalid bag | +------+------+------+------+ Devtools

● Previously collected a number of scripts that could be used to support Archivematica. Some have been moved to the Dashboard (below), the tool I’d like to highlight is the RPC Client, which can be used to push transfers along which may have been accidentally removed from the dashboard. [Demo: https://asciinema.org/a/195109] Graphlinks

● Graphlinks can now be called from the Dashboard, e.g. in Docker: $ sudo make manage-dashboard ARG="graph_links". A useful demonstration of what Archivematica is doing under the hood. [Demo: Graphlinks] Customising the FPR

● Python scripts ● Bash scripts Customizing the FPR

From Stream 1:

Archivematica comes with an extensive set of default commands and rules, based on commonly agreed-upon practices (i.e. using TIFF as the common preservation derivative for images).

However, institutional policies, legal requirements, and other external needs mean that it’s very likely that an institution will need to modify commands or rules. Customizing the FPR

For the administrators and developers in the room, these commands are either Bash, or Python scripts. You can customise the FPR easier the more familiar you are with these.

Exiftool Characterization (Bash):

Customizing the FPR

JHOVE validation uses Python to extract specific details (Python):

Where is ● The database Archivematica ● Elasticsearch indices putting ● Information packages information? Sources of information inside Archivematica

Information Packages (SIP/AIP/DIP)

METS Logs

Elasticsearch Index

MCP Server/Storage Service Database

Archivematica API

...the order here is not indicative of increasing or decreasing levels of stability... Database: Performance

One of the first places we’ll look to see how long a task is taking is the Database. We have recent issue looking at the processing times for transfers of large numbers of files: https://github.com/archivematica/Issues/issues/315. To see the client script process that has taken the longest time across all transfers: select CONCAT(TasksConfigs.description, ': ', timeDiff(Tasks.endTime, Tasks.startTime)) from Tasks inner join TasksConfigs inner join StandardTasksConfigs where TasksConfigs.taskTypePKReference = StandardTasksConfigs.pk and StandardTasksConfigs.execute = Tasks.exec order by timeDiff(Tasks.endTime, Tasks.startTime) desc limit 1 Database: Performance

Extending that idea, Justin previously did some analysis for Columbia University Libraries (Transfer details):

select Transfer Details: min(Job.createdTime) "Start Time", max(Task.endTime) "End Time", Start Time: 2017-11-02 18:14:20 timediff(max(Task.endTime), min(Task.createdTime)) End Time: 2017-11-09 04:39:26 Duration, Processing Duration: 154:25:06 count(*) Tasks Total CPU Time: 690:24:59 from Files: 20,359 Jobs Job join Tasks Task on Job.jobUUID = Total Size: 457.31 GB Task.jobuuid Tasks Performed: 305,489 where Job.SIPUUID = '{SIPUUD}' or Job.SIPUUID = '{SIPUUD}'; File Size Distribution Database: Performance +------+------+------+ | File Size MB | count | bar | +------+------+------+ (File size distribution): | 0 | 13513 | ********** | | 10 | 61 | **** | | 19 | 6 | ** | | 29 | 78 | **** | select bucket as "File Size MB", count, bar from ( | 38 | 14 | *** | SELECT round(ROUND(fileSize, -7)/1024/1024) AS | 48 | 139 | ***** | bucket, | 57 | 302 | ****** | COUNT(*) AS COUNT, | 67 | 6107 | ********* | RPAD('', LN(COUNT(*)), '*') AS bar | 76 | 53 | **** | FROM Files | 86 | 8 | ** | where transferUUId = '{transfer-uuid}' | 95 | 10 | ** | GROUP BY bucket) bk; | 105 | 5 | ** | | 114 | 2 | * | | 124 | 22 | *** | | 134 | 38 | **** | | 162 | 1 | | +------+------+------+ Database: Performance

(Format distribution):

select fmt.pronom_id, fmt.description, count(*) Distribution from Files files +------+------+------+ left join FilesIdentifiedIDs ids | pronom_id | description | count(*) | on ids.fileUUID = files.fileUUID +------+------+------+ join fpr_formatversion fmt on fmt.uuid= ids.fileID | | XML | 7240 | where | fmt/353 | TIFF | 6332 | files.sipUUId = | | Generic PDF | 454 | '0bf2e9cc-e3cc-44ab-a2a6-129b0166bf8e' | x-fmt/392 | JP2 (JPEG 2000 part 1) | 6332 | and files.fileGrpUse = 'original' | fmt/214 | Excel for Windows 2007+ | 1 | group by ids.fileID; +------+------+------+ Database: Performance (Microservice duration):

Select substring(job.unitType,5) Phase, job.microserviceGroup,

SEC_TO_TIME(SUM(TIME_TO_SEC(timediff(task.endTime,task.startTime )))) "CPU Time", count(task.taskUUID) Tasks, timediff(max(task.endTIme), min(task.startTime)) duration from Jobs job join Tasks task on job.jobUUID = task.jobuuid where job.SIPUUID = '0bf2e9cc-e3cc-44ab-a2a6-129b0166bf8e' or job.SIPUUID = '133cc0f4-6b41-4c45-bf10-f758e3c74447' group by job.unitType, job.microserviceGroup order by task.createdTime; Database: Performance (Microservice duration):

MicroService Duration +------+------+------+------+------+ | Phase | microserviceGroup | CPU Time | Tasks | duration | +------+------+------+------+------+ | Transfer | Verify transfer compliance | 02:53:20 | 20366 | 00:19:37 | | Transfer | Rename with transfer UUID | 00:00:01 | 1 | 00:00:01 | | Transfer | Include default Transfer processingMCP. | 00:00:00 | 1 | 00:00:00 | | Transfer | Assign file UUIDs and checksums | 20:04:28 | 40719 | 02:02:07 | | Transfer | Generate METS.xml document | 07:27:32 | 1 | 07:27:32 | | Transfer | Reformat metadata files | 00:00:00 | 1 | 00:00:00 | | Transfer | Verify transfer checksums | 00:00:00 | 1 | 00:00:00 | | Transfer | Quarantine | 00:00:01 | 1 | 00:00:01 | | Transfer | Scan for viruses | 03:56:38 | 20362 | 00:25:27 | | Transfer | Generate transfer structure report | 00:00:46 | 3 | 00:00:50 | | Transfer | Clean up names | 00:00:05 | 2 | 00:00:06 | | Transfer | Identify file format | 06:47:30 | 20360 | 00:39:08 | +------+------+------+------+------+ Database: Performance (Microservice duration):

MicroService Duration +------+------+------+------+------+ | Phase | microserviceGroup | CPU Time | Tasks | duration | +------+------+------+------+------+ | Transfer | Extract packages | 00:03:30 | 3 | 12:26:11 | | Transfer | Update METS.xml document | 15:02:43 | 2 | 27:27:07 | | Transfer | Characterize and extract metadata | 19:24:37 | 20360 | 01:56:31 | | Transfer | Validation | 09:03:20 | 20359 | 00:54:18 | | Transfer | Examine contents | 00:00:01 | 1 | 00:00:01 | | Transfer | Complete transfer | 00:00:06 | 3 | 00:00:06 | | Transfer | Create SIP from Transfer | 00:12:41 | 7 | 113:07:47 | | SIP | Verify SIP compliance | 00:12:47 | 3 | 00:12:47 | | SIP | Rename SIP directory with SIP UUID | 00:00:01 | 2 | 00:00:01 | | SIP | Verify transfer compliance | 00:00:00 | 1 | 00:00:00 | | SIP | Include default SIP processingMCP.xml | 00:00:01 | 1 | 00:00:01 | | SIP | Remove cache files | 02:51:27 | 20359 | 00:17:07 | +------+------+------+------+------+ Database: Performance (Microservice duration):

MicroService Duration +------+------+------+------+------+ | Phase | microserviceGroup | CPU Time | Tasks | duration | +------+------+------+------+------+ | SIP | Clean up names | 00:03:22 | 3 | 00:03:23 | | SIP | Normalize | 50:20:20 | 81446 | 57:38:29 | | SIP | Add final metadata | 00:02:21 | 2 | 00:07:08 | | SIP | Process manually normalized files | 00:00:00 | 1 | 00:00:00 | | SIP | Transcribe SIP contents | 02:54:19 | 20359 | 00:17:25 | | SIP | Process submission documentation | 02:27:07 | 20369 | 00:21:44 | | SIP | Process metadata directory | 00:00:14 | 10 | 00:01:18 | | SIP | Verify checksums | 17:17:10 | 20361 | 01:43:37 | | SIP | Generate AIP METS | 12:29:47 | 1 | 12:29:47 | | SIP | Prepare AIP | 05:40:23 | 10 | 05:40:27 | | SIP | Store AIP | 06:58:58 | 8 | 06:58:59 | +------+------+------+------+------+ Database: Format Analyzer ● Potential too in the database to support research such as Nick K’s ○ Bayesian Modeling of File Format Obsolescence (Nick Krabbenhoft, 2018), e.g.

● Format Analyzer work here: ○ https://github.com/NYPL/formatanalyzer ● PR submitted to accept Archivematica as a datasource: ○ https://github.com/NYPL/formatanalyzer/pull/3 Database: Format Analyzer

Gist:Create Dated Files Elasticsearch: Cross-AIP Data

● Elasticsearch is not a supported API, but it is used inside Archivematica. ● If you have access to the endpoints, you might want to experiment as well. ● Plenty of scope for improvement: ○ archivematica/issues#273 ‘Epic: Searching is difficult’ ● Sample script created as a potential solution for a client seeking to find duplicates cross AIP: ○ Cross-AIP duplicate detection using Elasticsearch Elasticsearch: Cross-AIP Data

● Example demonstrating difficulty accessing Elasticsearch structures:

try: object_dict = tech_md["ns0:techMD_dict_list"][0][ "ns0:mdWrap_dict_list" ][0]["ns0:xmlData_dict_list"][0]["ns3:object_dict_list"] except KeyError: Pass

● I.e. tech_md[“dict”][index][“dict][index][“dict”][index][“dict”] ● Just to get to the objects dictionary inside the AIP METS… ● Documentation: ○ https://wiki.archivematica.org/Elasticsearch_Development $ tree one-file-aip-a0bb6ccc-a08a-4ca6-9bbe-b1360640ef3c ├── bag-info.txt ├── bagit.txt ├── data AIP: Structure │ ├── logs │ │ ├── fileFormatIdentification.log │ │ ├── filenameCleanup.log │ │ └── transfers │ │ └── one-file-aip-6f3159fd-67e4-4ae6-8508-90c05570dcac │ │ └── logs │ │ ├── fileFormatIdentification.log │ │ └── filenameCleanup.log │ ├── METS.a0bb6ccc-a08a-4ca6-9bbe-b1360640ef3c.xml │ ├── objects │ │ ├── metadata │ │ │ └── transfers │ │ │ └── one-file-aip-6f3159fd-67e4-4ae6-8508-90c05570dcac │ │ │ └── directory_tree.txt │ │ ├── one_file.txt │ │ └── submissionDocumentation │ │ └── transfer-one-file-aip-6f3159fd-67e4-4ae6-8508-90c05570dcac │ │ └── METS.xml │ └── README.html ├── manifest-sha256.txt └── tagmanifest-md5.txt

11 directories, 13 files API Call: Download a file from an AIP

● We can also just rely on downloading the METS:

http -v --pretty=format \ GET "http://127.0.0.1:62081/api/v2/file/ {aip-uuid}/extract_file/ ?relative_path_to_file={{aip-name}-{aip-uuid}.xml" \ Authorization:"ApiKey test:test" | less

● Authoritative source of information about the material that you’re looking after. ● Can access all of the contents of your AIP once you’ve read the METS.xml as an index…*

*Although performance-wise it might not be the most efficient... METS: Reading METS

● METS Reader Writer (mets-rw or metsrw) is a Python library for reading, writing, and constructing METS files. ○ github.com/artefactual-labs/mets-reader-writer ● Archivematica uses METS-RW for some but not all of its METS viewing and manipulation. ● Still lots of Python lxml in reading METS, e.g. and in other cases it uses a lower-level Python library (lxml) to manipulate the METS XML directly. ● The RFC we looked at earlier is concerned with replacing other XML libraries in Archivematica with mets-rw: ○ github.com/artefactual-labs/archivematica-rfcs-test/adr-001 #!/usr/bin/env python # -*- coding: utf-8 -*- METS: Reading METS from __future__ import print_function import lxml import logging ● Minimal effort required to import metsrw access mets via mets-rw: import sys

def load_mets(filename): try: mets = metsrw.METSDocument.fromfile(filename) return mets except lxml.etree.XMLSyntaxError as e: logging.error("METS %s", e) sys.exit(1) except IOError as e: logging.error("File does not exist %s", e) sys.exit(1)

mets = load_mets("METS.xml") for entry in mets.all_files(): print(entry.label) METS: Reading METS

● METS-RW can validate a METS file according to the METS XMLSchema and other referenced schema, e.g., PREMIS. ● METS-RW can also validate a METS file according to Archivematica’s own, more restrictive Schematron:

>>> is_valid, report = metsrw.validate( ... mets_doc.serialize())

… or to a supplied Schematron file:

>>> is_valid, report = metsrw.validate( ... mets_doc.serialize(), ‘/path/to/schmtrn/file.xml’) METS: Reading METS

● Example reading/writing METS/PREMIS: ○ https://gist.github.com/ross-spencer/be123a4448da1d94124d4477a1affbc5 ○ [ASCIINEMA](https://asciinema.org/a/CoAhcBhZjBX7I5LYvRtUmCqPr) METS: METSFlask

TimWalsh:METSFlask Mother, mother: Features Showcase ● What’s going on?*

Marvin Gaye (1971) Performance Improvements

GitHub:PR#938 Performance Improvements

● Those changes are in 1.8 ● In future, PyFlame could still be used to discover new performance enhancements e.g. by identifying easy improvements in the most computationally intensive client scripts and optimizing those in descending order. ● Focus now to avoid running unnecessary tasks and on having single workers run quick (non IO-bound) tasks in batches: ○ PyFlame:dev-branch

GitHub:PR#938 Dataverse Integration

● New in 1.8. ● Dataverse PoC translated to new transfer-type inside Archivematica. ● Developed for Scholars Portal: OCUL (Ontario Council of University Libraries). ● Connects to the Dataverse API and downloads datasets via an Archivematica Transfer Source… ● Demo transfers available without API connection: ○ Archivematica-sample-data:Dataverse Dataverse Integration

● And we’d love it if you wanted to give it a whirl! Additional reading Other Archivematica resources... Links and resources

● Archivematica SlideShare: ○ https://slideshare.net/Archivematica ● AtoM SlideShare: ○ https://slideshare.net/accesstomemory Thank you!

Get in touch

[email protected]