Resolving Dilemmas Arising During Design and Implementation of Digital Repository of Heterogenic Scientific Resources
Total Page:16
File Type:pdf, Size:1020Kb
applied sciences Article Resolving Dilemmas Arising during Design and Implementation of Digital Repository of Heterogenic Scientific Resources Tomasz Kubik 1,* and Agnieszka Kwiecie ´n 2 1 Faculty of Electronics, Wrocław University of Science and Technology, Wybrzeze˙ Wyspia´nskiego27, 50-370 Wrocław, Poland 2 Wrocław Centre for Networking and Supercomputing, Wrocław University of Science and Technology, Wybrzeze˙ Wyspia´nskiego27, 50-370 Wrocław, Poland; [email protected] * Correspondence: [email protected] Abstract: The creation of digital repositories for archiving and disseminating scientific resources faces many challenges. These challenges relate not only to the modelling of the processes of preparing, depositing, sharing, maintaining, and curating resources. They also face the feasibility of the adopted assumptions and final implementation. Such kind of issues become particularly important in the case of processing of resources containing multimedia. The critical factor then becomes a properly designed architecture that supports efficient data processing and universal data presentation. This article aims to answer questions that may arise when approaching various designing and implementation dilemmas, such as how to handle processes in a digital repository, how to use cloud solutions in its construction, how to work with user interfaces, and how to process collected multimedia. The presented study explores their practical context based on the experiences gained during the AZON platform’s implementation. This platform stores tens of thousands of scientific resources: books, articles, magazines, teaching materials, presentations, photos, 3D scans, audio and video files, databases, and many more. It serves as a running example for all presented proposals. Keywords: digital repository; multimedia processing; cloud architecture; scientific resources Citation: Kubik, T.; Kwiecie´n,A. Resolving Dilemmas Arising during 1. Introduction Design and Implementation of Digital Repository of Heterogenic Scientific Digital repositories (DRs), in brief, are IT systems offering tools for publishing, storing, Resources. Appl. Sci. 2021, 11, 215. retrieving, and sharing data and information by collecting data records along with related https://doi.org/10.3390/app11010215 metadata. They are exciting subjects for conducting a variety of research. Its broad spectrum covers database management, including inference with incomplete/inconsistent Received: 25 November 2020 datasets, knowledge modeling and discovery, querying, security issues handling, etc., as Accepted: 23 December 2020 described in [1]. In particular, the research might cover models used to index, search and Published: 28 December 2020 recommend multimedia data as presented in [2,3]. From the users’ point of view, digital repositories should offer: (i) a user-friendly Publisher’s Note: MDPI stays neu- graphical interface enabling: easy search for records, classification and sorting of search tral with regard to jurisdictional claims results, content preview, retrieval of items; (ii) the ability to process data at different in published maps and institutional stages of the supported processes, for example, to automatically generate metadata when affiliations. publishing records; (iii) multi-level access control of individual metadata fields to meet the needs of different audiences; (iv) the ability to cooperate with other repositories; (v) handling and providing long-term support for different types of data files; etc. Copyright: © 2020 by the authors. Li- For the designers, essential parameters of the repositories are: infrastructure and censee MDPI, Basel, Switzerland. This hardware requirements; modularity, extensibility, and scalability of the architecture; a stack article is an open access article distributed of technologies used; security mechanisms implemented; supported protocols and offered under the terms and conditions of the APIs (Application Program Interfaces) allowing for adding new functions and integration Creative Commons Attribution (CC BY) with other systems; etc. license (https://creativecommons.org/ The various software solution examination in terms of their potential use was only licenses/by/4.0/). a part of the analyses carried out during the AZON (Atlas of Open Scientific Resources) Appl. Sci. 2021, 11, 215. https://doi.org/10.3390/app11010215 https://www.mdpi.com/journal/applsci Appl. Sci. 2021, 11, 215 2 of 19 platform’s construction. This platform enables the collection, processing, and sharing of open science resources in digital form. It was built in the scope of the project Akty- wna Platforma Informacyjna e-scienceplus.pl, POPC.02.03.01-00-0010/16, aimed at increasing accessibility, improving the quality and reuse facilitating of scientific resources of such institutions as Wrocław University of Science and Technology, University School of Physical Education in Wrocław, Wrocław University of Environmental and Life Sciences, Wrocław Medical University and Systems Research Institute Polish Academy of Sciences. Thus, the AZON platform can be compared, in some sense, to the data management systems like Dryad (https://datadryad.org/stash), Open Science Framework (https://osf.io), CKAN (https://ckan.org), and Dataverse (https://dataverse.org) addressed to the community of users, enabling collaboration, resources sharing and discovery. The digital archives, like National Archives Catalog (https://catalog.archives.gov), Narodowe Archiwum Cyfrowe (https://www.nac.gov.pl), or Europeana (https://www.europeana.eu/en) are also compa- rable. In the beginning, it was necessary to design an appropriate architecture for the AZON platform, especially that it was supposed to support the publication of 32,792 documents (assuming, for simplicity, that one document = one data record) that would require the dis- semination of 124 TB of data. Moreover, according to the assumptions, increasing scientific resources’ availability should be automated, with batch processing of data performed in the computing centre. The platform was supposed to support data standardization, records annotation, voice transcription, search, linking, keywords definition, and text analysis. The designed domain metadata models, dictionaries, and thesauri should support the information dissemination and search. The plan was to design separate subsystems for handling such tasks as data collecting, processing, long-term preservation, presentation, and sharing due to the possibility of concepts separation. Many of these tasks could be addressed by the software providing the digital repository functionality. However, the existing solutions did not meet the expecta- tions. For example, the problem of effective multimedia processing, including automatic conversion, generation of transcripts and preparation for streaming, remained untouched, and there was no support for consistent presentation of heterogeneous resources. Therefore, the design of a new platform started from scratch, and this was a challenge. During the work, many dilemmas have been raised and solved, both on the side of modeling and implementation. This article focuses on presenting some of them, empha- sizing those related to the processes running within the AZON platform, its architecture, and problems associated with the design of interfaces ensuring sufficient deposit and presentation of scientific resources, including resources with multimedia attached. 2. Literature Review The logical architecture of many digital repositories consists of several parts. Usually, there is a part responsible for the business logic (platform or engine). Another one is responsible for data preservation (file systems or databases). The last one is used to interact with users within the web browser window (access application). These parts are loosely coupled and often run as independent web services. The communication between them goes through the endpoints with well-defined application programming interfaces. In Table1, some of the existing implementations of digital repositories are listed, including platform only and fully featured repositories. All are open-source software- based. A more detailed comparison of selected repositories is given in [4] (a bit outdated, but still valid) while custom implementations are in the focus of [5,6]. Appl. Sci. 2021, 11, 215 3 of 19 Table 1. Selected digital repositories delivered as software packages. Project Name and Its Homepage Supported Metadata Standards Software Type and Search Features Fedora Commons, MODS, Dublin Core, QDC out of the box, platform only, with API, with native https://duraspace.org/fedora/ arbitrary metadata sets Linked Data support Islandora, MODS, Dublin Core, PREMIS, repository, built on Fedora Commons, https://islandora.ca MARCXML uses Apache Solr Samvera (previously Hydra), MODS, DC, DC11, EDM, FOAF, custom bespoke solution, build on Fedora https://samvera.org metadata set Commons, uses Apache Solr DSpace, Dublin Core, arbitrary metadata sets (but repository, uses Apache Solr http://duraspace.org/dspace/ flat) Archivematica, METS, PREMIS, Dublin Core, Library of repository, based on ISO-OAIS functional https://www.archivematica.org/en/ Congress BagIt model, uses Elastic Search AtoM, EAD, Dublin Core XML, MODS XML, repository, uses Elastic Search https://www.accesstomemory.org/en/ EAC, SKOS Invenio, MARC repository, uses Elastic Search https://invenio-software.org VIVO, VIVO ontology, arbitrary metadata sets repository, uses Apache Solr https://duraspace.org/vivo/ (custom ontologies)