Referencing Source Code Artifacts: a Separate Concern in Software Citation Roberto Di Cosmo, Morane Gruenpeter, Stefano Zacchiroli

Referencing Source Code Artifacts: a Separate Concern in Software Citation Roberto Di Cosmo, Morane Gruenpeter, Stefano Zacchiroli To cite this version: Roberto Di Cosmo, Morane Gruenpeter, Stefano Zacchiroli. Referencing Source Code Artifacts: a Separate Concern in Software Citation. Computing in Science and Engineering, Institute of Electrical and Electronics Engineers, In press, pp.1-9. 10.1109/MCSE.2019.2963148. hal-02446202 HAL Id: hal-02446202 https://hal.archives-ouvertes.fr/hal-02446202 Submitted on 22 Jan 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. 1 Referencing Source Code Artifacts: a Separate Concern in Software Citation Roberto Di Cosmo, Inria and University Paris Diderot, France [email protected] Morane Gruenpeter, University of L’Aquila and Inria, France [email protected] Stefano Zacchiroli, University Paris Diderot and Inria, France [email protected] Abstract—Among the entities involved in software citation, II. REFERENCING SOURCE CODE FOR REPRODUCIBILITY software source code requires special attention, due to the role Software and software-based methods are now widely used it plays in ensuring scientific reproducibility. To reference source code we need identifiers that are not only unique and persistent, in research fields other than just computer science and en- but also support integrity checking intrinsically. Suitable iden- gineering [4]. Recognition of the role that software plays in tifiers must guarantee that denoted objects will always stay the reproducing scientific results, and of the insufficient availabil- same, without relying on external third parties and administrative ity of its source code is growing [5], [6]. processes. In the journey towards open science, and reproducible We analyze the role of identifiers for digital objects (IDOs), whose properties are different from, and complementary to, those scientific results, open and unfettered access is needed to three of the various digital identifiers of objects (DIOs) that are today main kinds of research outputs [7]: popular building blocks of software and data citation toolchains. 1) the scientific articles; We argue that both kinds of identifiers are needed and 2) the data used or produced in the research; detail the syntax, semantics, and practical implementation of the persistent identifiers (PIDs) adopted by the Software Heritage 3) the source code of the software embodying the exper- project to reference billions of software source code artifacts iment logic. such as source code files, directories, and commits. Preserving software source code is as essential as preserv- Index Terms—software citation, digital preservation, repro- ing articles and datasets to promote both open science and ducibility, open science, digital object identifier reproducibility. Unfettered access to an executable version of the software is very valuable, but source code must be I. INTRODUCTION available too: it embodies the scientific knowledge underlying the computational calculation. This article builds upon key findings on identifiers for software source code artifacts from our previous work on digital archives [1]. We focus on its relevance for scientific A. Software reference versus citation reproducibility of experiments that rely on software, where one A nice presentation of the many reasons why the research needs to identify source code at various granularities, retrieve community needs to take software into account can be found byte-identical copies of source code artifacts described in a in the Software Citation Principles [8]. paper, and verify their integrity as a prerequisite for attempting When mentioning software in publications, though, we need to reproduce a given result. to distinguish the software project, which refers to an endeavor We recall the schema of software source code identifiers that to develop and maintain software artifacts, from the software has been developed and deployed in the context of Software artifacts (source code, executable binaries, etc.) that are the Heritage [2], a long term initiative aiming to collect, preserve, byproducts of that endeavor. The project is not a digital object; and share the entire body of software source code and its the resulting artifacts are. development history. We show that these identifiers are well- Following [9], we distinguish software artifact reference suited for addressing the need of scientific reproducibility from software project citation, that serve different purposes. when source code is involved. The main intention of a citation is to give credit to the The requirements that emerge from this particular setting, authors of a given software (as a project), whereas the function where one needs to handle tens of billions of different digital of a reference is to precisely identify software artifacts, usually objects, cannot be fully satisfied by state-of-the-art identifier for reuse purposes. The two activities are intertwined, and schemas broadly adopted for digital publications. Fortunately, sometimes citing the project without also referencing artifacts we can leverage recursive data structures based on crypto- is not satisfactory. But for the specific needs of reproducibility, graphic hashes, such as Merkle trees [3], to build identifiers referencing software artifacts is often sufficient and may of digital objects that satisfy the overlapping requirements be easily done without, e.g., having to track down credit of long-term software source code preservation and scientific attribution, a vastly more difficult task that is worth a research reproducibility. article on its own [9]. 2 In this article we focus on references and do not address TABLE I: Mechanism implementation in common systems of citations further. identifiers Mech. / System Handle DOI Ark PURL VDOI B. References for reuse and reproducibility Generation Yes Yes Yes Yes Yes Assignment Yes Yes Yes Yes Yes For reproducibility, software artifact references need to be Verification N.A. N.A. N.A. N.A. Yes very fine-grained, down to specific versions. And they should Retrieval Yes Yes Yes Yes Yes Reverse Lookup N.A. N.A. N.A. N.A. N.A. point to a persistent location, available over the long term. Description Yes Yes Yes N.A. Yes As noticed in [8], the usual ways of referencing software artifacts by just pointing to the project website or the current development repository are largely unsatisfactory, as these mechanism for referencing source code artifacts in research locations are ephemeral. articles. For reproducibility we also need a system of identifiers with specific properties that current systems do not pro- III. IDENTIFIER SYSTEMS AND THEIR PROPERTIES vide. Let’s consider ACM’s Artifact Review and Badging This whole section recalls the key findings on identifier policy, described at https://www.acm.org/publications/policies/ systems that the authors published to the attention of the digital artifact-review-badging. It defines the following properties, in preservation community in [1]. order of increasing desirability: • Repeatability: the ability to re-run an experiment by the same team using the same experimental setup— A. Identifier systems including all involved software artifacts. Results that are A system of identifier is composed of a set of labels that not repeatable are rarely suitable for publication. can be used as references for objects and a set of mechanisms • Replicability: the ability to re-run an experiment by a performing some or all of the following operations: different team, reusing the described experimental setup, • Generation: create a new label software artifacts included. • Assignment: associate a label to an object • Reproducibility: the ability to re-run an experiment by a • Verification: given a label and an object, verify that they different team, without relying on the experimental setup correspond and software artifacts developed by the original team. • Retrieval: given a label, provide a means of getting a Furthermore, the following “badges” can be associated to copy of the corresponding object software artifacts to capture their overall quality w.r.t. repro- • Reverse lookup: given a object, find the label that has ducibility and reuse: been assigned to it, if any • Available: software artifacts that have been made avail- • Description: given a label, provide a means of getting able via publicly-accessible long-term archives. metadata describing the corresponding object • Functional: software artifacts that meets the specific While these mechanisms can in principle be implemented needs of an experiment and are documented, consistent, by totally independent entities, the most common systems of complete, and exercisable [by third parties]. identifiers conflate all these conceptually distinct mechanisms • Reusable: functional (as above) software artifacts that into a single logical component usually called resolver. are more generally useful than addressing the needs of a Despite the fact that the verification mechanism is of specific experiment. paramount importance in all the identification systems used The focus of this

Referencing Source Code Artifacts: a Separate Concern in Software Citation Roberto Di Cosmo, Morane Gruenpeter, Stefano Zacchiroli

Persistent Identifiers (Pids): Recommendations for Institutions Persistent Identifiers (Pids): Recommendations for Institutions

Digital Object Identifier (DOI) System for Research Data

Identifiers and Use Case in Scientific Research

Persistent Identifiers: Consolidated Assertions Status of November, 2017

ARK Identifier Scheme

1 Distributed Information Management William M. Pottenger1 Lehigh

Decentralized Persistent Identifier Design

Architectural Components of Digital Library: a Practical Example Using Dspace

Digital Object Identifier (DOI) Under the Context of Research Data Librarianship

Choice of Docencocodingtype and Encoding Level for SPO Publications

University of Southampton Research Repository Eprints Soton

Handle System Evolution and Governance