Software and Dependencies in Research Citation Graphs
Total Page:16
File Type:pdf, Size:1020Kb
COMPUTING IN SCIENCE AND ENGINEERING 1 Software and Dependencies in Research Citation Graphs Stephan Druskat Abstract—Following the widespread digitalization of scholarship, software has become essential for research, but the current sociotechnical system of citation does not reflect this sufficiently. Citation provides context for research, but the current model for the respective research citation graphs does not integrate software. In this paper, I develop a directed graph model to alleviate this, describe challenges for its instantiation, and give an outlook of useful applications of research citation graphs, including transitive credit. Index Terms—software citation, citation graphs, transitive credit F 1 INTRODUCTION HE DIGITALIZATION of research changes research meth- Citation and the sociotechnical citation system broadly T ods across disciplines, and produces new forms of provide the following functions: research and knowledge [1]. In the process, research soft- ware has clearly become an integral part of digital research • Context function: The provision of context for re- methodologies [2]. It embeds research knowledge, imple- search products by establishing a graph of research ments algorithms and models, and is a central component products with links between the identified citing and of digital scholarly integration and application [3]. It thus cited products, to enable traceability of outcomes, presents a significant, and increasingly vital, intellectual both over the past to understand how present knowl- contribution to academic research. edge was established, and into the future to under- stand how present knowledge is being used (cf. [1], Research software should therefore also be considered [7], [8], [9]) a legitimate research product [3], [4], [5]. Research products • Social functions: The establishment of trust and have the most value – and their outcomes can be understood authority [9], [10], [11]; the recognition of the value most fully – when they are considered in their context [1, of a research product while providing credit to its p. 10]. This context is standardly provided through citation. authors [4], [7]; the potential for evaluation of in- Therefore, one aspect of research software gaining the status dividual researchers [7, ch. 6], individual research of a research product is, that it must be integrated into the products [12], journals [13], and research groups, scholarly citation system. institutions, and countries [7, ch. 6] The citation system in current digital scholarship is • Compliance function: The assertion of compliance a sociotechnical system based on technical infrastructure, with good scholarly practice [7], [14] and involves different stakeholders. Stakeholders include • Discursive function: The organization and shaping domain-specific communities of researchers who use soft- of discourses of scholarly credibility, authority, and ware, research software engineers (RSEs) and other soft- relevance through epistemic change via “dynami- ware developers, research institutions, publishers, reposi- cally rewriting the past” [7, p. xvi], [11], [15], [16]) tory providers, index providers, and funders (cf. [6]). Tech- • Reproducibility function: The enablement of re- arXiv:1906.06141v3 [cs.DL] 19 Dec 2019 nical infrastructure upon which the citation system is built search reproducibility through correct and complete most prominently include publication repositories; citation citation [17], [18], [19], [20] indices and aggregators; publishing services including web- sites; services provided by libraries; resolvers for digital Software as a research product can be subject to all of identifiers; metadata formats; reference management, text the described functions – including the discursive function, processing, and other software. albeit to a limited degree, see below – only if it is fully integrated in the citation system. While this is not currently the case [3], [18], [21], [22], [23], [24], progress is being made, • Stephan Druskat is with the German Aerospace Center (DLR), Berlin, Germany, the Computer Science Department at Humboldt-Universit¨at driven by different stakeholders: zu Berlin, Berlin, Germany, and the Department of English Studies at Friedrich Schiller University, Jena, Germany. • Research software community initiatives such as E-mail: [email protected] the FORCE11 Software Citation Implementation c 2019 IEEE. Personal use of this material is permitted. Permission from Working Group (www.force11.org/group/software- IEEE must be obtained for all other uses, in any current or future media, citation-implementation-working-group) build on including reprinting/republishing this material for advertising or promotional established community standards [4] and bring to- purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. gether stakeholders to shape technical infrastructure DOI: https://doi.org/10.1109/MCSE.2019.2952840 and policy, and develop guidance [6]; their activities COMPUTING IN SCIENCE AND ENGINEERING 2 concern the discursive and social functions directly, c c and the remaining functions indirectly. 1 2 pub-in • Domain, infrastructure and software communities develop software solutions for providing citation pub-in cite metadata [25], [26], create repositories, information pub-in cite services, and indices (e.g., [27], [28], [29], [30]), and p1 p2 p3 develop metadata formats [31], [32]; their activities auth concern the context and social functions directly, and auth auth auth auth the remaining functions indirectly. • Research policy researchers develop procedures for a1 a2 a3 a4 evaluating research software [33]; their activities con- affil affil affil affil affil affil cern the social and compliance functions directly, and the remaining functions indirectly. • Funding agencies update funding policies (cf. [5]) i1 i2 i3 i4 and guidelines for scholarly practice [34] to incor- porate citation of research software; their activities Fig. 1. Partial G2-type RCG for a research product which references two concern the compliance function directly, and the other research products. The graph includes the respective G1 graph remaining functions indirectly. (in bold print). Nodes: pn=research product, an=author, in=evaluable affiliation, c =evaluable publishing container; Edges: cite=“citation” re- • n Publishers establish editorial policies that require lation, auth=“authored by” relation, affil=“affiliated with” relation, pub- the citation of software, sometimes as a subset of in=“published in” relation. data [24], [35], or plan to do so [18]; their activities concern the context, social and reproducibility func- tions directly, and the remaining functions indirectly. research products rather than establishing references to their authors. In this paper, I aim to contribute to the understanding of the In order to fully exploit the social citation function, RCGs requirements for the implementation of research software must model additional properties of research products: citation. To this end, I will investigate the output of the con- text function of citation, research citation graphs (RCGs), with • The provision of academic credit requires the in- the objective to answer the following research questions: clusion of author nodes, and authorship relations between them and research products, as does the • RQ1: What are the necessary changes in the model of establishment of trust and authority if it is based on research citation graphs to allow for the integration individuals. of research software, and the adoption of the citation • Evaluation requires the inclusion of two classes of functions? entity nodes: affiliations (research groups, institu- • RQ2: What are the requirements for the implementa- tions, countries, etc.), and respective affiliation rela- tion of software citation based on an updated model tions between them and authors; “product contain- of research citation graphs? ers” (journals, edited volumes, repositories, archives, • RQ3: What are current challenges for the instantia- etc.), and respective published-in part-of relations tion of research citation graphs? between them and research products. • RQ4: What applications do research citation graphs enable? The model graph changes accordingly: Let P be the set of all vertices fp1; : : : ; png which represent research products, A the set of all vertices fa1; : : : ; ang which represent authors, I 2 RESEARCH CITATION GRAPHS the set of all vertices fi1; : : : ; ing which represent evaluable Research products and the references between them can author affiliations, and C the set of all vertices fc1; : : : ; cng be modeled as a directed graph G1 = (V; E) where V is which represent entities which contain research products. a set of vertices (or “nodes”), and E is a set of ordered Let V be the set of disjoint sets fP; A; I; Cg of vertices in the pairs of nodes (i.e., “directed edges”). The nodes in V RCG G2 = (V; E). Define L : V !V to set represent research products, the edges represent reference • L(v) = P when v 2 P 2 V, relations (i.e., citation) between source nodes (the citing • L(v) = A when v 2 A 2 V, research product) and target nodes (the cited product). • L(v) = I when v 2 I 2 V, This most basic model of a research citation graph (RCG) • L(v) = C when v 2 C 2 V. enables the context function of citation: It helps understand what other research products a specific product relied on See Figure 1 for an exemplary visualization.