Towards a Sustainable Funding Model for the Uniprot Use Case[Version 2

F1000Research 2018, 6(ELIXIR):2051 Last updated: 25 OCT 2018 RESEARCH ARTICLE Funding knowledgebases: Towards a sustainable funding model for the UniProt use case [version 2; referees: 3 approved] Chiara Gabella , Christine Durinx , Ron Appel ELIXIR-Switzerland, SIB Swiss Institute of Bioinformatics, Lausanne, 1015, Switzerland First published: 27 Nov 2017, 6(ELIXIR):2051 (doi: Open Peer Review v2 10.12688/f1000research.12989.1) Latest published: 22 Mar 2018, 6(ELIXIR):2051 (doi: 10.12688/f1000research.12989.2) Referee Status: Abstract Invited Referees Millions of life scientists across the world rely on bioinformatics data resources 1 2 3 for their research projects. Data resources can be very expensive, especially those with a high added value as the expert-curated knowledgebases. Despite the increasing need for such highly accurate and reliable sources of scientific version 2 information, most of them do not have secured funding over the near future and published often depend on short-term grants that are much shorter than their planning 22 Mar 2018 horizon. Additionally, they are often evaluated as research projects rather than as research infrastructure components. version 1 In this work, twelve funding models for data resources are described and published report report report applied on the case study of the Universal Protein Resource (UniProt), a key 27 Nov 2017 resource for protein sequences and functional information knowledge. We show that most of the models present inconsistencies with open access or Helen Berman, Rutgers, The State equity policies, and that while some models do not allow to cover the total 1 costs, they could potentially be used as a complementary income source. University of New Jersey , USA We propose the Infrastructure Model as a sustainable and equitable model for John Westbrook, The State University of all core data resources in the life sciences. With this model, funding agencies New Jersey, USA would set aside a fixed percentage of their research grant volumes, which would subsequently be redistributed to core data resources according to 2 Eva Huala, Phoenix Bioinformatics, USA well-defined selection criteria. This model, compatible with the principles of Arabidopsis Information Resource, USA open science, is in agreement with several international initiatives such as the Tanya Z. Berardini , Phoenix Human Frontiers Science Program Organisation (HFSPO) and the OECD Bioinformatics, USA Global Science Forum (GSF) project. Here, we have estimated that less than 1% of the total amount dedicated to research grants in the life sciences would The Arabidopsis Information Resource, be sufficient to cover the costs of the core data resources worldwide, including USA both knowledgebases and deposition databases. 3 Judith A. Blake , The Jackson Keywords Laboratory, USA Bioinformatics, Data Resources, Knowledgebases, Funding, Long-Term Sustainability, Open Science Discuss this article Comments (0) This article is included in the ELIXIR gateway. Page 1 of 25 F1000Research 2018, 6(ELIXIR):2051 Last updated: 25 OCT 2018 Corresponding author: Chiara Gabella ([email protected]) Author roles: Gabella C: Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Project Administration, Validation, Writing – Original Draft Preparation, Writing – Review & Editing; Durinx C: Conceptualization, Funding Acquisition, Supervision, Validation, Writing – Review & Editing; Appel R: Conceptualization, Funding Acquisition, Supervision, Writing – Review & Editing Competing interests: UniProt is partially funded by the SIB, the Swiss Node of ELIXIR. Grant information: This work was done in the context of an ELIXIR Implementation Study linked to the ELIXIR Data platform and is funded by the ELIXIR Hub. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Copyright: © 2018 Gabella C et al. This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. How to cite this article: Gabella C, Durinx C and Appel R. Funding knowledgebases: Towards a sustainable funding model for the UniProt use case [version 2; referees: 3 approved] F1000Research 2018, 6(ELIXIR):2051 (doi: 10.12688/f1000research.12989.2) First published: 27 Nov 2017, 6(ELIXIR):2051 (doi: 10.12688/f1000research.12989.1) Page 2 of 25 F1000Research 2018, 6(ELIXIR):2051 Last updated: 25 OCT 2018 Despite the clear and increasing necessity for such high REVISED Amendments from Version 1 quality knowledgebases, the question of their sustainability in We have implemented the reviewers’ comments concerning the the long term is frequently raised, due to the current lack of an reorganization of the text and some technical details. We have appropriate funding model. Sustainability is a major problem added references to all the resources mentioned and enhanced for all data resources: while many international initiatives the difference between repositories and knowledgebases. Also, are opened to discuss the sustainability of digital infrastruc- we have highlighted the fact that sustainability is obviously a tures, curated databases are often left aside (see Section 5 problem for all the resources, not only for knowledgebases. We have corrected the introduction, in order to remain as general for a wider discussion on the existing initiatives and studies). as possible, and introduced UniProt in the appropriate section. We have added some details in the descriptions of some of the Manual curation and open access models, to better illustrate them. Moreover, we have corrected the In life sciences, manual expert curation plays a fundamental role TAIR description and classification and added it as an example of mixed model. We also have made more specific references to the in the creation of high quality knowledgebases. Manual curation previous comparable studies that are mentioned in the discussion is acknowledged to be highly accurate1,2, but criticism is often section. raised about the necessity for such a time- (and cost-) consuming See referee reports activity as opposed to the use of programs for automated or semi-automated information extraction (Information-Extraction programs—IE programs). In reality, current IE programs are not able to extract the large amount of information or 1 Introduction compare data with the same accuracy as professional curators Knowledgebases, why? do, but they can be extremely useful for identifying mentions of single entities in the scientific publications, using for instance Knowledgebases are organized and dynamic collections of infor- name-entity recognition tools1. Consequently, manual curation mation about a particular subject where data from multiple cannot be fully replaced by the existing Artificial Intelligence sources are not only archived, but also reviewed, distilled and (AI) technology. Text-mining is, however, often used as a manually annotated by experts. These digital infrastructures are first-line method for data extraction and identification of relevant essential to the effective functioning of scientific research and literature. for the whole life science community: they serve as encyclo- paedias, concentrating high quality knowledge collected from many different sources. In life sciences, knowledgebases are in The cost of professional curation is surprisingly low compared general manually curated by experts, i.e. highly qualified sci- to the cost of open access journals’ publication charges, or to entists —called biocurators —who manually select, review and the cost of performing the related research. The Swiss National 1 annotate the information on a particular subject. As a result, Science Foundation (SNSF) allows to claim CHF 3000 (€2790) knowledgebases are collections of continuously updated data, for costs of Open Access (OA) publication from agreed research providing a highly reliable source of scientific knowledge, with funding. The Open Access Co-ordination Group in the UK 2 the data being validated and enhanced. There is a substantial estimates average fees at £1586 (€1863). Per year, the cura- difference between a repository and a knowledgebase. Both rep- tors of the UniProt knowledgebase, a key resource for protein 3 resent the computationally tractable accumulation of (pieces of) sequences and functional information , read and/or evaluate information and knowledge processed in such a way that the between 50,000 and 70,000 papers, of which they fully curate data is easily readable, understandable and exported. However, approximately 8,000 publications. This means that in one year repositories rely partially or completely on data deposition by they read, evaluate and capture the output of research associ- the users, while in knowledgebases, the information in gen- ated with OA publication costs of €100 to €200 million, signifi- eral requires to be carefully selected and processed by experts. cantly more than the budget of UniProt as a whole (∼ €15 million Both types of data resources are crucial for allowing research per year). Similarly, each publication that is read and/or to be faster and more efficient as they: evaluated, is the result of a research project grant with a typi- cal value of ∼ $ 450,0003 (∼ €400,000). The cost of integrating • promote knowledge transfer to different sectors (e.g. the output (of the 8,000 publications) in UniProtKB, corre-

Towards a Sustainable Funding Model for the Uniprot Use Case[Version 2

1 Supporting Information for a Microrna Network Regulates

Supporting Information

Cytoplasmic Localized Ubiquitin Ligase Cullin 7 Binds to P53 and Promotes Cell Growth by Antagonizing P53 Function

Visual Data Mining : Background, Techniques, and Drug Discovery

Biological Data Integration Using Semantic Web Technologies

Uman Enome News

Trapping of the Transport-Segment DNA by the Atpase Domains of a Type II Topoisomerase

Mutation Discovery in Mice by Whole Exome Sequencing

Qt7xg6b543.Pdf

Mutations in the Gyrb, Parc, and Pare Genes of Quinolone-Resistant Isolates and Mutants of Edwardsiella Tarda

Functional Analysis of Gene Datasets Based on Gene Ontology. David Martin, C

An Overview of in Vivo and in Vitro Models for Autosomal Dominant Polycystic Kidney Disease: a Journey from 3D-Cysts to Mini-Pigs