Article Reference

Total Page:16

File Type:pdf, Size:1020Kb

Article Reference Article Incidences of problematic cell lines are lower in papers that use RRIDs to identify cell lines BABIC, Zeljana, et al. Abstract The use of misidentified and contaminated cell lines continues to be a problem in biomedical research. Research Resource Identifiers (RRIDs) should reduce the prevalence of misidentified and contaminated cell lines in the literature by alerting researchers to cell lines that are on the list of problematic cell lines, which is maintained by the International Cell Line Authentication Committee (ICLAC) and the Cellosaurus database. To test this assertion, we text-mined the methods sections of about two million papers in PubMed Central, identifying 305,161 unique cell-line names in 150,459 articles. We estimate that 8.6% of these cell lines were on the list of problematic cell lines, whereas only 3.3% of the cell lines in the 634 papers that included RRIDs were on the problematic list. This suggests that the use of RRIDs is associated with a lower reported use of problematic cell lines. Reference BABIC, Zeljana, et al. Incidences of problematic cell lines are lower in papers that use RRIDs to identify cell lines. eLife, 2019, vol. 8, p. e41676 DOI : 10.7554/eLife.41676 PMID : 30693867 Available at: http://archive-ouverte.unige.ch/unige:119832 Disclaimer: layout of this document may differ from the published version. 1 / 1 FEATURE ARTICLE META-RESEARCH Incidences of problematic cell lines are lower in papers that use RRIDs to identify cell lines Abstract The use of misidentified and contaminated cell lines continues to be a problem in biomedical research. Research Resource Identifiers (RRIDs) should reduce the prevalence of misidentified and contaminated cell lines in the literature by alerting researchers to cell lines that are on the list of problematic cell lines, which is maintained by the International Cell Line Authentication Committee (ICLAC) and the Cellosaurus database. To test this assertion, we text-mined the methods sections of about two million papers in PubMed Central, identifying 305,161 unique cell-line names in 150,459 articles. We estimate that 8.6% of these cell lines were on the list of problematic cell lines, whereas only 3.3% of the cell lines in the 634 papers that included RRIDs were on the problematic list. This suggests that the use of RRIDs is associated with a lower reported use of problematic cell lines. DOI: https://doi.org/10.7554/eLife.41676.001 ZELJANA BABIC, AMANDA CAPES-DAVIS, MARYANN E MARTONE, AMOS BAIROCH, I BURAK OZYURT, THOMAS H GILLESPIE AND ANITA E BANDROWSKI* Introduction Authentication Committee (ICLAC) to create a Cell lines are widely used in the biological scien- register of contaminated and misidentified cell ces, partly because they are able to multiply lines, increase awareness of the problem, and indefinitely. This property means that scientists propose approaches to decrease the use of such can, in theory, exactly replicate previous studies cell lines (Masters, 2012). ICLAC and other and then build on these results (American Type expert groups (such as the American Type Cul- Culture Collection Standards Development ture Collections: ATCC) propose that one of the Organization Workgroup ASN-0002, most practical ways to detect contamination is 2010). However, mislabeling or mishandling can to make use of a DNA-based technique called *For correspondence: result in misidentification, contamination or dis- short-tandem-repeat profiling to authenticate [email protected] tribution of problematic cell lines which, in turn, human cell lines (American Type Culture Collec- Competing interest: See can affect the validity of research data. Despite tion Standards Development Organization page 13 this, testing for contamination and misidentifica- Workgroup ASN-0002, 2010). Additionally, the Funding: See page 13 tion is commonly not performed and, therefore, Cellosaurus database, housed at the Swiss Insti- a laboratory could obtain contaminated cell lines tute of Bioinformatics, contains extensive infor- Reviewing editor: Helga Groll, and spend significant resources pursuing mation on more than 100,000 cell lines eLife, United Kingdom research based on faulty premises (Bairoch, 2018), including information about cell Copyright Babic et al. This (Drexler et al., 2017). A particular problem is line misidentification and other problems (such article is distributed under the the contamination of a cell line by a cancer cell as incorrect tissue origin, which is not detected terms of the Creative Commons line such as HeLa (Capes-Davis et al., 2010) by short-tandem-repeat profiling). Attribution License, which permits unrestricted use and because cancer cells tend to grow more rapidly Despite the availability of the ICLAC register, redistribution provided that the than normal cells. the overall rate of misidentified cell line use has original author and source are In 2012, a group of scientists established an not fallen across the literature (Horbach and credited. organization called the International Cell Line Halffman, 2017). It has been shown that Babic et al. eLife 2019;8:e41676. DOI: https://doi.org/10.7554/eLife.41676 1 of 14 Feature article Meta-Research Incidences of problematic cell lines are lower in papers that use RRIDs to identify cell lines publishers checking the manuscript during sub- also includes cell lines that have been labeled mission for misidentified cell lines can eliminate with the wrong type of cancer, but these may their use, although such a process requires sig- still be safely used if the researchers know the nificant time and cost (Fusenig et al., 2017). We true identity of the line. We must, therefore, believe that the continued use of misidentified exercise caution when interpreting these results. cell lines in published studies is due, in part, to More information about the problematic list researchers not checking the lists of misidenti- please is given in the discussion section and fied cell lines consistently. Supplementary file 2. Research Resource Identifiers (RRIDs) are unique identifiers that can be included in the Text mining corpus methods section of a research paper to define To derive a dataset of reported cell lines from the cell line, antibody, transgenic organism or the general literature, we used text mining of software used (Bandrowski et al., 2016; the PubMed Central open access subset. This Bandrowski and Martone, 2016). These identi- task requires the use of natural language proc- fiers were introduced largely because antibodies essing, as cell-line names are not unique strings. are a known source of variation in experiments, For example, looking for a set of characters, yet researchers did not include enough meta- such as ‘H2’, may reveal papers where H2 data to identify which antibody they used denotes a cell line, gene, protein, antibody or a (Vasilevsky et al., 2013). RRIDs are slowly figure. We used the SciScore tool, a ‘Named becoming more common in the literature, espe- Entity Recognition-based algorithm’ specialized cially in some disciplines such as neuroscience. for scientific resources, which can in principle They are issued by the naming authority for a recognize the H2 cell line and can ignore its particular type of resource (for example, stock mentions referring to something other than a centers and model-organism databases issue cell line. RRIDs for organisms; Cellosaurus issues RRIDs The algorithm was taught to consider fea- for cell lines; the antibody registry issues RRIDs tures around the recognized term that are asso- for antibodies; and the SciCrunch registry issues ciated with cell lines. SciScore was trained using RRIDs for software and other digital resources). 1,457 annotations made by a curator as well as a RRIDs for cell lines were first incorporated complete list of cell lines and synonyms from into the RRID portal in 2016, and journals often Cellosaurus as a set of seed data. We split the require researchers to look up each antibody, data into 90% training, and 10% testing of ran- animal or cell line on this website (which is syn- dom sets ten times, and obtained an average chronized with the Cellosaurus database, and precision of 87.3% (+/- sd 4.65%), recall 61.9% therefore contains information about cell line (+/-sd 7.01%) with an overall F1 mean of 72.2% contamination or misidentification). This require- (+/- sd 5.39%). ment creates a natural experiment to address This algorithm was then deployed on the ~2 the question: if researchers are alerted about million articles in the open access subset of the misidentification of a cell line before they PubMed Central. It recognized 305,161 unique publish, will they still report data from such a cell-line names from a total of 150,459 unique cell line? articles (Figure 1), which we treated as our pop- ulation of cell lines that was considered for fur- Results ther analysis. These data were then provided to In this study, we used text mining to identify Cellosaurus curators, allowing them to create papers that included RRIDs and papers that entries for 22 cell lines that had been so far over- listed cell lines, and then compared the preva- looked, and to add 18 synonyms and eight mis- lence of misidentified cell lines in these two sam- spellings for existing entries, making Cellosaurus ples. It should be stressed that the use of cell a more complete resource. lines on the problematic list does not automati- As cell-line names are not standardized in cally mean that a given cell line is being most papers, we used three different employed improperly. For example, the prob- approaches to estimate the percentage of prob- lematic list includes cases where a cell line is lematic cell lines in the general literature. First, ‘partially contaminated’, which does not affect to obtain a lower boundary, we determined cells purchased from, say, a stock center. The list whether the cell line name, as stated by the Babic et al.
Recommended publications
  • EN125-Web.Pdf
    ercim-news.ercim.eu Number 125 April 2021 ERCIM NEWS Special theme: Brain-inspired Computing Also in this issue Research and Innovation: Human-like AI JoinCt ONTENTS Editorial Information JOINT ERCIM ACTIONS ERCIM News is the magazine of ERCIM. Published quarterly, it reports 4 ERCIM-JST Joint Symposium on Big Data and on joint actions of the ERCIM partners, and aims to reflect the contribu - Artificial Intelligence tion made by ERCIM to the European Community in Information Technology and Applied Mathematics. Through short articles and news 5 ERCIM “Alain Bensoussan” Fellowship Programme items, it provides a forum for the exchange of information between the institutes and also with the wider scientific community. This issue has a circulation of about 6,000 printed copies and is also available online. SPECIAL THEME ERCIM News is published by ERCIM EEIG The special theme “Brain-inspired Computing” has been BP 93, F-06902 Sophia Antipolis Cedex, France coordinated by the guest editors Robert Haas (IBM +33 4 9238 5010, [email protected] Research Europe) and Michael Pfeiffer (Bosch Center for Director: Philipp Hoschka, ISSN 0926-4981 Artificial Intelligence) Contributions Introduction to the Special Theme Contributions should be submitted to the local editor of your country 6 Brain-inspired Computing by Robert Haas (IBM Research Europe) and Michael Copyrightnotice Pfeiffer (Bosch Center for Artificial Intelligence) All authors, as identified in each article, retain copyright of their work. ERCIM News is licensed under a Creative Commons Attribution
    [Show full text]
  • General Assembly and Consortium Meeting 2020
    EJP RD GA and Consortium meeting Final program EJP RD General Assembly and Consortium meeting 2020 Online September 14th – 18th Page 1 of 46 EJP RD GA and Consortium meeting Final program Program at a glance: Page 2 of 46 EJP RD GA and Consortium meeting Final program Table of contents Program per day ....................................................................................................................... 4 Plenary GA Session .................................................................................................................. 8 Parallel Session Pillar 1 .......................................................................................................... 10 Parallel Session Pillar 2 .......................................................................................................... 11 Parallel Session Pillar 3 .......................................................................................................... 14 Parallel Session Pillar 4 .......................................................................................................... 16 Experimental data resources available in the EJP RD ............................................................. 18 Introduction to translational research for clinical or pre-clinical researchers .......................... 21 Funding opportunities in EJP RD & How to build a successful proposal ............................... 22 RD-Connect Genome-Phenome Analysis Platform ................................................................. 26 Improvement
    [Show full text]
  • 2003 Mulder Nucl Acids Res {22
    The InterPro Database, 2003 brings increased coverage and new features Nicola J Mulder, Rolf Apweiler, Teresa K Attwood, Amos Bairoch, Daniel Barrell, Alex Bateman, David Binns, Margaret Biswas, Paul Bradley, Peer Bork, et al. To cite this version: Nicola J Mulder, Rolf Apweiler, Teresa K Attwood, Amos Bairoch, Daniel Barrell, et al.. The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acids Research, Oxford University Press, 2003, 31 (1), pp.315-318. 10.1093/nar/gkg046. hal-01214149 HAL Id: hal-01214149 https://hal.archives-ouvertes.fr/hal-01214149 Submitted on 9 Oct 2015 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. # 2003 Oxford University Press Nucleic Acids Research, 2003, Vol. 31, No. 1 315–318 DOI: 10.1093/nar/gkg046 The InterPro Database, 2003 brings increased coverage and new features Nicola J. Mulder1,*, Rolf Apweiler1, Teresa K. Attwood3, Amos Bairoch4, Daniel Barrell1, Alex Bateman2, David Binns1, Margaret Biswas5, Paul Bradley1,3, Peer Bork6, Phillip Bucher7, Richard R. Copley8, Emmanuel Courcelle9, Ujjwal Das1, Richard Durbin2, Laurent Falquet7, Wolfgang Fleischmann1, Sam Griffiths-Jones2, Downloaded from Daniel Haft10, Nicola Harte1, Nicolas Hulo4, Daniel Kahn9, Alexander Kanapin1, Maria Krestyaninova1, Rodrigo Lopez1, Ivica Letunic6, David Lonsdale1, Ville Silventoinen1, Sandra E.
    [Show full text]
  • The NIDDK Information Network: a Community Portal for Finding Data, Materials, and Tools for Researchers Studying Diabetes, Digestive, and Kidney Diseases
    RESEARCH ARTICLE The NIDDK Information Network: A Community Portal for Finding Data, Materials, and Tools for Researchers Studying Diabetes, Digestive, and Kidney Diseases Patricia L. Whetzel1, Jeffrey S. Grethe1, Davis E. Banks1, Maryann E. Martone1,2* 1 Center for Research in Biological Systems, University of California, San Diego, San Diego, California, a11111 United States of America, 2 Dept of Neurosciences, University of California, San Diego, San Diego, California, United States of America * [email protected] Abstract OPEN ACCESS The NIDDK Information Network (dkNET; http://dknet.org) was launched to serve the needs Citation: Whetzel PL, Grethe JS, Banks DE, Martone of basic and clinical investigators in metabolic, digestive and kidney disease by facilitating ME (2015) The NIDDK Information Network: A access to research resources that advance the mission of the National Institute of Diabe- Community Portal for Finding Data, Materials, and Tools for Researchers Studying Diabetes, Digestive, tes and Digestive and Kidney Diseases (NIDDK). By research resources, we mean the and Kidney Diseases. PLoS ONE 10(9): e0136206. multitude of data, software tools, materials, services, projects and organizations available to doi:10.1371/journal.pone.0136206 researchers in the public domain. Most of these are accessed via web-accessible data- Editor: Chun-Hsi Huang, University of Connecticut, bases or web portals, each developed, designed and maintained by numerous different UNITED STATES projects, organizations and individuals. While many of the large government funded data- Received: February 17, 2015 bases, maintained by agencies such as European Bioinformatics Institute and the National Accepted: July 30, 2015 Center for Biotechnology Information, are well known to researchers, many more that have been developed by and for the biomedical research community are unknown or underuti- Published: September 22, 2015 lized.
    [Show full text]
  • MPGM: Scalable and Accurate Multiple Network Alignment
    MPGM: Scalable and Accurate Multiple Network Alignment Ehsan Kazemi1 and Matthias Grossglauser2 1Yale Institute for Network Science, Yale University 2School of Computer and Communication Sciences, EPFL Abstract Protein-protein interaction (PPI) network alignment is a canonical operation to transfer biological knowledge among species. The alignment of PPI-networks has many applica- tions, such as the prediction of protein function, detection of conserved network motifs, and the reconstruction of species’ phylogenetic relationships. A good multiple-network align- ment (MNA), by considering the data related to several species, provides a deep understand- ing of biological networks and system-level cellular processes. With the massive amounts of available PPI data and the increasing number of known PPI networks, the problem of MNA is gaining more attention in the systems-biology studies. In this paper, we introduce a new scalable and accurate algorithm, called MPGM, for aligning multiple networks. The MPGM algorithm has two main steps: (i) SEEDGENERA- TION and (ii) MULTIPLEPERCOLATION. In the first step, to generate an initial set of seed tuples, the SEEDGENERATION algorithm uses only protein sequence similarities. In the second step, to align remaining unmatched nodes, the MULTIPLEPERCOLATION algorithm uses network structures and the seed tuples generated from the first step. We show that, with respect to different evaluation criteria, MPGM outperforms the other state-of-the-art algo- rithms. In addition, we guarantee the performance of MPGM under certain classes of net- work models. We introduce a sampling-based stochastic model for generating k correlated networks. We prove that for this model if a sufficient number of seed tuples are available, the MULTIPLEPERCOLATION algorithm correctly aligns almost all the nodes.
    [Show full text]
  • Anita Bandrowski, Scicrunch / NIF / RRID [email protected]
    on Anita Bandrowski, Sci Cr unch / NI F / RRI D [email protected] Is Reproducibility really a Problem? Growing Challenge: Ensuring the Rigor and Reproducibility of Science Growing Challenge: Ensuring the Rigor and Reproducibility of Science 4 Rigor and Transparency in Research *New Grant Review Criteria* To support the highest quality science, public accountability, and social responsibility in the conduct of science, NIH’s Rigor and Transparency efforts are intended to clarify expectations and highlight attention to four areas that may need more explicit attention by applicants and reviewers: • Scientific premise • Scientific rigor • Consideration of relevant biological variables, such as sex • Authentication of key biological and/or chemical resources Rigor and Transparency in Research *New Grant Review Criteria* To support the highest quality science, public accountability, and social responsibility in the conduct of science, NIH’s Rigor and Transparency efforts are intended to clarify expectations and highlight attention to four areas that may need more explicit attention by applicants and reviewers: • Scientific premise • Scientific rigor • Consideration of relevant biological variables, such as sex • Authentication of key biological and/or chemical resources What is a Key Biological Resource? • Antibodies • Cell Lines • Organisms (transgenic) Reagents that are the most common failure point in experiments What does it cost to have Key Biological Resource fail sometimes? * Freeman et al, 2017. Reproducibility2020: Progress and priorities
    [Show full text]
  • Plos Progress Update 2014/2015 from the Chairman and Ceo
    PLOS PROGRESS UPDATE 2014/2015 FROM THE CHAIRMAN AND CEO PLOS is dedicated to the transformation of research communication through collaboration, transparency, speed and access. Since its founding, PLOS has demonstrated the viability of high quality, Open Access publishing; launched the ground- breaking PLOS ONE, a home for all sound science selected for its rigor, not its “significance”; developed the first Article- Level Metrics (ALMs) to demonstrate the value of research beyond the perceived status of a journal title; and extended the impact of research after its publication with the PLOS data policy, ALMs and liberal Open Access licensing. But challenges remain. Scientific communication is far from its ideal state. There is still inconsistent access, and research is oered at a snapshot in time, instead of as an evolving contribution whose reliability and significance are continually evaluated through its lifetime. The current state demands that PLOS continue to establish new standards and expectations for scholarly communication. These include a faster and more ecient publication experience, more transparent peer review, assessment though the lifetime of a work, better recognition of the range of contributions made by collaborators and placing researchers and their communities back at the center of scientific communication. To these ends, PLOS is developing ApertaTM, a system that will facilitate and advance the submission and peer review process for authors, editors and reviewers. PLOS is also creating richer and more inclusive forums, such as PLOS Paleontology and PLOS Ecology Communities and the PLOS Science Wednesday redditscience Ask Me Anything. Progress is being made on early posting of manuscripts at PLOS.
    [Show full text]
  • Presentation Sams Patrick Ruch
    Genealogy… How can I relevantly answer the invitation from the SAMS’ medical librarians ? Special greetings to Tamara Morcillo and Isabelle de Kaenel ! 2 Today’s agenda 3 Count of bi-gram and tri-grams 4 … to support my presentation ! 5 Objective of text and data mining To support you like TDM helped me today ! 6 How text & data mining can support… ● … medical librarians ? ● … research data and open science ? [well covered by other speakers] ● … clinical knowledge ? 7 History ● Empirical medicine Observation & Cooking ● Evidence-based medicine Statistical power ● Personalized health Deciphering omics 8 Blind spot ● Empirical medicine Observation & Cooking [Mountains are made of stones] ● Evidence-based medicine Statistical power ● Personalized health Deciphering omics ● Access to EHR evidences (80% narratives) • Overcome publication paywalls • Fight silos [researchers, clinicians, institutions…] • Empower patients • Establish circle of trust and clarify legal basis 9 Basis of clinical practice… ● Empirical medicine Observation & Cooking ● Evidence-based medicine Statistical power ● Personalized health Deciphering omics ● Access to EHR evidences (80% narratives) ! Evidence-based “holistic” (including phenotypes) personalized health ! 10 Basis of clinical practice… ● Empirical medicine Observation & Cooking ● Evidence-based medicine Statistical power Open Access, Open Science required ! ● Personalized health Deciphering omics ● Access to EHR evidences (80% narratives) ! 11 How text & data mining can support… ● … medical librarians
    [Show full text]
  • EBRAINS Data and Knowledge Services: FAIR Data and Models for the Neuroscience Community Status at M7 (D4.1 - SGA3)
    EBRAINS Data and Knowledge services: FAIR Data and Models for the Neuroscience Community Status at M7 (D4.1 - SGA3) Figure 1 Key services accessible through EBRAINS Data and Knowledge The background image depicts the graph structure of the active and continuously developing content of the EBRAINS Knowledge Graph (derived from https://kg.ebrains.eu/statistics/) D4.1 (D32) SGA3 M8 ACCEPTED 210504.docx PU = Public 4-May-2021 Page 1 / 30 Project Number: 945539 Project Title: HBP SGA3 Document Title: D4.1 EBRAINS FAIR data services (SC1) status at M7 Document Filename: D4.1 (D32) SGA3 M8 ACCEPTED 210504.docx Deliverable Number: SGA3 D4.1 (D32) Deliverable Type: Other Dissemination Level: PU = Public Planned Delivery Date: SGA3 M7 / 30 Oct 2020 Actual Delivery Date: SGA3 M8 / 13 Nov 2020; resubmitted 27 Apr 2021; accepted 04 May 2021 Jan G. BJAALIE, UIO (P81), Andrew DAVISON, CNRS (P10), Ida AASEBØ, UIO (P81), Author(s): Lyuba ZEHL, JUELICH (P20), Oliver SCHMID, EPFL (P134), Mathew ABRAMS, KI (P37), Simisola AKINTOYE, DMU (P16), William KNIGHT, DMU (P16), Damian EKE, DMU (P16) Compiled by: Martha Elisabeth BRIGG and Roman VOLCHENKOV, UIO (P81) Shailesh APPUKUTTAN, CNRS (P10), Onur ATES, CNRS (P10), Timo DICKSCHEID, JUELICH (P20), Tom GILLESPIE (International Neuroinformatics Coordinating Facility), Xiao GUI, JUELICH (P20), Camilla HAGEN BLIXHAVN, UIO (P81), Anna HILVERLING, JUELICH (P20), Heidi KLEVEN, UIO (P81), Stefan KÖHNEN, JUELICH Contributor(s): (P20), Trygve LEERGAARD, UIO (P81), Elodie LEGOUÉE, CNRS (P10), Glynis MATTHEISEN, CNRS (P10), Maja PUCHADES, UIO (P81), Ingrid REITEN, UIO (P81), Ulrike SCHLEGEL, UIO (P81), Benjamin WEYERS, UT(130), Sara ZAFARNIA, JUELICH (P20), Yann ZERLAUT, CNRS (P10) WP QC Review: Timo DICKSCHEID, JUELICH (P20) WP Leader / Deputy Jan G.
    [Show full text]
  • Improving Data Identification and Tagging for More Effective Decision Making in Agriculture Pascal Neveu, Romain David, Clement Jonquet
    Improving data identification and tagging for more effective decision making in agriculture Pascal Neveu, Romain David, Clement Jonquet To cite this version: Pascal Neveu, Romain David, Clement Jonquet. Improving data identification and tagging for more effective decision making in agriculture. Leisa Armstrong. Improving Data Management and Decision Support Systems in Agriculture, 85, Burleigh Dodds Science Publishing, 2020, Series in Agricultural Science, 978-1-78676-340-2. hal-02164149 HAL Id: hal-02164149 https://hal.archives-ouvertes.fr/hal-02164149 Submitted on 11 May 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Improving data identification and tagging for more effective decision making in agriculture Pascal Neveu and Romain David, INRA, France and Clement Jonquet, LIRMM, France Abstract Data integration, data analytics and decision support methods can help increase agriculture challenges such as climate change adaptation or food security. In this context, smart data acquisition systems, interoperable information systems and frameworks for data structuring are required. In this chapter we describe methods for data identification and provide some recommendations. We also describe how to enrich data with semantics and a way to tag data with the relevant ontology.
    [Show full text]
  • Data Standards and Guidelines 05/11/2020 Online Event
    Artificial Intelligence and Augmented Intelligence for Automated Investigations for Scientific Discovery The Physical Sciences Data-Science Service (PSDS) Patterns Failed it to Nailed it: Data Standards and Guidelines 05/11/2020 Online Event Dr Samantha Kanza & Dr Nicola Knight University of Southampton 10/12/2020 AI3SD-Event-Series:Report-20 Failed it to Nailed it: Data Standards and Guidelines AI3SD-Event-Series:Report-20 10/12/2020 DOI: 10.5258/SOTON/P0033 Published by University of Southampton Network: Artificial Intelligence and Augmented Intelligence for Automated Invest- igations for Scientific Discovery This Network+ is EPSRC Funded under Grant No: EP/S000356/1 Principal Investigator: Professor Jeremy Frey Co-Investigator: Professor Mahesan Niranjan Network+ Coordinator: Dr Samantha Kanza An EPSRC National Research Facility to facilitate Data Science in the Physical Sciences: The Physical Sciences Data science Service (PSDS) This Facility is EPSRC Funded under Grant No: EP/S020357/1 Principal Investigator: Professor Simon Coles Co-Investigators: Dr Brian Matthews, Dr Juan Bicarregui & Professor Jeremy Frey Contents 1 Event Details1 2 Event Summary and Format1 3 Event Background1 4 Talks 2 4.1 Metadata........................................2 4.1.1 Introduction to Metadata...........................2 4.1.2 Introduction to FAIR.............................2 4.1.3 Data generation, data standards, and metadata capture in drug discovery - Dr Martin-Immanuel Bittner (Arctoris)..................2 4.2 Open Data.......................................8 4.2.1 Introduction to Open Data..........................8 4.2.2 Giving your open data the best chance to realise its potential - Mr Chris Gutteridge (University of Southampton)...................8 4.3 Linked Data....................................... 13 4.3.1 Introduction to Semantic Web Technologies................
    [Show full text]
  • Biological Databases an Introduction
    Biological databases an introduction By Dr. Erik Bongcam-Rudloff SLU 2017 Biological Databases ▪ Sequence Databases ▪ Genome Databases ▪ Structure Databases Sequence Databases ▪ The sequence databases are the oldest type of biological databases, and also the most widely used Sequence Databases ▪ Nucleotide: ATGC ▪ Protein: MERITSAPLG The nucleotide sequence repositories ▪ There are three main repositories for nucleotide sequences: EMBL, GenBank, and DDBJ. ▪ All of these should in theory contain "all" known public DNA or RNA sequences ▪ These repositories have a collaboration so that any data submitted to one of databases will be redistributed to the others. ▪ The three databases are the only databases that can issue sequence accession numbers. ▪ Accession numbers are unique identifiers which permanently identify sequences in the databases. ▪ These accession numbers are required by many biological journals before manuscripts are accepted. ▪ It should be noted that during the last decade several commercial companies have engaged in sequencing ESTs and genomes that they have not made public. Ideal minimal content of a « sequence » db Sequences !! Accession number (AC) References Taxonomic data ANNOTATION/CURATION Keywords Cross-references Documentation Example: Swiss-Prot entry Entry name Accession number sequence Protein name Gene name Taxonomy References Comments Cross-references Keywords Feature table (sequence description) Sequence database: example …a SWISS-PROT entry, in fasta format: >sp|P01588|EPO_HUMAN ERYTHROPOIETIN PRECURSOR -
    [Show full text]