FACT SHEET Genetic Sequence Data and Databases

3 April 2018 Version 1 FACT SHEET Genetic sequence data and databases Background GSD databases Genetic sequence data (GSD) GSD databases are databases that receive, host and Organisms are built, and their functions are provide access to GSD that has been submitted to determined, by their genetic code. This code is them. Most databases also provide important contained in DNA molecules, which are found in additional, contextual information related to the human, animal and plant cells, as well as in sequences, to enrich the sequence data with microorganisms like bacteria and viruses. DNA has information about the patient/animal/other source four components, or building blocks, called C from which the sample was extracted. This additional (cytosine), G (guanine), A (adenine), or T (thymine). information is known as “metadata”. The nucleotides line up in an order particular to each organism, comprising the genetic code. The code can Principal databases that host influenza GSD be “read” by the cell to produce RNA – which is also a There are more than 1700 molecular biology type of genetic information, encoded by the databases2, at least six of which contain influenza GSD. nucleotides A (adenine), C (cytosine), G (guanine) and These are: U (uracil) – and in turn, proteins that are responsible • GISAID EpifluTM: https://www.gisaid.org/ for all of the structure and function of a living • Influenza Research Database: organism. https://www.fludb.org/brc/home.spg?decorat or=influenza The order of the nucleotides in DNA and RNA – that is, • Openflu database: http://openflu.vital- the sequence – is critical because genetic sequences it.ch/browse.php contain information about the structure and specific • The three databases are part of the properties of an organism. In the case of viruses, this International Nucleotide Sequence Database includes characteristics such as pathogenicity, Collaboration (INSDC) 3: transmissibility and antiviral susceptibility. o GenBank: Laboratories can determine the genetic sequence of a https://www.ncbi.nlm.nih.gov/genban particular organism, using sequencing technologies. k/ The data generated through this process is called o European Nucleotide Archive: genetic sequence data (GSD), which is represented by https://www.ebi.ac.uk/ena listing the nucleotides in order (e.g., an RNA sequence o DNA Data Bank of Japan: might look like AGAAAUGAAAUGGCUCCUGUCAA). http://www.ddbj.nig.ac.jp/ Providers and users of influenza GSD GSD and databases in the PIP Framework The main providers of GSD from influenza viruses are Although GSD is not included in the definition of PIP GISRS laboratories. Other providers include national Biological Materials, genetic sequences are defined in public health laboratories and public and private section 4.2: “‘Genetic sequences’ means the order of laboratories that may sequence influenza viruses. GSD nucleotides found in a molecule of DNA or RNA. They can be uploaded to one or more databases by the contain the genetic information that determines the sequencing laboratory.1 biological characteristics of an organism or a virus”. Additionally, the PIP Framework has several GISRS laboratories, academia and industry are the references to GSD and databases: main influenza GSD users. They generally use the data to conduct risk assessment and monitor the • 5.2.1 Genetic sequence data, and analyses emergence and evolution of influenza viruses; develop arising from that data, relating to H5N1 and diagnostics; generate candidate vaccine viruses; and other influenza viruses with human pandemic produce new types of vaccines such as recombinant potential should be shared in a rapid, timely vaccine and vaccines using synthetic candidate and systematic manner with the originating vaccine viruses. laboratory and among WHO GISRS laboratories. 1 • 5.2.2 Recognizing that greater transparency overview of the main approaches taken by existing and access concerning influenza virus genetic databases. sequence data is important to public health and there is a movement towards the use of public- domain or public-access databases such as Genbank and GISAID respectively; and • 5.2.3 Recognizing that in some instances the publication of genetic sequence data has been considered sensitive by the country providing the virus; • 5.2.4 Member States request the Director- General to consult the Advisory Group on the best process for further discussion and resolution of issues relating to the handling of genetic sequence data from H5N1 and other influenza viruses with pandemic potential as part of the Pandemic Influenza Preparedness Framework.4 The PIP Framework also sets out certain expectations regarding GSD: • Annex 5: WHO Collaborating Center Terms of Reference, section B.5: WHO CC “upload available haemagglutinin, neuraminidase and other gene sequences of A(H5) and other influenza viruses with pandemic potential to a publicly accessible database in a timely manner but no later than three months after sequencing is completed, unless otherwise instructed by the laboratory or country providing the clinical specimens and/or viruses 5 (Guiding Principle 9)”. • Annex 4, Guiding Principle 9: “WHO GISRS laboratories will submit genetic sequences data to GISAID and Genbank or similar databases in a timely manner consistent with the Standard Material Transfer Agreement”.6 Main features of GSD databases As elaborated in the table below, GSD databases vary according to several factors, including • Scope of data coverage • Level of data curation and analysis • Database access • Terms and conditions of access to, and use of, data • Intellectual property (IP) and other legal rights over the data Databases address these factors, depending on their purpose and the community of data providers and data users they serve. The table below provides an 2 3 Glossary This glossary provides definitions of terms frequently used in the context of data access and databases. It does not purport to be comprehensive as new terms come into use and definitions evolve. The definitions provided in the Glossary are commonly used but not necessarily universally agreed. Therefore, some terms may be used differently in different contexts. Data access A Data access agreement is a contract between a database user (this includes data providers agreement (also and data users) on the one hand and the database (custodian of the data) on the other. It referred to as ‘database generally contains terms and conditions for accessing and using data hosted by the database. access agreement’ or These terms and conditions can include: conditions of access, use and re-use of data; further ‘data use agreement’) distribution of data; attribution and collaboration with the data provider; intellectual property rights, commercialization and licensing rights over data; suspension and termination of access. Open access23 The concept of ‘Open access’ arose in response to the growing demand for unrestricted, free access to scholarly research.24 According to the Berlin Declaration on Open Access to Knowledge in the Sciences and Humanities and the Bethesda Statement on Open Access Publishing, open access contributions must satisfy two conditions. The Berlin Declaration states: 1. The author(s) and right holder(s) of such contributions grant(s) to all users a free, irrevocable, worldwide, right of access to, and a license to copy, use, distribute, transmit and display the work publicly and to make and distribute derivative works, in any digital medium for any responsible purpose, subject to proper attribution of authorship (community standards, will continue to provide the mechanism for enforcement of proper attribution and responsible use of the published work, as they do now), as well as the right to make small numbers of printed copies for their personal use. 2. A complete version of the work and all supplemental materials, including a copy of the permission as stated above, in an appropriate standard electronic format is deposited (and thus published) in at least one online repository using suitable technical standards (such as the Open Archive definitions) that is supported and maintained by an academic institution, scholarly society, government agency, or other well-established organization that seeks to enable open access, unrestricted distribution, interoperability, and long-term archiving.25 Open data ‘Open data’ is a relatively recent term. The concept behind “open data” however predates the invention of the Internet and has arisen from the belief that knowledge is a common good26 and “that the enormous amount of information routinely collected by government entities should be available to all citizens.”27 According to Open Knowledge International, ‘open data’ is data that can be “freely used, modified, and shared by anyone for any purpose”28. For example, the European Data Portal, which gives access to “metadata of Public Sector Information available on public data portals across European countries”29, states that: “[open data] should have no limitations that prevent it from being used in any particular way”, “[it] must be free to use, but this does not mean that it must be free to access”, and “once the user has the data, they are free to use, reuse and redistribute it – even commercially.”30 The World Bank states that open data has two dimension: “1. The data must be legally open, which means they must be placed in the public domain or under liberal terms of use with minimal restrictions. 2. The data must be technically open, which means they must be published in electronic formats that are machine readable and preferably non-proprietary, so that anyone can access and use the data using common, freely available software tools. Data must also be publicly available and accessible on a public server, without password or firewall restrictions.”31 Open source The concept of open source arose from the field of computer programming.32 Open source licenses allow a community of users to access source materials, tools and platforms under terms that are meant to encourage free flow of information, collaboration and innovation. Open source licenses usually require that no intellectual property rights be asserted on the materials, tools or platforms.

FACT SHEET Genetic Sequence Data and Databases

Bioinformatics Study of Lectins: New Classification and Prediction In

A SARS-Cov-2 Sequence Submission Tool for the European Nucleotide

Six-Fold Speed-Up of Smith-Waterman Sequence Database Searches Using Parallel Processing on Common Microprocessors

Bioinformatics Courses Lecture 3: (Local) Alignment and Homology

A Survey on Data Compression Methods for Biological Sequences

Biohackathon Series in 2013 and 2014

Database and Bioinformatic Analysis of BCL-2 Family Proteins and BH3-Only Proteins Abdel Aouacheria, Vincent Navratil, Christophe Combet

Protein Database

Sequence Database Searching

Accessing Molecular Biology Information * Genome Browsers

Introduction to Bioinformatics

An Open-Access Platform for Glycoinformatics