The Uniprot Knowledgebase
Total Page:16
File Type:pdf, Size:1020Kb
Bringing bioinformatics into the classroom The UniProt Knowledgebase UniProtKB Using Bioinformatics to hunt SARS-CoV-2, its variants & its origins A PRACTICAL GUIDE 1 Version: 19 August 2021 A Practical Guide to SARS-CoV-2, its variants & its origins Hunting SARS-CoV-2, its variants & its origins Overview This Practical Guide outlines basic bioinformatics approaches for exploring the SARS-CoV-2 genome and its corresponding proteins, focusing on the protein exposed on the viral particle surface: the spike protein. The ways in which bioinformatics can be harnessed to study a new virus, its genome, its proteins, its origins and its evolution are explored. Teaching Goals & Learning Outcomes This Guide introduces a range of bioinformatics tools for comparing and analysing nucleotide and protein sequences. On reading the Guide and completing the exercises, you will be able to: • discover SARS-CoV-2 genome(s) available in a public nucleotide repository; • compare SARS-CoV-2 genome sequences, look for their differences (mutations) and identify the variants; • translate the spike gene into its encoded protein sequence; • discover the 3D structure of the spike protein; • understand the impact of mutations on infectivity and immune responses; and • infer the origin of SARS-CoV-2 by comparing coronavirus spike protein sequences from different animal origins. 1 Introduction 2 About this Guide Viruses are, by far, the most abundant microbes on the planet1. This Guide outlines basic bioinformatics approaches for exploring The living world could not exist without them! They encompass SARS-CoV-2 genomes and the proteins they encode. We focus on much of the biological diversity on the planet, catalyse nutrient the spike protein, located at the surface of the virion, which is cycling, affect the microbial make-up of communities through selec- responsible for virion entry into human cells. Exercises are provided tive mortality, and play a key role in the regulation of carbon dioxide to show how to study the spike gene, its protein sequence and 3D production by the oceans; last but not least, they are important structure, focusing in particular on the impact of a mutation found actors in the evolution of species. As an example, ~8 % of the human in the Alpha, Beta, Gamma and Delta virus variants. We also show genome is believed to have originated from viral genome integra- how we can generate hypotheses on the animal origins of SARS- tions: e.g., placenta development originates from the integration of CoV-2 (pangolin or bat). Exercises are adapted from a freely accessi- a virus into a primate genome more than 40 million years ago2. ble online workshop9. Throughout the text, key terms – rendered in The number of virus particles on Earth is frequently reported as bold type – are defined in green boxes. Additional information is being of the order of 1031 3. There are typically 10 million viruses per provided in various other supplementary boxes throughout the text. milliliter in coastal seawater4. Each day, viruses fall from the sky: in each square metre, tens of millions of bacteria and billions of virus- KEY TERMS 5 es are deposited . Amino acid: one of 20 common, naturally occurring building-blocks of Biologists estimate that 380 trillion viruses are living on and inside proteins 6 our body right now — 10 times the number of bacteria . If they Bacteria: unicellular microorganisms that can live in a variety of envi- cause disease, viruses are considered pathogenic. By 2021, ~160 ronments (air, soil, water, other organisms); bacteria constitute one viruses were known to be pathogenic to humans, such as Ebola, of the three primary kingdoms of life measles, Human Immunodeficiency Virus (HIV), dengue, papilloma- Catalyse: to accelerate a chemical reaction 7,8 virus, hepatitis and certain coronaviruses . Genome: the entirety of an organism’s genetic information, encoded as Coronaviruses took on a new and rather frightening significance either DNA or RNA (in some viruses) towards the end of 2019 and in the early months of 2020, when a Mutation: a change in a genome sequence, such as a change in a nucle- new and deadly coronavirus took the world by storm, leaving a trail otide base, or the deletion or addition of a base of infection, illness and death in its wake. Caught off guard, commu- Protein: an organic compound containing one or more linear polymers nities around the world were galvanised into action, racing to se- of amino acids; existing in globular, fibrous or membrane-bound quence its genome, to trace its origins, and to develop life-saving forms, proteins participate in virtually all cellular processes, including treatments and vaccines. This Guide illustrates part of this story: it the construction of viruses shows how, for example, with the help of bioinformatics approach- Vaccine: a substance or agent designed to stimulate the production of es, various aspects of viruses can be studied today, highlighting antibodies & hence provide immunity against a particular pathogen some key facts about how viruses are monitored, once we have Virion: an entire virus particle, comprising an outer protein envelope & access to their genome sequences. an inner core of nucleic acid 2 A Practical Guide to SARS-CoV-2, its variants & its origins 3 Corona viruses & SARS-CoV-2 SARS-CoV-2 genomes, which means that all the sequences collected worldwide are compared against it12. A virus is a parasitic agent transmitted via a microscopic particle Public nucleotide sequence databases made of strands of RNA or DNA (its genome) inside a protein coat (capsid) and/or envelope (Figure 1). GenBank is the nucleotide sequence database maintained by the Na- tional Centre for Biotechnology Information (NCBI). It is a member of the International Nucleotide Sequence Database Collaboration (INSDC), a long-standing, foundational initiative that was devised in order to bring together the three major nucleotide sequence repositories (DDBJ, from Japan; EMBL-Bank, from Europe; GenBank, from the USA) & harmonise their annotations. These databases, which contain all the publicly available DNA & RNA sequences (& their annotations) submit- ted by the scientific community, exchange their data daily. 3.2 Setting up a test for the presence of SARS-CoV-2 Sequencing the SARS-CoV-2 genome made it possible to rapidly Figure 1 Basic anatomy of a virus particle (virion). All particles house a set up a test, based on a method known as the Polymerase Chain genome, encoded as DNA or RNA. SARS-CoV-2 has an RNA genome Reaction (PCR), to detect the presence of the virus in nasopharyn- packed inside a protein coat (capsid), with an outer envelope containing geal (nose) or oropharyngeal (throat) swabs. The method is so the well-known ‘spike’ protein. sensitive that swabs can test 'positive' with just 100 viruses present! A virus can only replicate by entering a cell and using the cellular Because the SARS-CoV-2 genome is RNA-based, it’s necessary to machinery of its host. Some viruses infect animals; others infect use a Reverse Transcription-PCR (RT-PCR) approach, whereby the plants or bacteria. When infected, a host cell is directed to rapidly RNA is first converted into DNA, then selectively amplified (or ‘pho- produce hundreds of copies of the original virus. When not inside an tocopied’) to create millions to billions of fragment copies. In the infected cell, viruses exist as independent particles, which are for- amplification step, DNA replication is initiated by two small DNA mally referred to as virions. sequences (of ~20 nucleotides) called primers. These are designed Coronaviruses constitute a large family of viruses that includes to bind specifically to either side of the section of DNA to be copied. more than 40 species, most of which are harmless to humans. Seven coronaviruses are human pathogens: four (OC43, 229E, NL63, HKU1) KEY TERMS are endemic and known to cause colds; three are zoonotic and can Annotation: notes included within database entries to make them both cause severe lung infections: Severe Acute Respiratory Syndrome- informative & re-usable related Coronavirus (SARS-CoV); Middle East Respiratory Syndrome- Cell: the fundamental structural & functional unit, or building block, of related coronavirus (MERS); and Severe Acute Respiratory Syn- living organisms; eukaryotic cells typically contain cytoplasm & a nu- drome-related Coronavirus 2 (SARS-CoV-2) – the latter successfully cleus bounded within a membrane transitioned from zoonotic to endemic in 2020. DNA: deoxyribonucleic acid, a molecule comprising two nucleotide SARS-CoV-2 is responsible for Coronavirus Disease-19, or COVID- chains that coil together, forming a double-helix structure in which A 19. The virus was first identified in December 2019 in Wuhan, China. always binds to T, & G to C, rather like a twisted ladder The World Health Organisation declared the outbreak a Public Endemic: a disease, condition or infection that is constantly maintained Health Emergency of International Concern in January 2020, and a at a base level in a given area or population pandemic in March 2020. By 19 August 2021, more than 209 million Nucleotide: a chemical base (one of 4 building-blocks of DNA & RNA) cases had been confirmed worldwide, with more than 4 million linked to a molecule of sugar & a molecule of phosphoric acid. In deaths attributed to COVID-19, making it one of the deadliest pan- DNA, the nucleotide bases are adenine (A), cytosine (C), guanine (G) 10 demics in history (ninth out of the top ten) . & thymine (T), whose