Bringing bioinformatics into the classroom

The UniProt Knowledgebase

UniProtKB

Using Bioinformatics to hunt SARS-CoV-2, its variants & its origins

A PRACTICAL GUIDE

1

Version: 19 August 2021 A Practical Guide to SARS-CoV-2, its variants & its origins

Hunting SARS-CoV-2, its variants & its origins

Overview This Practical Guide outlines basic bioinformatics approaches for exploring the SARS-CoV-2 genome and its corresponding , focusing on the exposed on the viral particle surface: the spike protein. The ways in which bioinformatics can be harnessed to study a new virus, its genome, its proteins, its origins and its evolution are explored.

Teaching Goals & Learning Outcomes This Guide introduces a range of bioinformatics tools for comparing and analysing nucleotide and protein sequences. On reading the Guide and completing the exercises, you will be able to: • discover SARS-CoV-2 genome(s) available in a public nucleotide repository; • compare SARS-CoV-2 genome sequences, look for their differences (mutations) and identify the variants; • translate the spike into its encoded protein sequence; • discover the 3D structure of the spike protein; • understand the impact of mutations on infectivity and immune responses; and • infer the origin of SARS-CoV-2 by comparing coronavirus spike protein sequences from different animal origins.

1 Introduction 2 About this Guide

Viruses are, by far, the most abundant microbes on the planet1. This Guide outlines basic bioinformatics approaches for exploring The living world could not exist without them! They encompass SARS-CoV-2 genomes and the proteins they encode. We focus on much of the biological diversity on the planet, catalyse nutrient the spike protein, located at the surface of the virion, which is cycling, affect the microbial make-up of communities through selec- responsible for virion entry into human cells. Exercises are provided tive mortality, and play a key role in the regulation of carbon dioxide to show how to study the spike gene, its protein sequence and 3D production by the oceans; last but not least, they are important structure, focusing in particular on the impact of a mutation found actors in the evolution of species. As an example, ~8 % of the human in the Alpha, Beta, Gamma and Delta virus variants. We also show genome is believed to have originated from viral genome integra- how we can generate hypotheses on the animal origins of SARS- tions: e.g., placenta development originates from the integration of CoV-2 (pangolin or bat). Exercises are adapted from a freely accessi- a virus into a primate genome more than 40 million years ago2. ble online workshop9. Throughout the text, key terms – rendered in The number of virus particles on Earth is frequently reported as bold type – are defined in green boxes. Additional information is being of the order of 1031 3. There are typically 10 million viruses per provided in various other supplementary boxes throughout the text. milliliter in coastal seawater4. Each day, viruses fall from the sky: in each square metre, tens of millions of bacteria and billions of virus- KEY TERMS 5 es are deposited . Amino acid: one of 20 common, naturally occurring building-blocks of Biologists estimate that 380 trillion viruses are living on and inside proteins 6 our body right now — 10 times the number of bacteria . If they Bacteria: unicellular microorganisms that can live in a variety of envi- cause disease, viruses are considered pathogenic. By 2021, ~160 ronments (air, soil, water, other organisms); bacteria constitute one viruses were known to be pathogenic to humans, such as Ebola, of the three primary kingdoms of life measles, Human Immunodeficiency Virus (HIV), dengue, papilloma- Catalyse: to accelerate a chemical reaction 7,8 virus, hepatitis and certain coronaviruses . Genome: the entirety of an organism’s genetic information, encoded as Coronaviruses took on a new and rather frightening significance either DNA or RNA (in some viruses) towards the end of 2019 and in the early months of 2020, when a Mutation: a change in a genome sequence, such as a change in a nucle- new and deadly coronavirus took the world by storm, leaving a trail otide base, or the deletion or addition of a base of infection, illness and death in its wake. Caught off guard, commu- Protein: an organic compound containing one or more linear polymers nities around the world were galvanised into action, racing to se- of amino acids; existing in globular, fibrous or membrane-bound quence its genome, to trace its origins, and to develop life-saving forms, proteins participate in virtually all cellular processes, including treatments and vaccines. This Guide illustrates part of this story: it the construction of viruses shows how, for example, with the help of bioinformatics approach- Vaccine: a substance or agent designed to stimulate the production of es, various aspects of viruses can be studied today, highlighting antibodies & hence provide immunity against a particular pathogen some key facts about how viruses are monitored, once we have Virion: an entire virus particle, comprising an outer protein envelope & access to their genome sequences. an inner core of nucleic acid

2 A Practical Guide to SARS-CoV-2, its variants & its origins

3 Corona viruses & SARS-CoV-2 SARS-CoV-2 genomes, which means that all the sequences collected worldwide are compared against it12. A virus is a parasitic agent transmitted via a microscopic particle Public nucleotide sequence databases made of strands of RNA or DNA (its genome) inside a protein coat (capsid) and/or envelope (Figure 1). GenBank is the nucleotide sequence database maintained by the Na- tional Centre for Biotechnology Information (NCBI). It is a member of the International Nucleotide Sequence Database Collaboration (INSDC), a long-standing, foundational initiative that was devised in order to bring together the three major nucleotide sequence repositories (DDBJ, from Japan; EMBL-Bank, from Europe; GenBank, from the USA) & harmonise their annotations. These databases, which contain all the publicly available DNA & RNA sequences (& their annotations) submit- ted by the scientific community, exchange their data daily.

3.2 Setting up a test for the presence of SARS-CoV-2 Sequencing the SARS-CoV-2 genome made it possible to rapidly Figure 1 Basic anatomy of a virus particle (virion). All particles house a set up a test, based on a method known as the Polymerase Chain genome, encoded as DNA or RNA. SARS-CoV-2 has an RNA genome Reaction (PCR), to detect the presence of the virus in nasopharyn- packed inside a protein coat (capsid), with an outer envelope containing geal (nose) or oropharyngeal (throat) swabs. The method is so the well-known ‘spike’ protein. sensitive that swabs can test 'positive' with just 100 viruses present! A virus can only replicate by entering a cell and using the cellular Because the SARS-CoV-2 genome is RNA-based, it’s necessary to machinery of its host. Some viruses infect animals; others infect use a Reverse Transcription-PCR (RT-PCR) approach, whereby the plants or bacteria. When infected, a host cell is directed to rapidly RNA is first converted into DNA, then selectively amplified (or ‘pho- produce hundreds of copies of the original virus. When not inside an tocopied’) to create millions to billions of fragment copies. In the infected cell, viruses exist as independent particles, which are for- amplification step, DNA replication is initiated by two small DNA mally referred to as virions. sequences (of ~20 nucleotides) called primers. These are designed Coronaviruses constitute a large family of viruses that includes to bind specifically to either side of the section of DNA to be copied. more than 40 species, most of which are harmless to humans. Seven coronaviruses are human pathogens: four (OC43, 229E, NL63, HKU1) KEY TERMS are endemic and known to cause colds; three are zoonotic and can Annotation: notes included within database entries to make them both cause severe lung infections: Severe Acute Respiratory Syndrome- informative & re-usable related Coronavirus (SARS-CoV); Middle East Respiratory Syndrome- Cell: the fundamental structural & functional unit, or building block, of related coronavirus (MERS); and Severe Acute Respiratory Syn- living organisms; eukaryotic cells typically contain cytoplasm & a nu- drome-related Coronavirus 2 (SARS-CoV-2) – the latter successfully cleus bounded within a membrane transitioned from zoonotic to endemic in 2020. DNA: deoxyribonucleic acid, a molecule comprising two nucleotide SARS-CoV-2 is responsible for Coronavirus Disease-19, or COVID- chains that coil together, forming a double-helix structure in which A 19. The virus was first identified in December 2019 in Wuhan, China. always binds to T, & G to C, rather like a twisted ladder The World Health Organisation declared the outbreak a Public Endemic: a disease, condition or infection that is constantly maintained Health Emergency of International Concern in January 2020, and a at a base level in a given area or population pandemic in March 2020. By 19 August 2021, more than 209 million Nucleotide: a chemical base (one of 4 building-blocks of DNA & RNA) cases had been confirmed worldwide, with more than 4 million linked to a molecule of sugar & a molecule of phosphoric acid. In deaths attributed to COVID-19, making it one of the deadliest pan- DNA, the nucleotide bases are adenine (A), cytosine (C), guanine (G) 10 demics in history (ninth out of the top ten) . & thymine (T), whose precise order (sequence) conveys the genetic The COVID-19 pandemic marked the beginning of an unprece- information; in RNA, thymine (T) is replaced by uracil (U) dented race against time to better understand the biology and Polymerase Chain Reaction (PCR): a lab technique that makes it possi- evolution of the virus, and to find treatments and vaccines, based on ble to selectively amplify a DNA fragment: i.e., to produce millions or years of fundamental research and cutting-edge technologies! billions of copies of this fragment in order, for example, to search for 3.1 The first SARS-CoV-2 genome sequence its presence in a biological sample Reverse Transcription-PCR (RT-PCR): a lab technique that combines On 10 January 2020, researchers in China submitted the first ge- reverse transcription of RNA into DNA with amplification of specific nome sequence of SARS-CoV-2 to the public nucleotide sequence DNA fragments using PCR repository, GenBank11 (see box). They sequenced the genetic mate- RNA: ribonucleic acid, a molecule comprising a long, unbranched chain rial found in a sample of lung lavage fluid of a patient (a worker at of ribonucleotides (i.e., nucleotides whose sugar component is ri- Wuhan market) who’d been admitted to hospital on 26 December bose). In cells, the well-known single-stranded RNA (ssRNA), called 2019 with a severe respiratory syndrome. The researchers identified mRNA, acts as a 'messenger' & is essential in the synthesis of pro- a new strain of an RNA virus from the coronavirus family, and made teins. The genetic material of RNA viruses can be single-stranded their data freely available to the scientific community via GenBank. (ssRNA) or double-stranded (dsRNA) The RNA sequence comprises 29,903 nucleotides (A, U, C, G). In Zoonotic disease: disease caused by an infectious agent, such as a nucleotide sequence databases, the U nucleotide is replaced by T. bacterium or virus, that has jumped from another animal (usually a This sequence is now considered as the ‘reference’ sequence for vertebrate) to a human

3 A Practical Guide to SARS-CoV-2, its variants & its origins

If the virus’ genetic material is present in the tested sample, the in the ? amplified DNA fragments will be visible, for example, on an agarose 5 Return to the 'BLAT' input box. Type a random sequence (~30 let- gel – see Figure 2. Quantitative PCR (qPCR, or real-time PCR) is much ters); click ‘Submit’. Can you find your query sequence in the human used in diagnostics: it involves collecting PCR data using fluorescent- genome? Try another random sequence, or select another genome… ly labelled primers. This test can detect the presence and amount of a specific viral genome. To validate the test, several fragments, from 3.3 The SARS-CoV-2 genomes different regions of the genome, need to be amplified, with differ- Several hundred thousand SARS-CoV-2 genomes have been se- ent primer pairs. quenced in different countries, and have been submitted to open- access repositories like GenBank (see previous box) or GISAID13 (see box below). Several research centres collaborate on the analysis of these genomes: this is just essential, given the enormity of the task!

GISAID & open data

The GISAID initiative promotes rapid sharing of data from all influen- za viruses & the coronavirus causing COVID-19. It includes genetic sequences & related clinical & epidemiological data associated with human viruses, plus both geographical & species-specific data associat- Figure 2 Overview of the PCR technique. Two ~20 nucleotide primers are ed with avian & other animal viruses, to help researchers understand selected to match 100% with the single-strand DNA to be amplified in how viruses evolve & spread during epidemics & pandemics. order to hybridise with it. A polymerase is added, plus the four nucleo- tides, to copy the DNA fragment in several polymerisation cycles (ampli- Examples of data portals created by such research centres include fication). The fragments can then be analysed, say, on an agarose gel. the COVID-19 Data Portal (Europe)14 and the NCBI SARS-CoV-2 15 The primers have been carefully designed to be selective for the Resources (USA) . These give access to the latest SARS-CoV-2 ge- SARS-CoV-2 genome sequence: they neither match the human DNA nomes that have been sequenced and submitted to the public (which could contaminate the sample), nor the genomes of other sequence databases. Each submitted sequence has its own unique types of virus (which could also be present in the sample). Let’s take identifying code, called an accession number (AC): e.g., the AC of a closer look at the specificity of the RT-PCR test. the reference genome, submitted on 5 January 2020 from Wuhan (China), is NC_045512.2; for the Alpha variant, submitted on 22 EXERCISES December 2020 (UK), it is LR991698.2; and so on. So, let’s take a look at some of the virus genomes sequenced in different countries. 1 Use ctrl C, or cmd C for Macs, to copy one of the two DNA fragments below. These correspond to the primers used in a PCR test: EXERCISES Primer 1: CTCGAACTGCACCTCATGG Primer 2: GGCATACACTCGCTATGTC 1 Go to www.covid19dataportal.org: click on ‘Viral sequences’. Alter- 2 To look for the primer sequence in the SARS-CoV-2 genome, visit natively, go to the NCBI SARS-CoV-2 Resources www.ncbi.nlm.nih.gov/nuccore/NC_045512.2?report=fasta. Use www.ncbi.nlm.nih.gov/sars-cov-2: select ‘Nucleotide records’. ctrl F, or cmd F for Macs, to find your primer in the genome ‘text’. 2 Each genome sequence has its own AC: e.g., NC_045512.2. This URL 3 Again using ctrl F, or cmd F for Macs, type a random sequence (~20 www.ncbi.nlm.nih.gov/nuccore/NC_045512.2?report=fasta ac- letters) using the 4-letter alphabet (A, T, G, C). Can you find your cesses the GenBank entry of the reference genome, in Fasta format. query sequence in the genome? Try another random sequence... To access the GenBank entry, with additional annotation, use this 4 To look for the primer sequence in the SARS-CoV genome, the coro- URL: www.ncbi.nlm.nih.gov/nuccore/NC_045512.2. To access the navirus responsible for the 2003 epidemic, visit GenBank entry of your genome sequence of interest, replace the AC www.ncbi.nlm.nih.gov/nuccore/AY274119?report=fasta. Use ctrl F, number within this URL: e.g., the GenBank entry for LR991698.2 (Al- or cmd F for Macs, to find your primer in the genome ‘text’. Does pha variant) is accessed via the URL your primer match the SARS-CoV genome? www.ncbi.nlm.nih.gov/nuccore/LR991698.2. 3 Use the ACs below to explore SARS-CoV-2 genomes from different Now let’s perform the same tasks using bioinformatics methods. countries, sequenced & submitted at different times: MT612198.1; MT911538.1; MW079825.1; MW592707.1; MZ026889.1. EXERCISES

1 Copy one of the two primers used in a PCR test: KEY TERMS Primer 1: CTCGAACTGCACCTCATGG Primer 2: GGCATACACTCGCTATGTC Accession number (AC): a unique computer-readable code given to 2 Paste the sequence into the ‘BLAT’ genome-search tool input box: identify a particular entry in a particular database genome.ucsc.edu/cgi-bin/hgBlat. Using the ‘Genome’ option above Agarose: a polysaccharide matrix often used in molecular biology to the search box, select ‘SARS-CoV-2’ (species names are more or less separate large molecules, such as DNA in alphabetical order); then click the 'Submit' button. Fasta format: a text-based file format for amino acid or nucleotide 3 Does your primer match the SARS-CoV-2 genome? At which position sequences; the file’s first line contains a ‘>’ symbol, followed by the in the SARS-CoV-2 genome do you find your primer? accession number (& sometimes the identifier (ID)) & sequence title; 4 Return to the 'BLAT' input box. Using the same primer sequence, the rest of the file contains the sequence in single-letter notation select the ‘Human’ genome using the ‘Genome’ option above the Polymerase: an that synthesises chains of nucleic acids: DNA- & search box; click ‘Submit’. Are you able to find your query sequence RNA-polymerases assemble DNA & RNA chains respectively

4 A Practical Guide to SARS-CoV-2, its variants & its origins

3.4 From mutations to virus variants The impacts of such mutations on the biology of the virus need to Each time a virus replicates in a human cell, it naturally accumu- be studied: for this, you need to know whether the mutations are lates small errors, called mutations. This is an inevitable conse- located in a gene, and if so, whether they modify the corresponding quence of error-prone genome replication. viral protein, and how. The COVID-19 pandemic resulted in a huge increase in the num- To learn more about our A->T mutation located at position 23,063 ber of virus particles worldwide; this, in turn, led to the rapid ap- of the reference genome, let’s first explore the proteins produced pearance of random mutations in the virus genome (see Figure 3), by the virus, focusing on a surface protein called ‘Spike’. allowing us to observe its ongoing evolution within a very short time-frame. 4 SARS-CoV-2 & its proteins

Knowing the genome sequence of the virus has made it possible to predict the different present, and to deduce the amino acid sequences of the corresponding proteins. At the time of writing, Figure 3 Alignment of the SARS-CoV-2 reference genome approximately 17 genes have been identified that give rise to ~29 (NC_045512.2) & genome of the Alpha variant (LR991698.2). Only part different proteins (these are listed in UniProtKB17 and ViralZone18,19 – of the alignment is shown, from nucleotide 23,041 to 23,100 of the see box below). Knowing the sequence and biological functions of reference sequence. A mutation (A->T) at position 23,063 in the refer- ence sequence is highlighted. This mutation intrigued & quickly worried these proteins is a key step in the study of the impacts of mutations, scientists. Does it alter viral function, like its contagiousness or ability to and in the search for treatments (drugs or vaccines). evade vaccines? Should the virus become a Variant Of Concern (see section 5.2)? UniProt & ViralZone 17 SARS-CoV-2 has a system of proof-reading that ‘corrects’ newly UniProt (www..org) is the pre-eminent protein sequence generated errors: this coronavirus does not accumulate as many database maintained by the European Bioinformatics Institute (UK), the mutations as do other RNA viruses. However, it still makes some SIB Swiss Institute of Bioinformatics (CH) & the Protein Information ‘typos’ each time it replicates. The particular combination of muta- Resource (US). Its central component is the UniProt Knowledgebase, tions found in a given genome allows identification of the virus generally referred to simply as UniProtKB. UniProtKB is a reference variant (note that each variant – also termed a lineage – can have resource of protein sequences & biological knowledge, covering pro- different names: e.g., the Alpha variant has been referred to as the teins from all branches of the tree of life. The resource comprises two UK variant, B.1.1.7, 20I, or 501Y.V1). Let’s now compare some ge- sections: the first, UniProtKB/Swiss-Prot, contains a relatively small set nomes and look for mutations. of entries that have been annotated by expert biocurators using litera- ture sources & computed features; the second, UniProtKB/ TrEMBL, is a EXERCISES very much larger set of entries that have been annotated automatically. The UniProt website also provides access to bioinformatics tools, 1 Go to www.ebi.ac.uk/Tools/msa/clustalo; toggle PROTEIN to DNA. such as ‘Align’, which allows users to align & compare sequences. 2 From the previous exercise, select the reference genome sequence 18 ViralZone (viralzone.expasy.org) , a knowledge resource maintained (NC_045512.2) & another sequence: copy & paste each sequence in by the SIB Swiss Institute of Bioinformatics, gathers comprehensive Fasta format (keep the line starting with the > sign) into the input information on viruses. In so doing, it helps to understand virus diversity box, one beneath the other; click on the ‘Submit’ button; be patient & provides a gateway to the corresponding UniProtKB/Swiss-Prot viral – aligning 2x29,900 nucleotides isn’t a piece of cake…! protein entries. ViralZone gathers, in particular, data on SARS-CoV-2, 3 Can you count the differences (i.e., the mutations)? with illustrations of its structure, its life cycle, its genome, its proteins, 4 Repeat the exercise with other genome sequences; always include the different stages of COVID-19 & the targets of anti-viral drugs. the reference sequence. 5 Look, in particular, for the presence of a mutation in position 23,063 Amongst the SARS-CoV-2 proteins are RdRp (RNA-dependent RNA in the reference sequence: which sequence (from which country) polymerase, also known as RNA replicase), an enzyme that’s in- has the A->T mutation? (We will return to this mutation later.) volved in the transcription and replication of viral RNAs, and is part of the proof-reading system associated with RNA replication (Uni- How do we identify a viral genome and the variant to which it cor- ProtKB: P0DTD1); a nucleoprotein (from gene N), which packages ? responds the viral genome RNA (UniProtKB: P0DTC9); and membrane proteins EXERCISES (from genes M and E) that are important for viral assembly (Uni- ProtKB: P0DTC5 and UniProtKB: P0DTC4, respectively). 1 Copy & paste one of your previous sequences (in Fasta format) into a text file. Rename the file extension ‘.txt’ to ‘.fasta’. KEY TERMS 16 2 Go to the PANGOLIN COVID-19 Lineage Assigner of PANGO line- Gene: a molecular unit of heredity, broadly corresponding to a piece of ages: pangolin.cog-uk.io DNA (or RNA) that encodes a protein (or a functional RNA) 3 Import your ‘.fasta’ file by clicking on ‘Select fasta file to upload’. PANGOLIN: Phylogenetic Assignment of Named Global Outbreak LINe- 4 Click on the ‘Start analysis’ button on the top left. ages is a software tool that allows users to submit a full SARS-CoV-2 5 The name of your lineage/variant will appear, together with addi- genome sequence & to determine its most likely lineage by compari- tional information. Click on the ‘i' icon on the right. Scroll down the son with other genome sequences page to see the actual distribution of your variant (if it’s known). Variant Of Concern (VOC): the term used for variants of SARS-CoV-2 6 Explore the genomes of different variants: MZ344997.1 (Alpha); when particular mutations lead to rapid spread in human popula- MW598419.1 (Beta); MZ169911.1 (Gamma); MZ359841.1 (Delta). tions, more severe disease or reduced efficacy of vaccine treatments

5 A Practical Guide to SARS-CoV-2, its variants & its origins

The mutation previously identified in the Alpha variant, at posi- EXERCISES tion 23,063 of the SARS-CoV-2 reference genome, is located in gene S. This encodes the spike protein (UniProtKB: P0DTC2), a protein 1 Go to the GenBank entry of the reference SARS-CoV-2 genome: located at the surface of the virion, giving it its ‘corona’ appearance www.ncbi.nlm.nih.gov/nuccore/NC_045512.2?report=genbank and hence its name (see Figure 4). 2 Look for ‘spike’ in the entry: use ctrl F, or cmd F for Macs. 3 Click on the CDS link in the left-hand column. Highlighted in brown is gene S, encoding the spike protein: this is the RNA se- quence found in RNA vaccines. 4 The right-hand box contains the amino acid sequence of the spike protein (see Figure 5). This is the ‘reference’ sequence, so all spike sequences collected worldwide are compared against it.

4.2 The spike protein function The spike protein allows the virus to enter human cells by inter- acting with a protein called ACE2, which is present on the surface of several cell types, including those of small and large arteries, and the lung and small intestine, as illustrated in Figure 6.

Figure 4 Representation of SARS-CoV-2 & some of its proteins. The nucleoprotein (N) packages the RNA genome. Several proteins are locat- ed in the membrane at the surface of the virion (M, E & S). The spike protein (S), which forms a trimer, is shown in yellow. 4.1 The spike protein sequence Like all proteins, the spike protein comprises a sequence of amino acid residues, the order of which is determined by the RNA se- quence of the corresponding viral gene. Three nucleotide bases of RNA (a codon) correspond to one amino acid. There are 20 different amino acids, each of which can be symbolised by a single letter, as illustrated on the right-hand side of Figure 5 (e.g., G=glycine, A=alanine, V=valine, etc. – the full code is shown in Figure 8).

Figure 6 How SARS-CoV-2 enters human cells (adapted from20). The virion spike protein interacts with the ACE2 receptor, allowing the virion to enter in two ways: if protease TMPRSS2 is present at the cell surface, the virion membrane can fuse directly & release the viral genome into the cell; or, the virion can be taken up into an endosome, where cathep- sin L activates the spike, allowing fusion & release of the viral RNA. KEY TERMS

ACE2: Angiotensin-Converting Enzyme 2, a cell-surface enzyme respon- sible for cleaving a larger protein – angiotensinogen – into smaller peptides that have critical roles in regulating cell functions, including the regulation of blood pressure Antigen: generally, a sugar, a protein, or part of a protein belonging to a foreign body (virus, bacteria, etc.). Antigens are like flags warning the body of intruders; they are recognised by antibodies & are at the heart of the immune response Cathepsin L: an enzyme that plays a vital role in normal cellular func- Figure 5 Excerpt from the reference genome sequence of SARS-CoV-2 in tions, such as general protein turnover & antigen processing; within gene S’s GenBank entry. Nucleotides 21,563–25,384 (highlighted brown) endosomes, it also cleaves the spike proteins of human coronavirus- correspond to the sequence of gene S (towards the centre of the figure, es SARS-CoV & SARS-CoV-2, facilitating their entry into host cells the ellipsis, …, denotes that, for convenience, a block of nucleotides have Endosome: an intracellular organelle typically involved in sorting & been deleted). Gene S encodes the spike protein, whose amino acid trafficking lipid vesicles to & from plasma membranes; endosomes sequence is shown on the right; these nucleotides form part of the RNA may be converted into lysosomes – cellular waste-disposal systems – sequence found in RNA vaccines (i.e. BioNTech/Pfizer & Moderna). which contain that digest many types of biomolecule The ‘reference’ SARS-CoV-2 spike protein sequence, which has TMPRSS2: Transmembrane Serine Protease 2, a membrane-anchored been available in UniProtKB since April 2020, comprises 1,273 amino enzyme that cleaves & activates proteins involved in various physio- acids: www.uniprot.org/uniprot/P0DTC2.fasta. Let’s find gene S, logical functions; it also cleaves & activates both the ACE2 receptor, and its corresponding spike protein, in the SARS-CoV-2 genome. promoting uptake of human coronaviruses SARS-CoV & SARS-CoV-2, & coronavirus spike glycoproteins, activating them for host cell entry

6 A Practical Guide to SARS-CoV-2, its variants & its origins

That ACE2 should be the way into human cells for SARS-CoV-2 (and also for other coronaviruses, such as SARS-CoV and HCoV- NL6321) is by complete chance! The ACE2 protein is involved in the regulation of blood pressure in humans20,22. Spike is the SARS-CoV-2 protein that’s most accessible to the hu- man immune system, by virtue of being exposed on the virion sur- face – it is therefore accessible to antibodies. Only a small subset of the many antibodies that bind a virus are able to block viral infec- tion. These are called neutralising antibodies. Neutralisation pre- vents the viral particle from entering the target cells by either block- ing binding to the entry receptor, disrupting the entry process (fu- sion, injection, endocytosis, etc.), or aggregating the viral particles into a complex that can no longer enter cells. 4.3 The spike protein 3D structure Ke et al., (2020)23 used cryo-electron microscopy to image intact SARS-CoV-2 virions and determine the high-resolution structure of the spike protein on the virion surface. They discovered that Spike forms a trimer, as shown in Figure 7a. Knowledge of the 3-dimensional (3D) structure of the protein al- lows us to study the domain involved in the interaction of Spike with the human ACE2 receptor. Also, very importantly, this knowledge allows us to study the major part of the spike protein that’s recog- nised by neutralising antibodies – see Figure 7b. 6VXX Structure of the SARS-CoV-2 spike glycoprotein (closed state)

Figure 7b Major structural domains of the spike protein. The ACE2 receptor-binding & major neutralising antibody-binding domains are coloured red (adapted from25). The Protein Data Bank

Maintained by the Research Collaboratory for Structural Biology (RCSB) in the USA, the RCSB Protein Data Bank (PDB) is the central repository of macromolecular 3D structure data, including the struc- tures of proteins, nucleic acids & complex assemblies. It is a founding member of the worldwide Protein Data Bank (wwPDB) partnership, which was established to create a single, global, public archive of mac- romolecular structure data. Since 2020, the resource has been updated with newly released structures of SARS-CoV-2 & host-cell proteins, which will help research- ers to better understand & tackle the COVID-19 global pandemic. Figure 7a 3D structure of the spike trimer (PDB ID: 6VXX). Each 1,273- amino acid chain is shown in a different colour: green, blue, red. Pale blue cubes show the location of sugar molecules. The virion surface is KEY TERMS depicted in brown. The spike protein amino acid sequence is shown at the top of the image (amino acids highlighted light grey in the sequence Antibodies: proteins that are produced by cells known as B lympho- are those that weren’t clearly located in the 3D structure). cytes, which are part of the immune system. Antibodies are found in different fluids (blood, saliva, etc.); their role is to detect the pres- Let’s now visualise the 3D structure of the SARS-CoV-2 spike pro- ence of foreign bodies – i.e., other proteins that belong, for example, tein by retrieving its structural data from the publicly accessible to a virus. When this happens, a defense response – an immune re- Protein Data Bank (PDB)24. sponse – is triggered & the intruder is eliminated EXERCISES Cryo-electron microscopy (cryo-EM): an electron-microscopy (EM) technique applied to samples cooled to cryogenic (ultra-low) tem- 1 Go to PDB entry www.rcsb.org/3d-view/6vxx peratures & embedded in an environment of vitreous water; cryo-EM 2 Mouse over the 3D structure: the position & name of each amino is used to image biological samples acid appear in a grey box on the bottom right. Rotate the structure. Neutralising antibodies: some antibodies can bind to a virion without 3 Try to locate the amino acids that form part of the domain that preventing its infectivity; neutralising antibodies are those that pre- interacts with human ACE2 receptor: amino acids 319-541. vent the virus from infecting a target cell & are the most important in 4 Try to find amino acid 501: i.e., asparagine 501, Asn 501, N501. terms of acquired immunity & vaccination

7 A Practical Guide to SARS-CoV-2, its variants & its origins

5 Mutations & the spike protein EXERCISES

5.1 Impact of genome mutations on their proteins 1 Use the genetic code illustrated in Figure 8 to write down the amino acid sequences encoded in the DNA fragments below: Many mutations found in ‘new’ SARS-CoV-2 genomes don’t mod- AAT GGT GTT ify the encoded protein sequences, owing to the redundancy of the TAT GGT GTT genetic code (e.g., codons AAT and AAC encode the same amino AAC GGT GTT acid: asparagine, N). However, some mutations do modify the corre- sponding protein sequences, and, depending on their locations in 2 To translate these DNA fragments automatically, paste them into the the 3D structure of a protein, such ‘mutated’ amino acids may have input box of the ‘Sequence Manipulation Suite (SMS) Translate’ tool: an effect on the protein’s biological function. Scientists are thus www.bioinformatics.org/sms2/translate.html. The form requires carefully studying mutations found in the SARS-CoV-2 spike protein sequences to be given a formatted header line, consistent with Fasta to determine whether they modify how the protein interacts with format; for clarity, we suggest adding the headers ‘>Reference’, cell receptor(s), such as ACE2, or fuse with cell membranes, or ‘>VariantAlpha’ & >’VariantXX’, as shown below: whether they modulate the antigenicity of the protein, thereby >Reference genome sequence (NC_045512.2) creating ‘vaccine escape mutants’. AAT GGT GTT Recall from Figure 3 that the sequence AAT GGT GTT is part of the >VariantAlpha (LR991698.2) SARS-CoV-2 reference genome (NC_045512.2), while TAT GGT GTT is TAT GGT GTT a piece of the Alpha variant viral genome, with the mutation A -> T >VariantXX (Another genome sequence) located at position 23,063: AAC GGT GTT ê 3 When you’re ready, click the 'Submit’ button. AAT GGT GTT 4 In the result page, examine the translations returned. Does the A -> T TAT GGT GTT mutation change Spike’s amino acid sequence? If so, how? We want to know whether A -> T or AAT -> TAT mutations change 5 Does the T -> C mutation change Spike’s amino acid sequence? If so, Spike’s amino acid sequence. We can answer this by manually look- how? ing up the codons in the genetic code (which, for reference, is illus- 5.2 Tracking virus variants trated in Figure 8), or by using bioinformatics tools. Mutations can be deleterious, neutral or beneficial. Deleterious mutations are quickly lost. Neutral and beneficial mutations circu- late in viral populations, and can be monitored by genetic surveil- lance. Beneficial mutations are usually fixed by natural selection.

These may lead to an increase in viral infectivity, or they may pro-

mote escape from immunity previously acquired by infection or

vaccination27.

Each virus variant (e.g., Alpha (B.1.1.7), Beta (B.1.351), Gamma

(B.1.1.28.1) or Delta (B.1.617.2)) has its own combination of muta-

tions, both at genome and protein levels, compared with the refer-

ence Wuhan genome. In one year, more than 20 SARS-CoV-2 vari-

ants have emerged, with a range of fixed amino acid substitutions, 25,28 between 10 and 30 per lineage . For the spike protein, between 2

and 12 mutations can be present (see Figures 9a and 9b). This com-

bination of mutations allows the different virus variants to be identi- st nd rd Figure 8 The genetic code. The 1 , 2 & 3 codon bases are shown in the fied, rather like a bar code, even though some mutations, such as nd rd inner, 2 & 3 concentric circles; the respective translations sit at the spike mutations D614G and N501Y, are present in several variants. circumference (e.g., reading from inside-out, codon AAG points to the Some variants are designated ‘Variants Of Concern’ (VOCs): this amino acid lysine (K) in the bottom-left quadrant of the diagram). happens if there’s evidence that a particular variant i) has increased Inspection of Figure 8 shows that our mutation changes the cor- transmissibility, ii) leads to more severe disease (e.g., resulting in responding amino acid from asparagine, N (codon AAT) – bottom- more hospitalisations or deaths), iii) shows significant resistance to left quadrant – to tyrosine, Y (codon TAT) – top-right quadrant. This neutralisation by antibodies produced by previous infection or mutation was first discovered in December 2020 in the UK virus vaccination or iv) is resistant to treatments or vaccines, or where genome, B.1.1.7, now known as the Alpha variant. Located at posi- there have been diagnostic detection failures. Identifying circulating tion 501 in the spike protein sequence, the mutation is called N501Y variants allows public health authorities to respond quickly and de- (also referred to as Nelly). N501Y is located in a region involved in cisively to prevent the spread of VOCs. To do this, the viral genomes the interaction of Spike with the human ACE2 receptor: between are sequenced, and the corresponding proteins are studied. Howev- amino acids 319 and 541. It has been hypothesised that this muta- er, this requires access to state-of-the-art sequencing technologies; tion could modify the affinity of the spike protein for ACE2, and hence modulate the infectivity of the virus25,26. KEY TERMS To get a better understanding of the effects of mutations on the Antigenicity: the capacity of an antigen to trigger an immune response amino acids they encode, and hence on the 3D structures of their Genetic code: the set of rules biologists use to translate information parent proteins, let’s now take a closer look, and translate some sequestered in genetic sequences (i.e., within codons) into proteins DNA fragments containing different mutations.

8 A Practical Guide to SARS-CoV-2, its variants & its origins certain regions of the world are therefore better represented in the It’s possible to get clues to, and to formulate hypotheses on, the sequencing data than others. relationships between the different coronaviruses that infect differ- ent species (human, bat, pangolin, civet, etc.) by comparing their protein sequences using Multiple Sequence Alignment (MSA) tech- niques. The MSA methods used in many software programs, such as UniProt’s Align tool (which uses the Clustal Omega algorithm29), first build a guide tree, and then align the sequences progressively ac- cording to the tree topology (Figure 10). The guide tree is not a phylogenetic tree; however, it rapidly gives clues to the relationship between the sequences. For example, by using a set of spike protein sequences, we can construct an MSA and obtain the guide tree Figure 9a Spike mutations found in different well-known virus variants. shown in Figure 10. The specific combination of mutations acts as a kind of bar code, allow- ing identification of different variants25. As can be seen in the image, the N501Y mutation is found in several variants. It’s believed that this muta- tion may help the virus spread more easily, while the E484K mutation, also found in several variants, may affect the antibody response.

Figure 10 Guide tree provided by UniProt’s Align tool. This guide tree offers a first hint at how it’s possible to see the relationships between different viruses, & allows us to consider the following hypotheses: i) the virus responsible for human SARS in 2003 probably originates from a civet; ii) the virus responsible for COVID-19 probably originates from a bat; however, as it’s also closely related to a coronavirus infecting pan- golins, this could be an intermediate in the process of transmission. As of August 2021, the origin of the virus is still not yet firmly established.

Researchers examining coronaviruses and bats are used to work- ing with large numbers. A study of 12,333 bats from Latin America, Africa and Asia found that almost 9% carried at least one of 91 distinct coronaviruses. The authors estimated that there are at least 3,200 coronaviruses that infect bats. Moreover, there are more than 1,400 species of bat. Figuring out which ones are susceptible to which coronaviruses is no small task. The plethora of bat corona- viruses, coupled with the uncertainty about the role of an interme- diary animal, also makes it tricky to know how to go about prevent- ing a future viral spillover30. In the final exercises, we’ll compare spike protein sequences de- rived from viruses with different animal origins. KEY TERMS

Guide tree: an initial step performed by multiple sequence alignment tools in which a quick set of comparisons is performed to create a hierarchical clustering of the sequences – the guide tree. The tree is then used to determine the order in which to add sequences to the alignment. The guide tree is generally just an approximation & isn't Figure 9b Excerpt from a table listing mutations found in spike protein really suitable for use as a phylogenetic tree sequences from a range of well-known virus variants25. The biologically Phylogenetic tree: a branching diagram, commonly referred to as a important regions of the spike protein (the neutralising antibody-binding & ACE2 receptor-binding domains) are indicated in the purple (NTD) & tree, that shows the evolutionary relationships among various spe- blue boxes (RBD) respectively. cies, or other entities, based on similarities & differences in their physical or genetic characteristics. In the field of bioinformatics, phy- logenetic trees are generally inferred from multiple alignments of 6 Inferring the origins of SARS-CoV-2 DNA or protein sequences using sophisticated computer software & statistical tools Coronaviruses, including those related to SARS-CoV-2, are clearly Spillover: an event whereby a natural reservoir population with a high present in many wild mammals. The origins of the virus are still pathogen prevalence comes into contact with a new potential host unclear; however, the first genomic analysis suggests that SARS- population into which the pathogen may be transmitted CoV-2 is most closely related to viruses previously identified in bats. Tree topology: the branching structure of a guide tree or phylogenetic Nevertheless, it’s plausible that there were other intermediate tree; tree topologies indicate patterns of relatedness among species animal transmissions before the introduction into humans. or other biological entities being compared

9 A Practical Guide to SARS-CoV-2, its variants & its origins

In this first exercise, we’ll examine a set of partial spike protein 5 Search public nucleotide sequence databases, such as GenBank, to sequences manually. find named genes & their corresponding proteins; EXERCISES 6 Use the RCSB PDB macromolecular structure database to visualise & explore the 3D structure of the spike protein; 1 Examine the alignment of partial spike protein sequences below: 7 Translate DNA sequences into their protein products using the Se- Human SARS-CoV-2 (2020) IRGDEVRQIAPGQTGKIAD quence Manipulation Suite (SMS), & explore the impact of nucleo- Pangolin coronavirus (2020) VRGDEVRQIAPGQTGRIAD tide base mutations on the translated amino acid sequences; Human SARS-CoV (2003) VKGDDVRQIAPGQTGVIAD Bat coronavirus (2020) ITGDEVRQIAPGQTGKIAD 8 Investigate the impact of mutations on infectivity & immune re- sponses; 2 Look for similarities & differences between the aligned amino acids. Which spike sequence is more similar to human SARS-CoV-2? 9 Use UniProt’s Align tool to compare spike proteins from different animals, identify differences between them & infer the origins of Now let’s now try to perform a similar exercise using a basic bio- SARS-CoV-2; & informatics sequence alignment tool. 10 Discover how evolutionary relatedness can be inferred at the mo- lecular level, using UniProt’s alignment tool to visualise the evolu- EXERCISES tionary relationships between different species based on their pro- 1 Go to UniProt Align: www.uniprot.org/align tein sequences. 2 Copy the partial spike sequences below, which originate from differ- ent coronavirus (CoV) infecting different mammals: 7 References & further reading >Human_SARS_CoV2_2020 FSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIAD >Civet_CoV_2003 1 What is the most abundant organism on earth? FSTFKCYGVSATKLNDLCFSNVYADSFVVKGDDVRQIAPGQTGVIAD aimed.net.au/2015/09/11/what-is-the-most-abundant- >Pangolin_CoV_2020 organism-on-earth-not-what-youd-think FSTFKCYGVSPTKLNDLCFTNVYADSFVVRGDEVRQIAPGQTGRIAD 2 Soygur B & Sati L. (2016) The role of syncytins in human re- >Human_SARS_CoV_2003 production and reproductive organ cancers. Reproduction, FSTFKCYGVSATKLNDLCFSNVYADSFVVKGDDVRQIAPGQTGVIAD 152(5), R167-R178. >Bat_CoV_2020 31 3 Mushegian R. (2020) Are There 10 Virus Particles on Earth, FSTFKCYGVSPTKLNDLCFTNVYADSFVITGDEVRQIAPGQTGKIAD or More, or Fewer? J. Bacteriol., 202(9), e00052-20. >Human_MERS_2012 VNDFTCSQISPAAIASNCYSSLILDYFSYPLSMKSDLSVSSAGPISQ 4 Suttle CA. (2007) Marine viruses – major players in the global >Bat_CoV_2007 ecosystem. Nat. Rev. Microbiol., 5(10), 801-812. VDEFSCNGISPDSIARGCYSTLTVDYFAYPLSMKSYIRPGSAGNIPL 5 Reche I et al. (2018) Deposition rates of viruses and bacteria 3 Paste the sequences into the Align query box. Click on ‘Run Align’. above the atmospheric boundary layer. ISME J., 12, 1154– 4 From the resulting alignment, how many amino acids are completely 1162. 6 conserved (i.e., are identical)? What are they? Now, looking at the Pride D. (2020) Viruses can help us as well as harm us. Sci. tree, discuss the different hypotheses & possible transmission Am., 323(6), 46-53; www.scientificamerican.com/arti- modes (human-human, bat-human, bat-pangolin-bat-human, etc.). cle/viruses-can-help-us-as-well-as-harm-us/ 7 5 Which method do you think made it easier to compare the similarity Human viruses and associated pathologies: vi- of the sequences: the manual or the bioinformatics approach? ralzone.expasy.org/678 8 Human virus relative size: viralzone.expasy.org/5216 In this Guide, we’ve seen how bioinformatics databases and soft- 9 Hunting SARS-CoV-2, its variants and its origin: educa- ware tools have played critical roles in tackling the global public- tion.expasy.org/bioinformatique/Coronavirus_proteines_EN. health crisis that faced us at the beginning of 2020. Key to this html endeavour was the ability to share data worldwide: i.e., genomic and 10 COVID-19 pandemic: en.wikipedia.org/wiki/COVID-19_pan- other crucial data resources are publicly accessible – the data are ‘open’. demic This allowed researchers to draw on years of basic research and 11 Sayers EW et al. (2021) GenBank. Nucleic Acids Res., 49(D1), state-of-the-art technology to quickly piece together the many D92-D96; www.ncbi.nlm.nih.gov/genbank pieces of the complex biological and epidemiological puzzles neces- 12 Severe acute respiratory syndrome coronavirus 2 isolate sary to understand how a new virus works. Wuhan-Hu-1, complete genome: www.ncbi.nlm.nih.gov/ nuccore/NC_045512.2?report=fasta TAKE HOMES 13 Shu Y & McCauley J. (2017) GISAID: Global Initiative on Shar- ing All Influenza Data – from vision to reality. Euro. Surveill., Having completed this Practical Guide, you now have a practical sense 22(13), 30494; www.gisaid.org of how to: 14 COVID-19 Data Portal – accelerating research through data 1 Test the specificity of primers used in the coronavirus RT-PCR test; sharing: www.covid19dataportal.org 2 Investigate the locations of short DNA sequences, like PCR primers, in 15 NCBI SARS-CoV-2 Resources: www.ncbi.nlm.nih.gov/sars-cov-2 SARS-CoV-2 & human genomes using the BLAT search tool; 16 Phylogenetic Assignment of Named Global Outbreak Lineages 3 Compare nucleotide or protein sequences using Clustal Omega or (PANGOLIN): cov-lineages.org/pangolin.html UniProt’s Align tool, & identify their mutations; 17 The UniProt Consortium. (2021) UniProt: the universial protein 4 Identify a viral genome & the variant to which it corresponds knowledgebase in 2021. Nucleic Acids Res., 49(D1), D480-D489; using the PANGOLIN COVID-19 lineage assigner; www.uniprot.org

10 A Practical Guide to SARS-CoV-2, its variants & its origins

18 ViralZone. SARS-CoV-2, COVID-19 Coronavirus Resource: 9 Licensing & availability viralzone.expasy.org 19 SARS coronavirus 2 (SARS-CoV-2) proteome: vi- This Guide is freely accessible under creative commons licence ralzone.expasy.org/8996 CC-BY-SA 2.5. The contents may be re-used and adapted for educa- 20 Coronavirus life-cycle: viralzone.expasy.org/9096 tion and training purposes. 21 Human coronavirus NL63: The Guide is freely available for download via the GOBLET portal en.wikipedia.org/wiki/Human_coronavirus_NL63 (www.mygoblet.org), EMBnet website (www.embnet.org) and the 22 Baillie Gerritsen V & Lolo A. (2020) A way in – Protein Spotlight F1000Research Bioinformatics Education and Training Collection comics: www.proteinspotlight.org/back_issues/223/comic/ (f1000research.com/collections/bioinformaticsedu?selectedDomai 23 Ke Z et al. (2020) Structures and distributions of SARS-CoV-2 n=documents). spike proteins on intact virions. Nature, 588, 498–502. doi: 10.1038/s41586-020-2665-2 24 Burley S et al. (2021) RCSB Protein Data Bank: powerful new 10 Disclaimer tools for exploring 3D structures of biological macromolecules for basic and applied research and education I fundamental Every effort has been made to ensure the accuracy of this Guide; biology, biomedicine, biotechnology, bioengineering and en- GOBLET cannot be held responsible for any errors/omissions it may ergy sciences. Nucleic Acids Res., 49(D1), D437-D451; contain and cannot accept liability arising from reliance placed on www.rcsb.org the information herein. 25 SARS-CoV-2 circulating variants: viralzone.expasy.org/9556 26 Shang et al. (2020) Structural basis of receptor recognition by SARS-CoV-2. Nature, 581, 221-224. doi: 10.1038/s41586-020- 2179-y 27 Kupferschmidt K. (2020) UK variant puts spotlight on immuno- compromised patients’ role in the COVID-19 pandemic. Science: www.sciencemag.org/news/2020/12/uk-variant- puts-spotlight-immunocompromised-patients-role-covid-19- pandemic 28 Petrova V et al. (2018) The evolution of seasonal influenza viruses. Nat. Rev. Microbiol., 16, 47–60. doi: 10.1038/nrmicro.2017.118 29 Sievers F & Higgins DG. (2017) Clustal Omega for making accurate alignments of many protein sequences. Protein Sci., 27(1), 135-145. 30 Burki T. (2020) The origin of SARS-CoV-2. The Lancet, 20, P1018-1019. doi: 10.1016/S1473-3099(2)30641-1

8 Acknowledgements & funding

GOBLET Practical Guides build on GOBLET’s Critical Guide con- cept, using layout ideas from the Higher Apprenticeship specifica- tion for college-level students in England. The contents herein ex- pand on materials made freely available by the SIB Swiss Institute for Bioinformatics (Hunting SARS-CoV-2, its variants & its origins: education.expasy.org/bioinformatique/Coronavirus_proteines_EN. html) 9. The Swiss-Prot group is part of the SIB Swiss Institute of Bioinfor- matics and of the UniProt Consortium. Swiss-Prot group activities are supported by the Swiss Federal Government through the State Secretariat for Education, Research and Innovation (SERI) and by the National Eye Institute (NEI), National Human Genome Research Institute (NHGRI), National Heart, Lung and Blood Institute (NHLBI), National Institute on Aging (NIA), National Institute of Allergy and Infectious Diseases (NIAID), National Institute of Diabetes and Diges- tive and Kidney Diseases (NIDDK), National Institute of General Medical Sciences (NIGMS), National Institute of Mental Health (NIMH), and National Cancer Institute (NCI) of the National Insti- tutes of Health (NIH) [U24HG007822].

11 A Practical Guide to SARS-CoV-2, its variants & its origins

About the organisations About the authors

GOBLET Marie-Claude Blatter (orcid.org/0000-0002-7474-1499)§ GOBLET (Global Organisation for Bioinformatics Learning, Educa- Marie-Claude Blatter leads the outreach activities tion & Training; www.mygoblet.org) was established in 2012 to of the SIB Swiss Institute of Bioinformatics. She coor- unite, inspire and equip bioinformatics trainers worldwide; its mis- dinates and participates in in-house and external sion, to cultivate the global bioinformatics trainer community, set teaching, including high-school teacher and bioinfor- standards and provide high-quality resources to support learning, matics education outreach programmes, activities for education and training. the public (more details at www.sib.swiss/about-sib/what-is- GOBLET’s ethos embraces: bioinformatics), and Geneva University’s bioinformatics BSc and • inclusivity: welcoming all relevant organisations & people MSc teaching programmes. She is also involved in documentation • sharing: expertise, best practices, materials, resources and user support for the UniProt Knowledgebase. • openness: using Creative Commons Licences • Blatter MC et al. (2016) The Metagenomic Pizza: a simple recipe • innovation: welcoming imaginative ideas & approaches to introduce bioinformatics to the layman. EMBnet J., 22, e864. • tolerance: transcending national, political, cultural, social & • Daina A et al. (2017) Drug Design Workshop: a Web-based edu- disciplinary boundaries cational tool to introduce computer-aided drug design to the Further information can be found in the following references: general public. J. Chem. Educ., 94(3), 335–344. • Blatter MC et al. (2019) Using Bioinformatics to Understand • Attwood et al. (2015) GOBLET: the Global Organisation for Bio- Genetic Diseases. F1000Research, 8, 272 (document). informatics Learning, Education & Training. PLoS Comput. Biol., • Attwood TK et al. (2020) Introducing computer-aided drug de- 11(5), e1004281. sign – a practical guide. F1000Research, 9, 1412 (document). • Corpas et al. (2014) The GOBLET training portal: a global reposi- tory of bioinformatics training materials, courses & trainers. Bi- Philippe Le Mercier (orcid.org/0000-0001-8528-090X)‡ oinformatics, 31(1), 140-142. Philippe Le Mercier is a resource manager at the SIB GOBLET is a not-for-profit foundation, legally registered in the Swiss Institute of Bioinformatics. He specialises in Netherlands. For general enquiries, contact [email protected]. molecular virology and coordinates the ViralZone Web resource (viralzone.expasy.org). He participates in the SIB Swiss Institute of Bioinformatics annotation of viral proteins and cellular representation The SIB Swiss Institute of Bioinformatics is a non-profit academic (www.swissbiopics.org) in the UniProt Knowledgebase. organisation, created in 1998, whose mission is to lead and coordi- • Hufsky F et al. (2020) The International Virus Bioinformatics nate the field of bioinformatics in Switzerland. Meeting 2020. Viruses, 12(12), 1398. SIB provides bioinformatics services and resources for scientists • Sigrist CJ et al. (2020) A potential role for integrins in host cell and clinicians from academia and industry in Switzerland and entry by SARS-CoV-2. Antiviral Res., 177, 104759. worldwide. Its data-science experts work to advance biological and • Hulo C et al. (2017) The ins and outs of eukaryotic viruses: medical research, and enhance health: www.sib.swiss. The SIB was Knowledge base and ontology of a viral infection. PLoS One, 16, one of the founders of GOBLET. 12.

Teresa K Attwood (orcid.org/0000-0003-2409-4235)¥ Teresa (Terri) Attwood is Professor emerita of Bioin- formatics, with more than 25 years’ experience teach- ing introductory bioinformatics in BSc, MSc and PhD programmes, and in ad hoc courses, workshops and

summer schools, in the UK and abroad. She wrote the first introductory bioinformatics text- book in 1999; her third book was published in 2016:

• Attwood TK & Parry-Smith DJ. (1999) Introduction to Bioinfor- matics. Prentice Hall. • Higgs P & Attwood TK. (2005) Bioinformatics & Molecular Evolu- tion. Wiley-Blackwell. • Attwood TK, Pettifer SR & Thorne D. (2016) Bioinformatics chal- lenges at the interface of biology and computer science: Mind the Gap. Wiley-Blackwell. Affiliations §,‡ SIB Swiss Institute of Bioinformatics, Swiss-Prot Group, Geneva ¥ (CH); Department of Computer Science, The University of Manches- ter, Oxford Road, Manchester M13 9PL (UK).

12 The GOBLET Foundation, CMBI Radboud University, Nijmegen Medical Centre, Geert Grooteplein 26-28, 6581 GB Nijmegen (NL)