EVOLUTIONARY TRENDS IN VIRAL PATHOGENS WITHIN AND BETWEEN OUTBREAKS

A dissertation submitted

to Kent State University in partial

fulfillment of the requirements for the

degree of

by

Mary E. Saha

December 2017

© Copyright

All rights reserved

Except for previously published materials

Dissertation written by

Mary E Saha

B.S., University of Akron, 2008

Ph.D., Kent State University, 2017

Approved by

_Dr. Helen Piontkivska______Chair, Doctoral Dissertation Committee

_Dr. Gary Koski______Members, Doctoral Dissertation Committee

_Dr. Christopher Woolverton______

_Dr. Tara Smith______

_Dr. Walter Hoeh______

_Dr. Gail Fraizer______

Accepted by

__Dr. Laura Leff______Chair, Department of Biology

__Dr. James Blank______Dean, College of Arts and Sciences

Table of Contents LIST OF FIGURES ...... V

LIST OF TABLES ...... VII

ACKNOWLEDGEMENTS ...... VIII

CHAPTER 1: INTRODUCTION ...... 1

1.1 RNA ...... 1

1.2 A...... 3

1.3 Challenges in Influenza Sampling ...... 6

1.4 Ebolavirus ...... 12

1.5 Immune Response against Ebolavirus ...... 17

1.6 Ebolavirus Outbreaks and Public Health ...... 19

1.7 Ebolavirus Evolutionary Questions ...... 27

1.8 Research Goals...... 30

1.9 Overview of Subsequent Chapters ...... 33

1.10 References ...... 35

CHAPTER 2: SAMPLING ISSUES IN INFLUENZA A ANALYSIS: AN

APPROACH TO DEALING WITH OVERSAMPLING ...... 44

2.1 Introduction ...... 44

2.2 Hypotheses ...... 46

2.3 Methods ...... 46

2.4 Results ...... 50

2.5 Discussion and Conclusions ...... 62

2.6 References ...... 71

iii

CHAPTER 3: GENOME-WIDE MOLECULAR SUBSTITUTION PATTERNS IN

EBOLAVIRUS ...... 76

3.1 Introduction ...... 76

3.2 Hypothesis ...... 77

3.3 Methods ...... 77

3.4 Results ...... 80

3.5 Discussion and Conclusions ...... 105

3.6 References ...... 112

CHAPTER 4: DISTRIBUTION OF POSITIVELY AND NEGATIVELY

SELECTED SITES IN THE EBOLAVIRUS GENOME ...... 115

4.1 Introduction ...... 115

4.2 Hypotheses ...... 116

4.3 Methods ...... 117

4.4 Results ...... 120

4.5 Discussion and Conclusions ...... 137

4.6 References ...... 146

CHAPTER 5: SUMMARY AND FUTURE DIRECTIONS ...... 150

References ...... 163

APPENDICES ...... 171

Chapter 1 ...... 171

Chapter 2 ...... 173

Chapter 3 ...... 176

Chapter 4 ...... 199

iv

List of Figures

Figure 1.1: Synonymous substitution rates in RNA viruses ...... 2

Figure 1.2: Percentages of HA Influenza A sequences available for the top ten countries

(or territories) in the NCBI Influenza Database (2013)...... 8

Figure 1.3: The number of publications in PubMed for Influenza A and Ebolavirus by date ...... 11

Figure 1.4: Infection Progression of Ebolavirus...... 14

Figure 1.5: Ebolavirus life cycle and immune system effects ...... 17

Figure 1.6: Map of 2014 outbreak with the number of cases ...... 22

Figure 1.7: Map of the 2017 outbreak in the Democratic Republic of the Congo (DRC, formerly Zaire) ...... 24

Figure 2.1: Workflow of analysis steps...... 49

Figure 2.2: dN Values by Country and Year: ...... 52

Figure 2.3: dS Values by Country and Year: ...... 53

Figure 2.4: Nucleotide Diversity Values...... 57

Figure 2.5: dN Values...... 59

Figure 2.6: dS Values:...... 61

Figure 2.7: dN/dS values by number of sequences used and color coded by the source article...... 69

Figure 3.1: Phylogenetic Tree of Ebolavirus Sequences ...... 89

Figure 3.2: Phylogenetic pairs mean dN...... 90

v Figure 3.3: Phylogenetic pairs mean dS...... 92

Figure 3.4: GP Between Pairs dN...... 94

Figure 3.5: GP Within Pairs dN...... 96

Figure 3.6: Phylogenetic pairs within group values...... 98

Figure 3.7: 2014 outbreak dN-dS values for within phylogenetic pair comparisons. ... 100

Figure 3.8: Other outbreak dN-dS values for within phylogenetic pair comparisons. ... 102

Figure 3.9: GP 50% epitope dN-dS...... 104

Figure 4.1: Epitope Regions and Polymorphic Sites in Three Ebolavirus Genes ...... 121

Figure 4.2: Epitope Regions and Polymorphic Sites in Three Ebolavirus Genes ...... 123

Figure 4.3: Epitope Regions and Polymorphic Sites in Three Ebolavirus Genes ...... 125

Figure 4.4: Structures and Polymorphic Sites in Six Ebolavirus Genes ...... 128

Figure 4.5: Relative Rate in Five Ebolavirus Genes ...... 136

Figure 4.6: Antiviral Treatments in HIV ...... 138

Figure 4.7: Ebolavirus life cycle and immune system effects ...... 140

Figure 4.8 GP structure ...... 143

Supplementary Figure 3.1: Phylogenetic pairs within group values ...... 198

Supplementary Figure 4.1: Epitope Regions and Point Mutations in Three Ebolavirus

Genes 50% Epitope Threshold...... 199

Supplementary Figure 4.2: Epitope Regions and Point Mutations in Three Ebolavirus

Genes 90% Threshold ...... 200

Supplementary Figure 4.3: Protein Structures and Point Mutations in Three Ebolavirus

Genes...... 201

vi

List of Tables

Table 1.1: History of Ebolavirus outbreaks ...... 20

Table 1.2: Overview of Ebolavirus epidemic challenges ...... 26

Table 2.1: Sequences Used in the Analysis ...... 488

Table 3.1: Sequences used and countries and outbreaks of origin ...... 7878

Table 3.2: Values for Different Outbreaks ...... 81

Table 3.3: Whole gene values compared to subset means ...... 83

Table 3.4: Epitope and Non-epitope values...... 85

Table 3.5: Epitope and Non-epitope dN and dS values...... 87

Table 3.6: Nonzero Within Pair Values ...... 109

Table 4.1: Sequences used and countries and outbreaks of origin ...... 78118

Table 4.2: Within Pair Differences ...... 131

Supplementary Table 1.1: Influenza A Sequences...... 17370

Supplementary Table 2.1: Diversity values...... 173

Supplementary Table 2.2: Average dN values...... 17372

Supplementary Table 2.3: Average dS values...... 175

Supplementary Table 3.1: Sequence Information...... 17374

Supplementary Table 3.2: Epitope Information...... 17380

vii

ACKNOWLEDGEMENTS

First I would like to thank my advisor, Dr. Helen Piontkivska, for all of her guidance and support. Without her I would have never found the path that led me to this research and to the field of study I am now passionate about. I would also like to thank my advisory committee- Dr. Gary Koski, Dr. Gail Fraizer, Dr. Walter Hoeh, and Dr. Tara

Smith for all of their encouragement and helpful feedback throughout my graduate student career. I would also like to thank Dr. Chris Woolverton for agreeing to be my

Graduate Representative and for his clear and constructive criticism of this dissertation. I would also like to acknowledge the Kent State Biology Department, which has been a wonderful place to work and to learn.

In addition, I want to thank my lab mates, Reeba Paul, Madara Hetti-Archchilage, and Noel-Marie Plonski, for all of their help and support. I would also like to thank

Chanelle Waligura, an undergraduate assistant, for all her hard work and creativity.

Lastly, I would like to thank my family for their support. This includes my parents, John and Karen Halpin, my husband, Mukul Saha, my brothers, Mike, Tim, and

Larry, and my extended family.

Dedicated to my father and mother, John and Karen, fostered my curiosity and desire to learn. And to my husband, Mukul, who is my most ardent supporter.

viii

CHAPTER 1: Introduction

This chapter is divided into two parts. The first part provides an overview of the

RNA viruses involved in this project (Influenza A and Ebolavirus (Zaire)), their biology, and their public health impact (Sections 1.1 through 1.7). The second part highlights the specific aims of this project and delineates the overall structure of the dissertation

(Sections 1.8 and 1.9)

1.1 RNA viruses

In the modern world, a local outbreak of an infectious disease can very easily become an international problem [e.g., SARS (2002-2003 (1)), MERS-COV (2012- present (2)), or influenza (e.g. 2009, 3)]. As seen with the recent Ebolavirus epidemic

(that began in late 2013 (4)), neglecting a disease when it is not a problem in your particular country is a dangerous thing to do. As shown in figure 1.1, RNA viruses have a high rate of change. This is due to their high mutation rate and lack of error correction

(5). RNA viruses can therefore rapidly evolve. This is more dangerous paired with their ability to disperse, spreading worldwide quickly by taking advantage of the same means of travel that their and animal hosts do. Thus, it is important to be able to follow and predict likely evolutionary changes of these viruses, particularly changes in the direction of increased pathogenicity and/or communicability. This can allow quick movement of resources to areas where conditions facilitate such changes (e.g., areas with high interaction with reservoirs).

1

Figure 1.1: Synonymous substitution rates in RNA viruses (From 6). For clarity the “–viridae” at the end of family names has been omitted. This figure contains a wide range of RNA viruses from positive singled stranded (Astroviridae

– Arteriviridae), negative single stranded (Arenaviridae – Orthomyxoviridae), double stranded (Reoviridae), and the reverse transcribing viruses (Retroviridae). The y axis is log synonymous substitution rate. The variation in rates can potentially be attributed to other factors including transmission mode and type of infection

2

Emerging pathogens and pathogens that are recurring are one of the biggest challenges to public health. Whether it is an emerging disease like Nipah , Zika virus, or Chikungunya virus [which although discovered in 1952 has just begun to spread worldwide (7)], or recurring diseases such as cholera, yellow fever, or influenza, surveillance and quick response times are integral. Outbreaks start small but spread quickly depending on the mode of transmission. Zika, Chikungunya, and yellow fever viruses are transmitted via mosquito vectors. For other diseases, like Ebolavirus, influenza, and Nipah viruses, the carriers are and animals, including bats and livestock. Monitoring populations of all carriers is vital to catching outbreaks before they become widespread. To this end, sequencing of viral samples from both human and animal hosts across all known locations for a virus is essential to staying ahead of possible mutations and host jumps, and developing and updating effective treatment and prevention options.

1.2 Influenza A

1.2.1 Influenza Biology

Influenza A is an RNA virus of the family Orthomyxoviridae. Influenza A has a single-stranded RNA genome containing eight segments. There are two surface encoded by segments four and six; hemagglutinin (HA) and neuraminidase (NA). These are important vaccine targets. Hemagglutinin binds a surface receptor and mediates entry into host cells. It is also a key target for the immune system (8-10). Neuraminidase is involved in release of viral buds from infected cells. There are 16 known subtypes of HA

3 and 9 subtypes of NA. These account for the naming conventions of flu strains (e.g.

H1N1, H3N2). Currently, only a limited number of combinations infect humans. These are H1N1, H2N2, H3N2, H5N1, H7N7, and H9N2 (11) and more recently H7N9 in

China (11). Other proteins including one encoded by a frameshift (PB1-F2) have also been implicated in influenza virulence (13, 14).

1.2.2 Impact of Influenza

Seasonal strains of influenza A infect around 500 million people a year resulting in 250,000-500,000 deaths, mainly in the elderly and young children (15). Some of these deaths can be directly attributed to the flu, but the majority of deaths are likely to occur by pneumonia due to secondary infections, which attacks the already weakened immune system (11). The economic costs of seasonal influenza are difficult to estimate, in nonfatal infections, but involve loss of work hours or productivity. An estimate in the US based on the 2003 population puts the costs of influenza per year (hospitalization, doctors' office visits, medications, etc.) at approximately $4.6 billion. Also, up to 111 million workdays are lost costing an estimated $7 billion/year in sick days and lost productivity (16).

1.2.3 Vaccine Development

Hemagglutinin (HA) and neuraminidase (NA) are the primary focus of vaccine development and influenza treatment (17). Both proteins have receptor regions which are necessary for protein function, but also variable enough to avoid immune system detection (18). Vaccines that target the more conserved stalk region of HA are being developed after it has been shown that the for this region can be cross- protective; however, strains with a mutated short stalk have been discovered (19). NA is

4 the target of drugs like oseltamivir (Tamiflu), which prevent the budding of new viruses from the host . It has been shown unfortunately, that some strains are already developing resistance to oseltamivir and other NA targeted drugs (20, 21).

Further, new strains of the flu are developing constantly, because of antigenic drift due to point mutations during RNA replication (roughly one mutation per replication cycle (22)) and antigenic shift, which is recombination of flu segments that occurs within cells infected by more than one strain (18). The latter facilitates host jumps and is mediated by the segmented genome structure. The former can lead to mutations that allow the virus to escape immune system detection (22, 23).

1.2.4 Limitations and Challenges for the Influenza Vaccine

For over 60 years vaccines against influenza A have been made available (15).

However, there is a lag period of up to seven months between identification of a pandemic or seasonal strain, and completion of vaccine development and distribution in the US alone. The 2009 pandemic strain had outbreaks early in the year (April 2009 in the US) and had reached 74 countries by June of that year (24). Vaccine was made available in the US in November 2009 (12). On multiple occasions, seasonal vaccines had diminished effectiveness because they targeted strains that turned out not to be prevalent during that particular flu season (as in the 2008-2009 flu season) (25). Thus, it is critical to be able to identify strains that are the most likely to pose a public health threat because they are highly communicable, highly virulent, or both.

The approach used for strain selection relies on the WHO’s Global Influenza

Surveillance Network of 83 countries to identify the three most virulent strains for use in the vaccine each year (15). These countries may have different public health priorities, so

5 sampling may be biased toward one country or region. In this case, the strains showing up most often in the most highly sampled countries are generally selected, while more virulent strains can be missed. This can lead to serious issues, most notably when prominent strain type changes (e.g. 1968-1969 flu season, 1993-1996 seasons) (25). The current influenza vaccine contains one H3N2 strain and one H1N1 strain along with an influenza B strain (12). This is due to the co-circulation of strain types caused by past pandemics. H1N1 was the dominant type through the 1918 pandemic, and then in 1968-

1969 H3N2 caused a pandemic. Since then both types have become seasonal and are included in vaccines (15).

1.3 Challenges in Influenza Sampling

There are multiple reasons why sampling bias exists, such as countries lacking resources for detailed surveillance or influenza not being a health priority in countries with widespread HIV and malaria (26). Likewise, some areas (e.g., areas where tourism is essential to the economy) tend to get better surveillance than others and have developed in country vaccine production programs (27). Due to these factors, the available sequence data in databases is unevenly distributed among geographic areas. For example (Supplementary Table 1.1, Figure 1.2), as of October 28, 2013, the most represented countries (US, China, Singapore) have 48.9% of the sequences out of 165 countries (28). This pattern of over- and under-sampling continues, as shown in Figure

1.2 in red. Out of 33,593 HA sequences, 18,184 (54%) are from USA, 1,462 (4%) from

Singapore, 1,205 (3.5%) from the United Kingdom, and 1,102 (3%) from China.

6 Flu surveillance is essential for pandemic prevention. Identifying the conditions that lead to species jumps allows for clarification and focus of prevention strategies.

However, the limited number and geographic area of sequences available from developing countries and countries with other public health priorities is impairing that identification. Species jumps happen more often in countries where people are more likely to interact with live animals on a day to day basis on farms or in marketplaces (24).

Highly developed countries have monetary and infrastructural advantages in implementing and maintaining a sampling network for influenza, whereas less developed nations lack the means and ability to easily take a representative number of samples from throughout their country.

7

Figure 1.2: Percentages of HA Influenza A sequences available for the top ten countries (or territories) in the NCBI Influenza Database (2013). Slices are proportionate to percentage of available HA sequences. This includes all influenza A strain types, but whole gene sequences only. Updated percentages shown in red (August 2017)

8 Thus, the important issue is how to overcome limitations of existing databases that are biased in terms of available genetic sequences (strain types, areas sampled, etc.).

One approach often used in studies is to limit your geographic area of analysis (e.g., 29,

30), which may miss the global trends. The other approach is to limit strain type or time frame included in analysis (e.g., 23, 31), which may underestimate the global complexity of viral evolution patterns, including evolutionary interaction of surface proteins (32, 17).

Therefore, for studies focused on having global representation of strains that is not biased, one approach would be to use random sampling to achieve a more representative dataset from those areas that are oversampled. The presence of multiple circulating strain types and the presence of less common strains (e.g. H5N1) are important to the overall picture of influenza distribution and evolution, but many exist in such small numbers that analyzing them separately may not be a feasible option. Knowing how such outlier strains affect overall statistics of the group is essential, especially when averages are considered.

Influenza A is an example of a virus which, when spreading on a local scale, can be considered more of a public health nuisance than an emergency. Yet, when the virus mutates into a virulent form and begins to quickly spread, it can rapidly become a public health emergency (e.g., 2009 H1N1 pandemic). Despite a consistent effort spent on studying influenza and vaccine development (see Figure 1.3), ongoing mutations in the influenza genome make it an ever-present public health challenge. Further, the large- scale sequencing efforts – while much larger than for any other virus 1– are limited in their geographical range. Specifically, the majority of influenza A sequences are derived from developed countries, thus, providing a biased sample for molecular evolutionary

1 About 500,000 sequences in the Flu database at NCBI http://www.ncbi.nlm.nih.gov/genomes/FLU/aboutdatabase.html vs. the next runner up HIV with over 95,000 sequences http://www.oxfordjournals.org/our_journals/nar/database/summary/76 (as of Aug 2017).

9 analysis. The broad availability of influenza A sequences is also limited to recent years, during and after the 2009 pandemic, necessitating development of sampling approaches that would take this into account.

This is also true for many other viruses. Ebolavirus is heavily sampled in the 2014 outbreak, which was the most wide spread and reached urban areas. The period following the beginning of the 2014 outbreak is also when the number of publications begins to increase rapidly (Figure 1.3). Previous outbreaks have few or no sequences (Table 1.1).

Adequate surveillance and sampling of emerging and recurring pathogens is necessary to prevent pandemics. This is true not only for humans, but for livestock as well. The Nipah virus caused over one million pigs to be put down in Malaysia in 1999, which was a huge blow to their economy (33). In Chapter 2 I will describe a re-sampling approach I developed to capture geographic sequence variation patterns in overrepresented datasets.

10

Figure 1.3: The number of publications in PubMed for Influenza A and Ebolavirus by date The blue circles represent the number of influenza A publications per year. The red squares represent the number of Ebolavirus publications per year

11

1.4 Ebolavirus

1.4.1 Ebolavirus Biology and Proteins

Ebolavirus (EV) is a negative-strand RNA virus of the family Filoviridae. Its

~19,000 nucleotide-long genome is non-segmented and enveloped, and encodes seven genes. The five types of EV are distinguished by PCR or ELISA assays (34), namely:

Zaire (EBOV), responsible for the majority of human outbreaks, including the current one, Sudan, Reston, Tai Forest (also known as the Côte d’Ivoire), and Bundibugyo (35).

All but the Reston strain can result in severe hemorrhagic fever in humans. This project will focus on the Zaire strains due to the availability of these strains from the current

(2013-present) outbreak in West Africa, and its high fatality rate (34).

The Ebolavirus genome encodes 8 proteins. The (NP), viral proteins (VP) VP35 and VP30 and the L protein (an RNA- dependent RNA polymerase) form the nucleocapsid of the virus and are bound to the viral genome to mediate replication. VP40 is a matrix protein similar to influenza’s NA protein, stimulating viral budding and release.VP24 is another matrix protein that helps in nucleocapsid formation and assembly. The last protein is the surface (GP) that is involved in attachment and entry into cells (like flu HA). Unlike other genes GP is edited from mRNA to create multiple proteins, including GP1, GP2, sGP, ssGP, and the Δ peptide.

GP1 and GP2 form the complex for the surface protein. A shorter version of GP1, sGP has been shown to have effects on Ebolavirus pathogenicity (36).

As shown in Figure 1.5, once the virus attaches to the cell using its surface glycoprotein (GP), the Ebolavirus enters the cell through the macropinocytosis. GP then

12 allows for the fusion of the viral and endosomal membranes after acidification of the endosome, which releases the viral ribonucleocapsid into the host cell. Here, in the host cell, the negative strand RNA genome then goes through and replication.

For RNA synthesis NP, VP35, and L are required, with VP30 being essential for initiation of transcription, but not for replication. Once the virus has replicated, new viruses are packaged at the where VP40 is the matrix protein, aiding in viral budding. VP24 has a role in regulating virus production at the cell membrane (37).

13

Figure 1.4: Infection Progression of Ebolavirus. Ebolavirus spreads from the initial infection site to the lymph nodes, liver and spleen.

Soluble factors from these cells also lead to vascular leaking (from 38) (IL=interleukin.

MCP-1=monocyte chemoattractant protein-1. MIPs=macrophage inflammatory proteins.

NO=nitric oxide. TNFα=tumour necrosis factor α.) (Reproduced with permission of

LANCET PUBLISHING GROUP in the format Thesis/Dissertation via Copyright

Clearance Center)

14

1.4.2 Ebolavirus Transmission and Disease Progression

Ebolavirus is transmitted through person to person and animal to person contact involving multiple types of bodily fluids and the virus enters through mucosal membranes, cuts and abrasions in the skin, and may also be transmitted sexually (39).

The virus spreads from the initial infection site to regional lymph nodes, liver and spleen with the help of infected monocytes, macrophages, and dendritic cells. Unlike many viruses, Ebolavirus infects a wide range of cell types including monocytes, macrophages, dendritic cells, endothelial cells, fibroblasts, hepatocytes, and several types of epithelial cells (38, Figure 1.4). Monocytes, macrophages, and dendritic cells seem to be favored for replication and assist in disease spread. The liver and adrenal gland are early

Ebolavirus targets, which may explain the hemorrhagic tendencies in some cases (the liver produces proteins that aid in coagulation and the adrenal gland helps in controlling blood pressure) (38, figure 1.4).

In the first few days of illness, weakness, fatigue, and fever are the most prominent symptoms. This makes diagnosis difficult due to the overlap with many other co-circulating diseases, such as malaria, yellow fever, or influenza. Within the first week of active infection, nausea, vomiting and diarrhea are often reported, which, since the patient is highly infectious at this time, can be extremely dangerous if the Ebolavirus diagnosis has not yet been made. Week two is often the make or break time. Adaptive immune response can kick in and the patient starts to get better, or not and the patient worsens. Organ failure, including renal and respiratory failure, can occur as well as meningoencephalitis (inflammation of the brain and surrounding tissues). Lung failure

15 can also occur and may be due to alveolar macrophages that are infected inducing inflammation within the lung. Other symptoms include eye pain and blurred vision, rashes, cardiac issues including severe arrhythmias, and abdominal pain (40).

Ebolavirus (Zaire) can have an up to 88% fatality rate (35). Recovery can be slow and the virus can remain in the system for months afterwards. It has been found in semen more than 12 months after infection (41), and has also been found in breast milk months after infection (42). The other long term effects that can result from Ebolavirus infection include musculoskeletal pain, eye pain, blurry vision, hearing loss, abdominal pain, headache, and memory issues (43). In addition to physical health concerns, there are also mental health issues. These include grief, depression, and drug use. These issues can be due to loss of loved ones and alienation from society due to recovered status. In order to avoid this, community outreach and education is vital (43).

16

1.5 Immune Response against Ebolavirus

Figure 1.5: Ebolavirus life cycle and immune system effects Shown above is the life cycle of the Ebolavirus and the points at which its proteins interact with the immune system. Adapted from 44 & 45

17

Although Ebolavirus does not infect lymphocytes, their rapid loss by is a prominent feature of disease impairing both T and B cell response. The direct interaction of lymphocytes with viral proteins cannot be discounted as having a role in their destruction, but the substantial loss of lymphocytes probably results from a combination of factors, including infection mediated impairment of dendritic cells and release of soluble factors from monocytes and macrophages. Soluble factors released from target cells also contribute to the impairment of dendritic cells and release of other soluble factors from monocytes and macrophages (46).

Antibodies against GP1 can be neutralizing, enhancing, or neither. The enhancing occurs due to the C1q protein complement. C1q can interact with GP1 and the binding antibodies and C1q ligands on the cell surface to facilitate virus binding to the cell (47).

This raises further concerns with relying solely on GP in the vaccine, as antibodies produced can be either neutral toward, or aiding in, infection instead of neutralizing which is essential for vaccine efficacy.

Four of the seven genes of the Ebolavirus code for proteins that affect the immune system and its response: VP35, VP40, VP24, and GP. VP35 not only helps in RNA synthesis it also impairs the maturation of dendritic cells by blocking RLR (RIG I like receptor) signaling (48). In addition, VP35 also blocks the cytoplasmic sensors RIG-1

(retinoic acid inducible gene 1) and MDA5 (melanoma differentiation associated gene 5) which leads to the inhibition of regulating factor 3 and 7 thus blocking type 1 interferon response. This protein also inhibits the antiviral PKR kinase (44) and has been shown to inhibit iRNA (49). VP40 has demonstrated the ability to be cytotoxic and cause

18 cell rounding in vitro (37). The VP24 protein also blocks interferon signaling, by interfering with STAT1 nuclear translocation by binding to importins α5, α6, α7 (49,

Figure 1.5).

The GP gene however, has the largest effect on the immune system of these four, by encoding multiple proteins that impair immune system response. Secreted GP (sGP) is the largest product of this gene, making up 70% of the GP produced during an infection

(50). It binds to antibodies outside the cell, acts as an anti-inflammatory, impairs neutrophil movement and activation, and increases endothelial cell barrier function (37).

GP 1,2 is a two part surface protein which can cause cytokine dysregulation and endothelial cell dysfunction when cleaved from the cell surface (47). The GP1 (surface portion) contains the mucin domain, which is cytotoxic. It is also the portion that binds to the Niemann-Pick C1 cholesterol transporter to facilitate cell entry by endosomal membrane fusion. The GP2 (transmembrane portion) counteracts tetherin to aid in viral budding and release. The Δ peptide encoded by the GP gene inhibits further Ebolavirus entry preventing superinfection (47).

1.6 Ebolavirus Outbreaks and Public Health

The Ebolavirus first became known in 1976 with outbreaks in southern Sudan and northern Zaire (now the Democratic Republic of the Congo (DRC)). These outbreaks were caused by the Sudan and Zaire types of Ebolavirus, respectively. There were roughly 300 cases each with a fatality rate of 88% for the and a fatality rate of 53% for the Sudan Ebolavirus (35). Additional outbreaks have been reported since, but none to the scale and severity of the 2014 outbreak (Table 1.1).

19 Table 1.1: History of outbreaks. The number of sequences in the all Ebola data set is listed as well as details about each outbreak (51). The outbreaks are listed in the reverse order, from the oldest to the most recent

Number of Reported number of Reported number sequences Year(s) Country Ebola subtype human cases (%) of deaths 0 1976 England Sudan virus 1 0 0 1976 Sudan (South Sudan) Sudan virus 284 151 (53%) Democratic Republic of 4 1976 the Congo (DRC) Ebolavirus 318 280 (88%) 1 1977 Zaire Ebolavirus 1 1 (100%) 1 1979 Sudan (South Sudan) Sudan virus 34 22 (65%) 0 1989 USA Reston virus 0 0 0 1989-1990 Philippines Reston virus 3 (asymptomatic) 0 0 1990 USA Reston virus 4 (asymptomatic) 0 0 1992 Italy Reston virus 0 0 2 1994 Côte d'Ivoire (Ivory Coast) Taï Forest virus 1 0 1 1994 Gabon Ebolavirus 52 31 (60%) 4 1995 DRC Ebolavirus 315 250 (81%) 0 1996 Russia Ebolavirus 1 1 (100%) 0 1996 Philippines Reston virus 0 0 0 1996 USA Reston virus 0 0 0 1996 South Africa Ebolavirus 2 1 (50%) 5 1996 Gabon Ebolavirus 37 21 (57%) 0 1996-1997 Gabon Ebolavirus 60 45 (74%) 1 2000-2001 Uganda Sudan virus 425 224 (53%) 1 2001-2002 Gabon Ebolavirus 65 53 (82%) 0 2001-2002 Republic of the Congo Ebolavirus 57 43 (75%) 0 2002-2003 Republic of the Congo Ebolavirus 143 128 (89%) 1 2003 Republic of the Congo Ebolavirus 35 29 (83%) 0 2004 Russia Ebolavirus 1 1 (100%) 1 2004 Sudan (South Sudan) Sudan virus 17 7 (41%) 7 2007 DRC Ebolavirus 264 187 (71%) 3 2007-2008 Uganda Bundibugyo virus 149 37 (25%) 0 2008 Philippines Reston virus 6 (asymptomatic) 0 0 2008-2009 DRC Ebolavirus 32 15 (47%) 1 2011 Uganda Sudan virus 1 1 (100%) 4 2012 DRC Bundibugyo virus 36* 13* (36.1%) 5 2012 Uganda Sudan virus 11* 4* (36.4%) 0 2012-2013 Uganda Sudan virus 6* 3* (50%) Sierra Leone, Guinea, 180 2014-2016 Liberia,Mali, Ebolavirus 28,616 11,310 (39.5%) 2 2014 DRC Ebolavirus 66 35(53%) 2017 DRC Ebolavirus 8 4(50%)

20 1.6.1 The 2014 Outbreak

The recent outbreak of Ebolavirus began in December of 2013 in southern Guinea

(4). This outbreak was by far the largest, both in geographic area and cases/fatalities

(shown in Figure 1.6). The WHO declared the outbreak a Public Health Emergency of

International Concern in August 2014.

Bringing this epidemic under control posed many challenges. Many were due to the geographic nature of the region, including lack of infrastructure, supplies, and health care professionals (4). The biggest impediments, however, were the beliefs of the people and their lack of faith in modern medicine coupled with a distrust of the West (4). For example, a riot erupted outside a hospital in Guinea treating Ebolavirus victims when the rumor was spread that the MSF (Doctors without Borders) had brought in the disease (4).

Funeral practices that include washing and touching of the body by attendees and even cleaning out of the bowels by preparers have been linked to roughly 60% of cases in

Guinea and 80% in Sierra Leone (4), where a single infection may spur hundreds of new ones. For example, the death of a single faith healer in Sierra Leone was linked to 365

Ebolavirus-related deaths and led to the outbreak spreading to Liberia (4). Further, the challenges of controlling this outbreak were not only sociological, but also due to the limited amount of study and available strains for Ebolavirus, which limited the development of a base of knowledge about Ebolavirus in humans and the best measures to be taken in regards to containment and treatment (Table 1.2). Outbreak totals stand at

28,616 cases 11,310 deaths (as of June 10, 2016, after the outbreak was declared over

(see Figure 1.6)).

21

Figure 1.6: Map of 2014 outbreak with the number of cases

The density of confirmed cases is shown by the shades of blue from light (low) to dark

(high). (52) http://apps.who.int/Ebolavirus/sites/default/files/thumbnails/image/sitrep_casecount_40.p ng?ua=1

22

1.6.2 2017 Outbreak in the Democratic Republic of Congo

In late April of 2017 cases of a febrile illness were reported in the Likati health zone of the DRC. As of July 2, 2017 the outbreak has been declared over. A total of five confirmed Ebolavirus cases, three probable and one suspected have been reported (Figure

1.7). So far there are four survivors and four deaths (the count of eight includes the probable cases). The last confirmed case was on May 11, 2017. No further cases are expected (53). Contact tracing and ring vaccine measures with the new vaccine have been used as containment strategy.

However, another strategy that may help is the establishment of free health care clinics to treat the most common diseases in the area and provide maternity services and general surgery (53). This outreach to the community is a good way to strengthen ties and build trust that is essential to handle outbreaks quickly and thoroughly. If people know to expect care instead of fearing the NGOs involved they will be more likely to turn to them for help instead of hiding and remaining at home, which was a major issue in the 2014 outbreak (54). The 2017 outbreak is also in the country that first had outbreaks of

Ebolavirus, thus the communities are more familiar with its severity and perhaps more trusting than those in which Ebolavirus is still a novel occurrence.

23

Figure 1.7: Map of the 2017 outbreak in the Democratic Republic of the Congo (DRC, formerly Zaire)

Confirmed (confirmé) cases and probable cases are marked as in the legend. A total of

583 contacts were traced in this outbreak. (53) http://apps.who.int/iris/bitstream/10665/255630/1/EbolavirusDRC-06062017.pdf

24 1.6.3 The Vaccine Development Efforts

The vaccine currently in use (rVSV-ZEBOV) was developed by the Public Health

Agency of Canada. This vaccine uses the vesicular stomatitis virus (VSV), a virus which can cause mild flu-like symptoms in humans, in which one gene is replaced with the

Ebolavirus glycoprotein (GP) (55). The vaccine only contains a single Ebolavirus protein and no live Ebolavirus. Trials were first on animals (55), then human trials were pursued at a small scale (56) leading up to the large ring trial in Guinea (57). This trial had 11841 participants with 5837 people receiving the vaccine. There were no Ebolavirus cases in those who received the vaccine after ten days or more, but there were 23 in those who did not receive the vaccine. The trial included three groups of both direct and indirect contacts, 2041 of which were given the vaccine 21 days after randomization (57). Given that the vaccine is still cold chain, ring vaccination techniques will probably be the most frequent solution, as Ebolavirus tends to strike first in areas that are less developed with limited access and no electricity.

25 Table 1.2. Overview of Ebolavirus epidemic challenges

Challenge References Small number of available sequences from earlier (see Table 3.1 for details) outbreaks Urgency driven by public health emergencies, coupled 34 with deficient infrastructure, including lack of major research efforts prior to current outbreak Reservoir still unknown 58 Recent influx of Ebolavirus sequences (2013-) from a 59 limited geographic range Even if a vaccine is made available soon, there will be 60 great challenges in distribution (cold chain) Post-exposure treatment is needed 4 Because of lacking public health infrastructure, ideally, a 4 cross-protective vaccine is needed, one that does not need a booster

26 Human outbreaks have been traced to direct human exposure to infected fruit bats or intermediate hosts, such as non- human primates. This virus is not transmitted through air or water (61). Human epidemics subsequently take off by direct human-to- human contact by bodily fluids or indirect contacts with contaminated surfaces. Social events, like funerals where the body of the deceased is washed and touched by family and friends, have contributed to increases in outbreak size. Past outbreaks have been caused by different virus strains (mainly of Zaire or Sudan type), and the genomes are poorly sampled in these smaller outbreaks, making it difficult to ascertain whether an observed epidemiological characteristic is unique to the causative strain. The high fatality rate, combined with absence of treatment and vaccination options, makes Ebolavirus an important public health threat, and potential bioweapon. Since Ebolavirus is found in multiple bodily fluids, a contagious individual could infect many others using a shrapnel bomb. Given that suicide bombers are a common terrorist tactic; this is an especially horrific thought.

1.7 Ebolavirus Evolutionary Questions

1.7.1 Asymptomatic cases

A lack of comprehensive health care and surveillance mechanisms in the affected countries in Africa leaves open the possibility that some of the EBOV infections exist as a chronic infection, without the severity of the acute symptoms. Leroy at al. (62) study of the 1996 outbreak in Gabon showed that although severe symptoms manifested in the majority of patients, some individuals remained asymptomatic. Notably, the viral load was relatively low in such patients, although they exhibited a strong inflammatory response (62). Likewise, mild cases were reported in other outbreaks as well (e.g., 63). It

27 remains unknown how long the virus may stay in the body, particularly in mild and/or asymptomatic cases, although evidence from the current outbreak suggests that

Ebolavirus RNA can remain in the body for at least nine months (64; 41). This is vitally important because although the viral load is lower, it does not mean that the individual is incapable of transmitting the virus through bodily fluids like blood.

1.7.2 Chronic or Acute?

While a lot of effort is focused on elucidating the molecular biology of the infection processes and controlling the current outbreak, very little remains understood about the long-term evolution of the EBOV, including what happens to the virus between outbreaks. Bats have been hypothesized to serve as EBOV reservoirs, although live virus is yet to be isolated from bats (44). Likewise, it is unclear whether Ebolavirus is a chronic condition in its reservoir. Due to severity of the symptoms, currently all diagnosed human

EBOV cases are reported as the acute Ebolavirus infections. Thus, it is important to distinguish between these possibilities - whether EBOV only exists as acute infection in humans, or can also be present as a chronic infection in some individuals, thereby creating the possibility of infection through latent carriers. This latter possibility is particularly important in infection control mechanisms. Further, although no significant differences were observed between EBOV sequences isolated from acute and asymptomatic cases (62), this finding was based on a comparison of partial sequences derived through PCR amplification, and hence, signs of sequence divergence could have been missed.

1.7.2.1 Chronic vs. Acute Infection Expectations

28 The evolutionary scenarios can be expected to differ between chronic and acute infections. In acute infections, the virus gets into a host, replicates quickly, and gets out infecting another host. As shown in Hanada et al., 2004 (6), replication frequencies of

RNA viruses capable only of acute infection are higher than the replication frequencies for RNA viruses capable only of persistent infection. This is also reflected by higher synonymous substitution rates for acute viruses over persistent viruses shown in the same study. Further, producing a large viral load increases the likelihood of infecting new hosts. Thus, the high error rate during viral replication, paired with a higher replication rate, leads to a high mutation rate and those mutations, likely random, should be spread somewhat evenly throughout the genome. On the other hand, chronic infections are in it for the long haul. Their respective transmission rates are generally lower (at least in their reservoirs), than those of acute infections thereby the ability of the host to survive longer is important. Thus, due to persistent selective pressure from the immune systems in chronic infections, the epitopes can be expected to accumulate amino acid substitutions indicating viral escape from the immune pressure, e.g., escape from CD8+ T- lymphocytes, as we showed in HIV (65, 66) and SIV (67, 68) viruses. Thus, elevated levels of amino acid sequence diversification at viral epitopes can point to the long-term relationships between the virus and the host (i.e., chronic infection).

Why is it important to know whether Ebolavirus is a chronic or an acute infection? This is because it only causes human outbreaks occasionally. If the evolutionary behavior can be ascertained in its reservoir host in between those outbreaks it will advise and inform the production of vaccines and treatments that will be effective for multiple outbreaks. It also might lead to knowledge about what Ebolavirus may do in

29 the future. If it is a chronic disease in its reservoir, it may adapt to the human system and become less fatal over time. There is already evidence that Ebolavirus can cause asymptomatic infections in humans. This means that the human immune system can suppress the virus and keep it at lower levels so no symptoms appear.

This project aims to distinguish between sequence divergence patterns that can be attributable to acute and chronic infections. We used publicly available EBOV genomic sequences, including those from the deep-sequencing studies, to analyze substitution patterns to infer whether the virus likely experiences chronic infection stage, and how

Ebolavirus evolves in its reservoir between human outbreaks. Our hypothesis is that if

EBOV only exists as an acute infection, high error rate during viral replication will lead to accumulation of mutations uniformly along the genome. On the other hand, if EBOV also exists in a chronic infection form, it will be subjected to selective pressures from the host immune system, specifically, in the regions of cytotoxic T lymphocytes (CTL) and epitopes, thus, leading to higher levels of mutations in epitopes compared to non-epitope regions as predicted by an HIV study (5; 65).

1.8 Research Goals

1.8.1 Aim 1: Influenza A: Sampling Issues in Influenza A: An Approach to

Dealing with Oversampling

Sampling bias is a fundamental challenge when dealing with statistical analyses.

When dealing with public health, choices made when taking a sample of a population impact the outcomes of the study and more broadly, the approaches to dealing with the disease (vaccines, treatments). This makes it vitally important to sample properly and use

30 the best available approaches to eliminate bias whenever possible. In the case of dealing with sequence databases, the samples are taken already. The bias is built in to the dataset.

By looking at the effects of subsampling on the countries with the largest available samples, insight is gained on how only having a small amount of the sequences that are available could impact the conclusion drawn by studies. This provides insight on potential methods for combating bias in existing datasets.

Hypothesis: When sampled properly, subsets can obtain the same trends as whole sets and allow the building of data sets that are more representative with less sampling bias.

1.8.2 Aim 2: Ebolavirus: Estimates of evolutionary rates in Ebolavirus:

Characterization of molecular substitution patterns (genome-wide)

Ebolavirus was discovered in 1976, but received very little attention by the public until the 2014 outbreak. Thus, we have very few samples predating the 2014 outbreak, in part due to poor surveillance, lack of infrastructure, and the high risk nature of the virus

(34). This recent outbreak however has brought to light the importance of this disease and with it, the resources to add more sequences and data. The challenge is trying to figure out the long term evolutionary trends when the samples available are heavily biased towards the most recent outbreak with many samples coming from Sierra Leone between

September 2014 and now and few sequences from Liberia and Guinea available, despite the effort being made to add sequences from these countries (59, 69, 70).

For this study, recently sequenced 217 complete genome Ebolavirus sequences were combined with the other sequences from human patients available to take an in depth look at this virus. With the first aim, the whole genome alignments were used to

31 generate phylogenetic trees and various descriptive statistics estimates to show how the virus is changing and whether there are differences in the evolutionary patterns between this massive and ongoing outbreak and the smaller ones that have come before.

Hypothesis 1: Due to the spread and length of the 2014 outbreak (the largest and the longest so far, (4)), mutations will have occurred at a higher rate than seen before.

Hypothesis 2: Due to extended contact with multiple human hosts there will be a higher number of changes to epitope regions in response to selective pressure from the human immune system than to non-epitope regions, compared to previous epidemics.

1.8.3 Aim 3: Ebolavirus: Informing vaccine development efforts:

Characterization of evolution of treatment targets (for drugs and vaccines),

genes and regions within genes

The goal when discussing any infectious disease is control, and ideally prevention. Both a vaccine and drug treatments for after the (potential) exposure are needed for Ebolavirus due to the fact that it thrives in rural regions that are difficult to access (58). Even if a vaccine were developed and made affordable for public use, impediments to establishing herd immunity such as fear of modern medicine, transport and distribution in rural areas (especially if it requires a cold chain), and funding required to provide the means for immunization drives would limit its effectiveness. So in addition to a vaccine, viable treatment options need to be paired with it. On both these fronts will be challenges on where to target the virus. Thus, delineating patterns of the evolution of genes and epitope regions that are potential treatment targets is essential.

32 Hypothesis 3: Due to functional and structural constraints acting on a virus, there will be highly variable and highly conserved regions that can serve as prospective targets for vaccine and drug design.

In this aim, we will identify such highly variable regions and conserved regions using several approaches: (a) identifying and mapping polymorphic changes, (b) mapping and prediction of secondary structural elements, (c) using known functional annotations

(such as surface protein regions exposed to the immune system outside the cell).

Hypothesis 4: Distribution of non-synonymous (amino acid altering) SNPs will be uneven along the genome, with the majority of SNPs leading to radical amino acid changes located in epitope regions, due to immune pressure from the host.

Hypothesis 5: Functionally important residues can be identified using SNP distributions within and between epidemics, where the most important residues are those that can harbor SNPs only during an individual epidemic, but not between epidemics. In other words, residues under strong purifying selection pressure can harbor polymorphisms only during the time of epidemics due to broader host distribution.

1.9 Overview of Subsequent Chapters

Chapter 2 discusses the approach to handle sampling issues in influenza A. The introduction is a brief review of why these issues exist. China and the United States are the two countries with enough available sequences for this project. China’s sequence set was significantly smaller, enabling a comparison of oversampling and under sampling issues.

33 Chapter 3 is the first part of the Ebolavirus project. Synonymous and non- synonymous rate values were calculated for both epitope and non-epitope regions throughout the seven genes at varying levels of percentage of positive assays. This was done to test the hypothesis as to whether Ebolavirus is evolving due to immune response.

The second section of chapter 3 involves phylogenetic pairs which were used in order to take a more detailed view into the evolution of Ebolavirus and its patterns of change.

These patterns could have a great impact on long term effectiveness of the current vaccine and insight on potential vaccine targets other than the currently used GP.

Chapter 4 continues work on Ebolavirus. Point mutations and polymorphisms

(from singletons to more prevalent changes) are mapped to current epitope and protein structure information. The location and density of non-synonymous changes especially and the locations where changes are not tolerated are vital to the development of treatments and drug specifically for Ebolavirus. Relative rate of evolution estimates are also used to pinpoint these highly variable sites and regions.

Chapter 5 is a summary of the work presented and conclusions drawn, with focus on potential uses of this information and future analyses that can be performed.

34 1.10 References

1. CDC factsheet SARS https://www.cdc.gov/sars/about/fs-sars.html Updated

July 2, 2012

2. CDC MERS https://www.cdc.gov/coronavirus/mers/index.html Updated July

13, 2016

3. Furuse, Yuki, Shimabukuro, Kozue, Odagiri, Takashi, et al. Comparison of

selection pressures on the HA gene of pandemic (2009) and seasonal human

and swine influenza A H1 subtype viruses. Virology. 2010. 405: 314-321

4. WHO. (2015). One Year into the Ebola Epidemic: a Deadly, Tenacious and

Unforgiving Virus. http://www.who.int/csr/disease/ebola/one-year-

report/ebola-report-1-year.pdf?ua=1&ua=1

5. Holmes, Edward. (2009) The Evolution and Emergence of RNA Viruses.

Oxford University Press

6. Hanada, Kousuke, Yoshiyuki Suzuki, and Takashi Gojobori. "A large

variation in the rates of synonymous substitution for RNA viruses and its

relationship to a diversity of viral infection and transmission

modes." Molecular biology and evolution 21.6 (2004): 1074-1080.

7. WHO (2014) Chikungunya

http://www.paho.org/hq/index.php?option=com_content&view=article&id=97

24&Itemid=1926&lang=en

8. Wiley, Don C., & Skehel, John J., (1987). The Structure and Function of the

Hemagglutinin Membrane Glycoprotein of Influenza Virus. Ann. Rev.

Biochem. 1987. 56: 365-94.

35 9. Wilson, Ian A., & Cox, Nancy J. (1990). Structural Basis of Immune

Recognition of Influenza Virus Hemagglutinin J Annu. Rev. Immunol. 1990,

8:737-71

10. Sandbulte, Matthew R., Westgeest, Kim B., Gao, Jin, et al. Discordant

antigenic drift of neuraminidase and hemagglutinin in H1N1 and H3N2

influenza viruses. PNAS 2011 108:20748-20753

11. Cheung, Timothy K. W., & Poon, Leo, L. M. (2007). Biology of Influenza A

Virus. Annals of the New York Academy of Sciences, 1102, 1-25.

12. CDC (2014) Influenza vaccine

http://www.cdc.gov/flu/protect/vaccine/index.htm

13. Le Goffic, Ronan, Leymarie, Olivier, Chevalier, Christophe, et al.,

Transcriptomic Analysis of Host Immune and Cell Death Responses

Associated with the PB1-F2 Protein. PLoS Pathogens,

August 2011, 7:8 e1002202

14. Coleman, J Robert. (2007). The PB1-F2 Protein of the Influenza A Virus:

Increasing Pathogenicity by Disrupting Alveolar Macrophages. Virology

Journal, 2007 4:9.

15. WHO Influenza Factsheet 2014

http://www.who.int/mediacentre/factsheets/fs211/en/

16. Molinari NA1, Ortega-Sanchez IR, Messonnier ML, Thompson WW, Wortley

PM, Weintraub E, Bridges CB. The annual impact of seasonal influenza in the

US: measuring disease burden and costs. Vaccine. 2007 Jun 28;25(27):5086-

96. Epub 2007 Apr 20.

36 17. Ward, Melissa J., Lycett, Samantha J., Avila, Dorita, Bollback, Jonathan P., &

Leigh Brown, Andrew J. (2013). Evolutionary interactions between

haemagglutinin and neuraminidase in avian influenza. BMC Evolutionary

Biology 13:222

18. Nicholls, John, Chan, Renee, Russell, Rupert, et al., Evolving Complexities

of the Influenza Virus and its Receptors. Cell Trends in Microbiology 2008

16:4

19. Thomson, C.A., Wang, Y., Jackson, L.M., et al., Pandemic H1N1 Influenza

Infection and Vaccination in Humans Induces Cross-protective Antibodies

That Target the Hemaglutinin Stem. Frontiers in Immunology, May 2012 3:87

20. Stoner, Terri D., Krauss, Scott, DuBois, Rebecca, Negovetich, et al. Antiviral

Susceptibility of Avian and Swine Influenza Virus of the N1 Neuraminidase

Subtype. Journal of Virology. Oct. 2010 p. 9800-9809.

21. Hurt, Aeron C., The and spread of drug resistant human

influenza viruses. Current Opinion in Virology, 2014 8:22-29

22. Dehner, George. (2012). Influenza. Pittsburgh, PA: University of Pittsburgh

Press.

23. Strengell, Mari, Ikonen, Niina, Ziegler, Thedi, Julkunen, Illkka.Minor

Changes in the Hemagglutinin of Influenza A(H1N1) 2009 Virus Alter its

Antigenic Properties. Plos ONE October 2011 6:10 e25848

24. WHO Pandemic (H1N1) 2009: frequently asked questions

http://www.who.int/csr/disease/swineflu/frequently_asked_questions/en/

37 25. Fiore, Anthony E., Bridges, Carolyn B., Katz, Jacqueline M., & Cox, Nancy J.

(2013). Inactivated influenza vaccines. In Stanley A. Plotkin, Walter A.

Orenstein & Paul A. Offit (Eds.), Vaccines 6th ed.(257-293).Elsevier

Saunders.

26. WHO Country Cooperation Strategy 2016-2020 South Africa

http://apps.who.int/iris/bitstream/10665/255007/1/ccs_zaf_2016_2020.pdf

27. WHO Country Cooperation Strategy 2012-2016 Thailand

http://www.who.int/countryfocus/cooperation_strategy/ccs_tha_en.pdf?ua=1

28. Bao Y., P. Bolotov, D. Dernovoy, B. Kiryutin, L. Zaslavsky, T. Tatusova, J.

Ostell, and D. Lipman. The Influenza Virus Resource at the National Center

for Biotechnology Information. J. Virol. 2008 Jan;82(2):596-601.

29. Lin, J.-H., Chiu, S.-C., Cheng, H.-W., et al. Molecular Epidemiology and

Antigenic Analyses of Influenza A Viruses H3N2 in Taiwan. Clin Microbiol

Infect, 17: 214-222

30. Bragstad, Karoline, Nielsen, Lars, & Fomsgaard, Anders. (2008). The

evolution of human influenza A viruses from 1999 to 2006: A complete

genome study. Virology Journal 5:40

31. Suzuki, Yoshiyuki. (2006). Natural Selection on the Influenza Virus Genome.

Mol. Bio. Evol. 23(10):1902-1911.

32. Kryazhimskiy, Sergey, Dushoff, Jonathan, Bazykin, Georgii, A. & Plotkin,

Joshua B. Prevalence of Epistasis in the Evolution of Influenza A Surface

Proteins. PLoS Genetics 2011 7:2

33. CDC (2017) Nipah virus https://www.cdc.gov/vhf/nipah/index.html

38 34. Martines, R. B., Ng, D. L., Greer, P. W., Rollin, P. E., & Zaki, S. R. (2015).

Tissue and cellular tropism, pathology and pathogenesis of Ebola and

Marburg viruses. The Journal of pathology, 235(2), 153-174.

35. Martina, Byron E.E. & Osterhaus Albert D.M.E. ““Filoviruses”: a Real

Pandemic Threat?” EMBO Molecular Medicine (2009): DOI

emmm.200900005

36. Barrette, Roger W., et al. "Current perspectives on the phylogeny of

Filoviridae." Infection, Genetics and Evolution 11.7 (2011): 1514-1519.

37. Takada, Ayato, and Yoshihiro Kawaoka. "The pathogenesis of Ebola

hemorrhagic fever." Trends in microbiology 9.10 (2001): 506-511.

38. Feldmann, Heinz, and Thomas W. Geisbert. "Ebola haemorrhagic fever." The

Lancet 377.9768 (2011): 849-862.

39. WHO (2017) Ebola FAQ http://www.who.int/csr/disease/ebola/faq-ebola/en/

40. Baseler, Laura, et al. "The pathogenesis of Ebola virus disease." Annual

Review of Pathology: Mechanisms of Disease 12 (2017): 387-418.

41. Deen GF, Knust B, Broutet N, Sesay FR, Formenty P, Ross C, et al. Ebola

virus RNA Persistence in Semen of Ebola Virus Disease Sur vivors -

Preliminary Report. The New England journal of medicine. 2015.

42. Bausch, Daniel G., et al. "Assessment of the risk of Ebola virus transmission

from bodily fluids and fomites." The Journal of infectious diseases

196.Supplement_2 (2007): S142-S147.

39 43. WHO (2016) Clinical care for survivors of Ebolavirus disease

http://apps.who.int/iris/bitstream/10665/204235/1/WHO_EVD_OHE_PED_1

6.1_eng.pdf?ua=1

44. Messaoudi, I., Amarasinghe, G. K., & Basler, C. F. (2015). Filovirus

pathogenesis and immune evasion: insights from Ebola virus and Marburg

virus. Nature Reviews Microbiology, 13(11), 663-676.

45. Viralzone 2014 Swiss Institute of Bioinformatics.

http://viralzone.expasy.org/all_by_species/5016.html

46. Groseth, Allison, Heinz Feldmann, and James E. Strong. "The ecology of

Ebola virus." Trends in microbiology 15.9 (2007): 408-416.

47. Lai, Kang Yiu, Wing Yiu George Ng, and Fan Fanny Cheng. "Human Ebola

virus infection in West Africa: a review of available therapeutic agents that

target different steps of the life cycle of Ebola virus." Infectious diseases of

poverty 3.1 (2014): 43.

48. Liu, Wen Bin, et al. "Ebola virus disease: from epidemiology to

prophylaxis." Military Medical Research 2.1 (2015): 7.

49. Kühl, Annika, and Stefan Pöhlmann. "How Ebola virus counters the interferon

system." Zoonoses and public health 59.s2 (2012): 116-131.

50. Mehedi, Masfique, et al. "A new Ebola virus nonstructural glycoprotein

expressed through RNA editing." Journal of virology85.11 (2011): 5406-

5414.

40 51. CDC Ebola Outbreaks Chronology

https://www.cdc.gov/vhf/ebola/outbreaks/history/chronology.html Updated

July 2017

52. WHO Situation report 3/30/16 Ebola

http://apps.who.int/ebola/sites/default/files/thumbnails/image/sitrep_casecount

_40.png?ua=1

53. WHO Situation report 6/6/17 Ebola in DRC

http://apps.who.int/iris/bitstream/10665/255630/1/EbolaDRC-06062017.pdf

54. WHO (2014) “Working with communities in Gueckedou for better

understanding of Ebola” http://www.who.int/features/2014/communities-

gueckedou/en/

55. Matassov, Demetrius, et al. "Vaccination with a highly attenuated

recombinant vesicular stomatitis virus vector protects against challenge with a

lethal dose of Ebola virus." The Journal of infectious diseases 212.suppl_2

(2015): S443-S451.

56. Regules, Jason A., et al. "A recombinant vesicular stomatitis virus Ebola

vaccine." New England Journal of Medicine 376.4 (2017): 330-341.

57. Henao-Restrepo, Ana Maria, et al. "Efficacy and effectiveness of an rVSV-

vectored vaccine expressing Ebola virus surface glycoprotein: interim results

from the Guinea ring vaccination cluster-randomised trial." The

Lancet 386.9996 (2015): 857-866.

58. Quammen, David. (2014). Ebola: The Natural History of a Deadly Virus. New

York:W. W. Norton

41 59. Park, Daniel J., et al. "Ebola Virus Epidemiology, Transmission, and

Evolution during Seven Months in Sierra Leone." Cell 161.7 (2015): 1516-

1526.

60. Cooper, C. L., & Bavari, S. (2014). A race for an : promises and

obstacles. Trends in Microbiology, 20, 1-2.

61. CDC Ebola transmission

https://www.cdc.gov/vhf/ebola/transmission/index.html Updated July 2015

62. Leroy, Eric M., et al. "Human asymptomatic Ebola infection and strong

inflammatory response." The Lancet 355.9222 (2000): 2210-2215.

63. Rowe, Alexander K., et al. "Clinical, virologic, and immunologic follow-up of

convalescent Ebola hemorrhagic fever patients and their household contacts,

Kikwit, Democratic Republic of the Congo." The Journal of infectious

diseases179.Supplement_1 (1999): S28-S35.

64. McGillis Hall, Linda, and Jordana Kashin. "Public understanding of the role

of nurses during Ebola." Journal of Nursing Scholarship48.1 (2016): 91-97.

65. Piontkivska, Helen, and Austin L. Hughes. "Patterns of sequence evolution at

epitopes for host antibodies and cytotoxic T-lymphocytes in human

immunodeficiency virus type 1." Virus research 116.1 (2006): 98-105.

66. Paul, Sinu, and Helen Piontkivska. "Frequent associations between CTL and

T-Helper epitopes in HIV-1 genomes and implications for multi-epitope

vaccine designs." BMC microbiology 10.1 (2010): 212.

42 67. O'Connor, David H., et al. "A dominant role for CD8+-T-lymphocyte

selection in simian immunodeficiency virus sequence variation." Journal of

virology 78.24 (2004): 14012-14022.

68. Maness, Nicholas J., et al. "Comprehensive immunological evaluation reveals

surprisingly few differences between elite controller and progressor Mamu-B*

17-positive Simian immunodeficiency virus-infected rhesus

macaques." Journal of virology 82.11 (2008): 5245-5254.

69. Tong, Y. G., Shi, W. F., Liu, D., Qian, J., Liang, L., Bo, X. C., ... & Jiang, J.

F. (2015). Genetic diversity and evolutionary dynamics of Ebola virus in

Sierra Leone. Nature.

70. Carroll, Serena A., et al. "Molecular evolution of viruses of the family

Filoviridae based on 97 whole-genome sequences." Journal of virology 87.5

(2013): 2608-2616.

43

CHAPTER 2: Sampling Issues in Influenza A Analysis: An Approach to

Dealing with Oversampling

2.1 Introduction

New strains of the flu are developing constantly, because of antigenic drift due to point mutations during RNA replication (roughly one mutation per replication cycle (1) or in the case of the NS gene 1.5 x 10-5 mutations per nucleotide per infectious cycle (2)) and antigenic shift which is recombination of flu segments that occurs within cells infected by more than one strain (3). The latter provides may facilitate flu strains jumping host species and is assisted by the segmented genome structure. The former can lead to mutations thereby allowing the virus to escape immune system detection (1, 4).

Hemagglutinin (HA) and neuraminidase (NA) are the focus of both vaccine development and influenza treatment (5). Both proteins have receptor regions which are necessary for protein function, but also variable enough to avoid immune system detection (3). Vaccines which target the more conserved stalk region of HA are being developed, after it has been shown that the antibodies for this region can be cross- protective (6). However, strains with a mutated short stalk have been discovered (6). NA is the target of drugs like oseltamivir (Tamiflu) which prevent the budding of new viruses from the host cell. Tests have confirmed that some strains are already developing resistance to oseltamivir and other NA targeted drugs (7, 8).

44 The approach used for strain selection for the influenza vaccine relies on the

WHO’s Global Influenza Surveillance Network of 83 countries to identify the three most virulent strains for use in the vaccine each year (9). This surveillance data may be biased or imprecise due to the availability of samples. In other words, the most frequent strains are selected more often, while strains with higher pathogenicity can be overlooked. This is especially true when there is a shift in prominent strain type (1968-1969 flu season,

1993-1996 seasons) (10). This can be due to sampling bias. Developed countries have the means and motivation to maintain a sampling network for influenza. In contrast, less developed nations do not have the means to take an adequate number of samples from throughout their country. Furthermore, crossover events tend to occur in developing countries or more rural areas of developed countries where surveillance might not be adequate and daily interactions with birds and pigs are common, such as farms or bird markets (11).

With this knowledge, the challenge is how to overcome the biases (both in geography and strain type) ingrained within the existing sequence databases. Studies utilizing localized data can be useful for within country or within urban area trends, but may miss the global trends (e.g., 12, 13). Limiting strain type or time frame (e.g., 4, 14), has similar drawbacks in terms of complex, interacting global trends, including species jumps which can occur when a cell is infected by multiple strains. It also fails to consider evolutionary interactions between the surface proteins HA and NA (5, 15).

An approach to achieve a better, more representative dataset is random sampling.

Selecting a randomly sampled subset from oversampled countries allows under-sampled areas to still be taken into account for statistics and trends. In addition, other strain types

45 (e.g. H5N1) can be analyzed in a dataset that can provide insight the small sets of that strain alone cannot provide. Knowledge about how outlier strains can affect overall statistics of influenza is vital, especially when averages are considered. In this work, we examined the effect of sampling on descriptive statistics of molecular substitution patterns and diversity, with the goal that representative samples can be taken without losing valuable information.

2.2 Hypotheses

The choices made when taking a sample of sequences impact the outcomes of the study and more broadly, the approaches to dealing with the disease (vaccines, treatments). In the case of dealing with sequences databases, the samples are taken already. By looking at the effects of the analysis of subsamples on the countries with the largest available samples, insight is gained on how only having a small amount of the sequences that are available could impact the conclusions drawn by studies.

Hypothesis: When sampled properly, subsets can reveal the same trends as whole sets and thus allow the building of data sets that are more representative, with less sampling bias.

2.3 Methods

2.3.1 Sequences collected

Full-length influenza A coding sequences of hemagglutinin genes from infected humans were collected from Influenza Resources Database at NCBI (16) and aligned as

46 per respective amino acid alignment (April - May 2013 for the 2010 and 2011 sets). The dataset included the United States which had 105 full length HA sequences available from 2004-2006, 788 HA sequences for 2009, 295 HA sequences for 2010, and 355 HA sequences for 2011. China had 30 full length HA sequences available from 2004-2006,

249 HA sequences for 2009, 69 HA sequences for 2010, and 54 HA sequences for 2011.

These countries were chosen due to the fact that when 19 countries were selected for an international analysis, the US had an overwhelming number of sequences (Figure 1.2,

Supplementary Table 1.1). In order to have comparison available for the subset method and China was chosen as the country with the second largest number of sequences available. Shown in Figure 2.1 is the breakdown of the number of sequences and subset sizes for each time period.

47 Table 2.1: Number of Sequences Used in the Analysis For the years used in this study, the numbers of full-length HA sequences available for the US and China are given. Subset size is 2/3 the amount of sequences available for

China. 100 random subsets were taken for each year

Sequences US China Subset Size

2004-2006 105 30 20

2009 788 249 166

2010 295 69 46

2011 355 54 36

48

Figure 2.1: Workflow of analysis steps. Once the sequences were collected by year, the number of available sequences is shown in the blue boxes *Subsets were sized to 2/3 the number of available sequences from

China

49 2.3.2 Estimation of nucleotide substitution pattern

Pairwise dN and dS values, and nucleotide diversity values were estimated for these subsets and the whole dataset for each period. The subset values were plotted, and means and medians taken to discern how the values distributed in relation to the whole set values, completing the steps shown in Figure 2.1. Average dN and dS (number of nonsynonymous substitutions per nonsynonymous site and synonymous substitutions per synonymous site, respectively) and nucleotide diversity values (average diversity of the population) (17, 18) were estimated using the re-sampled subsets of sequences. We used

Nei-Gojobori method with Jukes-Cantor correction as implemented in MEGACC with

100 bootstrap replications for each subset for the dN and dS (19). Jukes-Cantor correction was used in obtaining the average diversity within each subset.

2.3.3 Sampling Method

Then 100 random subsets of these sequences were taken using a random number generator in R (Version 3.0.0) (20). The amount of sequences for each subset was 2/3 of the available sequences for China for that time period (20 sequences for 2004-2006, 166 sequences for 2009, 46 for 2010, and 36 for 2011). This size was chosen to allow variety in the China subsets without making the subsets too small relative to the available US sequences (Table 2.1, Figure 2.1).

2.4 Results

First it was ascertained how many sequences were available from the two countries with the largest number of sequences for each time period (US and China) had

50 broken down by year. The periods were chosen to represent the following: 2004-2006 for the pre-pandemic period, 2009 for the pandemic, 2010 for the tail end of the pandemic

(recovery) and 2011 for post-pandemic periods respectively. The number of sequences available from US far outnumbered the number of sequences available from China, therefore sample sizes for each year were set at 2/3 the number of sequences available for

China. One hundred random subsets were taken for each country in each period using a random number generator in R Studio (21), estimates of nucleotide substitution patterns were considered.

2.4.1 Nonsynonymous and Synonymous Substitution Patterns by Time Period

Variance was found in the subsets that can be attributed not only to time period, but also to strain composition. The 2004-2006 subsets showed a spread that displays the temporal aspect of influenza evolution. The inclusion of multiple years leads to a wide spread in subset values. In the 2009 subsets, however, the values cluster tightly due to sampling closely related pandemic strains. The 2010 subsets are further spread apart, showing post-pandemic divergence of strains. In 2011, the subset values had an interesting split pattern due to differences in strain content.

51

Figure 2.2: dN Values by Country and Year: The boxplots above illustrate the relationship between the subset dN values (non- synonymous substitution rates) by country (pink color for China and light blue for the

US, respectively) and by time interval. The dots represent outliers. The whiskers represent the maximum (above) and minimum (below) values. The line within the box shows the median value for the dataset

52

Figure 2.3: dS Values by Country and Year: The boxplots above illustrate the relationship between the subset dS values (synonymous substitution rates) by country (pink color for China and light blue for the US, respectively) and by time interval. The dots represent outliers. The whiskers represent the maximum (above) and minimum (below) values. The line within the box shows the median value for the dataset

53 2.4.1.1 Pre-pandemic (2004-2006)

For the 2004-2006 dataset a greater scatter is expected due to the small size of the subsets and the multiple years present. This is also the pre-pandemic time period so there are a wider variety of strains. Another issue is the small size of the subset (20 sequences).

However, the amount of sequences for each country is small in comparison with pandemic and post-pandemic surveillance. In the China subsets the dN values do cluster together (Figure 2.2) and the wider spread of the dS values (Figure 2.3) is due to the presence of multiple strains in only some of the subsets. This span of time would require either multiple subsets combined into one larger subset (but still not as large as the whole set) or a breakdown further by year. Though the small number of sequences in the database for that period of time would make it difficult for the China set, breakdown by year might work for the larger US set.

2.4.1.2 Pandemic (2009)

For the 2009 dataset, a close cluster of values is expected due to the 2009 H1N1 pandemic. Though the viruses involved have formed separated groups and clusters allowing them to have different antigenic properties, they are still closely related (4).

Figures 2.2 and 2.3 support this, although the values the subsets cluster around differ. The higher values in the US subsets may be due to a higher number of H3N2 strains present as compared to the China subsets. In all graphs, the data points cluster closely around one value. The China subsets contained H5N1 sequences, yet the values are still lower than the US values. This shows the difference in impact for H3N2 strains and H5N1 strains.

Given this knowledge, adding in H3N2 strains may skew H1N1 trends and H5N1 trends

54 may not be seen when combined with H1N1. It is also interesting to note that perhaps the

N1 subtype of the neuraminidase is affecting the patterns seen in the hemagglutinin since the H5N1 and H1N1 seem to cluster so well. This coevolution of these two proteins has been shown in previous studies (5).

2.4.1.3 Recovery (2010)

For the 2010 subsets, for both China and the US, H1N1 was the majority of strains with H3N2 strains present in every subset as well. The majority of China subsets also contained one H5N1 strain. The smallest China dN value (subset 37) is a set containing mostly H1N1 with four H3N2 and one H5N1 strains. The subset dS values for both the US and China cluster closely around 0.019 (Figure 2.3), but it is unclear why the dN values for China are lower (Figure 2.2). The larger spread here is interesting given how closely the previous year fell. This shows that both the inter- and intra-strain competition reignited quickly after the pandemic, although the dS values for the US are very small in comparison to 2009. This may be due to pressure to change due to a much wider use of vaccine for the pandemic.

2.4.1.4 Post Pandemic (2011)

For the 2011 datasets, there are two distinct groups for three of the graphs. This is dependent on H3N2 content in the China sets. The higher points in the US dS values plot are subsets containing the single H1N2 strain in the database for that year (Figure 2.3).

This is demonstrative of the effect of strain difference on the overall statistics. Here combining subsets would be critical in obtaining the overall trends. This also

55 demonstrates how utilizing subset techniques can aid in locating patterns. The values for

China are also significantly smaller than the year before (Figures 2.2 and 2.3). This could point to the establishment of a cluster of strains that are more evolutionarily fit for the region. That the lower clusters are entirely H1N1 and the upper cluster subsets contain at least one H3N2 supports that clustering or “swarming” is occurring. This swarming pattern is shown in a previous study of influenza A hemagglutinin (22).

56

2.4.2 Sampling Provides Similar Insights into Substitution Patterns to Those from the Entire Sequence Set

Diversity values 0.40000

0.35000

0.30000

0.25000 Whole Set 0.20000 Subset Mean 0.15000

0.10000

0.05000 China 2004- US 2004- China 2009 US 2009 China 2010 US 2010 China 2011 US 2011 2006 2006

Figure 2.4: Nucleotide Diversity Values. The figure above shows the whole set values (blue diamonds) and the subset means (red squares) for each group (country/year(s)). Error bars shown are using standard error for each whole set measurement

57

As shown in Figure 2.4, the diversity values for the subsets are close to the full set values. However, these are the means and medians for the subsets. Individual values for each subset showed differing levels of variation by year, but mainly had a range of about

0.2 with some years far less (less than 0.1). Several outliers (on the low end) in the US subsets are due in part to the small subset size in comparison to the overall number of strains, but it is interesting to note that they are all underestimates of the diversity. This shows that the diversity values for the whole set can be obtained with a subset or a combination of several subsets.

58

dN Values

0.30000

0.25000

0.20000 Whole set

0.15000 Subset Mean 0.10000

0.05000

0.00000 China 2004- US 2004- China 2009 US 2009 China 2010 US 2010 China 2011 US 2011 2006 2006

Figure 2.5: dN Values. The figure above shows the whole set values (blue diamonds) and the subset means (red squares) for each group (country/year(s)). Error bars shown are using standard error for each whole set measurement

59 As shown in Figure 2.5, the overall non-synonymous rate (dN) is captured quite well in the subsets using the mean and median of all of the subsets. The range of the values is about 0.1 from smallest to largest with the 2004-2006 groups as the exception.

The larger range (closer to 0.15 in China and 0.15 with outliers as far as close to 0.3) is due to the inclusion of multiple years which was done due to number of strains available.

60 dS Values 1.00000 0.90000 0.80000 Whole set 0.70000 Subset 0.60000 0.50000 0.40000 0.30000 0.20000 0.10000 0.00000 China 2004- US 2004- China 2009 US 2009 China 2010 US 2010 China 2011 US 2011 2006 2006

Figure 2.6: dS Values: The figure above shows the whole set values (blue diamonds) and the subset means (red squares) for each group (country/year(s)). Error bars shown are using standard error for each whole set measurement

61

As shown in Figure 2.6, the dS values for the overall set are close the mean/median values for the combined subsets. However, the 2004-2006 subsets had a large range (around 1) due to the sampling of multiple years and multiple strain types.

The 2009 sets had a closer range (.08 for China and .25 for US). The 2010 values had the closest range for both: .008 for China and .0125 for the US and were also very small. The

2011 sets were again unusual, with both countries showing very small dS values, but a bottom-heavy distribution for China and a large cluster of high outliers for the US (with a value around .25).

As shown in the previous figures (Figures 2.4-2.6), the values for the whole set are close to the mean values for the subsets. Whether it is the mean or the median that is closer varies by the year and statistic (see Supplementary Tables 2.1-2.3) however this is an indicator that taking a subset of the oversampled countries could give us the same characteristics and trends of the whole set. This also provides hope for the small amount of strains available for many countries that though the dataset is far from ideal, it can still provide a window into what the virus is doing.

2.5 Discussion and Conclusions

Establishing the patterns of influenza evolution worldwide is essential to preventing or containing a developing strain that has the potential to become a pandemic strain. As the 2009 pandemic demonstrated, the current system is ineffective at containing or predicting these strains in time for a vaccine to be made and distributed (1).

The 2009 strain was highly communicable but luckily not highly virulent. This is likely

62 due to conflicting evolutionary pressures on HA to either keep the receptor binding area flexible enough for binding to receptors from multiple species (more communicable) or to focus specifically on binding exclusively to the human receptor (which could increase virulence by quickly increasing the number of cells infected). There are, however, highly virulent avian strains developing (H5N1, H7N9, H9N2) which could potentially gain communicability through antigenic shift and become pandemic (9, 23). Due to the potential for HA shift between strain types and drift to make these HA proteins bind more efficiently to human receptors; making vaccine for each strain would be a waste of time and money. Ideally, trends could be pinpointed early to identify the most likely route that these strains would take to become more communicable in humans, and what precautions would be most effective.

The goal of our study was to look at trends in flu sampling and determine whether subsampling influences the nucleotide pattern estimates. While utilizing the entire set of sequences is ideal, certain techniques such as taking multiple subsets and comparing them can help ascertain what a true trend is and what may be biased by sampling issues. Consistency between the mean statistical values for the whole set and those of the subsets indicate that the diversity, and the rates of synonymous and non- synonymous changes are still captured using the subsets. This shows that when combining the large sets of sequences from the highly sampled countries that taking a subset or combining several subsets should produce the same effect as using the whole set, which may keep those countries from overwhelming the information from the smaller amounts of strains available for developing countries. It also provides hope that even

63 when a limited number of sequences are available from the under-sampled countries, the data can still be useful in determining the trends of flu evolution within those countries.

Sampling issues are something every researcher deals with, and as the amount of sequencing data grows, it becomes even more prominent. Both oversampling and under sampling can lead to issues with consistency and repeatability of results. Currently, the surveillance available for influenza is far from ideal. Developed nations have implemented networks to track the flu, but limited attention has been paid to the fact the influenza is a global disease. The 2009 pandemic flu spread worldwide in less than three months. In order to track and prevent pandemic strains, surveillance must be implemented in all areas known to be hotspots for species jumping events, and in areas where contact with live pigs and fowl are common. This means even in the US, that more effort must be put into monitoring industrial farming and live animal markets. Catching a potential pandemic strain or even just a highly virulent strain as soon as it becomes able to infect humans is integral to developing a vaccine for that strain.

There are three main approaches for dealing with sampling issues. The first approach is to collect every available sequence. This approach was used by Furuse et al.

(24) in their study comparing the selection pressures on the HA protein of pandemic influenza compared to seasonal strains. This study used all available H1 subtype HA genes in the NCBI flu database. Additionally swine sequences of North American origin were added (this was ascertained by phylogenetic tree). From this study they concluded that the 2009 pandemic strain was of swine origin, but was also the H1 gene was evolutionarily different from the seasonal and swine strains. In Karthikeyan et al. (25) all

HA nucleotide sequences in the NCBI nucleotide database were included from 1918-

64 2011 from all mammal hosts (this included 20 subtypes) to study the evolution of the HA protein using phylogenetic and molecular modeling. The results of this study show that the 2009 Mexico City pandemic strain is closely related to a 2007 Mallard H1N1 strain from Norway and that “residue level mutation may have a vital role during evolution towards the development of resistance (sic) species.”

The conclusions drawn from these studies have certain limitations. By taking everything available the datasets may be biased to one country or area. This may mean biased results that may miss patterns essential to prediction of future virulent strains or pandemics. It also means that the trends seen may be more localized in nature instead of the global trends that the studies are looking for. Also, if the strains are mainly from a few areas, the dataset may be filled with mostly similar, co-circulating strains and lack or downplay developing outlier strains which may be the strains that are developing dangerous communicability or virulence mutations.

The second approach is to focus on one strain type. H3N2 and H1N1 are the most common seasonally circulating strains. Currently there are also studies concentrating on emerging strains like H5N1 as well. In Suzuki (14), H3N2 strains were analyzed to study natural selection (284 HA proteins). Positions associated with epitopes (found from referenced studies) were tested for dN/dS and p values. The results show both negatively and positively selected sites (especially strong positive selection in PB1-F2). The statistics provided show which sites are likely to develop escape mutations and should be monitored when developing treatments such as vaccines. In Ward et al. (5), avian H7 strains were examined for the evolutionary interaction between HA and NA proteins. It was concluded that HA evolution patterns are reflected by the type of NA they are paired

65 with. This shows that gene interaction should be taken into account when trying to predict influenza evolution.

Limitations of this approach are due to the nature of influenza A. Antigenic shift and genetic recombination can occur in any cell infected by more than one type of the flu.

This means that strain evolution must consider interaction with other strains. As well as interactions with other species’ strains due to host jumping. This has been shown to happen with the 2009 pandemic and the 1918 pandemic which were both shown to have animal origins (26). Swine act as the perfect mixing vessel because their respiratory cells possess both the receptors for human and avian HA (3). Live bird markets and industrial farming practices provide ideal environments for species jumping. For example, the 2009

H1N1 pandemic was traced back to a pig farm in Mexico (27).

The third approach is to take a small sample, either from a small geographic area or a small time frame. Lin et al. (12) did epidemiological and antigenic analyses on 44 influenza A H3N2 viruses in Taiwan that circulated from 2004-2008. The results of their study show that the HA protein showed significant changes to escape the immune system, as well as finding that co-circulation and intra-subtype reassortment events were happening. Bragstad et al. (13) took 234 Danish nasal swab samples from 1999-2006 for their study, which showed patterns of change in the proteins made to evade the immune system and high dN/dS ratios. The study also stated and demonstrated that influenza evolution is complex and that more sequences are need from throughout Europe to provide a better picture of the overall situation with which to make predictions about the future of the flu in Europe

66 The limitations of such a study are the limitations of any small sample size. The results may not scale up to the big picture, they may only apply locally. Given that influenza is a global disease, global trends are important and small scale trends may not predict the direction influenza is actually taking. This may mean missing key trends that result in pandemics or antiviral resistant strains. However, small samples are beneficial in providing guidance to local public health officials.

There are methods of sampling and resampling that seem to provide consistent results. Several studies have used a combination of random sampling and subsetting to pull out significant selection trends in the HA protein (28-30). Such studies show that when dealing with data that is bias (over or under sampled, differing strain composition) repetition and recombination are vital to insuring that the patterns viewed in the statistics are actually present in the population.

Another issue in many molecular evolution studies is the use of the dN/dS ratio

(28). A ratio will give an idea of the difference, but can be deceptive about the magnitude or significance of that difference. A dN/dS of 0.001/0.002 is equivalent to a dN/dS of ½ when reported as .5 although the changes actually involved are very different. Although is it not uncommon for studies to only report the ratio, the individual values for the dN and dS should also be given. An extreme example of this problem is shown in Furuse et al. (24) where in sites that are highly variable, there are no synonymous changes, so the dN/dS has a divide by zero error. As shown in Figure 2.7, different methods of assessing dN/dS and differing levels of sequences used can have an impact on the magnitude of the dN/dS which is seen. This can lead to misleading interpretations. The dN/dS ratio should

67 be given with clarifying information to ensure that the interpretation of that statistic is accurate.

68

Figure 2.7: dN/dS values by number of sequences used and color coded by the source article. The colored circles represent dN/dS values given from the studies shown in the legend (14, 31, 13, 5, 24, 32, 12, & 15). The x-axis is the dN/dS value and the y-axis is the number of sequences used in each study. Infinite values and a value of 9.5 from Furuse et al. (24) not shown

69

Influenza A remains a major public health challenge, on par with tuberculosis,

HIV, and malaria. The modern world is so interconnected that no populations are isolated enough to avoid such a communicable disease. The specter of past pandemics especially the 1918 pandemic which killed so many and the 2009 pandemic which was thankfully not as virulent show how dangerous this virus can be and how little can be done to stop its spread once it has begun. A combination of better tracking of virus evolution, better surveillance of hot spots for species crossover, and the development of a much faster method for making and distributing vaccine must occur in order to prevent the high death toll a virulent pandemic strain could produce.

Our results showed that – while overabundance of sequences from certain geographical areas does influence nucleotide pattern estimates – our approach of taking multiple subsets to infer substitution patterns and then comparing them can help ascertain a true trend and whether it is biased by sampling issues. Our results revealed consistency between the mean statistical values for the whole set and those of the subsets, indicating that the diversity, and the rates of synonymous and non-synonymous changes are still captured using the subsets. Thus, these results enable us to use subsets of sequences from highly sampled countries and to combine them with smaller sequence sets from under- sampled countries to provide insights into the overall substitution patterns.

70 2.6 References

1. Dehner, George. (2012). Influenza. Pittsburgh, PA: University of Pittsburgh Press.

2. Parvin, Jeffrey D., et al. "Measurement of the mutation rates of animal viruses:

influenza A virus and poliovirus type 1." Journal of virology 59.2 (1986): 377-

383.

3. Nicholls, John, Chan, Renee, Russell, Rupert, et al., Evolving Complexities of

the Influenza Virus and its Receptors. Cell Trends in Microbiology 2008 16:4

4. Strengell, Mari, Ikonen, Niina, Ziegler, Thedi, Julkunen, Illkka.Minor Changes in

the Hemagglutinin of Influenza A(H1N1) 2009 Virus Alter its Antigenic

Properties. Plos ONE October 2011 6:10 e25848

5. Ward, Melissa J., Lycett, Samantha J., Avila, Dorita, Bollback, Jonathan P., &

Leigh Brown, Andrew J. (2013). Evolutionary interactions between

haemagglutinin and neuraminidase in avian influenza. BMC Evolutionary Biology

13:222

6. Thomson, C.A., Wang, Y., Jackson, L.M., et al., Pandemic H1N1 Influenza

Infection and Vaccination in Humans Induces Cross-protective Antibodies That

Target the Hemaglutinin Stem. Frontiers in Immunology, May 2012 3:87

7. Stoner, Terri D., Krauss, Scott, DuBois, Rebecca, Negovetich, et al. Antiviral

Susceptibility of Avian and Swine Influenza Virus of the N1 Neuraminidase

Subtype. Journal of Virology. Oct. 2010 p. 9800-9809.

8. Hurt, Aeron C., The epidemiology and spread of drug resistant human influenza

viruses. Current Opinion in Virology, 2014 8:22-29

71 9. WHO Influenza Factsheet 2014

http://www.who.int/mediacentre/factsheets/fs211/en/

10. Fiore, Anthony E., Bridges, Carolyn B., Katz, Jacqueline M., & Cox, Nancy J.

(2013). Inactivated influenza vaccines. In Stanley A. Plotkin, Walter A. Orenstein

& Paul A. Offit (Eds.), Vaccines 6th ed.(257-293).Elsevier Saunders.

11. WHO Pandemic (H1N1) 2009: frequently asked questions

http://www.who.int/csr/disease/swineflu/frequently_asked_questions/en/

12. Lin, J.-H., Chiu, S.-C., Cheng, H.-W., et al. Molecular Epidemiology and

Antigenic Analyses of Influenza A Viruses H3N2 in Taiwan. Clin Microbiol

Infect, 17: 214-222

13. Bragstad, Karoline, Nielsen, Lars, & Fomsgaard, Anders. (2008). The evolution

of human influenza A viruses from 1999 to 2006: A complete genome study.

Virology Journal 5:40

14. Suzuki, Yoshiyuki. (2006). Natural Selection on the Influenza Virus Genome.

Mol. Bio. Evol. 23(10):1902-1911.

15. Kryazhimskiy, Sergey, Dushoff, Jonathan, Bazykin, Georgii, A. & Plotkin,

Joshua B. Prevalence of Epistasis in the Evolution of Influenza A Surface

Proteins. PLoS Genetics 2011 7:2

16. Bao Y., P. Bolotov, D. Dernovoy, B. Kiryutin, L. Zaslavsky, T. Tatusova, J.

Ostell, and D. Lipman. The Influenza Virus Resource at the National Center for

Biotechnology Information. J. Virol. 2008 Jan;82(2):596-601.

72 17. Jukes T.H. and Cantor C.R. (1969). Evolution of protein molecules. In Munro

HN, editor, Mammalian Protein Metabolism, pp. 21-132, Academic Press, New

York.

18. Nei M. and Kumar S. (2000). Molecular Evolution and Phylogenetics. Oxford

University Press, New York.

19. Tamura K., Peterson D., Peterson N., Stecher G., Nei M., and Kumar S. (2011).

MEGA5: Molecular Evolutionary Genetics Analysis using Maximum Likelihood,

Evolutionary Distance, and Maximum Parsimony Methods. Molecular Biology

and Evolution 28: 2731-2739.

20. R Development Core Team (2008). R: A language and environment for statistical

computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-

900051-07-0, URL http://www.R-project.org.

21. RStudio Team (2015). RStudio: Integrated Development for R. RStudio, Inc.,

Boston, MA URL http://www.rstudio.com/.

22. Plotkin, Joshua B., Dushoff, Jonathan, & Levin, Simon A. (2002). Hemagglutinin

sequence clusters and the antigenic evolution of the influenza A virus. PNAS

99:9:6263-6268

23. Gutiérrez, Ramona Alikiiteaga, et al. "A (H5N1) virus evolution in South East

Asia." Viruses 1.3 (2009): 335-361.

24. Furuse, Yuki, Shimabukuro, Kozue, Odagiri, Takashi, et al. Comparison of

selection pressures on the HA gene of pandemic (2009) and seasonal human and

swine influenza A H1 subtype viruses. Virology. 2010. 405: 314-321

73 25. Karthikeyan, Muthusamy, Kirubakaran, Palani, Singh, Kh. Dhanachandra, et al.

Understanding the evolutionary relationship of hemagglutinin protein from

influenza viruses using phylogenetic and molecular modeling studies. Journal of

Biomolecular Sturcture and Dynamics. 2013.

26. Payungporn, S., Panjaworayan, N., Makkoch, J., and Poovorawan Y. Molecular

Characteristics of Human Pandemic Influenza A Virus (H1N1) Acta Virologica

2010 54: 155-163

27. López-Cervantes, Malaquías, et al. "On the spread of the novel influenza A

(H1N1) virus in Mexico." The Journal of Infection in Developing Countries 3.05

(2009): 327-330.

28. Chen, Jiming & Sun, Yingxue. (2011). Variation in the Analysis of Positively

Selected Sites Using Nonsynonymous/Synonymous Rate Ratios: an Example

Using Influenza Virus. PLoS ONE 6(5): e19996. Doi:

10.1371/journal.pone.0019996

29. Plotkin, J.B., Dushoff, J., Codon Bias and Frequency-dependent Selection on the

Hemagglutinin Epitopes of Influenza A Virus. Proc Natl Acad Sci USA 2003

100:7152-7157

30. Shih, A.C., Hsiao, T.C., Ho, M.S., Li W.H., Simultaneous Amino Acid

Substitutions at Antigenic Sites Drive Influenza A Hemagglutinin Evolution. Proc

Natl Acad Sci USA 2007 104:6283-6288

31. Wolf, Y. I., Viboud, C., Holmes, E. C., Koonin, E. V., & Lipman, D. J. (2006).

Long intervals of stasis punctuated by bursts of positive selection in the seasonal

evolution of influenza A virus. Biol Direct, 1(34), 357-360.

74 32. Yang, Z. (2000). Maximum likelihood estimation on large phylogenies and

analysis of adaptive evolution in human influenza virus A. Journal of Molecular

Evolution, 51(5), 423-432.

75

CHAPTER 3: GENOME-WIDE MOLECULAR SUBSTITUTION

PATTERNS IN EBOLAVIRUS

3.1 Introduction

Although the first recorded outbreak of Ebolavirus was in 1976, it wasn’t until the

2014 outbreak that the virus gained significant international press. Ebolavirus was considered a neglected tropical disease, with few samples and poor surveillance.

Inadequate infrastructure, the danger and infectivity of the virus, and the small number of reported cases hampered study of Ebolavirus and sample collection (1). In 2014 the

Ebolavirus outbreak reached large urban populations and spread internationally to

Western countries, making research on the virus and its treatment a priority for resources.

This led to more sequences and data. The difficulty that remains is ascertaining the long term evolutionary trends in the Ebolavirus given the scarcity of pre-2014 sequences, and the concentration of 2014 sequences from Sierra Leone between September 2014 and

2016. Limited sequences from Liberia and Guinea are available, though there are efforts being made to add sequences from these countries (2-4).

In this study, 217 complete genome Ebolavirus sequences, consisting of mainly the 2014 outbreak, but also contributed to by the other smaller outbreaks, have been acquired. These genomes, obtained from human patients, will allow an in depth look at this virus. With the first aim, the whole genome alignments and phylogenetic analyses have been used to generate phylogenetic trees and various descriptive statistics estimates

76 to show how the virus is changing and whether there are differences in the evolutionary patterns between this massive 2014 outbreak and the smaller ones that have come before.

There were two techniques used: subsets and phylogenetically independent pairs.

These tactics allowed both a broad picture of the viral evolution (using subsets with many strains) and a more detailed view on short term evolution (using phylogenetic “sister pairs”) to be accomplished. Both epitope and non-epitope regions were examined and non-synonymous (dN) and synonymous (dS) rates estimated.

3.2 Hypothesis

Hypothesis 1: Due to the spread and length of the 2014 outbreak (the largest and the longest so far, (5)), mutations will have occurred at a higher rate than seen before.

Hypothesis 2: Due to extended contact with multiple human hosts there will be a higher number of changes to epitope regions in response to selective pressure from the human immune system than to non-epitope regions, compared to previous epidemics.

3.3 Methods

3.3.1 Sequences Collected

77

Table 3.1: Sequences used and countries and outbreaks of origin Sequences were obtained from the Virus Pathogen Database and Analysis Resource

(VIPR) (6) Ebolavirus database. All available whole genomes as of February 2016 were used. (See Supplementary Table 3.1)

Outbreak Country Number of sequences 2014 Sierra Leone 147 Liberia 11 Guinea 9 Mali 4 United Kingdom 3 Democratic Republic of Congo 2 USA 2 Italy 1 Switzerland 1 2007 Democratic Republic of Congo 7 2002-2003 Democratic Republic of Congo 1 Gabon 1 1994-1996 Democratic Republic of Congo 16 Gabon 6 1976 Democratic Republic of Congo 6

78

3.3.2 Sampling Subsets

Due to the number of 2014-2015 outbreak sequences and the small number of sequences for prior outbreaks, subsets were used. The results of 50 random outbreak sequences (taken using R and Rstudio (7,8)) were combined with the results from the other outbreak sequences. These were pulled by accession number from the pairwise non- synonymous (dN) and synonymous (dS) runs for all sequences. This was repeated 1000 times. The resulting dN and dS averages and ranges are reported below.

3.3.3 Estimation of synonymous and non-synonymous substitution patterns

Average dN and dS (number of nonsynonymous substitutions per nonsynonymous site and synonymous substitutions per synonymous site, respectively) and nucleotide diversity values (average diversity of the population) (9, 10) have been estimated using Nei-Gojobori method with Jukes-Cantor correction as implemented in

MEGACC with 500 bootstrap replications for each set of sequences for the dN and dS

(11). Whether or not dN or dS distributions were statistically significant from each other was ascertained using paired T-tests in R Studio (8) using the function t.test().

3.3.4 Phylogenetically Independent Pairs

To further investigate trends of mutation within and between outbreaks phylogenetic pairs were obtained. Phylogenetic pairs were chosen from a tree constructed using the neighbor-joining method (16) with maximum composite likelihood (MCL) distance (11). The tree was created in MEGA 7 (12). Those selected had an internal

79 branch link above 0. The majority have >70% bootstrap support, with the 2014 outbreak having 36 pairs (20 S), and 9 pairs from other outbreaks (7S). These are shown in the tree below. The tree topologies were essentially the same across different approaches of tree construction. The supported pairs were largely still supported at >60% bootstrap, with 19 out of 27 pairs being supported in three different models (neighbor-joining, maximum- likelihood, and maximum parsimony), and an additional four pairs being supported in two out of three models. Using these “sister pairs” enables detailed analysis of mutation patterns temporally and spatially. This is due to the fact that all substitutions between the two strains have occurred since the last common ancestor, and are therefore independent from other pairs (13).

3.3.5 Epitope Definition and Mapping

Epitopes were obtained from the Immune Epitope Database

(http://www.iedb.org/)(14). All Ebolavirus Zaire epitopes available in February 2016 were obtained, sorted, and mapped to the alignment. (Supplementary Table 3.2)

3.4 Results

3.4.1 Comparison of Substitution Rates Between 2014 Outbreak and Prior

Outbreaks

80 Table 3.2: Values for Different Outbreaks Whole gene, epitope, and non-epitope regions’ dN (non-synonymous rate) and dS (synonymous rate) for each gene of the Ebolavirus and each outbreak are shown

dS Whole Gene Epitope Non-Epitope Gene 1976 1994-1996 2002-2003 2007 2014 1976 1994-1996 2002-2003 2007 2014 1976 1994-1996 2002-2003 2007 2014 NP 0.00068 0.01445 0.00213 0.00058 0.00327 0.00082 0.01053 0.00254 0.0007 0.00282 0 0.03356 0 0 0.00590 VP35 0 0.01108 0 0 0.00264 0 0.00645 0 0 0.00098 0 0.01405 0 0 0.00381 VP40 0.00136 0.01106 0.00409 0.00194 0.00237 0 0 0 0 0.00183 0.00152 0.01234 0.00455 0.00217 0.00244 GP 0 0.01207 0.01495 0.00059 0.00330 0 0.00979 0.00486 0.00069 0.00354 0 0.02568 0.08965 0 0.00184 VP30 0 0.00709 0.00494 0.00284 0.00302 0 0 0 0 0 0 0.00782 0.00545 0.00313 0.00334 VP24 0.00193 0.00583 0.00573 0 0.00356 0 0 0 0 0.00634 0.00221 0.0067 0.00661 0 0.00315 L 0.00044 0.0117 0.01193 0.00037 0.00413 0 0.01313 0.01843 0 0.00399 0.00047 0.01159 0.01143 0.0004 0.00414 dN Gene 1976 1994-1996 2002-2003 2007 2014 1976 1994-1996 2002-2003 2007 2014 1976 1994-1996 2002-2003 2007 2014 NP 0 0.00157 0.00061 0 0.00049 0 0.00133 0.00073 0 0.00035 0 0.00261 0 0 0.00116 VP35 0 0.00053 0.00134 0.00037 0.00036 0 0 0.00327 0.00094 0.00022 0 0.00088 0 0 0.00047 VP40 0 0.00036 0 0 0.00017 0 0 0 0 0.00016 0 0.0004 0 0 0.00018 GP 0.00039 0.00187 0.00466 0 0.00168 0 0.00079 0.00463 0 0.00193 0.00246 0.00764 0.00411 0 0.00033 VP30 0 0.00163 0.00305 0 0.00004 0 0 0 0 0.00019 0 0.0018 0.00336 0 0.00002 VP24 0.00058 0.00072 0.00521 0 0.00010 0.00378 0 0 0 0.00038 0 0.00085 0.00615 0 0.00005 L 0 0.00105 0.00118 0.00026 0.00029 0 0 0 0 0 0 0.00113 0.00126 0.00028 0.00031

81 As seen in Table 3.2, the 2014 outbreak does not have the highest values in dN and dS for whole gene or for non-epitope regions. The outbreaks in 1994-1996 and 2002-

2003 do, depending on the gene. The 2014 outbreak, however, has the highest values for several genes (VP40, VP24) in epitope dS and dN (which also has the highest value in

VP30). The 1976 and 2007 strains have the least amount of variation which may be owing to the small sample sizes for those two outbreaks.

3.4.2 Subsets per Prior examples

Initially dN and dS rates were obtained for the whole sequence set. This gives an overall idea of which genes are the most variable. Then subsets were taken from the 2014 set of sequences to balance out the oversampling of the latest outbreak and under sampling of the previous outbreaks. The whole gene based values are given in Table 3.3.

82 Table 3.3: Whole gene values compared to subset means. The synonymous (dS) and non-synonymous (dN) rate values are shown by gene. The subset dN and dS values are the means of the whole gene values for each subset. Subset dN and subset dS are significantly different from each other using a paired T-test (p value

<.0001)

83

As shown in Table 3.3, the rates are small as is expected in Ebolavirus given the slower evolutionary rates (15). The dN rates for GP and NP are higher, which may reflect immune system interaction. Although the magnitude of the rate is enhanced with the use of random 2014 outbreak subsets, the patterns still hold when proteins are ranked in order.

Epitopes are regions where the virus interacts with the immune system

(specifically antibodies, B cells, or T cells). For testing epitope and non-epitope regions, percent positive of assays was used to provide thresholds. This eliminated three of the six proteins (VP40, VP24, and VP30) above 10%. NP and GP had the most tested epitope regions, with 2121/2217 nucleotides identified as epitope at the 10% threshold in NP and

1914/2028 in GP. Upping the threshold yields epitopes which are more likely to truly be epitopes. Looking at the comparison of non-synonymous and synonymous rates for epitope and non-epitope regions as the thresholds increase allows insight into potential evolutionary patterns.

84 Table 3.4: Epitope and Non-epitope values. Shown are all the dN and dS values (epitope and non-epitope regions) for each protein for both subsets and the entire 217 sequences (all). The threshold levels are the percentage of positive assay results for each gene’s epitopes. The p values are for a paired T-test between epitope and non- epitope dN and dS. NAs are for genes that had no epitopes above that threshold level

Threshold level Gene Non-epitope dN Epitope dN Non-epitope dS Epitope dS All Non-epitope dN All Epitope dN All Non-epitope dS All Epitope dS Epitope Sites Gene Length P Value (dN) P Value (dS) 10% NP 0.0045 0.0037 0.1139 0.0599 0.0020 0.0023 0.0628 0.0326 2121 2217 <.0001 <.0001 VP35 0.0026 0.0032 0.0497 0.0165 0.0016 0.0017 0.0265 0.0093 393 1020 <.0001 <.0001 VP40 0.0021 0.0001 0.0468 0.0057 0.0012 0.0001 0.0248 0.0038 99 978 <.0001 <.0001 GP 0.0003 0.0085 0.0348 0.0548 0.0002 0.0053 0.0193 0.0305 1914 2028 <.0001 <.0001 VP30 0.0008 0.0009 0.0420 0.0000 0.0003 0.0005 0.0242 0.0000 81 864 <.0001 <.0001 VP24 0.0015 0.0005 0.0563 0.0897 0.0007 0.0004 0.0319 0.0527 111 753 <.0001 <.0001 L 0.0033 0.0000 0.0576 0.0675 0.0019 0.0000 0.0325 0.0383 440 6636 <.0001 <.0001 50% NP 0.0078 0.0029 0.1062 0.0532 0.0049 0.0017 0.0589 0.0289 1809 2217 <.0001 <.0001 VP35 0.0025 0.0038 0.0437 0.0203 0.0015 0.0020 0.0234 0.0114 333 1020 <.0001 <.0001 VP40 NA NA NA NA NA NA NA NA 0 978 NA NA GP 0.0045 0.0087 0.0309 0.0577 0.0022 0.0055 0.0165 0.0322 1713 2028 <.0001 <.0001 VP30 NA NA NA NA NA NA NA NA 0 864 NA NA VP24 NA NA NA NA NA NA NA NA 0 753 NA NA L 0.0032 0.0000 0.0593 0.0148 0.0018 0.0000 0.0330 0.0074 102 6636 <.0001 <.0001 75% NP 0.0042 0.0022 0.0611 0.0607 0.0026 0.0011 0.0334 0.0328 429 2217 <.0001 <.0001 VP35 0.0025 0.0038 0.0437 0.0203 0.0015 0.0020 0.0234 0.0114 333 1020 <.0001 <.0001 VP40 NA NA NA NA NA NA NA NA 0 978 NA NA GP 0.0070 0.0103 0.0589 0.0419 0.0044 0.0062 0.0331 0.0225 597 2028 <.0001 <.0001 VP30 NA NA NA NA NA NA NA NA 0 864 NA NA VP24 NA NA NA NA NA NA NA NA 0 753 NA NA L 0.0032 0.0000 0.0590 0.0205 0.0018 0.0000 0.0332 0.0099 75 6636 <.0001 <.0001 90% NP 0.0040 0.0024 0.0616 0.0578 0.0024 0.0013 0.0337 0.0303 294 2217 <.0001 <.0001 VP35 0.0022 0.0058 0.0371 0.0290 0.0013 0.0032 0.0198 0.0170 174 1020 <.0001 <.0001 VP40 NA NA NA NA NA NA NA NA 0 978 NA NA GP 0.0064 0.0159 0.0572 0.0362 0.0041 0.0092 0.0320 0.0196 396 2028 <.0001 <.0001 VP30 NA NA NA NA NA NA NA NA 0 864 NA NA VP24 NA NA NA NA NA NA NA NA 0 753 NA NA L 0.0032 0.0000 0.0590 0.0205 0.0018 0.0000 0.0332 0.0099 75 6636 <.0001 <.0001

85 As shown in Table 3.4, across thresholds there is relatively little sequence divergence.

Epitope non-synonymous rates are consistently higher in GP in all thresholds. This is from 10% when 1914/2028 nucleotides are in epitope regions to when only 396/2028 nucleotides are in epitope regions. This is only true for the synonymous rates until the 50% threshold when the non-epitope dS becomes larger. For NP the non-epitope dN is consistently larger, as is the non- epitope dS, although the difference becomes smaller at higher thresholds. For the L gene, non- epitope dN is always higher because epitope dN is zero. Epitope dS is only higher at the 10% threshold where 440/6636 sites are epitope. This drops to only 75 sites at the 75% threshold. For

VP35, epitope dN is consistently higher, whereas epitope dS is lower than non-epitope dS.

86 Table 3.5: Epitope and Non-epitope dN and dS values. Epitope threshold levels are shown for four Ebolavirus proteins. The green regions indicate the trends with all of these having paired T-test p values of <.0001. Also shown are the number of epitope sites and the number of total sites per gene

Threshold Epitope dN> Non-epitope dN> Epitope dS> Non-epitope dS> Epitope Gene level Gene Non-epitope dN Epitope dN Non-epitope dS Epitope dS Sites Length 10% NP 0 <.0001 0 <.0001 2121 2217 VP35 <.0001 0 0 <.0001 393 1020 GP <.0001 0 <.0001 0 1914 2028 L 0 <.0001 <.0001 0 440 6636 50% NP 0 <.0001 0 <.0001 1809 2217 VP35 <.0001 0 0 <.0001 333 1020 GP <.0001 0 <.0001 0 1608 2028 L 0 <.0001 0 <.0001 102 6636 75% NP 0 <.0001 0 <.0001 429 2217 VP35 <.0001 0 0 <.0001 333 1020 GP <.0001 0 0 <.0001 597 2028 L 0 <.0001 0 <.0001 75 6636 90% NP 0 <.0001 0 <.0001 294 2217 VP35 <.0001 0 0 <.0001 174 1020 GP <.0001 0 0 <.0001 396 2028 L 0 <.0001 0 <.0001 75 6636

87 Shown in Table 3.5, there were four proteins out of seven that had epitopes with a positive assay percentage above 10%. The patterns shown above are the means of the dN and dS values taken after subsets of 2014 strains were combined with all other outbreak sequences. The epitopes were narrowed down using positive assay percentage thresholds.

Also shown in Table 3.5, the epitope dS measurements are lower than the non-epitope dS for all four proteins at the 75% threshold and above. However, the epitope dN is higher than the non-epitope dN in VP35 and GP in all levels. This may indicate there are still highly variable sites/regions in these two proteins despite the mutation rate being low overall. As both of these proteins are antigenically important and GP is the basis for the current vaccines we analyzed this phenomenon in greater details below.

3.4.3 Analysis of Patterns of Evolution Using Phylogenetic Pairs

To further investigate trends of mutation within and between outbreaks phylogenetic pairs were obtained. Phylogenetically independent pairs were chosen from a NJ with MCL distance tree. Those selected had a branch length different than zero. The majority of pairs, 20 out of the 36 pairs from the 2014 outbreak and seven of the nine pairs from other outbreaks, have >70% bootstrap support. These pairs are shown in the tree below. Using these pairs enables detailed analysis of mutation patterns temporally and spatially.

88

Figure 3.1: Phylogenetic Tree of Ebolavirus Sequences This tree was created using the neighbor-joining method (16) with maximum composite likelihood (MCL) distance. It was used to identify the phylogenetically independent pairs.

Colored boxes: yellow are the 2014 outbreak pairs, orange is the 2007 outbreak pair, purple is the 2002-2003 outbreak pair, green are the 1976 outbreak pairs, red are the 1994-1996 outbreak pairs and blue is the DRC 2014 outbreak pair

89

Figure 3.2: Phylogenetic pairs mean dN. The mean values of within (x-axis) and between (y-axis) group dN (non-synonymous rate) measurements for phylogenetic pairs are shown for each gene by epitope (green), non-epitope

(blue), and whole gene (purple). Baseline epitope levels were used for the epitope regions (10% for L, VP24, VP30, VP35, VP40, and 50% for NP and GP)

90

As shown in Figure 3.2, both the within and between pairs mean dN values are small, though the between pair dN is several orders larger. The GP values for all three categories are large compared to the other genes, while NP has very similar within pair dN values with a higher non-epitope between pair mean dN. The VP gene values tend to cluster although the within mean dN for the epitope regions is slightly higher.

91

Figure 3.3: Phylogenetic pairs mean dS. The mean values of within (x-axis) and between (y-axis) group dS (synonymous rate) measurements for phylogenetic pairs are shown for each gene by epitope (green), non-epitope

(blue), and whole gene (purple). Baseline epitope levels were used for the epitope regions (10% for L, VP24, VP30, VP35, VP40, and 50% for NP and GP)

92

For the dS measurements (Figure 3.3), most values cluster tightly. The exceptions are lower epitope region values in three of the four VP proteins (VP24 actually has rather high values) and the non-epitope NP and GP values. This may be due to the structural restraints in these regions which make synonymous changes far more likely to be preserved.

93

Figure 3.4: GP Between Pairs dN. Shown above are the epitope and non-epitope 50% threshold level values for each 2014 between pair comparison. The size of each point reflects the number of comparisons with this value. The red line indicates where epitope dN is equal to non-epitope dN

94 For the between 2014 pairs GP non-synonymous rates, though there are many points with zero or nearly zero values, the remaining points display a pattern of epitope dN being much larger than non-epitope dN (Figure 3.4). This pattern of higher rates of non-synonymous change in epitope regions could be indicative of a trend of divergence driven by immune system pressure within individual hosts across geographical regions and times.

95

Figure 3.5: GP Within Pairs dN. Shown above are the epitope and non-epitope 50% threshold level values for each 2014 within pair comparison. The size of each point reflects the number of comparisons with this value. The red line indicates where epitope dN is equal to non-epitope dN

96 As shown in Figure 3.5, the majority of within pair GP comparison values are 0. The few non-zero values are small but show a slight trend towards purifying selection in these closely related sequences.

97

Figure 3.6: Phylogenetic pairs within group values Shown above are the within pair values of dN and dS (y-axis) by pair number (x-axis). Values of zero have been omitted for clarity. Whole gene values are red, epitope region only values are green, and non-epitope region only values are blue. Baseline epitope levels were used (10% for

L, VP24, VP30, VP35, VP40, and 50% for NP and GP)

98 As is shown in Figure 3.6, the mid tree 2014 pairs and the other outbreak pairs (starting with 37) have the greatest number of non-zero dN and dS values. As expected dS measurements are greater than dN for most pairs (see Supplementary Figure 3.1). Several pairs especially the mid tree 2014 pairs have higher epitope values than non-epitope or whole gene. All of the genes have some non-zero measurements with GP having the majority. All of the values are relatively small as expected when comparing closely related sequences. Within the between pairs values the largest values are between outbreaks as expected.

99

Figure 3.7: dN-dS values for within phylogenetic pair comparisons from the 2014 outbreak. Shown are the pairs with non-zero GP, L, and/or VP35 values for the whole gene dN-dS in 2014 phylogenetic pairs. Pair number from left to right: 16, 36, 22, 34, 18, 23, 12, 13, 6, 7, 4, 1, 24, 11 and 25. The red line represents neutrality, where dN=dS

100 Shown in Figure 3.7, GP non-zero values for the 2014 pairs show a slight trend towards positive selection (diversifying selection, or selection for change). This is indicated by the values of dN-dS which are above the zero line, showing that there is a larger rate of changes that change the amino acid than those that do not. L is the gene with the most non-zero values and displays signs of the opposite, purifying selection (negative selection, or selection to keep the protein as it is). As L is the polymerase gene is so this is expected. VP35 has few non-zero within pair values that all point towards purifying selection.

101

Figure 3.8: dN-dS values for within phylogenetic pair comparisons from the other, non- 2014, outbreaks. Shown are the pairs with non-zero GP, L, and/or VP35 values for the whole gene dN-dS in all other outbreaks’ phylogenetic pairs. Pair number from left to right: 43 (DRC 2014), 45 (1994-

96), 39 (1976), 37 (2007), 41 (1994-96), and 38 (2002-03). The red line represents neutrality, where dN=dS

102 GP non-zero values for the other outbreak pairs show a slight trend towards positive selection similar to the 2014 pairs for half of the non-zero pairs. The L gene had only a few non- zero points but still displays signs of purifying selection, as is expected for a polymerase gene.

The two non-zero VP35 points show a slight trend towards positive selection, which is different from the 2014 outbreak pairs.

103

Figure 3.9: GP 50% epitope dN-dS. Shown above are the non-zero dN-dS values for the 50% epitope threshold for GP for all pairs from all outbreaks. The 2014 outbreak pairs are blue diamonds and the other outbreak pairs are squares. From left to right pair number: 16, 22, 34, 43 (DRC 2014), 23, 38 (2002-03), 41 (1994-

96), 37 (2007), and 12. All pairs had values of zero for L. The red line represents neutrality, where dN=dS

104 As shown in Figure 3.9, when just the epitope regions are considered, the 2014 pairs show a stronger shift toward diversifying (positive) selection and the pairs from earlier outbreaks show a purifying (negative) selection trend. This shows that there may be an increasing selection promoting change in GP epitope regions.

3.5 Discussion and Conclusions

As shown in table 3.2, with the exception of the epitope regions of three of the VP proteins (24, 40, and 30) the mutation rate for the 2014 outbreak is smaller when compared to previous outbreaks. This could be due to the much larger number of strains sequenced for this outbreak compared to previous ones. To investigate this further, 1000 subsets were taken including the previous outbreak strains (37 of them) and a smaller number of 2014 strains (50 of them, randomly sampled) in order to look at overall trends in evolution of the whole gene and epitope and non-epitope regions. Phylogenetically independent pairs were also taken from all outbreaks and compared.

3.5.1 Location and Definition of Epitopes- Challenges and Current Availability

The surface glycoprotein (GP), which is used for cell attachment and entry, is the most studied Ebolavirus protein. GP has the most impact on the immune system due to its encoding of multiple proteins that impair immune system response. This multi-tasking and importance to defending the virus against the immune system is why GP has the largest dN rates in epitope regions (and second largest in non-epitope). The high dS mutation rate in epitope regions may be indicative of changes in RNA that may be to avoid the immune system or to make translation easier by changing secondary RNA structure and therefore how that structure may interact with

105 other structures. The current vaccine utilizes this gene. Therefore, it is not surprising then to find out that the majority of epitopes in the database are GP epitopes.

The other gene with a multitude of mapped epitopes is the nucleoprotein, NP. Because

NP is necessary for both virus replication and assembly, it is a common protein for degradation and presentation, and has a multitude of confirmed MHC epitopes (17). The non-epitope regions are short, tucked in between structurally important regions. This is why the mutation rates are higher in the non-epitope regions, as most of the epitope regions are constrained. Though advantageous for human immune response, this poses a challenge for the virus. The NP protein does not have any known pathogenic effects.

VP35 had enough mapped epitopes to use the threshold approach. It has multiple roles including facilitating RNA synthesis, and impairing the maturation of dendritic cells by blocking

RLR (RIG I like receptor) signaling, which also inhibits type I interferon response (18).

Inhibition of iRNA has also been demonstrated (19). In VP35 the epitope regions tend to be outside of the alpha helix regions. This may be due to the fact that less structured regions might be easier to break up still intact for MHC processing. As these regions are less structurally constrained, the dN is higher and the dS is lower than in the non-epitope regions.

The L gene which also was used for the thresholds is the highly conserved polymerase gene. This is why the epitope regions are so small and far between.

The other VP genes play important roles in the viral life cycle and interacting with the immune system but are not as well studied. They are also much shorter and therefore will have a smaller amount of epitope sites. They are however, mapped and will be discussed further in the next chapter.

106 3.5.2 Outcome of Epitope Thresholds and Potential Target Areas for Treatment Initially the GP and NP epitopes were defined as those that had positive results when tested at least fifty percent of the time per IEDB. This resulted in the vast majority of these regions being labeled epitope. Thus we used stricter 75 and 90 percent cut off thresholds. The rates of substitution in epitope regions change by less than ~.007 even when the threshold is moved. Although the dS in NP seems to be more evenly distributed across thresholds, there is still a difference in dN. For GP, non-epitope dN is the only measurement to remain similar. The dS rates flip from epitope being larger to smaller but they become closer in magnitude. The epitope dN rate almost doubles at the 90% threshold which may be an indication of selective selection instead of neutral evolution. The other proteins did not have enough epitopes to go beyond a 10% threshold for testing.

The Becquart et al. 2014 (20) study is the source for epitopes that fall above both the 50% and 90% thresholds. The study was focused on identifying B cell epitopes in humans for

Ebolavirus. For the 90% group in GP all of the epitopes in this study (TVVSNGAKNISGQSP

NISGQSPARTSSDPG DPGTNTTTEDHKIMA NTTTEDHKIMASENS

ENSSAMVQVHSQGRE –overlap with each other,

KPGPDNSTHNTPVYK HNTPVYKLDISEATQ VYKLDISEATQVEQH –overlap with each other) fell in the mucin domain (amino acids 313-464 (21)). These were found in both survivor and asymptomatic cases. The GP epitope from Becquart (20) that fell in the 50% threshold (IRSEELSFTVVSNGA) overlapped with the first few amino acids of the mucin domain but the majority of the epitope was in the glycan cap region (21). The NP epitopes in the

90% group from this study (LFDLDEDDEDTKPVP, RSTKGGQQKNSQKGQ,

TSGHYDDDDDIPFPG DDIPFPGPINDDDNP- overlap) were all found in both survivors and

107 asymptomatic. However, the 50% epitope (HLGLDDQEKKILMNF) was only found in survivors.

The majority of epitopes overall were deposited into the database (IEDB) by a group from Duke in 2009 that had no published papers linked to them. The paper which was found to be associated with both 50% and 75% NP epitopes was Sundar et al. 2007 (22). These epitopes were tested for class I MHC binding and IFN-γ activation. FLSFASLFL was found to be a high binder, and YQGDYKLFL a non-binder that was further tested in other studies. These were in the 75% threshold group. The 50% group contained KLTEAITAA and RLMRTNFLI with the second best able to activate IFN-γ .

For VP35 the pattern is an increase in mutation rate in epitope regions as thresholds increase. The 50 and 90 percent threshold epitopes are mainly B cell in origin. This points to potential importance of VP35 in antibody studies.

3.5.3 Insights from Phylogenetic Pairs

108 Table 3.6: Nonzero Within Pair Values Shown above are the phylogenetic pairs ordered by number of non-zero measurements (smallest to largest). Non-zeroes are the number of non-zero within pair measurements per pair. Bootstrap support indicates whether a pair has a bootstrap values for that pair over 70%. Outbreak is the outbreak year(s). Sequence 1 corresponds to Country 1 and Date 1. Sequence 2 corresponds to

Country 2 and Date 2

Pair Number Nonzeroes Sequence 1 Sequence 2 Outbreak Bootstrap Support Country 1 Country 2 Date 1 Date 2 4 3 KT357826 KT357827 2014 Yes Sierra Leone Sierra Leone 01 15 01 15 6 3 KT357850 KT357837 2014 No Sierra Leone Sierra Leone 02 15 01 15 7 3 KM233093 KM233116 2014 No Sierra Leone Sierra Leone 06 14 06 14 11 3 KP240931 KT589389 2014 Yes Sierra Leone Sierra Leone 09 14 09 14 14 3 KT357858 KT357859 2014 Yes Sierra Leone Sierra Leone 07 15 07 15 25 3 KP240934 KP240933 2014 Yes USA Liberia 10 14 10 14 33 3 KM233084 KM233075 2014 No Sierra Leone Sierra Leone 06 14 06 14 34 3 KM034561 KM034563 2014 Yes Sierra Leone Sierra Leone 05 14 05 14 35 3 KM034550 KM034559 2014 Yes Sierra Leone Sierra Leone 05 14 05 14 40 3 KC242801 KC242791 1976 1977 No DRC DRC 1976 1977 45 3 KR063672 KU182903 1994 1996 Yes DRC DRC 04 95 05 95 43 4 KP271020 KP271018 2014 Yes DRC DRC 08 14 08 14 17 5 KT357816 KT357813 2014 No Sierra Leone Sierra Leone 02 15 02 15 39 5 KM655246 KR063671 1976 1977 No DRC DRC 1976 10 76 13 6 KT357823 KP701371 2014 Yes Sierra Leone Italy 03 15 11 14 18 6 KR025228 KT357821 2014 No UK Sierra Leone 03 15 03 15 22 6 KR074998 KR074997 2014 No Liberia Liberia 2015 2015 36 6 KJ660348 KP096420 2014 Yes Guinea Guinea 2014 03 14 1 8 KT357844 KT357838 2014 Yes Sierra Leone Sierra Leone 01 15 01 15 20 8 KM233109 KM233035 2014 No Sierra Leone Sierra Leone 06 14 06 14 12 9 KP728283 KM233115 2014 No Switzerland Sierra Leone 11 14 06 14 24 11 KP260802 KP260799 2014 Yes Mali Mali 11 14 10 14 16 12 KP184503 KT345616 2014 Yes UK Sierra Leone 08 14 02 15 37 12 KC242788 KC242789 2007 Yes DRC DRC 2007 2007 23 17 KR075000 KR075001 2014 Yes Liberia Liberia 2015 2014 41 17 KC242795 KC242792 1994 1996 Yes Gabon Gabon 96 94 38 29 KC242800 KF113528 2002 2003 Yes Gabon DRC 2002 2003

109 The greatest diversity with the phylogenetic pairs was found in the other outbreak pairs

(Table 3.6). This is due in part to some of these pairs being derived from different countries as is the case for pair 38 which has 29 non-zero dN and dS measurements. These sequences are from the 2002-2003 outbreaks that happened in Gabon and the Democratic Republic of Congo (DRC).

One is from Gabon in 2002 the other is from the DRC in 2003.

The second most diverse pair, pair 41, resulted from the outbreaks that occurred in Gabon from 1994 to 1996. One is from 1994 (KC242792) and one is from 1996 (KC242795). The sequences were both submitted by the Carroll group from the 2013 paper (3) through the CDC.

This lab also submitted the 2007 pair (37, KC242788, KC242789) which has the fourth most non-zero measurements at 12.

Other than pairs from different years and countries, there are still a few pairs with a high number of non-zero dN and dS values including pair 17 (KR075000 and KR075 001) from

Liberia. These two were taken from 2015 and 2014 respectively and sequenced by Hoenen,T. and Feldmann,H through NIH as part of a study in nanopore sequencing (23). Pair 1 from Sierra

Leone (KT357844 and KT357838) like pair 17 also has a high number of non-zeroes. These sequences are both from January 2015. It is a supported pair with 8 non-zero measurements.

Both were submitted by the same group (24). In the NCBI database they are listed as

“Hypermutated Ebolavirus circulating in Magazine Wharf area, Freetown, Sierra Leone”. They are at the top of the tree and most closely related to pair 2 which has no non-zero measurements.

Overall, there is still a trend of little variation between closely related sequences and small mutation rates. The patterns of polymorphisms will be looked at in the next chapter to elucidate in more detail the possible importance of what variation there is.

110 3.5.4 Implications for the Current Vaccine and Potential New Targets

Although there is a lack of well-studied epitopes for Ebolavirus and many in the data base are either non-human or predicted only, for two of the proteins (VP35 and GP) there is a clear pattern of sequence changes when positive assay thresholds are increased. This shows that while substitutions are occurring at a slower rate than expected in an RNA virus, they are still happening and may even be due to human immune system pressure. That is of vital importance due to the fact that GP is the protein used in the approved vaccine and there may be only a few mutations needed to escape the immune protection that the available vaccine provides. It could also indicate that with further human outbreaks the virus could acclimate to the human immune system as influenza has in the past. This is especially true of the strains that evolved from the highly virulent and fatal 1918 influenza pandemic strain, which although they still caused outbreaks, had many less fatalities (25). The bifurcated nature of mutations is the problem.

Dependent on where they take place and the change they cause, mutations can make the virus more or less of a threat to human life and so must be monitored closely.

GP is the most studied Ebolavirus protein. The epitopes for GP are also the most studied.

Raising the positive threshold for GP epitopes has revealed an evolutionary signal with high dN in epitope regions (Table 3.4). This shows the potential that the Ebolavirus is interacting with and changing because of human immune system pressure. This could potentially lead to less virulence and a lower fatality rate for human infections. VP35 is less studied and, given the evolutionary patterns seen in Table 3.4, should be looked at further for future vaccines and treatments. Distributions of polymorphisms are examined in the next chapter to narrow down patterns and elucidate potential areas of interest for treatments and observation.

111 3.6 References

1. Martines, Roosecelis Brasil, et al. "Tissue and cellular tropism, pathology and

pathogenesis of Ebolavirus and Marburg viruses." The Journal of pathology 235.2

(2015): 153-174.

2. Tong, Y. G., Shi, W. F., Liu, D., Qian, J., Liang, L., Bo, X. C., ... & Jiang, J. F. (2015).

Genetic diversity and evolutionary dynamics of Ebolavirus in Sierra Leone. Nature.

3. Carroll, Serena A., et al. "Molecular evolution of viruses of the family Filoviridae based

on 97 whole-genome sequences." Journal of virology 87.5 (2013): 2608-2616.

4. Park, D. J., Dudas, G., Wohl, S., Goba, A., Whitmer, S. L., Andersen, K. G., ... &

Gibbons, A. (2015). Ebolavirus Epidemiology, Transmission, and Evolution during

Seven Months in Sierra Leone. Cell, 161(7), 1516-1526.

5. WHO. (2015). One Year into the Ebola Epidemic: a Deadly, Tenacious and Unforgiving

Virus. http://www.who.int/csr/disease/ebola/one-year-report/introduction/en/

6. ViPR: an open bioinformatics database and analysis resource for virology research.

Pickett BE, Sadat EL, Zhang Y, Noronha JM, Squires RB, Hunt V, Liu M, Kumar S,

Zaremba S, Gu Z, Zhou L, Larson C, Dietrich J, Klem EB, Scheuermann RH. Nucleic

Acids Res. 2011 Oct 17

7. R Development Core Team (2008). R: A language and environment for statistical

computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-

0, URL http://www.R-project.org.

8. RStudio Team (2015). RStudio: Integrated Development for R. RStudio, Inc., Boston,

MA URL http://www.rstudio.com/.

112 9. Jukes T.H. and Cantor C.R. (1969). Evolution of protein molecules. In Munro HN,

editor, Mammalian Protein Metabolism, pp. 21-132, Academic Press, New York

10. Nei M. and Kumar S. (2000). Molecular Evolution and Phylogenetics. Oxford University

Press, New York.

11. Tamura, Koichiro, Masatoshi Nei, and Sudhir Kumar. "Prospects for inferring very large

phylogenies by using the neighbor-joining method." Proceedings of the National

Academy of Sciences of the United States of America 101.30 (2004): 11030-11035.

12. Kumar, Sudhir, Glen Stecher, and Koichiro Tamura. "MEGA7: Molecular Evolutionary

Genetics Analysis version 7.0 for bigger datasets." Molecular biology and evolution 33.7

(2016): 1870-1874.

13. Felsenstein, Joseph. "Phylogenies and the comparative method." The American Naturalist

125.1 (1985): 1-15.

14. Vita R, Overton JA, Greenbaum JA, Ponomarenko J, Clark JD, Cantrell JR, Wheeler DK,

Gabbard JL, Hix D, Sette A, Peters B. The immune epitope database (IEDB) 3.0. Nucleic

Acids Res. 2014 Oct 9. pii: gku938. [Epub ahead of print] PubMed PMID: 25300482.

15. Hanada, Kousuke, Yoshiyuki Suzuki, and Takashi Gojobori. "A large variation in the

rates of synonymous substitution for RNA viruses and its relationship to a diversity of

viral infection and transmission modes." Molecular biology and evolution 21.6 (2004):

1074-1080.

16. Saitou, Naruya, and Masatoshi Nei. "The neighbor-joining method: a new method for

reconstructing phylogenetic trees." Molecular biology and evolution 4.4 (1987): 406-425.

17. Baseler, Laura, et al. "The pathogenesis of Ebola virus disease." Annual Review of

Pathology: Mechanisms of Disease 12 (2017): 387-418.

113 18. Liu, Wen Bin, et al. "Ebolavirus disease: from epidemiology to prophylaxis." Military

Medical Research 2.1 (2015): 7.

19. Kühl, Annika, and Stefan Pöhlmann. "How Ebola counters the interferon

system." Zoonoses and public health 59.s2 (2012): 116-131.

20. Becquart, Pierre, et al. "Identification of continuous human B-cell epitopes in the VP35,

VP40, nucleoprotein and glycoprotein of Ebolavirus." PloS one 9.6 (2014): e96360.

21. Gregory, Sonia M., et al. "Structure and function of the complete internal fusion loop

from Ebolavirus glycoprotein 2." Proceedings of the National Academy of

Sciences 108.27 (2011): 11211-11216.

22. Sundar, Krishnan, Agnieszka Boesen, and Richard Coico. "Computational prediction and

identification of HLA-A2. 1-specific Ebolavirus CTL epitopes." Virology 360.2 (2007):

257-263.

23. Hoenen, Thomas, et al. "Nanopore sequencing as a rapidly deployable Ebola outbreak

tool." Emerging infectious diseases 22.2 (2016): 331.

24. Smits, Saskia L., et al. "Genotypic anomaly in Ebola virus strain circulating in Magazine

Wharf area, Freetown, Sierra Leone, 2015." Euro surveillance: bulletin Europeen sur les

maladies transmissibles= European communicable disease bulletin 20.40 (2015).

25. Nelson, Martha I., et al. "Multiple reassortment events in the evolutionary history of

H1N1 influenza A virus since 1918." PLoS Pathogens 4.2 (2008): e1000012.

114

CHAPTER 4: DISTRIBUTION OF POSITIVELY AND NEGATIVELY

SELECTED SITES IN THE EBOLAVIRUS GENOME

4.1 Introduction

When dealing with the public health aspects of infectious diseases, the objective is rapid control of outbreaks, and ideally, in the long term, prevention of infection. Both a vaccine and post-exposure drug treatments are needed for Ebolavirus because it thrives in rural regions that are difficult for NGOs and other health care workers to access (1). In Chapter 3 we identified potential vaccine targets outside of GP. In this chapter, the focus is on sequence polymorphisms and potential treatment targets.

Although the Ebolavirus vaccine was recently developed and successfully tested during the 2017 Democratic Republic of Congo outbreak (See section 1.6), there are still many challenges to conquer in preparation for a larger outbreak. Even if the current vaccine were made affordable for public use, impediments to establishing herd immunity such as fear of modern medicine, transport and distribution in rural areas (especially since it requires a cold chain), and funding required to provide the means for immunization drives would limit its effectiveness.

This has been shown time and again with the vaccine, which requires a cold chain and has run into issues with certain groups (including anti-vaxxers) who believe the vaccine does more harm than good, though there is no evidence to support their claim and measles can be deadly (2, 3). So in addition to a vaccine, viable treatment options (like the effective antivirals and pre-exposure prophylaxis that are used in HIV infections) need to be paired with it for those unable or unwilling to vaccinate. On both these fronts, there will be challenges on where to target

115 the virus. Hence, looking into the evolution of genes and epitope regions that are potential treatment targets is essential.

Single Nucleotide Polymorphisms (SNPs) can play an important part in many public health challenges (4). They have been linked to numerous genetic diseases in humans and are important in the function of viruses as well. Identification of functionally relevant SNPs and discerning what impact they might have on function within the virus could aid attempts to control future outbreaks. For example, a specific SNP may lead to the virus becoming more virulent or more communicable, for example two such mutations in H5N1 influenza HA protein that enabled its adaptation from birds to humans (5). Other SNPs may indicate instead that the virus is responding to immune system pressure from human hosts, who are a non-reservoir host for this particular virus. Non-synonymous mutations can also have a great impact on treatment and vaccine efficacy. SNPs have been associated with emergence of drug resistance in HIV, leading to resistance to not one but three types of treatments, which focus on specific parts of the

HIV life cycle (6).

4.2 Hypotheses

Hypothesis 3: Due to functional and structural constraints acting on the Ebolavirus, there will be highly variable and highly conserved regions that can serve as perspective targets for drug design and vaccine.

In this aim, we will identify such highly variable regions and conserved regions using several approaches: (a) identifying and mapping polymorphic changes, (b) mapping and prediction of secondary structural elements, (c) using known functional annotations (such as surface protein regions exposed to the immune system outside the cell).

116

Hypothesis 4: Distribution of non-synonymous (amino acid altering) SNPs will be uneven along the genome, with the majority of SNPs leading to radical amino acid changes located in epitope regions, due to immune pressure from the host.

This can lead to insights into functionally important and conserved regions for treatment and vaccine targets and identification of regions that are less conserved and hence may be of importance for virulence and communicability. These areas might also be monitored for ascertaining if Ebolavirus is adapting to the human immune system, and if this is true, whether it may become less fatal.

Hypothesis 5: Functionally important residues can be identified using SNP distributions within and between epidemics, where the most important residues are those that can harbor SNPs only during an individual epidemic, but not between epidemics. In other words, residues under strong purifying selection pressure can harbor polymorphisms only during the time of epidemics due to broader host distribution. This can lead to identification of potential drug targets.

4.3 Methods

4.3.1 Sequences collected for analysis

117

Table 4.1: Sequences used and countries and outbreaks of origin

Sequences were obtained from the Virus Pathogen Database and Analysis Resource (VIPR)

Ebolavirus database (https://www.viprbrc.org/brc/home.spg?decorator=filo_Ebolavirus)(7). All available whole genomes as of February 2016 were used. (Supplementary Table 3.1)

Outbreak Country Number of sequences 2014 Sierra Leone 147 Liberia 11 Guinea 9 Mali 4 United Kingdom 3 Democratic Republic of Congo 2 USA 2 Italy 1 Switzerland 1 2007 Democratic Republic of Congo 7 2002-2003 Democratic Republic of Congo 1 Gabon 1 1994-1996 Democratic Republic of Congo 16 Gabon 6 1976 Democratic Republic of Congo 6

118 4.3.2 Epitope Definition and Mapping

Epitopes were obtained from the Immune Epitope Database (http://www.iedb.org/)(8).

All Ebolavirus Zaire epitopes available in February 2016 were obtained, sorted, and mapped to the alignment. (Supplementary Table 3.2)

4.3.3 Protein Structures

The protein structures are based on protein maps obtained from the Research

Collaboratory for Structural Bioinformatics (RCSB) Protein Data Base (PDB)

(https://www.rcsb.org/pdb/home/home.do)(9). For NP protein 4Z9P, 4QBO, and 4YPI maps were used, for GP Q05320 and 3CSY maps were used, and for VP35 the 3FKE and 4YPI maps were used. For the other VP proteins: maps 4LDD and 4LDB for VP40, maps 2I8B and 5DVW for VP30, and map 4M0Q for VP24. NP and VP35 mappings were incomplete with gaps at nucleotides 1147-1935 (NP) and 127-660 (VP35) respectively.

4.3.4 Mapping of Polymorphisms

Nucleotide and amino acid polymorphisms were identified via multiple sequence alignment and organized by gene. Polymorphic sites were defined as substitutions occurring in at least two sequences.

4.3.5 Relative Rate Estimations

Mean relative evolutionary rate estimates were computed for each site using MEGA 7

(10). The relative rate estimates are scaled so that the average evolutionary rate is one for all sites included. Sites with a greater than one rate are evolving faster than average site for that segment,

119 and those with a rate lower than one are evolving more slowly than average. These relative rates were estimated under the Jukes-Cantor model (+Gamma correction) (11). For the 2014 unpaired sequences random subsets of 70 sequences (the number of paired 2014 sequences) were run and the averages were taken. For the 2014 paired sequences, pair 16 was omitted due to a deletion in the GP gene. A minimum of three sequences is recommended for this method (12). Therefore the

2014 pair 43 (from the small DRC outbreak), and the 2002-2003 pair were omitted from this analysis. All other outbreaks were analyzed separately.

4.4 Results

4.4.1 Distribution of Polymorphic Sites in Epitope Regions

120 10% Epitope Regions

A

B

C

Figure 4.1: Epitope Regions and Polymorphic Sites in Three Ebolavirus Genes Here the Ebolavirus genes (A) VP24 (B) VP30 and (C) VP40 are shown with the 10% threshold epitope regions are in blue, the nonsynonymous polymorphisms are in red and the synonymous polymorphisms are in grey. Y axis scale is for clarity in differentiating each measurement

121 In Ebolavirus epitopes are still being tested and regions narrowed down. Figure 4.1 shows epitope regions that have been tested and had a positive result at least 10% of the time, and Figure 4.2 shows those that had a positive result in 50% of tests. The grey bars represent synonymous polymorphisms where there was a change in a nucleotide (A, T, G, C) from the reference sequence (from the first outbreak in 1976) but not in the amino acid sequence

(synonymous changes) in at least two sequences. The red bars represent change in the nucleotide that changes the amino acid sequence (nonsynonymous). As is seen in the figures some genes have higher numbers of polymorphisms and areas where those mutations occur more frequently.

These do not always correlate with epitope regions and may have been driven by other factors like constraints on the structures and functions of the proteins transcribed from these genes.

In the genes at the 10% level (Figure 4.1), the epitopes are few and far between. Only

VP24 has polymorphic sites within epitopes. The number of synonymous polymorphic sites is much larger than non-synonymous for all three genes.

50% Epitope Regions

122 A

B

C

Figure 4.2: Epitope Regions and Polymorphic Sites in Three Ebolavirus Genes Here the Ebolavirus genes (A) NP (B) GP and (C) VP35 are shown with the 50% threshold epitope regions are in blue, the nonsynonymous polymorphisms are in red and the synonymous polymorphisms are in grey. Y axis scale is for clarity in differentiating each measurement

123 At the 50% level, most of the GP and NP residues are defined as epitope (Figure 4.2).

Therefore it is not surprising that most of the polymorphisms occur in epitope regions. The nonsynonymous (dN) mutations do seem to cluster together in both the NP and GP genes while the synonymous (dS) mutations are distributed throughout the genes. In the VP35 gene the mutations are distributed throughout the gene, with small clusters around 250-300 and 900-1000 nucleotides.

90% Epitope Regions

124

A

B

C

Figure 4.3: Epitope Regions and Polymorphic Sites in Three Ebolavirus Genes Here the Ebolavirus genes (A) NP (B) GP and (C) VP35 are shown with the 90% threshold epitope regions are in blue, the nonsynonymous polymorphisms are in red and the synonymous polymorphisms are in grey. Y axis scale is for clarity in differentiating each measurement

125 At the 90% threshold range, epitope regions are much smaller (Figure 4.3). This is due to how few epitopes have 90% positive assay results. In the NP gene, at 50%, 1809 out of 2217 nucleotides are epitope; at the 90% threshold only 294 of 2217 nucleotides are epitope. For the

GP gene the 50% threshold has 1608 out of 2028 nucleotides in epitope regions; at the 90% threshold this drops to 396 of 2028. The VP35 gene has 333 of 1020 nucleotides in epitope regions at the 50% threshold; only 174 nucleotides remain at the 90% threshold. For both NP and

GP the majority of the remaining epitopes are in regions with a large number of mutations, while the VP35 epitopes do not seem to correspond with the highest mutation rates.

4.4.2 Gene Maps of Polymorphic Sites across Functional Domains

Mapped Structural Areas

126

A

B

C

C

D

127 E

F

Figure 4.4: Protein Structures and Polymorphic Sites in Six Ebolavirus Genes Here the Ebolavirus genes (A) NP (B) GP (C) VP35 (D) VP24 (E) VP30 and (F) VP40 are shown with the alpha helices in blue, the beta sheets in green, other structures in orange, the nonsynonymous (dN) polymorphisms in red and the synonymous (dS) polymorphisms in grey.

These structures are based of RSCB PDB protein maps (For NP- 4Z9P and 4YPI, GP- Q05320 and 3CSY, and VP35- 3FKE). NP and VP35 mappings were incomplete with gaps at 1147-1935

(NP) and 127-660 (VP35) (shown in gray). Y axis scale is for clarity in differentiating each measurement

128 Trends of polymorphic mutation in relation to protein structures are shown in Figure 4.4.

Unfortunately for VP35 and NP the interesting areas for substitution patterns do not have protein structures mapped to them yet. For VP35 structures within the grey region are the consensus of two protein structure prediction models, GOR4 (13) and PREDATOR (14). The non-synonymous polymorphic sites within the region are outside the predicted structures. For the

NP grey area, the same two prediction methods showed very few structures. The area of GP with the greatest number of nonsynonymous polymorphic sites (~885-1500) does not have mapped structures but is the mucin domain which is of antigenic interest. The areas with structures are the areas used to create GP1 (~97-1500) and GP2 (~1503-2028) which has a highly conserved transmembrane region (~1948-2016).

For the VP24 and VP40 proteins, the non-synonymous polymorphisms fall in beta sheets and unstructured regions. For VP30 the structural regions are complex. The first, fourth, and sixth orange regions are intrinsically disordered, with the shortened area in the first region being associated with RNA binding. These are where most of the non-synonymous polymorphic sites fall. The second block is a zinc finger and the third block is an area for oligomerization. The fifth block interacts with NP.

4.4.3 Pair Differences

129 Table 4.2: Within Pair Differences The within pair differences are shown above by gene. The location in the gene sequence is given by nucleotide number and the number of pairs with a change in this location is shown as a count.

The inset table shows a count of nucleotide (NT) changes by type (which nucleotide to which nucleotide)

130 GP L NP VP24 96 1 37 1 103 1 76 1 137 1 69 1 366 1 81 1 576 1 93 1 813 1 361 1 665 1 115 1 1116 1 399 1 896 1 180 1 1446 1 686 1 1127 1 331 1 1489 1 VP30 1136 1 415 1 1734 1 234 1 1184 1 948 1 1749 1 296 1 1230 1 990 1 1767 1 422 1 1241 1 1305 1 1962 1 450 1 1315 1 1341 1 2028 1 VP35 1326 1 1587 1 2185 1 48 1 1351 1 1704 1 Total 12 819 2 1610 1 2026 1 973 1 1626 1 2060 1 VP40 1631 3 2079 1 231 1 1790 1 2439 1 255 1 1791 1 2499 1 281 1 1830 1 2649 1 453 1 1847 1 2950 1 574 1 Total 22 3048 1 741 2 3186 1 VP Gene Total 20 3468 2 3852 1 3980 1 4023 1 NT Change 4091 1 A->C 3 4212 1 A->G 16 4651 1 A->R (A or G) 2 4875 1 A->T 1 5098 1 C->G and A 1 5106 1 C->T 33 5228 1 C->Y (C or T) 1 5314 1 G->A 14 5877 1 G->C 1 5883 1 G->R (A or G) 1 5985 1 G->T 1 6134 1 T->A 2 6561 1 T->C 18 6627 1 T->G 1 Total 41 Grand Total 95

131

132

133 To further examine evolutionary patterns, within pair changes were examined by gene for the phylogenetic pairs identified in Chapter 3 (Table 4.2). Within pair changes that occur in more than one pair could be a sign of highly variable sites or of convergent evolution. There were only four sites in three genes that featured multiple within pair changes. One site had three instances of change, each in a different outbreak. That site was 1631 in GP which is a non-synonymous change within a mapped beta sheet section of GP2. VP35 had two pairs change at nucleotide 819 which is a synonymous change within an alpha helix. VP40 had one site, nucleotide 741 which had two changes, one synonymous and one nonsynonymous. This nucleotide is in a region with no secondary structure or known epitopes. L had one site with multiple changes, 3468.

4.4.4 Relative Rate

134 A

B

C

135 D

E

Figure 4.5: Relative Rate in Five Ebolavirus Genes Here the Ebolavirus genes (A) NP, (B) GP, (C) VP35, (D) VP30, and (E) VP40 are shown with the relative rates by outbreak. 2014 was split into paired sequences and unpaired sequences.

Unpaired sequences were split into 1000 random 70 sequence subsets and the means were taken.

The other outbreaks with more than three sequences were run separately. Colors correspond to the legend. Y-axis is the relative rate measurement

136 As shown in Figure 4.5, most sites in these three proteins have a low rate of change. For

NP the 2014 unpaired sequences had the highest number and broadest distribution of highly variable sites. The highest variable sites however, were single points in the 1976 and 2007 outbreaks. In GP, the majority of high relative rate sites are from the 2014 paired grouping with the exception of one site each for 2007 and 1976. For VP35 the highly variable sites were in the

2014 outbreak strains (with the exception of a single 2007 point). VP24 had no high relative rate sites. VP30 has two sites from the 2007 sequences and VP40 has one site from 207 and one from

1976.

4.5 Discussion and Conclusions

4.5.1 Model for Ebolavirus Treatment Development

137

Figure 4.6: Antiviral Treatments in HIV Shown above, human immunodeficiency virus (HIV) has antiviral treatments that target different steps of the life cycle of the virus. These include NRTIs (nucleoside analogue inhibitors), NNRTIs (non-nucleoside reverse transcriptase inhibitors), InSTIs

( strand transfer inhibitors), and ALLINIs (allosteric integrase inhibitors) (15) .

(Reproduced with permission of NATURE PUBLISHING GROUP in the format

Thesis/Dissertation via Copyright Clearance Center)

138 HIV is another virus with origins in Africa. Treatment for this disease has had to overcome the same challenges in beliefs and infrastructure. Although there is currently no vaccine for HIV, there are many available treatments that target different steps in the life cycle of the virus (Figure 4.6) and PrEP, which when taken as directed before sexual activities with an

HIV positive partner can prevent HIV infection. The difficulty in PrEP is that it must be taken exactly has prescribed to be effective and must be taken beforehand (up to 20 days depending on sexual activity) (16). Having a variety of treatments allows for the development of treatment strategies even for infections that have the additional challenge of developed resistance to some of them. This allows for a higher survival rate and quality of life for those infected.

4.5.2 Potential Treatment Targets for Ebolavirus

139

Figure 4.7: Ebolavirus life cycle and immune system effects Shown above is the life cycle of the Ebolavirus and the points at which its proteins interact with the immune system. Adapted from 17 &18

140 In Ebolavirus there are known points where the immune system interacts with viral proteins. There are also essential steps in the life cycle which can be targeted for treatments as they are in HIV (Figure 4.7).

NP, for example, is necessary for both viral replication and assembly. It is already targeted by the immune system and has MHC epitopes. Mapping the missing section should provide insight into the non-synonymous changes there. The non-epitope regions are small regions in between structurally important regions. This is why the mutation rates are higher in the non-epitope regions, as most of the epitope regions are constrained. This is good for the human immune system, but bad for the virus. The NP protein does not have any known pathogenic effects. Relative rate results show that there are many variable sites during an extended outbreak that may show adaptation. NP is essential to Ebolavirus survival and already targeted, with structural regions that are conserved (the majority of changes in known structural regions are synonymous). For these reasons it is an excellent target for drug development.

VP35 is another protein to consider for drug development, having very few non- synonymous changes or sites with high relative rate of mutation. VP35 plays several roles in the

Ebolavirus life cycle and interacting with the immune system. These include RNA synthesis, impairing iRNA and also hindering the maturation of dendritic cells by blocking RLR (RIG I like receptor) signaling which additionally inhibits type I interferon response (19, 20). VP35 shows potential given the patterns in epitope regions that may signal adaptation (Chapter 3). The epitope regions tend to be outside of the alpha helix regions. This may be due to the fact that less structured regions may be easier to break up intact for MHC processing. Mapping the rest of

VP35 would ease the development of treatments to target the more conserved regions of the protein.

141 VP40 is the matrix protein which aides in viral budding. VP40 has been shown to be cytotoxic and cause cell rounding in vitro (21). VP40 has only a few confirmed epitopes which lie on or near beta sheets. This explains the higher levels of synonymous substitutions and almost zero dN as these regions are structurally constrained. These conserved regions should also be considered as targets for drug treatment development.

Essential for viral transcription initiation, VP30 is more structurally constrained into alpha sheets and beta helices. It has very few known epitopes, all in non-structural areas. Many areas that are not secondary structures are RNA binding sites and zinc fingers and as such also constrained. This is why the protein overall has a low mutation rate and the only mutations really seen are synonymous in nature. The few non-synonymous changes fall mostly in intrinsically disordered regions. The conserved regions and low dN rates indicate VP30 is a good candidate for development of treatments.

VP24 blocks interferon signaling by interfering with STAT1 nuclear translocation (19). It only has a few confirmed epitopes which are mostly in non-structural or beta-sheet regions. It is a short gene that codes mostly secondary structures. The dN rates are so low because of this conservation, but the dS rates are much higher. The non-synonymous changes are few and mostly in the non-structural areas.

L is Ebolavirus’s RNA polymerase and like is highly conserved. Its epitope regions are small and few. Again, however, it has higher dS rates. The levels are as expected given how few of the nucleotides are in regions defined as epitope regions.

4.5.3 Current Vaccine and Potential Challenges

142

Figure 4.8 GP structure Shown above are the products of GP and the reading frames used to obtain them. sGP runs from nucleotide 97-972, GP1 runs from 97-1500 and contains the mucin domain (885-1500), and GP2 runs from 1501-2028 with a helical heptad-repeat region (HR) and a transmembrane region (TM, nucleotides 1948-2016). Function of ssGP is still unknown. Adapted from 22

143 The surface glycoprotein (GP), which is used for cell attachment and entry, is the most studied Ebolavirus protein. GP has the most impact on the immune system due to its encoding of multiple proteins that impair immune system response. Secreted GP (sGP) plays multiple roles, binding antibodies, acting as an anti-inflammatory, interfering with neutrophil movement and activation, and augmenting endothelial cell barrier function (21). In the course of an infection, this protein makes up 70% of the product produced by the GP gene (23). GP 1,2 is a surface protein with two parts which, when cleaved from the cell surface, cause cytokine dysregulation and endothelial cell dysfunction (24). This is due mainly to GP1, the surface portion of this protein where the cytotoxic mucin domain is located. The GP2 section of the protein, which contains the transmembrane portion, counteracts tetherin to assist in budding and release of new viruses. The Δ peptide, which is cleaved from the sGP before it dimerizes, is also encoded by the

GP gene, and prevents entry of other Ebolaviruses, blocking possible superinfection (24).

GP is the gene used in vaccine development. Vaccination using GP DNA is more effective than sGP DNA. Smaller doses are needed although a third booster prevents illness in mice regardless of whether it is GP or sGP (when the first two are GP) (24). Vaccination is less protective when amino acids 651-676 are removed (1/3 died at max dose in a study of macaques). This region is the transmembrane domain. Interestingly, a E71D mutant was more effective when not paired with NP in a vaccine and did not cause cell rounding (25).

Three main sites of GP have been shown to have antibodies specifically targeted to them: the glycan cap, the GP1/GP2 interface and the stalk region (above the HR2 helices). The glycan cap antibodies are not neutralizing but many for the other two regions are (26). ZMapp appears to be effective because it combine antibodies that attach to the glycan cap preventing cell entry and antibodies that attach to the GP1-GP2 interface limiting the ability for the virus to fuse

144 membranes once within the cell (27). Antibodies have been shown to bind better near the GP receptor binding region with the mucin domain removed. This shows that the mucin domain may mask some epitopes from antibodies (28). The mucin domain also plays a role in dendritic cell infection, assisting in the production of cytokines and the processes that lead to these cells not being able to properly activate T-cells (29). This is in agreement with our findings that show high levels of nonsynonymous mutations in the mucin domain region. If this region is changing routinely it can better evade the immune system. Given the level of nonsynonymous mutations in the mucin region and the importance of this region for the vaccine to work, close monitoring should be done.

Having the current vaccine be solely reliant on a gene with highly variable areas may prove to be an unwise move long term. The vaccine may need updated and re-administered as the virus changes and new outbreaks arise. This challenge, added to the hurdles already faced

(cold chain issues, infrastructure challenges, distrust of Western medicine) mean that others treatments and vaccines need to be developed. This is not a situation where the fact that a vaccine exists can slow research and surveillance. Although Ebolavirus is evolving far more slowly than influenza it is still evolving and measures need to be taken to track this evolution and more thoroughly examine the impact of point mutations on vaccine effectiveness and ascertain which regions are indeed epitopes as opposed to noise in the signal.

145 4.6 References

1. Quammen, David. (2014). Ebola: The Natural History of a Deadly Virus. New

York:W. W. Norton

2. CDC (2017) Measles cases and Outbreaks. https://www.cdc.gov/measles/cases-

outbreaks.html

3. Zdechlik, Mark. “Unfounded Autism Fears are Fueling Minnesota’s Measles

Outbreak” NPR. May 3, 2017 http://www.npr.org/sections/health-

shots/2017/05/03/526723028/autism-fears-fueling-minnesotas-measles-outbreak

4. Shastry, Barkur S. "SNPs in disease gene mapping, medicinal drug development and

evolution." Journal of human genetics52.11 (2007): 871-880.

5. Wilker, Peter R., et al. "Selection on haemagglutinin imposes a bottleneck during

mammalian transmission of reassortant H5N1 influenza viruses." Nature

communications 4 (2013): 2636.

6. Rhee Soo-Yon, Matthew J. Gonzales, Rami Kantor, Bradley J. Betts, Jaideep Ravela,

and Robert W. Shafer (2003) Human immunodeficiency virus reverse transcriptase

and protease sequence database. Nucleic Acids Research, 31(1), 298-303.

7. ViPR: an open bioinformatics database and analysis resource for virology research.

Pickett BE, Sadat EL, Zhang Y, Noronha JM, Squires RB, Hunt V, Liu M, Kumar S,

Zaremba S, Gu Z, Zhou L, Larson C, Dietrich J, Klem EB, Scheuermann RH. Nucleic

Acids Res. 2011 Oct 17

8. Vita R, Overton JA, Greenbaum JA, Ponomarenko J, Clark JD, Cantrell JR, Wheeler

DK, Gabbard JL, Hix D, Sette A, Peters B. The immune epitope database (IEDB) 3.0.

146 Nucleic Acids Res. 2014 Oct 9. pii: gku938. [Epub ahead of print] PubMed

PMID: 25300482

9. Berman H.M., J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N.

Shindyalov, P.E. Bourne. (2000) The Protein Data Bank Nucleic Acids Research, 28:

235-242.

10. Kumar, Sudhir, Glen Stecher, and Koichiro Tamura. "MEGA7: Molecular

Evolutionary Genetics Analysis version 7.0 for bigger datasets." Molecular biology

and evolution 33.7 (2016): 1870-1874.

11. Jukes T.H. and Cantor C.R. (1969). Evolution of protein molecules. In Munro HN,

editor, Mammalian Protein Metabolism, pp. 21-132, Academic Press, New York

12. Robinson, Marc, et al. "Sensitivity of the relative-rate test to taxonomic sampling."

Molecular biology and evolution 15.9 (1998): 1091-1098.

13. Garnier, Jean, Jean-François Gibrat, and Barry Robson. "[32] GOR method for

predicting protein secondary structure from amino acid sequence." Methods in

enzymology 266 (1996): 540-553.

14. Frishman, Dmitrij, and Patrick Argos. "Incorporation of non-local interactions in

protein secondary structure prediction from the amino acid sequence." Protein

Engineering, Design and Selection9.2 (1996): 133-142.

15. Laskey, Sarah B., and Robert F. Siliciano. "A mechanistic theory to explain the

efficacy of antiretroviral therapy." Nature reviews. Microbiology 12.11 (2014): 772.

16. CDC (2017) HIV Basics: PrEP. https://www.cdc.gov/hiv/basics/prep.html

147 17. Messaoudi, Ilhem, Gaya K. Amarasinghe, and Christopher F. Basler. "Filovirus

pathogenesis and immune evasion: insights from Ebolavirus and Marburg virus."

Nature Reviews Microbiology 13.11 (2015): 663-676.

18. Viralzone 2014 Swiss Institute of Bioinformatics.

http://viralzone.expasy.org/all_by_species/5016.html

19. Kühl, Annika, and Stefan Pöhlmann. "How Ebola counters the interferon

system." Zoonoses and public health 59.s2 (2012): 116-131.

20. Liu, Wen Bin, et al. "Ebolavirus disease: from epidemiology to prophylaxis." Military

Medical Research 2.1 (2015): 7.

21. Takada, Ayato, and Yoshihiro Kawaoka. "The pathogenesis of Ebolavirus

hemorrhagic fever." Trends in microbiology 9.10 (2001): 506-511.

22. Cook, Jonathan D., and Jeffrey E. Lee. "The secret life of viral entry :

moonlighting in immune evasion." PLoS pathogens 9.5 (2013): e1003258.

23. Mehedi, Masfique, et al. "A new Ebolavirus nonstructural glycoprotein expressed

through RNA editing." Journal of virology85.11 (2011): 5406-5414.

24. Lai, Kang Yiu, Wing Yiu George Ng, and Fan Fanny Cheng. "Human Ebolavirus

infection in West Africa: a review of available therapeutic agents that target different

steps of the life cycle of Ebolavirus." Infectious diseases of poverty 3.1 (2014): 43.

25. Sullivan, Nancy J., et al. "Development of a preventive vaccine for Ebolavirus

infection in primates." Nature 408.6812 (2000): 605.

26. Bornholdt, Zachary A., et al. "Isolation of potent neutralizing antibodies from a

survivor of the 2014 Ebolavirus outbreak." Science 351.6277 (2016): 1078-1083.

148 27. Murin, Charles D., et al. "Structures of protective antibodies reveal sites of

vulnerability on Ebolavirus." Proceedings of the National Academy of

Sciences 111.48 (2014): 17182-17187.

28. Vu, Hong, et al. "Quantitative serology assays for determination of antibody

responses to Ebolavirus glycoprotein and matrix protein in nonhuman primates and

humans." Antiviral research 126 (2016): 55-61.

29. Martinez, Osvaldo, Charalampos Valmas, and Christopher F. Basler. "Ebola virus-

like particle-induced activation of NF-κB and Erk signaling in human dendritic cells

requires the glycoprotein mucin domain." Virology 364.2 (2007): 342-354.

149

CHAPTER 5: SUMMARY AND FUTURE DIRECTIONS

The overall goal of my research is to use bioinformatics and phylogenetic approaches to better understand evolutionary patterns in viral genomes - namely, in influenza A and

Ebolaviruses - that occur during epidemics so that better vaccine and/or drug targets can be identified. Availability of large-scale genomic sequence data that are representative of trends in broader viral populations, thus, plays a very important role in such analyses. For example, being able to conduct timely molecular surveillance of influenza strains across multiple geographic regions, including developing countries and high-risk groups (e.g., 1), including evaluating their nucleotide and amino acid divergence, is important in vaccine development (e.g., 2). However, significant bias exists in how the strains are sampled and represented in the public databases, with many countries, particularly from the developing world, having only a marginal representation (Figure 1.2), while others – such as China and US – depositing a significant majority of the available sequences. Thus, in the attempt to overcome this limitation, I developed a subset approach to evaluate the nucleotide divergence of a larger sample (discussed in Chapter 2). My findings show that when we resample the larger dataset multiple times, while limiting the number of sequences from each geographic region to a certain threshold in order to avoid oversampling, the estimates of the mean statistical values for the whole set and those of the subsets for both nucleotide diversity rates and rates of synonymous and non-synonymous changes are comparable. This suggests that this approach allows us to capture the overall trends.

In other words, this demonstrates that taking a subset or combining several subsets should result in the same outcome, as using the whole set from the highly sampled countries, which may

150 prevent those countries from overwhelming the input from the smaller numbers of sequences available from developing countries or countries with different public health priorities. It also provides hope that even when a limited number of sequences are available from the under- sampled countries, the data can still be useful in determining the trends of influenza (or other virus) evolution within those countries or geographical regions.

In Chapter 2 I further discuss the three main approaches for dealing with sampling issues that are currently used. These are: to collect every available sequence (e.g. 3), to focus on one strain type (e.g. 4), or to take a small sample, either from a small geographic area or a small time frame (e.g. 5). However, all of these approaches have limitations in scope, i.e., sampling from one geographical region or a set of closely-related strains is often not representative of the big picture, especially due to antigenic shift (e.g., 6, 7), and identifying overall world-wide trends when many countries are over- or under-sampled, when compared to the global scale influenza evolves on. In contrast, approaches that involve sampling and resampling - like that used in my study – appear to offer consistent results. Combining random sampling and subsetting has been used to ascertain significant selection trends in the HA protein (8-10). My results are in agreement with these studies, showing that when dealing with a dataset that has biases (over or under sampled, differing strain composition), repetition and resampling are necessary to establish that the patterns revealed in the subsets are likely the ones present in population.

Overall, adequate surveillance is essential in preventing global pandemics. The definition and implementation of adequate is however, debatable. Given the global nature of many diseases, countries need to put more effort into aiding developing countries in infectious disease monitoring. Surveillance must be implemented in all areas known to be hotspots for species jumping events, and in areas where contact with live pigs and fowl are common. This applies to

151 both developing countries with live animal markets and developed countries like the US where more effort must be put into monitoring industrial farming. This problem became more apparent after the 2009 pandemic when the first outbreak of human cases was tracked to Mexico, but locating where the pigs were and how they became infected was far more difficult due to movement of livestock between not only countries in North America but continents (11).

Catching a potential pandemic strain or even just a highly virulent strain as soon as it becomes able to infect humans is integral to developing a vaccine for that strain. Having samples of even a few sequences from a diverse collection of areas can enable far greater insight into long term evolutionary trends and the short term effects that cause them. This has been improved in recent years with the WHO FluNet and FluID being implemented in more countries, however, there is still a large gap in data gathering with smaller countries like Thailand having less than 250 specimens processed per week, Viet Nam at less than 100, and larger countries like China with around 20,000 and the US with around 60,000 (12).

I also explored the evolutionary patterns in Ebolavirus genomes, a virus that until recently received relatively little public attention (13). The recent Ebolavirus outbreaks (the largest outbreak of 2014-2016, and the small recent one of 2017) have brought renewed public and researchers’ attention to this dangerous virus. This led to sequencing of numerous recent isolates, enabling me to consider evolutionary trends spread over several outbreaks (Chapters 3 and 4).

In Chapter 3 I explored the hypothesis that due to the larger geographical spread and duration of the 2014 outbreak (the largest and the longest so far, (14)), mutations will have occurred at a higher rate than seen before, although that hypothesis appeared to not be supported by the available data. However, as with influenza surveillance, the numbers of sequences

152 available from prior outbreaks are fewer, and not uniformly sampled from different times and countries (Table 1.1 and Table 3.1). The 2014 outbreak had many sequences from essentially the same geographic regions and/or time periods (Figure 1.6), for example, when isolates from multiple patients but from the same hospital are being sequenced (15). Interestingly, my results based on analysis of individual sequence pairs (i.e., phylogenetically independent comparisons) instead of larger (likely heavily oversampled of closely related) groups of sequences showed that there exists a relatively large number of non-zero measurements for the prior outbreak pairs. This suggests that Ebolavirus did not experience a burst of recent mutational events due to the recent outbreaks. This is consistent with my findings from several pairs from the 2014 outbreak, which showed non-zero mutation rates in multiple genes and regions. However, to better understand the underlying genomic sequence diversity and how it changes during small versus large outbreaks, more genomic sequences from prior outbreaks must become available, akin to sequencing of

1918 influenza isolates that enabled insights into molecular determinants of 1918 pandemic pathogenicity (16).

These results support the evolutionary rate estimates reported by others. As discussed in the Holmes review (17), an initial study of 2014 Ebolavirus outbreak by Gire et al. (15) showed a higher rate (~1.9 × 10−3 nucleotide substitutions (subs) per site per year) in 2014 than the previous outbreaks averaged (0.9 × 10−3 (0.81, 1.18 × 10−3) subs per site per year). However, the set of sequences used in that study (15) were collected from 78 patients in Sierra Leone in the first stages of the outbreak, so likely come from patients with the most severe symptoms and representing numerous remote locations throughout the larger hospital catchment area.

Subsequent analyses done by other groups further along in the outbreak (including 17-20), which incorporated denser representation of available samples, brought the average value down to 1.2 ×

153 10−3, which is much closer to the previous outbreaks’ average estimate of 0.9 × 10-3 (0.81, 1.18

× 10-3) subs per site. This underscores again the importance of using representative datasets and taking the big picture view of viral evolution, as I have done in this work, by using sequences from all outbreaks.

In the future more comprehensive effort needs to be taken to gather and sequence samples from throughout the outbreak region and throughout the outbreak period. Training of biotechnical staff within African countries, and availability of cheaper - and lighter - sequencing technology will aid in this. This training needs to be part of an overall surveillance program targeting any possible cases of Ebolavirus, as well as other emerging infectious diseases (21, 22).

Moreover, recent advances in large-scale genomic sequencing using ultra-long reads technology such as Oxford Nanopore (23, 24) should be implemented broader, because it allows for initial sequencing to be done using small mobile cell-phone-powered sequencers. This will enable broader (and cheaper) surveillance, which is also imperative for containment. The 2017 outbreak in the Democratic Republic of Congo was controlled within a small number of cases (<10) because it was identified early and the government was prepared.

The second hypothesis I examined is about the mode of Ebolavirus evolution, namely that due to extended contact with multiple human hosts there will be a higher number of changes to epitope regions in response to selective pressure from the human immune system than to non- epitope regions, compared to previous epidemics. This hypothesis is partially supported by results from the analysis of GP and VP35 genes, which show higher mutation rate patterns in epitope regions, especially when positive assay thresholds are increased (Table 3.4, Table 3.5).

This diversifying selection is also seen for GP with the between pairs dN (Figure 3.4) and dN-dS within pairs measurements in epitope regions (Figure 3.9). The other genes may not have

154 enough known or well-tested epitopes to thoroughly test this hypothesis, though, and can be analyzed in the future once we better understand the distribution of epitope and non-epitope regions across the Ebolavirus genome.

Both GP and VP35 have been shown to interact in multiple ways with the immune system, so my findings are in agreement with the literature. The GP gene as a whole has been shown to display most of the phylogenetic signal (25). Our findings agree with this, as GP had the strongest trends for epitope regions.

More rigorous and broad (experimental) testing of the epitopes in the databases (such as

IEDB (26)) is needed to ascertain whether there are actually immunogenic epitopes overlapping these highly variable regions, or if it is merely an artifact of protein structure. Further experimental epitope studies will also help to clarify whether there is an evolutionary response to human immune system pressure, such as immune escape, and the extent of that response. A careful consideration of the reliance of the vaccine currently in use on solely the GP gene (27) is also needed, in order to avoid potential escape mutations.

To further investigate mutation patterns in Ebolavirus genomes, distributions of single nucleotide polymorphisms (SNPs) were next examined in Chapter 4. The first hypothesis involving these distributions, and 3rd Ebolavirus hypothesis was that due to functional and structural constraints acting on the Ebolavirus, there will be highly variable and highly conserved regions that can serve as perspective targets for drug design and vaccine. As shown in the structural features and epitope distribution graphs (shown in Chapter 4), many of the highly variable regions indeed exist outside of secondary structures like alpha helices. The VP proteins are relatively highly conserved overall, in part due to their importance in viral life-cycle (for example, due to critical role of VP30 in transcription initiation (28)), which makes them good

155 candidates for the development of targeted drug (29) and/or vaccine treatments (30). Moreover, because the identified regions with high non-synonymous rates in GP and NP proteins also appeared to be regions with little secondary structure (mapped for GP, predicted for NP), such unstructured regions can potentially be useful for future design of multi-epitope vaccines (e.g.,

31) and/or vaccine adjuvants (32, 33). Nonetheless, further studies are needed into the protein structures, their dynamic presentations during the viral lifecycle and associated immune responses, as in cancer vaccine development; for example, use of unstructured peptides appeared to be ineffective compared to humoral immune response elicited by peptides with rigid secondary structures (34).

The mucin domain in GP has shown great importance with relation to the immune response (e.g., 35, 36). The high rate of sequence changes shown in my results within this region is in agreement with the known high level of interactions this region has with the immune system. E.g., Vu et al. (37) showed that removal of the mucin domain allows for easier antibody binding and potential neutralization of GP. Additionally, the mucin domain is essential to effective dendritic cell infection, facilitating the production of cytokines and to these cells not maturing properly, so they cannot activate T-cells (38). The region itself, similar to sGP, binds antibodies to keep them from binding areas of high functional importance like the receptor binding site (39). The multiple proteins encoded within this region all interact with the immune system, especially sGP which is the bulk (70%) of the protein made (40). sGP alone does not make a viable vaccine due to the fact that antibodies are then primed to attack sGP preferentially

(41). However, it has been proposed that sGP coding helps block GP1 from antibodies and immune pressure and GP2 is more structurally conserved (25). A treatment that attacks sGP, perhaps by down-regulating it would then be effective in allowing the immune system to focus

156 on more relevant targets for containing infection. Also, developing antibodies that prefer targets in GP1 or GP2 but not in sGP would be a good treatment option.

NP is the major viral target of CD8 T cells, and a target for presentation, due in part to it being the most expressed protein within the cell (42). The variance found in the area with few structures may be due to the fact that this less conserved, less structured region is a prime candidate for antigen processing, and is responding to immune system pressure.

Sequence changes in NP and GP may both serve as indicators of whether the Ebolavirus is adapting to human hosts and potentially losing virulence as a consequence of associated functional and/or structural changes. One study has shown that GP mutations are adapting GP protein toward human cell entry and away from the fruit bats, which are likely its reservoir (43).

Thus, studies involving each new outbreak strain should be done with both B and T cell response being measured to provide insight into the interplay of actions of the immune system and the

Ebolavirus response. If the same antibodies or are consistently effective, this could provide treatment options and reinforce confidence in the vaccine. However, if the reverse is true, the vaccine will need to be continuously updated, much like what is being done with influenza A vaccine development (e.g., 44, 45), and other relevant treatment targets sought. This has already been shown in a study of a single survivor which found that escape variants resulted from treatment with the MB-003 antibody cocktail (46).

The hypothesis that distribution of non-synonymous (amino acid altering) SNPs will be uneven along the genome, with the majority of SNPs leading to radical amino acid changes located in epitope regions, due to immune pressure from the host was also examined (Chapter 4).

However, in part due to our lack of thorough understanding of Ebolavirus biology and life-cycle, we understand relatively little about epitope distribution along the genome, particularly that of

157 experimentally confirmed epitopes. For example, the VP genes, with the exception of VP35, have very few mapped epitopes (47). This is in part due to a combination of structural constraints acting on these proteins, lack of comprehensive immunological studies, and the relatively short length of these genes. My results showed that in the genes with many mapped epitopes, indeed epitopes appear to harbor a substantial number of non-synonymous changes, which can be attributed to their interactions with the host immune system, including the potential for episodic immune escape; similar to what has been documented in HIV-1 (48). For example, in VP35, six out of ten non-synonymous changes fell near of within the 50% threshold epitopes. On the other hand, for NP and GP majority of the gene codons are mapped as epitope at 50% and at 90% thresholds, with both epitope and non-epitope regions containing multiple non-synonymous changes.

Certain SNPs have already shown their functional importance in genes like GP. For example, mutations in amino acid 82 and 544 increased Ebolavirus infectivity, with the former aiding in cell entry and the latter playing a key role in membrane fusion (49). Amino acid 544 is the location of nucleotide 1631 which showed three within pair changes. Amino acid 82 changed early in the outbreak and quickly became fixed, showing the virus is adapting to better infect humans (49). Further study of other genes and detected amino acid variants may show similar effects of other SNPs. Likewise, fixation of specific SNPs in future strains should also be closely monitored, as these may indicate further adaptation to human hosts and/or shifts in viral fitness, virulence and infectivity. Moreover, we should not be discarding synonymous nucleotide changes and their potential influence on RNA structure and function of the virus at different stages of its life-cycle. For example, if the RNA folds improperly, it may not be able to be efficiently packaged into new viruses, or improperly interact with the cell transcriptional and

158 translational machinery (50). Additional studies – both computational and experimental – are needed to better understand the interplay between different parts of the viral genome, including whether immune pressure experienced by epitopes harbored by such variable regions are driving the observed sequence changes, or whether the changes should be attributed to features and limitations of underlying protein structures.

The last hypothesis I have explored is whether functionally important residues can be identified using SNP distributions within and between epidemics, where the most important residues are likely to be those that can harbor SNPs only during an individual epidemic, but not between epidemics. This was examined using two approaches – estimates of relative rates and analysis of within sister-pair changes. The results of within sister-pair changes focused on amino acid changes that occur within a short period of time between closely related sequences.

However, the fact that so few occur at the same sites, and that these changes rarely become fixed indicates that they are more likely to be random neutral or slightly deleterious mutations and can be expected to be eliminated by purifying selection in subsequent generations. Moreover, the relative rate results show that the VP genes are less likely to have highly variable sites, whereas

NP and GP have such sites throughout their genomic regions that have a high relative rate during an outbreak, indicating that there is a substantial heterogeneity in how substitutions are distributed along the genome.

One factor at play here is that VP genes are shorter and more structured than NP or GP, which may impact their ability to harbor non-synonymous mutations, particularly those with stronger functional and/or structural consequences. Such likely functional consequences of identified substitutions should be further explored, for example, examining the nature of observed amino acid changes and whether such residue change can be classified as radical based

159 on the polarity, or size, or charge differences (51). Moreover, the function of these proteins should also be considered, as VP24 and VP40 are necessary for virus budding and also have immune system effects (52, 53). Likewise, VP30 is essential for transcription initiation (52).

These properties likely make these genes a good target for long term treatments, such as being drug targets, as they are necessary for viral survival and reproduction, but less likely to develop escape and/or drug resistance mutations.

These results suggest that VP genes may serve as better – more conserved, less likely to change – treatment targets than currently used GP. For example, in the future VPs can be incorporated into vaccines, preventing potential escape mutations in GP from making the vaccine entirely ineffective. Subunit vaccines can be less effective, although safer, than whole virus vaccines due to the fact that not all antigenically important viral regions may be represented (54).

Subunit vaccines may also require multiple doses to be effective and have a shorter immunogenic memory response (54). Therefore, developing a vaccine with as much of the

Ebolavirus as possible is the ideal. A vaccine has already been tested and found effective in non- human primates that consists of the Ebolavirus genome without the necessary VP30 gene, though concerns exist over whether another protein could be used instead (55).

Overall, as discussed in Chapter 1 (1.7), the big overarching question is whether (and how) the evolutionary patterns observed in Ebolavirus indicate whether it can exist as a chronic, or only as an acute infection. The expectations differ between these two lifecycle strategies, specifically, if it is an acute infection, the high error rate during viral replication should lead to appearance of random mutations, spread more or less uniformly throughout the genome.

Otherwise, if it exists as a chronic infection, at least in some human hosts, there should be detectable signs of selective pressure from the immune system, and therefore, we can expect to

160 observe higher levels of mutations, particularly nonsynonymous (amino acid changing) in epitope regions as those are regions that interact with the host immune system.

My results indeed show a pattern of Ebolavirus genomes harboring a larger number of non-synonymous mutations in immunologically important genes, like NP and GP, and that these changes are often clustered in areas that may be of great importance to the virus as means of evading immune response. Other genes appear to be more highly conserved, even in non- structured protein regions, showing that non-synonymous mutations there do not seem to be randomly/uniformly distributed. Further testing and narrowing down of specific epitope regions will be needed to clarify whether this pattern can also be attributed to immune system pressure and therefore may serve as a sign of potential chronic Ebolavirus infection, either undetected now, or possible in the future. Thus, this study provides a unique perspective on distribution of observed sequence changes in Ebolavirus genomes, and raises an important possibility of existence of asymptomatic Ebolavirus infections that may be persisting in chronic stage, although further examination of this broad hypothesis is needed. This is also important because if

Ebolavirus is chronic in humans, it may change enough to become a separate human virus as was seen in the transition of SIV (simian immunodeficiency virus) to HIV (human immunodeficiency virus), which could make it an even larger issue than it is now, with Ebolavirus outbreaks being traced back to interaction with tropical ecosystems (56). My findings have important implications for both our understanding of Ebolavirus life cycle and biology, as well as design of future vaccines and/or treatments, including the treatments that can prevent Ebolavirus spread in the community. Moreover, further studies, including experimental confirmation and broader geographic sampling, are needed to verify both the occurrence of interactions between epitopes and human immune system, as well as to delineate the functional significance of observed

161 changes. I would like to further emphasize the need to sequence samples that may have been collected during Ebolavirus outbreaks prior to 2014 (as well as additional sequencing of samples from recent outbreaks), where we experience particular shortage of available viral sequences, thus, limiting the scope of computational insights into the evolution of this important pathogen.

162 References

1. Paules, Catharine I., et al. "The Pathway to a Universal Influenza

Vaccine." Immunity 47.4 (2017): 599-603.

2. Glatman-Freedman, Aharona, et al. "Genetic divergence of Influenza A (H3N2)

amino acid substitutions mark the beginning of the 2016–2017 winter season in

Israel." Journal of Clinical Virology (2017).

3. Furuse, Yuki, Shimabukuro, Kozue, Odagiri, Takashi, et al. Comparison of selection

pressures on the HA gene of pandemic (2009) and seasonal human and swine

influenza A H1 subtype viruses. Virology. 2010. 405: 314-321

4. Ward, Melissa J., Lycett, Samantha J., Avila, Dorita, Bollback, Jonathan P., & Leigh

Brown, Andrew J. (2013). Evolutionary interactions between haemagglutinin and

neuraminidase in avian influenza. BMC Evolutionary Biology 13:222

5. Lin, J.-H., Chiu, S.-C., Cheng, H.-W., et al. Molecular Epidemiology and Antigenic

Analyses of Influenza A Viruses H3N2 in Taiwan. Clin Microbiol Infect, 17: 214-222

6. Bedford, T., M. A. Suchard, P. Lemey, G. Dudas, V. Gregory, A. J. Hay, J. W.

McCauley, C. A. Russell, D. J. Smith and A. Rambaut (2014). "Integrating influenza

antigenic dynamics with molecular evolution." Elife 3: e01914.

7. Klein, E. Y., A. W. Serohijos, J. M. Choi, E. I. Shakhnovich and A. Pekosz (2014).

"Influenza A H1N1 pandemic strain evolution--divergence and the potential for

antigenic drift variants." PLoS One 9(4): e93632.

163 8. Chen, Jiming & Sun, Yingxue. (2011). Variation in the Analysis of Positively

Selected Sites Using Nonsynonymous/Synonymous Rate Ratios: an Example Using

Influenza Virus. PLoS ONE 6(5): e19996. Doi: 10.1371/journal.pone.0019996

9. Plotkin, J.B., Dushoff, J., Codon Bias and Frequency-dependent Selection on the

Hemagglutinin Epitopes of Influenza A Virus. Proc Natl Acad Sci USA 2003

100:7152-7157

10. Shih, A.C., Hsiao, T.C., Ho, M.S., Li W.H., Simultaneous Amino Acid Substitutions

at Antigenic Sites Drive Influenza A Hemagglutinin Evolution. Proc Natl Acad Sci

USA 2007 104:6283-6288

11. López-Cervantes, Malaquías, et al. "On the spread of the novel influenza A (H1N1)

virus in Mexico." The Journal of Infection in Developing Countries 3.05 (2009): 327-

330.

12. WHO (2017) FluNet reporting Map

https://extranet.who.int/sree/Reports?op=vs&path=/WHO_HQ_Reports/G5/PROD/E

XT/Influenza%20Reporting+Global+Map accessed October 25, 2017

13. Ohimain, E. I. (2016). "Recent advances in the development of vaccines for

Ebolavirus disease." Virus Res 211: 174-185.

14. WHO. (2015). One Year into the Ebola Epidemic: a Deadly, Tenacious and

Unforgiving Virus. http://www.who.int/csr/disease/ebola/one-year-

report/introduction/en/

15. Gire, S. K., A. Goba, K. G. et al. (2014). "Genomic surveillance elucidates Ebolavirus

origin and transmission during the 2014 outbreak." Science 345(6202): 1369-1372.

164 16. Worobey, M., G. Z. Han and A. Rambaut (2014). "Genesis and pathogenesis of the

1918 pandemic H1N1 influenza A virus." Proc Natl Acad Sci U S A 111(22): 8107-

8112.

17. Holmes, Edward C., et al. "The evolution of Ebolavirus: Insights from the 2013–2016

epidemic." Nature 538.7624 (2016): 193-200.

18. Carroll, Serena A., et al. "Molecular evolution of viruses of the family Filoviridae

based on 97 whole-genome sequences." Journal of virology 87.5 (2013): 2608-2616.

19. Park, D. J., Dudas, G., Wohl, S., Goba, A., Whitmer, S. L., Andersen, K. G., ... &

Gibbons, A. (2015). Ebolavirus Epidemiology, Transmission, and Evolution during

Seven Months in Sierra Leone. Cell, 161(7), 1516-1526.

20. Tong, Y. G., Shi, W. F., Liu, D., Qian, J., Liang, L., Bo, X. C., ... & Jiang, J. F.

(2015). Genetic diversity and evolutionary dynamics of Ebola in Sierra Leone.

Nature.

21. Hotez, Peter J., et al. "Eliminating the neglected tropical diseases: translational

science and new technologies." PLoS neglected tropical diseases 10.3 (2016):

e0003895.

22. Hotez, P. (2016). "2017 Global Infectious Diseases Threats to the United States,

http://blogs.plos.org/speakingofmedicine/2016/12/22/2017-global-infectious-

diseases-threats-to-the-united-states/." (blog).

23. Goodwin, S., R. Wappel and W. R. McCombie (2017). "1D Genome Sequencing on

the Oxford Nanopore MinION." Curr Protoc Hum Genet 94: 18 11 11-18 11 14.

165 24. Jain, M., H. E. Olsen, B. Paten and M. Akeson (2016). "The Oxford Nanopore

MinION: delivery of nanopore sequencing to the genomics community." Genome

Biol 17(1): 239.

25. Brown, Celeste J., et al. "New Perspectives on Ebolavirus Evolution." PloS one 11.8

(2016): e0160410.

26. Vita, R., J. A. et al. (2015). "The immune epitope database (IEDB) 3.0." Nucleic

Acids Res 43(Database issue): D405-412.

27. Agnandji, S. T., et al. (2017). "Safety and immunogenicity of rVSVDeltaG-ZEBOV-

GP Ebolavirus vaccine in adults and children in Lambarene, Gabon: A phase I

randomised trial." PLoS Med 14(10): e1002402.

28. Xu, W., et al. (2017). "Ebolavirus VP30 and nucleoprotein interactions modulate viral

RNA synthesis." Nat Commun 8: 15576

29. Raj, U. and P. K. Varadwaj (2016). "Flavonoids as Multi-target Inhibitors for Proteins

Associated with Ebolavirus: In Silico Discovery Using Virtual Screening and

Molecular Docking Studies." Interdisciplinary Sciences: Computational Life Sciences

8(2): 132-141.

30. Yasmin, T. and A. H. Nabi (2016). "B and T Cell Epitope-Based Peptides Predicted

from Evolutionarily Conserved and Whole Protein Sequences of Ebolavirus as

Vaccine Targets." Scand J Immunol 83(5): 321-337.

31. Lee, H. B., D. C. Piao, J. Y. Lee, J. Y. Choi, J. D. Bok, C. S. Cho, S. K. Kang and Y.

J. Choi (2017). "Artificially designed recombinant protein composed of multiple

166 epitopes of foot-and-mouth disease virus as a vaccine candidate." Microb Cell Fact

16(1): 33.

32. Bengtsson, K. L., H. Song, L. Stertman, Y. Liu, D. C. Flyer, M. J. Massare, R. H. Xu,

B. Zhou, H. Lu, S. A. Kwilas, T. J. Hahn, E. Kpamegan, J. Hooper, R. Carrion, Jr., G.

Glenn and G. Smith (2016). "Matrix-M adjuvant enhances antibody, cellular and

protective immune responses of a Zaire Ebolavirus/Makona virus glycoprotein (GP)

nanoparticle vaccine in mice." Vaccine 34(16): 1927-1935.

33. Warfield, K. L., D. L. Swenson, G. G. Olinger, W. V. Kalina, M. J. Aman and S.

Bavari (2007). "Ebolavirus-like particle-based vaccine protects nonhuman primates

against lethal Ebolavirus challenge." J Infect Dis 196 Suppl 2: S430-437

34. Wentink, M. Q., T. M. Hackeng, S. P. Tabruyn, W. C. Puijk, K. Schwamborn, D.

Altschuh, R. H. Meloen, T. Schuurman, A. W. Griffioen and P. Timmerman (2016).

"Targeted vaccination against the bevacizumab binding site on VEGF using 3D-

structured peptides elicits efficient antitumor activity." Proc Natl Acad Sci U S A

113(44): 12532-12537.

35. Martinez, O., L. Tantral, N. Mulherkar, K. Chandran and C. F. Basler (2011). "Impact

of Ebolavirus mucin-like domain on antiglycoprotein antibody responses induced by

Ebolavirus-like particles." J Infect Dis 204 Suppl 3: S825-832

36. Mohan, G. S., W. Li, L. Ye, R. W. Compans and C. Yang (2012). "Antigenic

subversion: a novel mechanism of host immune evasion by Ebolavirus." PLoS Pathog

8(12): e1003065.

167 37. Vu, Hong, et al. "Quantitative serology assays for determination of antibody

responses to Ebolavirus glycoprotein and matrix protein in nonhuman primates and

humans." Antiviral research 126 (2016): 55-61.

38. Martinez, Osvaldo, Charalampos Valmas, and Christopher F. Basler. "Ebola virus-

like particle-induced activation of NF-κB and Erk signaling in human dendritic cells

requires the glycoprotein mucin domain." Virology 364.2 (2007): 342-354.

39. Cook, Jonathan D., and Jeffrey E. Lee. "The secret life of viral entry glycoproteins:

moonlighting in immune evasion." PLoS pathogens 9.5 (2013): e1003258

40. Mehedi, Masfique, et al. "A new Ebolavirus nonstructural glycoprotein expressed

through RNA editing." Journal of virology85.11 (2011): 5406-5414

41. Li, Wenfang, et al. "Characterization of immune responses induced by Ebolavirus

glycoprotein (GP) and truncated GP isoform DNA vaccines and protection against

lethal Ebolavirus challenge in mice." The Journal of infectious diseases212.suppl_2

(2015): S398-S403.

42. McElroy, Anita K., et al. "Human Ebolavirus infection results in substantial immune

activation." Proceedings of the National Academy of Sciences 112.15 (2015): 4719-

4724.

43. Urbanowicz, Richard A., et al. "Human adaptation of Ebolavirus during the West

African outbreak." Cell 167.4 (2016): 1079-1087.

44. Angeletti, D. and J. W. Yewdell (2017). "Is It Possible to Develop a "Universal"

Influenza Virus Vaccine? Outflanking Antibody Immunodominance on the Road to

Universal Influenza Vaccination." Cold Spring Harb Perspect Biol.

168 45. Krammer, F., A. Garcia-Sastre and P. Palese (2017). "Is It Possible to Develop a

"Universal" Influenza Virus Vaccine? Toward a Universal Influenza Virus Vaccine:

Potential Target Antigens and Critical Aspects for Vaccine Development." Cold

Spring Harb Perspect Biol

46. Bornholdt, Zachary A., et al. "Isolation of potent neutralizing antibodies from a

survivor of the 2014 Ebola virus outbreak." Science 351.6277 (2016): 1078-1083.

47. Anderson, G. P., J. L. Liu, D. Zabetakis, P. M. Legler and E. R. Goldman (2017).

"Label free checkerboard assay to determine overlapping epitopes of Ebolavirus VP-

40 antibodies using surface plasmon resonance." J Immunol Methods 442: 42-48.

48. Yusim, K., C. Kesmir, B. Gaschen, M. M. Addo, M. Altfeld, S. Brunak, A. Chigaev,

V. Detours and B. T. Korber (2002). "Clustering patterns of cytotoxic T-lymphocyte

epitopes in human immunodeficiency virus type 1 (HIV-1) proteins reveal imprints of

immune evasion on HIV-1 global variation." J Virol 76(17): 8757-8768.

49. Ueda, Mahoko Takahashi, et al. "Functional mutations in spike glycoprotein of Zaire

Ebolavirus associated with an increase in infection efficiency." Genes to Cells 22.2

(2017): 148-159.

50. Nguyen, Q. K. K., Y. K. Gomez, M. Bakhom, A. Radcliffe, P. La, D. Rochelle, J. W.

Lee and E. J. Sorin (2017). "Ensemble simulations: folding, unfolding and misfolding

of a high-efficiency frameshifting RNA pseudoknot." Nucleic Acids Res 45(8): 4893-

4904.

169 51. Woolley, S., J. Johnson, M. J. Smith, K. A. Crandall and D. A. McClellan (2003).

"TreeSAAP: selection on amino acid properties using phylogenetic trees."

Bioinformatics 19(5): 671-672.

52. Takada, Ayato, and Yoshihiro Kawaoka. "The pathogenesis of Ebola hemorrhagic

fever." Trends in microbiology 9.10 (2001): 506-511.

53. Kühl, Annika, and Stefan Pöhlmann. "How Ebola counters the interferon

system." Zoonoses and public health 59.s2 (2012): 116-131.

54. Baxter, David. "Active and passive immunity, vaccine types, excipients and

licensing." Occupational Medicine 57.8 (2007): 552-556.

55. Marzi, Andrea, et al. "An Ebolavirus whole-virus vaccine is protective in nonhuman

primates." Science 348.6233 (2015): 439-442.

56. Sharp, Paul M., and Beatrice H. Hahn. "Origins of HIV and the AIDS pandemic."

Cold Spring Harbor perspectives in medicine 1.1 (2011): a006841.

170

Appendices

Chapter 1

171

Supplementary Table 1.1: Influenza A Sequences Number of sequences available by country and type in the Influenza Resources Database at NCBI (as of April 2013)

GROUP (1=very high risk, Human-Pre- 2=high risk, 3=moderate pandemic Human-Post- risk, 4=low risk) Rank Country Human-HA (2009) pandemic Avian-HA Swine-HA 3 110 United States of America 3252 1215 2037 2328 1431 3 124 China 471 96 375 1297 178 3 130 Singapore 429 27 402 4 1 3 132 United Kingdom 366 57 309 39 66 4 150 Japan 317 61 256 212 36 3 146 Hong Kong 287 233 54 216 163 4 154 Australia 246 158 88 70 5 4 154 New Zealand 227 227 0 11 0 2 66 Russia 219 29 190 85 7 3 139 Taiwan 200 61 139 75 15 3 139 Vietnam 194 168 26 194 11 2 64 Malaysia 158 124 34 6 0 4 164 Denmark 155 123 32 5 1 4 161 Finland 144 7 137 1 4 3 105 Brazil 140 2 138 1 1 3 146 Netherlands 139 106 33 82 4 4 163 Canada 137 4 133 574 75 2 39 Thailand 136 52 84 89 67 2 79 Mexico 121 25 96 17 6 2 75 Iran 94 21 73 53 0 3 106 Egypt 88 12 76 294 0 2 89 Nicaragua 76 37 39 0 0 4 150 Germany 74 29 45 99 40 3 135 India 65 1 64 46 0 2 71 Greece 57 0 57 0 0 1 4 Cambodia 50 8 42 19 0 1 19 Kenya 45 0 45 0 0 3 117 South Korea 43 10 33 173 47 3 110 France 41 22 19 21 17 4 165 Norway 38 18 20 11 1 2 52 Indonesia 35 31 4 113 7 3 121 Italy 35 16 19 109 84 3 104 Spain 30 9 21 6 14 2 39 Peru 27 11 16 0 0 3 117 Chile 27 3 24 7 0 2 55 Turkey 26 0 26 4 0 2 33 Argentina 21 0 21 7 6 3 136 Poland 17 0 17 1 2 4 153 Czech Republic 16 2 14 26 1 4 161 Sweden 15 11 4 101 2 3 99 Kuwait 14 1 13 4 0 1 16 Dominican Republic 11 0 11 0 0 2 31 Sri Lanka 11 3 8 0 10 3 146 Belgium 11 0 11 3 8 4 154 Austria 11 4 7 2 0 2 39 South Africa 10 10 0 8 0 2 56 Estonia 10 0 10 0 0 4 160 Switzerland 10 9 1 22 0

172 Chapter 2

Supplementary Table 2.1: Diversity values. The diversity values are given below. There is the diversity value from the whole set for that year(s). Then the mean and median values for the 100 subsets taken are provided

Diversity Whole set Standard Error Median of subsets Mean of subsets Standard Error Confidence Interval China 2004-2006 0.35368 0.01127 0.35386 0.34648 0.00505 (0.3366, 0.3560) US 2004-2006 0.28793 0.01124 0.24763 0.25386 0.00691 (0.2405, 0.2678) China 2009 0.17226 0.00760 0.16108 0.15988 0.00154 (0.1570, 0.1630) US 2009 0.31738 0.01218 0.29563 0.29547 0.00126 (0.2929, 0.2979) China 2010 0.18913 0.00728 0.17834 0.17814 0.00329 (0.1719, 0.1844) US 2010 0.37720 0.01600 0.35638 0.35329 0.02470 (0.3486, 0.3580) China 2011 0.05367 0.00224 0.05422 0.05197 0.00207 (0.0480, 0.0561) US 2011 0.39957 0.01717 0.37248 0.36605 0.00248 (0.3617, 0.3712)

173 Supplementary Table 2.2: Average dN values. The dN values are given below. There is the dN value from the whole set for that year(s). Then the mean and median values for the 100 subsets taken are provided dN (average) Whole set Standard Error Median of subsets Mean of subsets Standard Error Confidence Interval (95%) China 2004-2006 0.24691 0.01298 0.26763 0.25711 0.003550 (0.2504, 0.2640) US 2004-2006 0.21970 0.01285 0.21625 0.22200 0.006128 (0.2099, 0.2342) China 2009 0.12145 0.00725 0.12315 0.12330 0.00133 (0.1206, 0.1258) US 2009 0.20418 0.01377 0.20819 0.20738 0.00110 (0.2052, 0.2095) China 2010 0.13813 0.00784 0.13883 0.13985 0.00262 (0.1347, 0.1452) US 2010 0.26664 0.01768 0.28715 0.28285 0.00199 (0.2789, 0.2867) China 2011 0.03484 0.00213 0.04240 0.03545 0.00161 (0.0323, 0.0386) US 2011 0.29097 0.01814 0.30572 0.29930 0.00199 (0.2954, 0.3032)

174 Supplementary Table 2.3: Average dS values. The dS values are given below. There is the dS value from the whole set for that year(s). Then the mean and median values for the 100 subsets taken are provided dS (average) Whole set Standard Error Median of subsets Mean of subsets Standard Error Confidence Interval China 2004-2006 0.93724 0.29927 1.06595 0.99686 0.02169 (0.9523, 1.0384) US 2004-2006 0.75952 0.31312 0.84710 0.82891 0.023972 (0.7807, 0.8774) China 2009 0.05669 0.09373 0.05283 0.05552 0.00173 (0.0522, 0.0589) US 2009 0.89902 0.24009 0.86154 0.86151 0.00454 (0.8525, 0.8702) China 2010 0.02025 0.14672 0.01962 0.02034 0.00023 (0.0199, 0.0208) US 2010 0.01543 0.47727 0.01864 0.01841 0.00025 (0.0179, 0.0189) China 2011 0.09281 0.04695 0.13133 0.10725 0.00591 (0.0965, 0.1190) US 2011 0.08038 0.44621 0.04513 0.06773 0.00707 (0.0524, 0.0806)

175 Chapter 3

Supplementary Table 3.1 Sequence Information

Country Strain type Accession Date Host Democratic Republic of the Congo Zaire Ebolavirus KC242801 1976 Human Democratic Republic of the Congo Zaire Ebolavirus KM655246 1976 Human Democratic Republic of the Congo Zaire Ebolavirus KF827427 1976 Human Democratic Republic of the Congo Zaire Ebolavirus NC_002549 1976 Human Democratic Republic of the Congo Zaire Ebolavirus KC242791 1977 Human Democratic Republic of the Congo Zaire Ebolavirus KC242796 1995 Human Democratic Republic of the Congo Zaire Ebolavirus KC242799 1995 Human Democratic Republic of the Congo Zaire Ebolavirus KF113528 2003 Human Democratic Republic of the Congo Zaire Ebolavirus KC242785 2007 Human Democratic Republic of the Congo Zaire Ebolavirus KC242786 2007 Human Democratic Republic of the Congo Zaire Ebolavirus KC242787 2007 Human Democratic Republic of the Congo Zaire Ebolavirus KC242789 2007 Human Democratic Republic of the Congo Zaire Ebolavirus KC242788 2007 Human Democratic Republic of the Congo Zaire Ebolavirus KC242790 2007 Human Democratic Republic of the Congo Zaire Ebolavirus KC242784 2007 Human Democratic Republic of the Congo Zaire Ebolavirus KR063671 1976_10_01 Human Democratic Republic of the Congo Zaire Ebolavirus KR063672 1995_04_01 Human Democratic Republic of the Congo Zaire Ebolavirus KU182905 1995_05_04 Human Democratic Republic of the Congo Zaire Ebolavirus KU182909 1995_05_04 Human Democratic Republic of the Congo Zaire Ebolavirus KU182898 1995_05_04 Human Democratic Republic of the Congo Zaire Ebolavirus KU182902 1995_05_04 Human Democratic Republic of the Congo Zaire Ebolavirus KU182903 1995_05_04 Human Democratic Republic of the Congo Zaire Ebolavirus KU182904 1995_05_04 Human Democratic Republic of the Congo Zaire Ebolavirus KU182907 1995_05_04 Human Democratic Republic of the Congo Zaire Ebolavirus KU182906 1995_05_04 Human Democratic Republic of the Congo Zaire Ebolavirus KU182901 1995_05_04 Human Democratic Republic of the Congo Zaire Ebolavirus KU182900 1995_05_04 Human

176 Democratic Republic of the Congo Zaire Ebolavirus KU182899 1995_05_04 Human Democratic Republic of the Congo Zaire Ebolavirus KU182908 1995_05_04 Human Democratic Republic of the Congo Zaire Ebolavirus KR867676 1995_05_04 Human Democratic Republic of the Congo Zaire Ebolavirus KP271018 2014_08_20 Human Democratic Republic of the Congo Zaire Ebolavirus KP271020 2014_08_20 Human Gabon Zaire Ebolavirus KC242792 1994 Human Gabon Zaire Ebolavirus KC242793 1996 Human Gabon Zaire Ebolavirus KC242798 1996 Human Gabon Zaire Ebolavirus KC242795 1996 Human Gabon Zaire Ebolavirus KC242797 1996 Human Gabon Zaire Ebolavirus KC242794 1996 Human Gabon Zaire Ebolavirus KC242800 2002 Human Guinea Zaire Ebolavirus KT765130 2014 Human Guinea Zaire Ebolavirus KT765131 2014 Human Guinea Zaire Ebolavirus KJ660348 2014 Human Guinea Zaire Ebolavirus KJ660347 2014 Human Guinea Zaire Ebolavirus KJ660346 2014 Human Guinea Zaire Ebolavirus KP096420 2014_03 Human Guinea Zaire Ebolavirus KP096421 2014_03 Human Guinea Zaire Ebolavirus KP096422 2014_03 Human Guinea Zaire Ebolavirus KP342330 2014_10 Human Italy Zaire Ebolavirus KP701371 2014_11_25 Human Liberia Zaire Ebolavirus KR074996 2014 Human Liberia Zaire Ebolavirus KR075001 2014 Human Liberia Zaire Ebolavirus KR075002 2014 Human Liberia Zaire Ebolavirus KR075003 2014 Human Liberia Zaire Ebolavirus KR074997 2015 Human Liberia Zaire Ebolavirus KR074998 2015 Human Liberia Zaire Ebolavirus KR074999 2015 Human Liberia Zaire Ebolavirus KR075000 2015 Human Liberia Zaire Ebolavirus KP178538 2014_08_03 Human Liberia Zaire Ebolavirus KP240932 2014_09_26 Human Liberia Zaire Ebolavirus KP240933 2014_10_06 Human Mali Zaire Ebolavirus KP260799 2014_10_23 Human Mali Zaire Ebolavirus KP260800 2014_11_12 Human Mali Zaire Ebolavirus KP260802 2014_11_12 Human Mali Zaire Ebolavirus KP260801 2014_11_21 Human Sierra Leone Zaire Ebolavirus KR013754 2014 Human Sierra Leone Zaire Ebolavirus KM034550 2014_05_25 Human

177 Sierra Leone Zaire Ebolavirus KM034549 2014_05_25 Human Sierra Leone Zaire Ebolavirus KM034551 2014_05_26 Human Sierra Leone Zaire Ebolavirus KM034552 2014_05_26 Human Sierra Leone Zaire Ebolavirus KM034556 2014_05_26 Human Sierra Leone Zaire Ebolavirus KM034553 2014_05_27 Human Sierra Leone Zaire Ebolavirus KM034554 2014_05_27 Human Sierra Leone Zaire Ebolavirus KM034557 2014_05_27 Human Sierra Leone Zaire Ebolavirus KM034558 2014_05_28 Human Sierra Leone Zaire Ebolavirus KM034559 2014_05_28 Human Sierra Leone Zaire Ebolavirus KM034560 2014_05_28 Human Sierra Leone Zaire Ebolavirus KM034561 2014_05_28 Human Sierra Leone Zaire Ebolavirus KM034562 2014_05_28 Human Sierra Leone Zaire Ebolavirus KM034563 2014_05_28 Human Sierra Leone Zaire Ebolavirus KM233049 2014_05_31 Human Sierra Leone Zaire Ebolavirus KM233035 2014_06_02 Human Sierra Leone Zaire Ebolavirus KM233036 2014_06_02 Human Sierra Leone Zaire Ebolavirus KM233037 2014_06_03 Human Sierra Leone Zaire Ebolavirus KM233038 2014_06_03 Human Sierra Leone Zaire Ebolavirus KM233039 2014_06_03 Human Sierra Leone Zaire Ebolavirus KM233040 2014_06_03 Human Sierra Leone Zaire Ebolavirus KM233041 2014_06_03 Human Sierra Leone Zaire Ebolavirus KM233042 2014_06_03 Human Sierra Leone Zaire Ebolavirus KM233043 2014_06_03 Human Sierra Leone Zaire Ebolavirus KM233044 2014_06_04 Human Sierra Leone Zaire Ebolavirus KM233045 2014_06_04 Human Sierra Leone Zaire Ebolavirus KM233116 2014_06_04 Human Sierra Leone Zaire Ebolavirus KM233053 2014_06_05 Human Sierra Leone Zaire Ebolavirus KM233046 2014_06_06 Human Sierra Leone Zaire Ebolavirus KM034555 2014_06_06 Human Sierra Leone Zaire Ebolavirus KM233054 2014_06_07 Human Sierra Leone Zaire Ebolavirus KM233055 2014_06_07 Human Sierra Leone Zaire Ebolavirus KM233056 2014_06_07 Human Sierra Leone Zaire Ebolavirus KM233047 2014_06_08 Human Sierra Leone Zaire Ebolavirus KM233048 2014_06_09 Human Sierra Leone Zaire Ebolavirus KM233050 2014_06_09 Human Sierra Leone Zaire Ebolavirus KM233057 2014_06_09 Human Sierra Leone Zaire Ebolavirus KM233117 2014_06_09 Human Sierra Leone Zaire Ebolavirus KM233058 2014_06_10 Human Sierra Leone Zaire Ebolavirus KM233061 2014_06_10 Human Sierra Leone Zaire Ebolavirus KM233051 2014_06_11 Human Sierra Leone Zaire Ebolavirus KM233062 2014_06_11 Human

178 Sierra Leone Zaire Ebolavirus KM233059 2014_06_12 Human Sierra Leone Zaire Ebolavirus KM233063 2014_06_12 Human Sierra Leone Zaire Ebolavirus KM233065 2014_06_12 Human Sierra Leone Zaire Ebolavirus KM233069 2014_06_12 Human Sierra Leone Zaire Ebolavirus KM233071 2014_06_12 Human Sierra Leone Zaire Ebolavirus KM233118 2014_06_12 Human Sierra Leone Zaire Ebolavirus KM233052 2014_06_13 Human Sierra Leone Zaire Ebolavirus KM233066 2014_06_13 Human Sierra Leone Zaire Ebolavirus KM233060 2014_06_14 Human Sierra Leone Zaire Ebolavirus KM233064 2014_06_14 Human Sierra Leone Zaire Ebolavirus KM233070 2014_06_14 Human Sierra Leone Zaire Ebolavirus KM233072 2014_06_14 Human Sierra Leone Zaire Ebolavirus KM233073 2014_06_14 Human Sierra Leone Zaire Ebolavirus KM233074 2014_06_14 Human Sierra Leone Zaire Ebolavirus KM233075 2014_06_14 Human Sierra Leone Zaire Ebolavirus KM233076 2014_06_14 Human Sierra Leone Zaire Ebolavirus KM233067 2014_06_15 Human Sierra Leone Zaire Ebolavirus KM233077 2014_06_15 Human Sierra Leone Zaire Ebolavirus KM233078 2014_06_15 Human Sierra Leone Zaire Ebolavirus KM233079 2014_06_15 Human Sierra Leone Zaire Ebolavirus KM233080 2014_06_15 Human Sierra Leone Zaire Ebolavirus KM233081 2014_06_15 Human Sierra Leone Zaire Ebolavirus KM233082 2014_06_15 Human Sierra Leone Zaire Ebolavirus KM233084 2014_06_15 Human Sierra Leone Zaire Ebolavirus KM233085 2014_06_15 Human Sierra Leone Zaire Ebolavirus KM233086 2014_06_15 Human Sierra Leone Zaire Ebolavirus KM233087 2014_06_15 Human Sierra Leone Zaire Ebolavirus KM233089 2014_06_15 Human Sierra Leone Zaire Ebolavirus KM233090 2014_06_15 Human Sierra Leone Zaire Ebolavirus KM233091 2014_06_15 Human Sierra Leone Zaire Ebolavirus KM233092 2014_06_15 Human Sierra Leone Zaire Ebolavirus KM233093 2014_06_15 Human Sierra Leone Zaire Ebolavirus KM233094 2014_06_15 Human Sierra Leone Zaire Ebolavirus KM233095 2014_06_15 Human Sierra Leone Zaire Ebolavirus KM233096 2014_06_15 Human Sierra Leone Zaire Ebolavirus KM233097 2014_06_15 Human Sierra Leone Zaire Ebolavirus KM233068 2014_06_16 Human Sierra Leone Zaire Ebolavirus KM233098 2014_06_16 Human Sierra Leone Zaire Ebolavirus KM233100 2014_06_16 Human Sierra Leone Zaire Ebolavirus KM233101 2014_06_16 Human Sierra Leone Zaire Ebolavirus KM233102 2014_06_16 Human

179 Sierra Leone Zaire Ebolavirus KM233103 2014_06_16 Human Sierra Leone Zaire Ebolavirus KM233088 2014_06_17 Human Sierra Leone Zaire Ebolavirus KM233099 2014_06_17 Human Sierra Leone Zaire Ebolavirus KM233104 2014_06_17 Human Sierra Leone Zaire Ebolavirus KM233105 2014_06_17 Human Sierra Leone Zaire Ebolavirus KM233106 2014_06_17 Human Sierra Leone Zaire Ebolavirus KM233107 2014_06_17 Human Sierra Leone Zaire Ebolavirus KM233108 2014_06_18 Human Sierra Leone Zaire Ebolavirus KM233109 2014_06_18 Human Sierra Leone Zaire Ebolavirus KM233110 2014_06_18 Human Sierra Leone Zaire Ebolavirus KM233111 2014_06_18 Human Sierra Leone Zaire Ebolavirus KM233112 2014_06_18 Human Sierra Leone Zaire Ebolavirus KM233113 2014_06_18 Human Sierra Leone Zaire Ebolavirus KM233115 2014_06_18 Human Sierra Leone Zaire Ebolavirus KM233083 2014_06_20 Human Sierra Leone Zaire Ebolavirus KM233114 2014_06_20 Human Sierra Leone Zaire Ebolavirus KT589389 2014_09_07 Human Sierra Leone Zaire Ebolavirus KP240931 2014_09_13 Human Sierra Leone Zaire Ebolavirus KT589390 2014_09_17 Human Sierra Leone Zaire Ebolavirus KT357826 2015_01_13 Human Sierra Leone Zaire Ebolavirus KT357827 2015_01_14 Human Sierra Leone Zaire Ebolavirus KT357828 2015_01_14 Human Sierra Leone Zaire Ebolavirus KT357829 2015_01_17 Human Sierra Leone Zaire Ebolavirus KT357830 2015_01_17 Human Sierra Leone Zaire Ebolavirus KT357831 2015_01_18 Human Sierra Leone Zaire Ebolavirus KT357832 2015_01_19 Human Sierra Leone Zaire Ebolavirus KT357833 2015_01_20 Human Sierra Leone Zaire Ebolavirus KT357834 2015_01_20 Human Sierra Leone Zaire Ebolavirus KT357835 2015_01_20 Human Sierra Leone Zaire Ebolavirus KT357836 2015_01_20 Human Sierra Leone Zaire Ebolavirus KT357837 2015_01_20 Human Sierra Leone Zaire Ebolavirus KT357838 2015_01_20 Human Sierra Leone Zaire Ebolavirus KT357839 2015_01_21 Human Sierra Leone Zaire Ebolavirus KT357841 2015_01_25 Human Sierra Leone Zaire Ebolavirus KT357842 2015_01_25 Human Sierra Leone Zaire Ebolavirus KT357843 2015_01_26 Human Sierra Leone Zaire Ebolavirus KT357844 2015_01_28 Human Sierra Leone Zaire Ebolavirus KT357845 2015_01_28 Human Sierra Leone Zaire Ebolavirus KT357846 2015_01_29 Human Sierra Leone Zaire Ebolavirus KT357847 2015_01_30 Human Sierra Leone Zaire Ebolavirus KT357848 2015_02_03 Human

180 Sierra Leone Zaire Ebolavirus KT357849 2015_02_04 Human Sierra Leone Zaire Ebolavirus KT357850 2015_02_06 Human Sierra Leone Zaire Ebolavirus KT357851 2015_02_06 Human Sierra Leone Zaire Ebolavirus KT357852 2015_02_18 Human Sierra Leone Zaire Ebolavirus KT345616 2015_02_19 Human Sierra Leone Zaire Ebolavirus KT357813 2015_02_19 Human Sierra Leone Zaire Ebolavirus KT357853 2015_02_19 Human Sierra Leone Zaire Ebolavirus KT357814 2015_02_21 Human Sierra Leone Zaire Ebolavirus KT357855 2015_02_23 Human Sierra Leone Zaire Ebolavirus KT357815 2015_02_26 Human Sierra Leone Zaire Ebolavirus KT357816 2015_02_26 Human Sierra Leone Zaire Ebolavirus KT357817 2015_02_27 Human Sierra Leone Zaire Ebolavirus KT357819 2015_02_28 Human Sierra Leone Zaire Ebolavirus KT357818 2015_03_04 Human Sierra Leone Zaire Ebolavirus KT357856 2015_03_06 Human Sierra Leone Zaire Ebolavirus KT357820 2015_03_07 Human Sierra Leone Zaire Ebolavirus KT357821 2015_03_09 Human Sierra Leone Zaire Ebolavirus KT357822 2015_03_10 Human Sierra Leone Zaire Ebolavirus KT357823 2015_03_28 Human Sierra Leone Zaire Ebolavirus KT357860 2015_06_30 Human Sierra Leone Zaire Ebolavirus KT357858 2015_07_03 Human Sierra Leone Zaire Ebolavirus KT357859 2015_07_11 Human Switzerland Zaire Ebolavirus KP728283 2014_11_21 Human United Kingdom Zaire Ebolavirus KP184503 2014_08_25 Human United Kingdom Zaire Ebolavirus KP120616 2014_08_25 Human United Kingdom Zaire Ebolavirus KR025228 2015_03_12 Human USA Zaire Ebolavirus KP240934 2014_10_11 Human USA Zaire Ebolavirus KP240935 2014_10_13 Human

181 Supplementary Table 3.2 Epitope Information

Human IEDB ID Epitope Sequence Protein Names Tests MHC Binding Assays T Cell Assays B Cell Assays Negative Results Positive Results Percent positive 227064 CRDKLSSTNQLRSVG ssGP, 2 0 0 2 1 1 0.5 227097 DNLTYVQLESRFTPQ ssGP, 2 0 0 2 1 1 0.5 227170 FGTNETEYLFEVDNL ssGP, 2 0 0 2 1 1 0.5 227269 IKKPDGSECLPAAPD ssGP, 2 0 0 2 1 1 0.5 187300 DNSTHNTPVYKLDIS virion spike glycoprotein 2 0 12 2 7 7 0.5 91086 AAPDGIRGF GP1 2 glycoprotein 36 36 0 0 18 18 0.500 91096 AFFLYDRLA GP1 2 glycoprotein 96 96 0 0 48 48 0.500 91130 AQPKCNPNL GP1 2 glycoprotein 72 72 0 0 36 36 0.500 91139 ASSGKLGLI GP1 2 glycoprotein 36 36 0 0 18 18 0.500 91147 ATISTSPQS GP1 2 glycoprotein 36 36 0 0 18 18 0.500 91148 ATQVEQHHR GP1 2 glycoprotein 36 36 0 0 18 18 0.500 91159 AVSHLTTLA GP1 2 glycoprotein 36 36 0 0 18 18 0.500 91161 AWIPYFGPA GP1 2 glycoprotein 36 36 0 0 18 18 0.500 91169 CLPAAPDGI GP1 2 glycoprotein 36 36 0 0 18 18 0.500 91270 ETKKNLTRK GP1 2 glycoprotein 36 36 0 0 18 18 0.500 91285 FAFHKEGAF GP1 2 glycoprotein 72 72 0 0 36 36 0.500 91294 FFLWVIILF GP1 2 glycoprotein 36 36 0 0 18 18 0.500 91300 FHKEGAFFL GP1 2 glycoprotein 36 36 0 0 18 18 0.500 91400 GPCAGDFAF GP1 2 glycoprotein 36 36 12 0 24 24 0.500 91452 HILGPDCCI GP1 2 glycoprotein 36 36 0 0 18 18 0.500 91489 IGLAWIPYF GP1 2 glycoprotein 36 36 0 0 18 18 0.500 91497 IILFQRTFS GP1 2 glycoprotein 36 36 0 0 18 18 0.500 91512 IMASENSSA GP1 2 glycoprotein 36 36 0 0 18 18 0.500 91551 KAENTNTSK GP1 2 glycoprotein 36 36 0 0 18 18 0.500 91559 KEGAFFLYD GP1 2 glycoprotein 36 36 0 0 18 18 0.500

182 91569 KIDQIIHDF GP1 2 glycoprotein 36 36 0 0 18 18 0.500 91688 LQVSDVDKL GP1 2 glycoprotein 36 36 0 0 18 18 0.500 91734 MVQVHSQGR GP1 2 glycoprotein 36 36 0 0 18 18 0.500 91774 NTIAGVAGL GP1 2 glycoprotein 36 36 0 0 18 18 0.500 91829 QGREAAVSH GP1 2 glycoprotein 36 36 0 0 18 18 0.500 91881 REPVNATED GP1 2 glycoprotein 36 36 0 0 18 18 0.500 91937 RYVHKVSGT GP1 2 glycoprotein 36 36 0 0 18 18 0.500 91992 SSGKLGLIT GP1 2 glycoprotein 36 36 0 0 18 18 0.500 92047 TIRYQATGF GP1 2 glycoprotein 72 72 0 0 36 36 0.500 92054 TLATISTSP GP1 2 glycoprotein 36 36 0 0 18 18 0.500 92067 TPVYKLDIS GP1 2 glycoprotein 36 36 0 0 18 18 0.500 92071 TQVEQHHRR GP1 2 glycoprotein 36 36 0 0 18 18 0.500 92074 TSFFLWVII GP1 2 glycoprotein 36 36 0 0 18 18 0.500 92081 TTELRTFSI GP1 2 glycoprotein 36 36 0 0 18 18 0.500 92084 TTGKLIWKV GP1 2 glycoprotein 36 36 0 0 18 18 0.500 92091 TTQALQLFL GP1 2 glycoprotein 96 96 0 0 48 48 0.500 92138 VNPEIDTTI GP1 2 glycoprotein 36 36 0 0 18 18 0.500 92143 VQLESRFTP GP1 2 glycoprotein 36 36 0 0 18 18 0.500 92149 VSHLTTLAT GP1 2 glycoprotein 36 36 0 0 18 18 0.500 92164 VVSNGAKNI GP1 2 glycoprotein 36 36 0 0 18 18 0.500 15046 EYLFEVDNL GP1 2 glycoprotein 36 36 13 0 24 25 0.510 91712 LYDRLASTV GP1 2 glycoprotein 96 96 18 0 54 60 0.526 91347 GAAIGLAWI GP1 2 glycoprotein 90 90 0 0 42 48 0.533 91151 ATTELRTFS GP1 2 glycoprotein 54 54 0 0 24 30 0.556 91371 GIRGFPRCR GP1 2 glycoprotein 54 54 0 0 24 30 0.556 91388 GLMHNQDGL GP1 2 glycoprotein 54 54 0 0 24 30 0.556 91421 GTNETEYLF GP1 2 glycoprotein 54 54 0 0 24 30 0.556

183 91505 ILFQRTFSI GP1 2 glycoprotein 108 108 0 0 48 60 0.556 91583 KLSSTNQLR GP1 2 glycoprotein 54 54 0 0 24 30 0.556 91664 LLQLNETIY GP1 2 glycoprotein 54 54 0 0 24 30 0.556 91666 LLQRWGGTC GP1 2 glycoprotein 54 54 0 0 24 30 0.556 91686 LQRWGGTCH GP1 2 glycoprotein 54 54 0 0 24 30 0.556 91717 MGVTGILQL GP1 2 glycoprotein 54 54 0 0 24 30 0.556 91921 RTFSIPLGV GP1 2 glycoprotein 108 108 0 0 48 60 0.556 91986 SRFTPQFLL GP1 2 glycoprotein 54 54 0 0 24 30 0.556 91993 SSGYYSTTI GP1 2 glycoprotein 54 54 0 0 24 30 0.556 92041 TIAGVAGLI GP1 2 glycoprotein 54 54 0 0 24 30 0.556 16985 FLYDRLAST GP1 2 glycoprotein 176 176 0 0 78 98 0.557 91089 AAVSHLTTL GP1 2 glycoprotein 72 72 0 0 30 42 0.583 91144 ATDVPSATK GP1 2 glycoprotein 72 72 0 0 30 42 0.583 91553 KAIDFLLQR GP1 2 glycoprotein 72 72 0 0 30 42 0.583 92006 STHNTPVYK GP1 2 glycoprotein 72 72 0 0 30 42 0.583 91886 RGFPRCRYV GP1 2 glycoprotein 216 216 0 0 88 128 0.593 91383 GLICGLRQL GP1 2 glycoprotein 108 108 0 0 42 66 0.611 28398 IRSEELSFTVVSNGA GP1 2 glycoprotein 12 0 6 12 6 12 0.667 91106 AGPPKAENT GP1 2 glycoprotein 18 18 0 0 6 12 0.667 91125 AMVQVHSQG GP1 2 glycoprotein 18 18 0 0 6 12 0.667 91128 APDGIRGFP GP1 2 glycoprotein 18 18 0 0 6 12 0.667 91150 ATTAAGPPK GP1 2 glycoprotein 54 54 0 0 18 36 0.667 91152 ATTTSPQNH GP1 2 glycoprotein 36 36 0 0 12 24 0.667 91164 CAGDFAFHK GP1 2 glycoprotein 36 36 0 0 12 24 0.667 91180 DFLLQRWGG GP1 2 glycoprotein 18 18 0 0 6 12 0.667 91194 DKLSSTNQL GP1 2 glycoprotein 18 18 0 0 6 12 0.667 91216 DTTIGEWAF GP1 2 glycoprotein 18 18 0 0 6 12 0.667

184 91217 DVDKLVCRD GP1 2 glycoprotein 18 18 0 0 6 12 0.667 91223 EAIVNAQPK GP1 2 glycoprotein 36 36 0 0 12 24 0.667 91267 ESRFTPQFL GP1 2 glycoprotein 18 18 0 0 6 12 0.667 91281 EWAFWETKK GP1 2 glycoprotein 18 18 0 0 6 12 0.667 91292 FEVDNLTYV GP1 2 glycoprotein 18 18 0 0 6 12 0.667 91298 FGTNETEYL GP1 2 glycoprotein 18 18 0 0 6 12 0.667 91317 FLLQRWGGT GP1 2 glycoprotein 36 36 0 0 12 24 0.667 91321 FLRATTELR GP1 2 glycoprotein 168 168 0 0 56 112 0.667 91328 FQRTFSIPL GP1 2 glycoprotein 90 90 0 0 30 60 0.667 91360 GFPRCRYVH GP1 2 glycoprotein 168 168 0 0 56 112 0.667 91380 GLAWIPYFG GP1 2 glycoprotein 36 36 0 0 12 24 0.667 91385 GLITNTIAG GP1 2 glycoprotein 18 18 0 0 6 12 0.667 91390 GLNLEGNGV GP1 2 glycoprotein 18 18 0 0 6 12 0.667 91393 GLRQLANET GP1 2 glycoprotein 48 48 0 0 16 32 0.667 91428 GVATDVPSA GP1 2 glycoprotein 18 18 0 0 6 12 0.667 91439 GVTGILQLP GP1 2 glycoprotein 18 18 0 0 6 12 0.667 91463 HLTTLATIS GP1 2 glycoprotein 18 18 0 0 6 12 0.667 91475 HYWTTQDEG GP1 2 glycoprotein 18 18 0 0 6 12 0.667 91479 ICGLRQLAN GP1 2 glycoprotein 18 18 0 0 6 12 0.667 91506 ILGPDCCIE GP1 2 glycoprotein 48 48 0 0 16 32 0.667 91509 ILPQAKKDF GP1 2 glycoprotein 36 36 0 0 12 24 0.667 91540 IVNAQPKCN GP1 2 glycoprotein 18 18 0 0 6 12 0.667 91545 IYRGTTFAE GP1 2 glycoprotein 18 18 0 0 6 12 0.667 91548 IYTSGKRSN GP1 2 glycoprotein 18 18 0 0 6 12 0.667 91572 KIMASENSS GP1 2 glycoprotein 18 18 0 0 6 12 0.667 91581 KLGLITNTI GP1 2 glycoprotein 18 18 0 0 6 12 0.667 91602 KSTDFLDPA GP1 2 glycoprotein 18 18 0 0 6 12 0.667

185 91642 LGLITNTIA GP1 2 glycoprotein 18 18 0 0 6 12 0.667 91668 LMHNQDGLI GP1 2 glycoprotein 54 54 0 0 18 36 0.667 91681 LPRDRFKRT GP1 2 glycoprotein 36 36 0 0 12 24 0.667 91683 LQLNETIYT GP1 2 glycoprotein 18 18 0 0 6 12 0.667 91685 LQLPRDRFK GP1 2 glycoprotein 36 36 0 0 12 24 0.667 91706 LTYVQLESR GP1 2 glycoprotein 36 36 0 0 12 24 0.667 91752 NITDKIDQI GP1 2 glycoprotein 36 36 0 0 12 24 0.667 91766 NQDGLICGL GP1 2 glycoprotein 18 18 0 0 6 12 0.667 91778 NTTGKLIWK GP1 2 glycoprotein 36 36 0 0 12 24 0.667 91782 NYEAGEWAE GP1 2 glycoprotein 18 18 0 0 6 12 0.667 91796 PLGVIHNST GP1 2 glycoprotein 18 18 0 0 6 12 0.667 91811 PYFGPAAEG GP1 2 glycoprotein 18 18 0 0 6 12 0.667 91823 QFLLQLNET GP1 2 glycoprotein 18 18 0 0 6 12 0.667 91835 QLANETTQA GP1 2 glycoprotein 48 48 0 0 16 32 0.667 91840 QLNETIYTS GP1 2 glycoprotein 18 18 0 0 6 12 0.667 91846 QLRSVGLNL GP1 2 glycoprotein 90 90 0 0 30 60 0.667 91858 QRWGGTCHI GP1 2 glycoprotein 18 18 0 0 6 12 0.667 91867 QVEQHHRRT GP1 2 glycoprotein 54 54 0 0 18 36 0.667 91870 QVSDVDKLV GP1 2 glycoprotein 18 18 0 0 6 12 0.667 91885 RFTPQFLLQ GP1 2 glycoprotein 18 18 0 0 6 12 0.667 91887 RGTTFAEGV GP1 2 glycoprotein 18 18 0 0 6 12 0.667 91914 RSEELSFTV GP1 2 glycoprotein 18 18 0 0 6 12 0.667 91933 RWGFRSGVP GP1 2 glycoprotein 96 96 0 0 32 64 0.667 91948 SFFLWVIIL GP1 2 glycoprotein 18 18 0 0 6 12 0.667 91977 SPQNHSETA GP1 2 glycoprotein 18 18 0 0 6 12 0.667 91982 SQGREAAVS GP1 2 glycoprotein 18 18 0 0 6 12 0.667 92000 SSTNQLRSV GP1 2 glycoprotein 18 18 0 0 6 12 0.667

186 92023 TEDPSSGYY GP1 2 glycoprotein 18 18 0 0 6 12 0.667 92038 TGPCAGDFA GP1 2 glycoprotein 18 18 0 0 6 12 0.667 92050 TIYTSGKRS GP1 2 glycoprotein 18 18 0 0 6 12 0.667 92079 TTAAGPPKA GP1 2 glycoprotein 36 36 0 0 12 24 0.667 92088 TTLATISTS GP1 2 glycoprotein 18 18 0 0 6 12 0.667 92096 TYVQLESRF GP1 2 glycoprotein 18 18 0 0 6 12 0.667 92110 VHSQGREAA GP1 2 glycoprotein 36 36 0 0 12 24 0.667 92116 VIILFQRTF GP1 2 glycoprotein 54 54 0 0 18 36 0.667 92175 WIPYFGPAA GP1 2 glycoprotein 18 18 0 0 6 12 0.667 92218 YYSTTIRYQ GP1 2 glycoprotein 18 18 0 0 6 12 0.667 17380 FPRCRYVHK GP1 2 glycoprotein 322 322 0 0 106 216 0.671 91098 AFHKEGAFF GP1 2 glycoprotein 36 36 0 0 6 30 0.833 91316 FLLQLNETI GP1 2 glycoprotein 36 36 0 0 6 30 0.833 91339 FTPQFLLQL GP1 2 glycoprotein 36 36 0 0 6 30 0.833 91657 LIWKVNPEI GP1 2 glycoprotein 36 36 0 0 6 30 0.833 91817 QDGLICGLR GP1 2 glycoprotein 36 36 0 0 6 30 0.833 91877 RATTELRTF GP1 2 glycoprotein 36 36 0 0 6 30 0.833 91958 SHLTTLATI GP1 2 glycoprotein 36 36 0 0 6 30 0.833 92028 TFAEGVVAF GP1 2 glycoprotein 36 36 0 0 6 30 0.833 92154 VTGILQLPR GP1 2 glycoprotein 36 36 0 0 6 30 0.833 227099 DPGTNTTTEDHKIMA virion spike glycoprotein 2 0 0 2 0 2 1 227141 ENSSAMVQVHSQGRE virion spike glycoprotein 2 0 0 2 0 2 1 227248 HNTPVYKLDISEATQ virion spike glycoprotein 2 0 0 2 0 2 1 227307 KPGPDNSTHNTPVYK virion spike glycoprotein 2 0 0 2 0 2 1 227409 NISGQSPARTSSDPG virion spike glycoprotein 2 0 0 2 0 2 1 227422 NTTTEDHKIMASENS virion spike glycoprotein 2 0 0 2 0 2 1 227687 VYKLDISEATQVEQH virion spike glycoprotein 2 0 0 2 0 2 1

187 227646 TVVSNGAKNISGQSP GP1 2 glycoprotein 12 0 0 12 0 12 1.000 91650 LICGLRQLA GP1 2 glycoprotein 18 18 0 0 0 18 1.000 91934 RWGGTCHIL GP1 2 glycoprotein 18 18 0 0 0 18 1.000 92034 TFSIPLGVI GP1 2 glycoprotein 18 18 0 0 0 18 1.000 92097 VAFLILPQA GP1 2 glycoprotein 18 18 0 0 0 18 1.000 92197 YLFEVDNLT GP1 2 glycoprotein 18 18 0 0 0 18 1.000 91927 RTSFFLWVI GP1 2 glycoprotein 36 36 0 0 0 36 1.000 234002 Q508 GP1 2 glycoprotein 12 0 0 24 0 24 1.000 227244 HLGLDDQEKKILMNF NP 2 0 0 2 1 1 0.500 7664 DAVLYYHMM NP 9 9 3 0 6 6 0.500 91162 AWQSVGHMM NP 18 18 0 0 9 9 0.500 91237 EGHGFRFEV NP 18 18 0 0 9 9 0.500 91253 EMYRHILRS NP 18 18 0 0 9 9 0.500 91287 FASLFLPKL NP 18 18 0 0 9 9 0.500 91315 FLLMLCLHH NP 18 18 0 0 9 9 0.500 91324 FPQLSAIAL NP 36 36 0 0 18 18 0.500 91329 FRFEVKKRD NP 18 18 0 0 9 9 0.500 91364 GGQQKNSQK NP 18 18 0 0 9 9 0.500 91434 GVNNLEHGL NP 18 18 0 0 9 9 0.500 91437 GVRLHPLAR NP 18 18 0 0 9 9 0.500 91449 HHAYQGDYK NP 18 18 0 0 9 9 0.500 91453 HILQKTERG NP 18 18 0 0 9 9 0.500 91465 HMMKDEPVV NP 18 18 0 0 9 9 0.500 91469 HSFEEMYRH NP 18 18 0 0 9 9 0.500 91510 ILQKTERGV NP 18 18 0 0 9 9 0.500 91511 ILTAGLSVQ NP 18 18 0 0 9 9 0.500 91517 IPVYQVNNL NP 18 18 0 0 9 9 0.500

188 91527 ISFQQTNAM NP 36 36 0 0 18 18 0.500 91593 KQLQQYAES NP 18 18 0 0 9 9 0.500 91613 KYLEGHGFR NP 18 18 0 0 9 9 0.500 91652 LIHQGMHMV NP 18 18 0 0 9 9 0.500 91728 MNHKNKFMA NP 18 18 0 0 9 9 0.500 91740 NAMVTLRKE NP 18 18 0 0 9 9 0.500 91746 NFHQKKNEI NP 18 18 0 0 9 9 0.500 91779 NVGEQYQQL NP 18 18 0 0 9 9 0.500 91787 PFARLLNLS NP 18 18 0 0 9 9 0.500 91795 PLARTAKVK NP 18 18 0 0 9 9 0.500 91804 PTAWQSVGH NP 18 18 0 0 9 9 0.500 91810 PWLTEKEAM NP 18 18 0 0 9 9 0.500 91834 QKIWMAPSL NP 18 18 0 0 9 9 0.500 91844 QLREAATEA NP 36 36 0 0 18 18 0.500 91847 QLSAIALGV NP 18 18 0 0 9 9 0.500 91856 QQYAESREL NP 36 36 0 0 18 18 0.500 91889 RGVRLHPLA NP 18 18 0 0 9 9 0.500 91899 RLHPLARTA NP 36 36 0 0 18 18 0.500 91905 RPIQNVPGP NP 18 18 0 0 9 9 0.500 91920 RTAKVKNEV NP 36 36 0 0 18 18 0.500 91922 RTIHHASAP NP 18 18 0 0 9 9 0.500 91923 RTLAAMPEE NP 18 18 0 0 9 9 0.500 91961 SLAKHGEYA NP 18 18 0 0 9 9 0.500 91995 SSLAKHGEY NP 36 36 0 0 18 18 0.500 92044 TIHHASAPL NP 54 54 0 0 27 27 0.500 92058 TLRKERLAK NP 36 36 0 0 18 18 0.500 92108 VGHMMVIFR NP 54 54 0 0 27 27 0.500

189 92126 VLDHILQKT NP 18 18 0 0 9 9 0.500 92139 VPGPHRTIH NP 18 18 0 0 9 9 0.500 92145 VQRQIQVHA NP 18 18 0 0 9 9 0.500 92155 VTLDGQQFY NP 54 54 0 0 27 27 0.500 92178 WMAPSLTES NP 18 18 0 0 9 9 0.500 92185 YAPFARLLN NP 18 18 0 0 9 9 0.500 92189 YHMMKDEPV NP 18 18 0 0 9 9 0.500 92202 YQQLREAAT NP 18 18 0 0 9 9 0.500 91387 GLLIVKTVL NP 27 27 0 0 12 15 0.556 91408 GQQFYWPVM NP 27 27 0 0 12 15 0.556 91454 HILRSQGPF NP 27 27 0 0 12 15 0.556 91522 IQYPTAWQS NP 27 27 0 0 12 15 0.556 91539 IVKTVLDHI NP 27 27 0 0 12 15 0.556 91563 KFMAILQHH NP 27 27 0 0 12 15 0.556 91634 LFLESGAVK NP 27 27 0 0 12 15 0.556 91807 PVMNHKNKF NP 27 27 0 0 12 15 0.556 91873 QYPTAWQSV NP 27 27 0 0 12 15 0.556 92014 SVQQGIVRQ NP 27 27 0 0 12 15 0.556 92161 VVFSTSDGK NP 27 27 0 0 12 15 0.556 21863 GQFLSFASL NP 42 45 3 0 21 27 0.563 91115 AITAASLPK NP 36 36 0 0 15 21 0.583 91486 IFRLMRTNF NP 36 36 0 0 15 21 0.583 91520 IQNVPGPHR NP 36 36 0 0 15 21 0.583 91570 KILMNFHQK NP 36 36 0 0 15 21 0.583 91732 MVIFRLMRT NP 48 48 0 0 20 28 0.583 91735 MVTLRKERL NP 36 36 0 0 15 21 0.583 92059 TNAMVTLRK NP 36 36 0 0 15 21 0.583

190 75566 YQVNNLEEI NP 27 27 5 0 12 20 0.625 32188 KLTEAITAA NP 21 21 6 0 9 18 0.667 91108 AGVNVGEQY NP 9 9 0 0 3 6 0.667 91110 AHGSTLAGV NP 9 9 0 0 3 6 0.667 91111 AIALGVATA NP 18 18 0 0 6 12 0.667 91118 AKLTEAITA NP 9 9 0 0 3 6 0.667 91122 ALSSLAKHG NP 9 9 0 0 3 6 0.667 91137 ASLFLPKLV NP 9 9 0 0 3 6 0.667 91157 AVLYYHMMK NP 27 27 0 0 9 18 0.667 91163 AYQGDYKLF NP 9 9 0 0 3 6 0.667 91167 CLEKVQRQI NP 9 9 0 0 3 6 0.667 91209 DSDNTQSEH NP 9 9 0 0 3 6 0.667 91220 DYKLFLESG NP 9 9 0 0 3 6 0.667 91286 FARLLNLSG NP 9 9 0 0 3 6 0.667 91289 FDAVLYYHM NP 18 18 0 0 6 12 0.667 91291 FEEMYRHIL NP 18 18 0 0 6 12 0.667 91309 FLESGAVKY NP 27 27 0 0 9 18 0.667 91311 FLIKFLLIH NP 9 9 0 0 3 6 0.667 91319 FLPKLVVGE NP 18 18 0 0 6 12 0.667 91323 FMAILQHHQ NP 9 9 0 0 3 6 0.667 91325 FQESADSFL NP 9 9 0 0 3 6 0.667 91336 FSTSDGKEY NP 18 18 0 0 6 12 0.667 91343 FVTLDGQQF NP 27 27 0 0 9 18 0.667 91367 GHGFRFEVK NP 18 18 0 0 6 12 0.667 91381 GLDDQEKKI NP 9 9 0 0 3 6 0.667 91395 GLSVQQGIV NP 9 9 0 0 3 6 0.667 91397 GMNAPDDLV NP 9 9 0 0 3 6 0.667

191 91403 GPHRTIHHA NP 9 9 0 0 3 6 0.667 91446 HGEYAPFAR NP 9 9 0 0 3 6 0.667 91448 HGLFPQLSA NP 9 9 0 0 3 6 0.667 91466 HMMVIFRLM NP 36 36 0 0 12 24 0.667 91498 IIQAFEAGV NP 9 9 0 0 3 6 0.667 91508 ILMNFHQKK NP 45 45 0 0 15 30 0.667 91521 IQVHAEQGL NP 18 18 0 0 6 12 0.667 91531 ITAASLPKT NP 27 27 0 0 9 18 0.667 91549 KAALSSLAK NP 27 27 0 0 9 18 0.667 91565 KHGEYAPFA NP 9 9 0 0 3 6 0.667 91578 KKILMNFHQ NP 9 9 0 0 3 6 0.667 91626 LEHGLFPQL NP 18 18 0 0 6 12 0.667 91635 LFLPKLVVG NP 18 18 0 0 6 12 0.667 91641 LGLDDQEKK NP 9 9 0 0 3 6 0.667 91670 LMLCLHHAY NP 18 18 0 0 6 12 0.667 91703 LTESDMDYH NP 9 9 0 0 3 6 0.667 91715 MDYHKILTA NP 9 9 0 0 3 6 0.667 91724 MMKDEPVVF NP 36 36 0 0 12 24 0.667 91727 MNEENRFVT NP 9 9 0 0 3 6 0.667 91741 NAPDDLVLF NP 9 9 0 0 3 6 0.667 91749 NHKNKFMAI NP 18 18 0 0 6 12 0.667 91788 PFDAVLYYH NP 9 9 0 0 3 6 0.667 91798 PNRSTKGGQ NP 9 9 0 0 3 6 0.667 91809 PVYRDHSEK NP 18 18 0 0 6 12 0.667 91828 QGLIQYPTA NP 18 18 0 0 6 12 0.667 91864 QTNAMVTLR NP 18 18 0 0 6 12 0.667 91898 RLEELLPAV NP 9 9 0 0 3 6 0.667

192 91906 RPQKIWMAP NP 9 9 0 0 3 6 0.667 91908 RQIQVHAEQ NP 9 9 0 0 3 6 0.667 91909 RQRVIPVYQ NP 9 9 0 0 3 6 0.667 91917 RSQGPFDAV NP 9 9 0 0 3 6 0.667 91919 RSTKGGQQK NP 18 18 0 0 6 12 0.667 91939 SAIALGVAT NP 9 9 0 0 3 6 0.667 91947 SFEEMYRHI NP 9 9 0 0 3 6 0.667 91954 SGLLIVKTV NP 9 9 0 0 3 6 0.667 91979 SPRMLTPIN NP 9 9 0 0 3 6 0.667 91980 SQDTTIPDV NP 9 9 0 0 3 6 0.667 91981 SQGPFDAVL NP 36 36 0 0 12 24 0.667 92052 TLAAMPEEE NP 9 9 0 0 3 6 0.667 92055 TLDGQQFYW NP 27 27 0 0 9 18 0.667 92070 TQSRPIQNV NP 9 9 0 0 3 6 0.667 92073 TSDGKEYTY NP 9 9 0 0 3 6 0.667 92119 VISNSVAQA NP 9 9 0 0 3 6 0.667 92128 VLFDLDEDD NP 9 9 0 0 3 6 0.667 92133 VLYYHMMKD NP 27 27 0 0 9 18 0.667 92135 VMNHKNKFM NP 36 36 0 0 12 24 0.667 92171 VYRDHSEKK NP 27 27 0 0 9 18 0.667 92180 WQSVGHMMV NP 18 18 0 0 6 12 0.667 92195 YLEGHGFRF NP 36 36 0 0 12 24 0.667 92203 YQSYSENGM NP 18 18 0 0 6 12 0.667 92215 YWPVMNHKN NP 9 9 0 0 3 6 0.667 54673 RLMRTNFLI NP 72 72 6 0 25 53 0.679 17527 FQQTNAMVT NP 18 18 5 0 6 17 0.739 16888 FLSFASLFL NP 41 41 6 0 11 36 0.766

193 75491 YQGDYKLFL NP 39 39 0 0 9 30 0.769 91382 GLFPQLSAI NP 27 27 0 0 6 21 0.778 91946 SFASLFLPK NP 27 27 0 0 6 21 0.778 91447 HGFRFEVKK NP 18 18 0 0 3 15 0.833 91837 QLIIQAFEA NP 18 18 0 0 3 15 0.833 91725 MMVIFRLMR NP 27 27 0 0 3 24 0.889 227076 DDIPFPGPINDDDNP NP 2 0 0 2 0 2 1.000 227330 LFDLDEDDEDTKPVP NP 2 0 0 2 0 2 1.000 227532 RSTKGGQQKNSQKGQ NP 2 0 0 2 0 2 1.000 227631 TSGHYDDDDDIPFPG NP 2 0 0 2 0 2 1.000 91127 ANAGQFLSF NP 9 9 0 0 0 9 1.000 91480 ICQLIIQAF NP 9 9 0 0 0 9 1.000 91543 IWMAPSLTE NP 9 9 0 0 0 9 1.000 91661 LLMLCLHHA NP 9 9 0 0 0 9 1.000 91723 MLTPINEEA NP 9 9 0 0 0 9 1.000 64728 TLASIGTAF L 96 96 0 0 85 11 0.115 22207 GRTFGKLPY L 87 87 0 0 77 10 0.115 24743 HSGFIYFGK L 110 110 0 0 97 13 0.118 65802 TPVMSRFAA L 126 126 0 0 110 16 0.127 66098 TRSFTTHFL L 43 43 0 0 37 6 0.140 4190 ARLSSPIVL L 74 74 0 0 63 11 0.149 23396 GYLEGTRTL L 97 97 0 0 82 15 0.155 75746 YSGNIVHRY L 71 71 0 0 59 12 0.169 35360 LEARVNLSV L 111 111 0 0 90 21 0.189 60820 SRTPSGKRL L 76 76 0 0 61 15 0.197 75643 YRNFSFSLK L 100 100 0 0 80 20 0.200 17655 FRYEFTAPF L 72 72 0 0 52 20 0.278

194 16337 FIYFGKKQY L 107 107 0 0 69 38 0.355 69872 VLYHRYNLV L 91 91 0 0 55 36 0.396 18186 FVHSGFIYF L 122 122 0 0 44 78 0.639 26670 IISDLSIFI L 18 18 0 0 0 18 1.000 37081 LLADGLAKA L 18 18 0 0 0 18 1.000 29799 KAFPSNMMV L 10 10 0 0 0 10 1.000 39313 LSDLCNFLV VP24 110 110 0 0 98 12 0.109 46665 NYNGLLSSI VP24 95 95 3 0 87 11 0.112 69756 VLSDLCNFL VP24 71 71 0 0 63 8 0.113 33782 KTNDFAPAW VP24 70 70 0 0 62 8 0.114 34279 KVYWAGIEF VP24 69 69 0 0 53 16 0.232 59501 SLTDRELLL VP30 78 78 0 0 70 8 0.103 93203 LANPTADDF VP30 106 106 0 0 88 18 0.170 56377 RVPTVFHKK VP30 100 100 0 0 80 20 0.200 24266 HLPGFGTAF VP35 48 48 0 0 42 6 0.125 93005 IMYDHLPGF VP35 58 58 0 0 43 15 0.259 4862 ATAAATEAY VP35 61 79 0 0 52 27 0.342 227054 ATTQNDRMPGPELSG VP35 2 0 0 2 1 1 0.500 227128 EHGQPPPGPSLYEES VP35 2 0 0 2 1 1 0.500 227350 LMTGRIPVSDIFCDI VP35 2 0 0 2 1 1 0.500 227416 NNPGLCYASQMQQTK VP35 2 0 0 2 1 1 0.500 227651 VCVFQLQDGKTLGLK VP35 2 0 0 2 1 1 0.500 227058 CDIENNPGLCYASQM VP35 2 0 0 2 0 2 1.000 227084 DETVPQSVREAFNNL VP35 2 0 0 2 0 2 1.000 227263 IESRDETVPQSVREA VP35 2 0 0 2 0 2 1.000 227458 PQSVREAFNNLNSTT VP35 2 0 0 2 0 2 1.000 227512 RIPVSDIFCDIENNP VP35 2 0 0 2 0 2 1.000

195 227547 SDIFCDIENNPGLCY VP35 2 0 0 2 0 2 1.000 227589 STTSLTEENFGKPDI VP35 2 0 0 2 0 2 1.000 38663 LPQYFTFDL VP40 176 176 0 0 154 22 0.125 62789 TAAIMLASY VP40 114 114 0 0 99 15 0.132 70603 VQLPQYFTF VP40 121 121 0 0 105 16 0.132 17945 FTFDLTALK VP40 162 162 0 0 119 43 0.265 33002 KQIPIWLPL VP40 78 79 0 0 56 23 0.291

196 Appendix 3.1 Within group values alternate colors

197

Supplementary Figure 3.1: Phylogenetic pairs within group values Shown above are the within pair values of dN and dS with two different color schemes, one by dN or dS and one by gene. Values of zero have been omitted for clarity. Baseline epitope levels were used (10% for L, VP24, VP30, VP35, VP40, and 50% for NP and

GP)

198 Chapter 4

Appendix 4.1 Point Mutation Graphs

These include single sequence mutations

Supplementary Figure 4.1: Epitope Regions and Point Mutations in Three Ebolavirus Genes 50% Epitope Threshold Here the Ebolavirus genes (A) NP (B) GP and (C) VP35 are shown with the 50% threshold epitope regions are in blue, the nonsynonymous point mutations are in red and the synonymous point mutations are in orange

199

Supplementary Figure 4.2: Epitope Regions and Point Mutations in Three Ebolavirus Genes 90% Threshold Here the Ebolavirus genes (A) NP (B) GP and (C) VP35 are shown with the 90% threshold epitope regions are in blue, the nonsynonymous point mutations are in red and the synonymous point mutations are in orange

200

Supplementary Figure 4.3: Protein Structures and Point Mutations in Three Ebolavirus Genes Here the Ebolavirus genes (A) NP (B) GP and (C) VP35 are shown with the alpha helices in blue, the beta sheets in green, the nonsynonymous (dN) point mutations in red and the synonymous (dS) point mutations in orange. These structures are based of RSCB PDB protein maps (For NP- 4Z9P and 4YPI, GP- Q05320 and 3CSY, and VP35- 3FKE). NP and VP35 mappings were incomplete with gaps at 1147-1935 (NP) and 127-660 (VP35)

201

202