ABSTRACT

MODELING AND PHYLODYNAMIC SIMULATIONS OF AVIAN

by Liam Mosley

Avian Influenza (AIV) are highly adaptive and mutate continuously throughout their life-cycle. Subtype H5N1, also known as Highly Pathogenic Asian Avian Influenza, is of particular interest due to its rapid spread from Asia to other countries. Constant mu- tations in the protein sequences of AIVs cause which leads to the spread of epidemics to livestock, causing billions of dollars in socio-economic losses each year. Con- sequently, containment of AIV epidemics is of vital importance. Computational approaches to epidemic forecasting, specifically phylodynamic simulations, enhance in vivo analysis by enabling analysis of ecological parameters, evolutionary traits, and the ability to predict antigenic shifts to assist vaccine design. This work introduces an improvement on existing phylodynamic simulations models, called the HASEQ model, by using actual Hemagglutinin (HA) protein sequences, simulating mutations through amino acid substitution models, and implementing an amino-acid level antigenic analysis algorithm to model pressure. In contrast to prior approaches that rely on abstract representations of strains and mutations, HASEQ manipulates and yields actual HA strains to allow for robust validation and direct application of results to inform epidemic containment efforts. The validity of the HASEQ model is assessed via comparisons to WHO Nomenclature refined to represent strains present in 3 high risk countries. The model is calibrated and validated using thousands of simulations with wide-ranging parameter settings requiring over 2,500 hours of computation time. Results show that the model improvements yield results with the expected evolutionary characteristics at the cost of increasing computational run-time costs 10-fold. MODELING AND PHYLODYNAMIC SIMULATIONS OF

A Thesis

Submitted to the Faculty of Miami University in partial fulfillment of the requirements for the degree of Master of Science in Computer Science Department of Computer Science and Software Engineering by Liam Mosley Miami University Oxford, Ohio 2019

Advisor: Dr. Dhananjai Rao

Reader: Dr. Eric Rapos

Reader: Dr. Eric Bachmann Contents

1 Introduction1 1.1 Motivation ...... 1 1.2 Contributions ...... 1

2 Background4 2.1 Epidemiological Model ...... 4 2.2 Phylodynamic Simulations ...... 6 2.3 PhySim ...... 7 2.3.1 Simulation Methods ...... 9 2.3.2 Euclidean Models of Avian Influenza ...... 11 2.4 Related Work ...... 13

3 Phylodynamic Model Improvements and Implementation 15 3.1 HASEQ Model Introduced ...... 15 3.2 Virus Strain Representation ...... 16 3.3 P- ...... 17 3.4 Modeling Mutations on Influenza Strains ...... 18

4 Validation and Analysis 21 4.1 P-Epitope Validation ...... 22 4.2 PhySim Calibration and Comparative Analysis ...... 23 4.2.1 Turkey ...... 25 4.2.2 Nigeria ...... 29 4.2.3 Vietnam ...... 32 4.3 Generalized Sensitivity Analysis: Turkey ...... 35 4.4 BLAST Analysis ...... 37 4.5 Run-time and Memory Analysis ...... 40

ii 4.6 Discussion ...... 41

5 Conclusion 42 5.1 Limitations ...... 42 5.2 Summary ...... 42 5.3 Future Work ...... 44

References 46

iii List of Figures

2.1 Ecological model of the influenza life cycle ...... 4 2.2 A showing the different virus lineages for Vietnam . . . . .5 2.3 SEIR Compartmental Model ...... 6 2.4 Combained Ecoligcal and SIS models with example parameters from PhySim8 2.5 Antigenic drift caused via mutations in the euclidean model ...... 12

3.1 A diagram of the PhySim program (Green = main, Yellow = abstract classes). Arrows represent inheritance, straight lines represent interaction without di- rect relationships...... 15 3.2 Generalized PhySim diagram with changes outlined in red. Arrows represent inheritance, straight lines represent interaction without direct relationships. . 16 3.3 Expected number of substitutions for particular residues in an average HA Sequence ...... 17 3.4 Using matrix exponentiation figure (a) is transformed into figure (b), this example is done using a large value for t to show the differences in substitution rates ...... 19

4.1 A comparison of the risk function with typical antigenic distances for both the geometric (orange) and HASEQ (blue) scales ...... 22 4.2 A comparison of P-Epitope value distributions for inter-clade distances (or- ange) and intra-clade (blue) distances, sample was done on 100 different se- quences from 10 different clades in the H5N1 2012 nomenclature ...... 23 4.3 Phylograms produced from HASEQ and Geometric simulations compared to the reference phylogram for Turkey ...... 25 4.4 Calibration exploration for Turkey. Results from 10 runs, brightly colored nodes represent successful runs...... 26

iv 4.5 Graph of average HASEQ and Geometric simulation infective populations over the course of 10 runs with a 95% Confidence Interval on the Geometric population for Turkey ...... 27 4.6 Infective population for the last 5 simulation years with a 95% Confidence Interval on the Geometric population for Turkey ...... 28 4.7 Phylograms produced from HASEQ and Geometric simulations compared to the reference phylogram for Turkey ...... 29 4.8 Calibration exploration for Nigeria. Results from 10 runs, brightly colored nodes represent successful runs...... 30 4.9 Graph of average HASEQ and Geometric simulation infective populations over the course of 5 runs with a 95% Confidence Interval on the Geometric population for Nigeria ...... 31 4.10 Infective population for the last 5 simulation years with a 95% Confidence Interval on the Geometric population for Nigeria ...... 31 4.11 Phylograms produced from HASEQ and Geometric simulations compared to the reference phylogram for Vietnam ...... 32 4.12 Calibration exploration for Vietnam. Results from 10 runs, brightly colored nodes represent successful runs...... 33 4.13 Graph of average HASEQ and Geometric simulation infective populations over the course of 10 runs with a 95% Confidence Interval on the Geometric population for Turkey ...... 34 4.14 Infective population for the last 5 simulation years with a 95% Confidence Interval on the Geometric population for Vietnam ...... 34 4.15 Generalized Sensitivity Analysis (GSA) results for Turkey, x-axis values in each sub-chart show the range of values for each parameter explored. The

y-axis shows dm,n for the values explored...... 35

4.16 Summary of dm,n values for Figure 4.15...... 36 4.17 Correlation between parameter variables for PhySim runs using the HASEQ model ...... 37

v ACKNOWLEDGEMENTS

First I would like to thank my advisor DJ Rao for all of the advice and guidance he has given me throughout my time in the Computer Science Department. Second I would like to thank those on my thesis committee for taking the time out of their schedules to go over my work. Last I would like to show my appreciation to my family and friends who have supported me and helped to push me along the way while I finished my degree.

vi Chapter 1 Introduction

1.1 Motivation

Avian Influenza Viruses cause billions of dollars of socio-economic losses every year. Between the years of 2014 and 2015 alone the spread of AIVs incited the culling of 45 million turkeys and chickens in order to contain a single epidemic. There are a variety of approaches to containing AIV epidemics such as vaccination, culling of populations, and livestock isolation. Vaccination efforts are of pivotal importance because they are able to prevent epidemics before they have the chance to spread [1]. There are two typical groupings of approaches to deciding the best approach to epidemic containment, the first of which is via in vivo, or live, analysis. In vivo approaches to strain selection revolve around sampling AIV host populations and using the density of different strains in host populations to determine which strains are most commonly present. Unfortunately this process can take months to complete and epidemic containment requires constant monitoring and quick action. Because of the time constraints on this methodology it is much harder to inform regions as to how to contain epidemics as they are happening. The second group of approaches to epidemic analysis are in silico, or computational, approaches. Phylodynamic simulations are one such computational approach. Focused on modeling the spread of epidemics, phylodynamic simulations allow researchers to analyze and predict when future epidemics will arise and how to tackle their containment. Because in silico approaches focus on preventative measures they are of increasing importance to containment efforts.

1.2 Contributions

Current phylodynamic models are limited in their representation of virus strains. Viruses are modeled as euclidean coordinate points, where each dimension represents a multitude

1 of characteristics for a particular strain. These geometric, or euclidean, models can be scaled up or down to represent different abstract dimensions of influenza strains, as the number of dimensions increases so do the computational requirements of the simulation. Previously there was motivation for finding scaled down representations of virus strains through modeling of antigenic drift, or the evolutionary distance, as changes in 2-D vector values [2]. These representations are limited to an antiquated equation for antigenic distance and new, more proven measures have been introduced into literature, such as P-Epitope [3]. Along with improvements in the measure of antigenic distance, the advent of machine learning has brought with it more accurate models of amino acid substitution. Current phylodynamic simulations model the change of virus strains over time as uniform random mutations in the nucleotide structure of protein sequences [4]. Models such as FLU have been shown to be effective at capturing individual amino acid substitution rates [5]. Amino acids are the encoded version of multiple nucleotides, and are responsible for defining the shape of proteins [6]. This paper implements these substitution models to get a more accurate representation of how viruses mutate. In summation, this work introduces an improvement on existing phylodynamic simulation models by implementing current measures of antigenic distance and adapting amino acid substitution models to represent changes in virus strains at the protein level. This will be done by modeling actual protein sequences for each virus in a simulation environment. The research introduced in this Thesis addresses the previously mentioned shortcomings of the euclidean model, as well as delivers the following:

1. Drawbacks of the current modeling techniques can be summarized as: 1.1. Current simulations do not model actual viral strains, which allows more robust analysis of virus sequences in target simulation regions 1.2. Mutation rates in the simulation are not representative of actual amino acid changes, and are assumed to be uniform and random 1.3. Antigenic characteristics are approximated and not as closely related to vaccine efficacy as newer measures 1.4. Forecasting of epidemics to inform vaccine design is not straight forward due to constructing phylograms without using actual viral sequences 2. The new model introduced as HASEQ includes: 2.1. Simulated viruses modeled using actual HA sequence(s) instead of the abstract

2 Euclidean model, starting with the root HA sequence (A/turkey/England/5092/1991) corresponding to the root of the WHO H5N1 nomenclature [1]. 2.2. Realistic mutations simulated using observed mutation rates in nature as reported by Dang et al [5]. However, the mutation rates are further calibrated to charac- terize phylogenetic diversity in a given region. 2.3. Antigenic diversity is measured using an amino-acid level comparison algorithm, called P-Epitope proposed by Gupta et al [3]. 3. Other deliverables: 3.1. An enhanced version of PhySim with the proposed antigenic model implemented 3.2. Parameter settings to reconstruct phylogenetic trees similar to those already pro- duced using PhySim 3.3. A tool to generate an amino acid from a nucleotide substitution rate

The model introduced in this research is named HASEQ and is compared against current phylodynamic modeling techniques.

3 Chapter 2 Background

2.1 Epidemiological Model

AIVs are of particular interest for modeling environments due to rapid mutations in H5N1 and the disease being endemic in waterfowl. More over the seasonal migration patterns of waterfowl give rise to a complex evolutionary model.

Antigenically different viruses cause new infection in hosts. Avian Viruses Hosts Infected hosts shed viruses (until host gains immunity to current Viruses with suf fi cient evade host's immunity. host's evade strain) with genetic changes. genetic changes become changes become genetic antigenically different & different antigenically Example phylogram with 3 clades / lineages created from In vivo sampling & sequencing. Figure 2.1: Ecological model of the influenza life cycle

Figure 2.1 represents an abstract view of the ecological process that phylodynamic sim- ulations recreate [7]. Waterfowl hosts are seeded with an initial viral strain and the virus then begins to mutate. For up to 8 days the virus is shed from infected individuals, with the potential to infect not only other waterfowl but also the environment the host has contact with [8]. Water sources are particularly vulnerable and can harbor infections for up to 20 days after contamination [9]. Host immunity prevents individuals in the model from acquir- ing new infections if a viral strain is antigenically similar to the strain of a previous infection. Mutations occur within individual hosts and accumulate over time causing antigenic drift. Antigenic drift causes new viral strains to diverge from their ancestral lineage, enabling them to escape host immunity and cause new infections. Establishment of different viral lineages gives rise to new clades in a phylogentic tree as shown in Figure 2.1. Phylogenetic trees are one of the most important outputs of phylodynamic simulations and are used in both in vivo and in silico analysis. Figure 2.2 shows an example reference tree for Vietnam. The different colorings represent clades, or groupings, of antigenically similar virus strains.

4 Due to the rapid mutations AIVs undergo new strains must be selected for vaccine design every 6 to 8 months [10]. This poses a big challenge to epidemic forecasters and analysis efforts as in vivo analysis can take an increasingly longer amount of time to produce results, and resultingly lags behind emerging epidemics. In order to be proactive, researchers adapt in silico approaches to epidemic analysis through the use of phylodynamic simulations.

Figure 2.2: A phylogenetic tree showing the different virus lineages for Vietnam

The length of the branches that connect the viruses is representative of phylogenetic distance. The chosen measure of distance for the WHO is P-Sequence [1]. P-Sequence is simply the pairwise number of nucleotides that are different between two aligned virus

5 sequences. Virus sequence alignment allows two virus strains to be compared by matching the indices, or residues, or both sequences to each other. An average H5N1 virus sequence is made up of 1600 nucleotides, the cutoff value for clades is 24 nucleotide discrepancies, or 1.5% of 1600 [1]. Phylodynamic simulations are able produce phylogenetic trees as an output, and by calibrating simulations to produce trees that are similar in structure to reference phylograms the simulation model can be validated. There are a multitude of ways to construct phylogenetic trees, the primary methods are based upon phylogenetic distance and emergence times. Trees constructed using phylogenetic distance will be produced from the HASEQ model, and are produced from current models. As determined by the WHO Nomenclature the protein sequence used to represent each particular virus is derived from Hemagglutinin protein structures [1]. Hemagglutinin (HA) is the receptor binding protein found on the surface of influenza viruses. Changes in the amino acid sequence of HA proteins result in the changes in the shape of the surface of the virus and directly affects how the virus binds to potential host receptors. The protein sequence data of HA is used to determine the distances that define phylogenetic trees.

2.2 Phylodynamic Simulations

Phylodynamics is the study of epidemiological, environmental, and phlyogenetic processes. More specifically how the interactions between each process impacts how a particular disease spreads and evolves over time. Simulation tools and the modeling involved in translating biological functions and environmental parameters into simulation environments are the focus of this research. Gog and Grenfell proposed the current leading practices and founded the basis for phylodynamic modeling [2].

Figure 2.3: SEIR Compartmental Model

The starting point for epidemiological studies is based on four standard compartments

6 for hosts within the modeling environment. Hosts can represent individuals or groups of any animals such as humans, pigs, waterfowl, and livestock that are at risk for spreading infections. The four compartments as shown in Figure 2.3 are Susceptible (S), Exposed (E), Infected (I), and Recovered (R). Hosts that are susceptible have not been in contact with a specific viral strain but can potentially be infected. A viral strain can infect a host only if an antigenically similar strain is not present in the immune history of a host. Once infected, the host remains in the infective category until the virus is no longer replicating and has run its course. While in the infective department the virus can be spread to other susceptible hosts that it comes into contact with. In addition an infective host also sheds the virus, contaminating its environment. After the host acquires immunity against a viral strain they transition back to the susceptible state and the cycle repeats.

2.3 PhySim

PhySim is the computational implementation of the ecological model shown in Figure 2.3. PhySim is a simulation tool that is used to model the and epidemic progression of avian influenza [11]. Antigen is the general simulation tool that PhySim was derived from, which was originally developed by Bedford et al. to work with human influenza virus H3N2 [4]. PhySim’s enhancements ontop of Antigen include:

1. Simulation of avian influenza strains 2. Multiple host species modeled using different birth rates, deaths rates, and brooding seasons 3. Genetic and antigenic properties of viruses are independently modeled 4. Modeled Cross Immunity 5. Production of phylogenetic trees using genetic differences instead of emergence times 6. Seasonal fluctuation of infection rates to model changes in temperature

PhySim is written in Java and uses Gillespie’s Stochastic Simulation Algorithm (SSA) with Tau-Leap optimization to simulate epidemic progression with sufficient accuracy. The system uses a time step of 0.1, meaning actions are taken in 10 increments per simulation day. An Individual-Based Model (IBM) is used to model host interactions or contacts, and hosts are moved between the aforementioned Susceptible and Infective compartments based on contact rates and risk of infection. Figure 2.4 illustrates a broadened view of how the S-I-S and ecological model interact in the simulation environment.

7 An S-I-S model is used over S-E-I-R due to the endemic nature of HPAI. Infections take hold immediately after exposure so hosts most from S to I without being kept in compartment E. Once the virus runs it’s course hosts are then immediately susceptible to new infections, so long as the potential infection escapes the hosts immune history.

Infection from antigenically birth different strain 휇푏 훽 1 + Ω ∙ 2휋푡

S 휇푏 I birth (during brooding) 휇푑 휇푑 휈 (recovered from infection) Figure 2.4: Combained Ecoligcal and SIS models with example parameters from PhySim

PhySim allows the analysis of the impact of a variety of parameters as summarized in Table 2.2. Hosts represent a group of waterfowl from a specific species, where multiple species can be present in each simulation. This is done via an IBM because groups of waterfowl flock and roost together. Different regions can be modeled by targeting specific countries and setting parameter values specific for the waterfowl found in that region. Nigeria, Turkey, and Vietnam were identified as high risk countries using a combination of Phylodynamic and Phylogeographic analysis and will be the focus of this research [11]. A point of interest in the model is that seasonal temperature fluctuation is taken into consideration when determining of potential infections. By using a sinusoidal curve as a modulation factor the chance of infection can be increased in colder seasons and decreased during warmer seasons.

µb Species specific daily birth rate during brooding season µd Species specific daily death rate derived from lifespan β Contact rate (direct) between hosts for a region Ω Sinusoidal seasonal modulation factor ψ Average daily phenotypic ν Inverse of infectuous period

Table 2.1: A sample of PhySim ecological parameters for a multi-species simulation model

These settings are passed into PhySim in YAML file format, initializing the simulation to run for a particular country. Models for simulations can be validated by setting ecolog-

8 ical parameters such as those found in Table 2.2 such that phylograms produced from the simulations mirror those constructed from in vivo analysis. By identifying which parame- ters consistently produce matching phylograms the factors that have the largest impact on epidemic containment are able to be deduced, as exemplified in Section 4.3. Simulations are ran with a burn-in period of 15 years to simulate the time leading up to current day. By stepping past the burn-in period, simulations are then able to effectively predict what the evolutionary landscape will look like in the coming years. Parameters such as contact rate can be abstractly tied to things such as livestock isolation, and features such as mutation rate and infectuous period can be mapped to vaccination efforts. By analyzing the parameters associated with different containment and vaccination regimes the impact and effectiveness of each method can be determined.

2.3.1 Simulation Methods

Algorithm1 moves the simulation forward on a day-to-day basis. The delta variable is representative of the time-step value 0.1 that was previously mentioned.

Algorithm 1 simulate(days, delta = 0.1) 1: day = 0 2: while day < days do 3: grow(day%365, delta) 4: decline(delta) 5: epidemic(delta) 6: mutate(delta) 7: day = day + delta 8: end while

Hosts are introduced and removed from the simulation using a birth and death model. These rates are different for each respective species being modeled for a particular region. Variables can account for abundance and lifespans of different high-risk waterfowl species such as A.Acuta (Northern Pintail), A.Crecca (Common Teal), A.Fuligula (Tufted Duck), A.Penelope (Eurasian Wigeon), L.Canus (Common Gull), L.Limosa (Black-tailed Godwit), P.Pugnax (Ruff), and V.Vanellus (Northern Lapwing). Algorithms2 and3 show how the different characteristics of each waterfowl species are incorporated into the birth and death cycle. The brooding season determines when hosts

9 Algorithm 2 grow(day, delta) 1: for species  model do 2: if species.broodStart ≤ day ≤ species.broodEnd then 3: net µb = (species.S + species.I) * µb * delta 4: species.S = species.S + poisson(net µb) 5: end if 6: end for

Algorithm 3 decline(delta) 1: for species  model do 2: for comp  species do . comp is S, I for a species 3: net µd = comp + species.µd * delta 4: comp = comp - poisson(net µd) 5: end for 6: end for of a particular species are introduced into the simulation. A sample of these values pulled from the Global Register of Migratory Species (GROMS) database are found in Table 2.2 [12]. Each species can belong to a multitude of regions. At initialization the simulation is seeded with a starting population of each waterfowl such that it models the expected skew of species found in the GROMS database for each species in a particular region.

Waterfowl Species Life Span Brooding Period A. Penelope 2.02 Feb-April A. Crecca 2.5 Dec-Feb A. Acuta 3 Feb-July A. Fuligula 3.5 Feb-April V. Vanellus 3.5 April-July

Table 2.2: A sample of PhySim ecological parameters for a multi-species simulation model

Algorithm4 is the core of the simulation and dictates simulated contacts and recoveries using a Poisson distribution. Infective hosts are randomly selected, and contacts are randomly distributed to suscep- tible hosts. If the infective host’s virus escapes the immune history of the susceptible host then the infection spreads. Each day a certain number of infective hosts recover from their virus and move back to the susceptible compartment.

10 Algorithm 4 epidemic(day, delta = 0.1) 1: S = Σspecies.S, for all species in the model 2: I = Σspecies.I, for all species in the model 3: N = S + I 4: I0 = poisson((I ∗ S ∗ β ∗ modulation factor ∗ delta)/N) 5: R0 = poisson(I ∗ ν ∗ delta) 6: while I0 > 0 do 7: i = getInfected(uniform(I)) 8: s = getSusceptible(uniform(S)) 9: if canInfect(i.getVirus(), s) then 10: removeS(s, S) 11: s0 = s.setV irus(i.getV irus()) 12: addI(s0) 13: I0 = I0 − 1 14: end if 15: end while 16: while R0 > 0 do 17: r = getInfected(uniform(I)) 18: r.addT oImmuneHistory(r.getV irus) 19: r.clearV irus() 20: removeI(r) 21: addS(r) 22: R0 = R0 − 1 23: end while

2.3.2 Euclidean Models of Avian Influenza

The model currently used by PhySim represents the phylogenetic characteristics of AIV strains as 2-Dimensional vectors of floating point values, this is the geometric or euclidean model [2]. The root sequence for the simulation starts at the euclidean origin point and strain mutations are represented as changes in the vector values. The mutation rate for the strains is set as a parameter value. Figure 2.5 illustrates two example strains mutating over time. The euclidean distance between two points is representative of phylogenetic distance [13]. In-line with the 2D model presented by Gog and Grenfell, using this model allows the phylogenetic distance to then be mapped to a measure of antigenic difference from which the chance of infection can be derived. For example if a host had been previously been infected with Virus 1 in Figure 2.5, then Virus 1 is present in the hosts immune history. If the host was then was exposed to Virus 2, the phylogenetic distance could be measured using the euclidean distance. This value can then be passed through a Bayesian formula to find the related antigenic distance between the two strains and the ensuing risk of the new infection taking hold. Algorithm5 exemplifies how a potential source of infection is parsed against the immune

11 Figure 2.5: Antigenic drift caused via mutations in the euclidean model

Algorithm 5 canInfect(virus v, susceptibleHost s) 1: minRisk = 1 − homologousImmunity 2: maxRisk = homologousImmunity 3: risk = 0.0 4: for vis.immuneHistory do 5: distance(v, vi) 6: if distance < risk then 7: risk = distance 8: end if 9: end for 10: risk = min(maxRisk, risk) 11: infectF lag = uniform(1) < risk 12: return infectF lag

history of a susceptible host. The closest distance between the potential infection and the hosts immune history is used to calculate the risk of infection. The antigenic distance used here is calculated using the following equation:

antigenicDistance = 1 − e−2dDistance/shapecrossImmunity

Where the 2-D Distance is the euclidean distance between two viral strains. This allows the relationship between phylogenetic distance and antigenic distance to be exponential [2]. Antigenic distance is representative of the risk of infection, where values closer to 0 represent antigenically similar strains, and values closer to 1 represent antigenically distant strains. The mutations that occur in the geometric model are also controlled using the time-step value for delta. A portion of the infective population is pulled and their virus is mutated at each time step, the direction of the mutation is controlled using sin and cosine curves to model dimensional mutations.

12 Algorithm 6 2dMutate(delta) 1: I = Σspecies.I, for all species in model 2: I0 = poisson(I ∗ delta) 3: while I0 > 0 do 4: i = getInfected(uniform(I)) 5: v = i.getV irus() 6: θ = uniform(2π)

7: v.traitX = v.traitX + ψ ∗ cos(θ)

8: v.traitY = v.traitY + ψ ∗ sin(θ) 9: I0 = I0 − 1 10: end while

The mutation rate variable ψ is one of the simulation parameters described in Table 2.2. Because the number of strains that are chosen to be mutated each time-step is controlled by delta it averages out that each infection mutates approximately once per simulation day.

2.4 Related Work

This research distinguishes itself from other state-of-the-art in the field of phylodynamics in a number of ways:

1. Similar to the previous work of Rao, Sheffield, and Giridharan simulations for this research focus on modeling avian influenza epidemics [11][13][14], whereas other liter- ature focuses on modeling human influenza viruses [4][9][15] 2. Literature that reconstructs viral phylogenies do not simulate the spread and mutation of viruses but rely on approximation without selection pressure [5] 3. The HASEQ model is a crossover of Item 1 and 2. Viral phylogenies of avian in- fluenza strains are reconstructed, but by applying selection pressure in a simulation environment

Sheffield and Rao adapted Antigen to work with AIV strains in order to monitor epidemic progression in a new context [13]. Their model was further applied by Giridharan and Rao to target specific high-risk countries for epidemic forecasting [11]. The model introduced by this research distinguishes itself from their work by improving upon the models used to simulate host immune history, determine the risk of infection of disease, and by modeling mutations in competing strains using amino acid substitution models which better reflect

13 specific amino acid substitution rates rather than assuming all nucleotide and amino acid substitutions are uniform and random. Work done by Dang et al. focused on reconstructing the entire phylogeny of H5N1 HA sequences using bayesian models, this research is distinguished from theirs by producing phy- logenies specific to high-risk regions as well as introducing selection pressure and employing results using IBM simulation models [5]. By exploring the crossover between the reconstruction of viral phylogenies using actual HA protein sequences and applying selection pressure in a simulation environment vaccine design and epidemic containment efforts can be informed with a more hands on approach while being able to control ecological parameters and explore a variety of containment and vaccination methods similar to the work of Giridharan and Rao [11].

14 Chapter 3 Phylodynamic Model Improvements and Implementation

3.1 HASEQ Model Introduced

This section illustrates how the HASEQ model is incorporated into the PhySim environment. Figure 3.1 displays a generalized diagram of PhySim without the new introductions.

Figure 3.1: A diagram of the PhySim program (Green = main, Yellow = abstract classes). Arrows represent inheritance, straight lines represent interaction without direct relationships.

The Antigen class controls the simulation environment and accesses parameter settings for a simulation through YAML file format. These parameters are then passed along to the simulation classes. Figure 3.2 marks the changes introduced to enable HASEQ simulation runs in red.

15 Figure 3.2: Generalized PhySim diagram with changes outlined in red. Arrows represent inheritance, straight lines represent interaction without direct relationships.

A new specific phenotype class is introduced into the existing PhySim system that sim- ulates virus sequences using the HASEQ model. This class is abstractly handled by the existing phenotype hierarchy. The supporting classes are handled through the parameter class. Changes marked in red in Figure 3.2 abstractly summarize the HASEQ model. Introduc- ing models for actual HA Sequences, implementing a substitution model to use for mutations, and examining changes in specific epitope regions using P-Epitope. The Substitution Model class handles loading in pre-defined substitution matrices to match expected substitution rates, the Epitope Regions class defines which residues belong to each epitope as well as which to analyze for a particular simulation run, and lastly the Amino Map class translates matrix indices to amino acids for the mutation model.

3.2 Virus Strain Representation

The aforementioned geometric model of representation limits the information that is attached to each virus strain in the simulation. By modeling actual HA protein sequences changes

16 that directly affect the shape of the virus are able to be examined, and from this information a better value representing risk of infection can be calculated.

1 50 100 150 200 250 300 350 400 450 500 550 585 0 1 2

3 Site A Site B 4 Residues with just 1 Site C Epitopes Polymorphic Variants Polymorphic Site D 5 polymorphic variant indicate Site E 6 highly conserved regions. Receptors Figure 3.3: Expected number of substitutions for particular residues in an average HA Sequence

A complete HA Sequence is 400-600 amino acids in length, and is comprises of 5 epitope regions. Figure 3.3 shows the layout of the epitope regions of HA sequences. Polymorphic variants are the expected number of amino acid substitutions over the course of a branch in a phylogenetic tree for a particular amino acid residue. A residue is another term for a specific index for an aligned amino acid protein sequence. The grey areas show regions outside of epitope regions, and are not directly responsible for the shape of HA proteins. There are a variety of reasons to model virus strains as actual HA protein sequences:

1. Substitution rates for individual amino acids can be controlled, and reflect the rates of change experienced in nature 2. Phylogenetic trees will still be reproduced, but using actual sequences that can be compared to actual virus strains in a targeted region 3. It allows comparisons between competing viral strains using P-Epitope, which means risk of infection will resultingly have a higher correlation to vaccine efficacy

Because the current models used for phylodynamic simulations assume uniform distri- bution of mutations and do an overall pair-wise analysis of each sequence the methods are limited to estimations of antigenic distance.

3.3 P-Epitope

P-Epitope was proposed by Gupta et al. and proved that by examining the changes that occur in specific epitope regions antigenic distance can be more accurately measured between

17 two competing viral strains, with respect to vaccine efficacy [3]. P-Epitope was shown to be more closely related to vaccine efficacy than traditional measures of antigenic distance such as P-Sequence which is one of the current measures used by the WHO [1]. Algorithm7 shows how P-Epitope can be implemented computationally.

Algorithm 7 pEpitope(virusv1, virusv2) 1: pEpitope = 0 2: for epitope  epitopeRegions do 3: localDifference = 0 4: for residue  epitope do 5: if v1[residue] 6= v2[residue] then 6: localDifference = localDifference + 1 7: end if 8: end for 9: difference = localDifference/epitope.size 10: if difference > pEpitope then 11: pEpitope = difference 12: end if 13: end for 14: return pEpitope ∗ pEpConv

The variable pEpConv allows P-Epitope to be used in the Gaussian equation described previously to determine the risk of infection. It’s a scalar value that is able to be set in the parameter file for a simulation, this allows the association to exponential.

antigenicDistance = 1 − e−pEpitope/shapecrossImmunity

There are five epitope regions in H5N1 HA protein sequences, A, B, C, D, and E. Epitopes A and B have been identified as the most dominant epitopes and are responsible for the majority of antigenic drift in AIVs [16]. Peng’s group also further expanded each epitope region using machine learning practices to survey which residues belong to each epitope. This survey allows the calculation of P-Epitope to be more precise and capture changes in all residues that belong to the varying epitopes.

3.4 Modeling Mutations on Influenza Strains

General time reversible (GTR) models of amino acid substitution are the de facto standard for predicting future phylogenies based on the bioligcal makeup of protein sequences [6].

18 GTR models predict the chance of a particular amino acid being replaced by another amino acid for a particular residue given a period of time. The derivation of these models can vary, of particular interest is the FLUModel which has shown to be very effective at producing accurate phylograms for H5N1 protein sequences using a bayesian model [5]. Dang et al’s work involved generating a rate of change matrix and steady state vector of amino acids, which was found using a maximum-liklihood approach on a large sample of H5N1 HA protein sequences. Using Equations 3.1 and 3.2 one can derive matrix q from the steady state vector π and instantaneous rate of change matrix r.

qxy = πrxy (3.1)

qxx = −Σx6=yqxy (3.2)

P (t) = etQ (3.3)

The probability matrix P(t) is found through matrix exponentiation of q with respect to the time parameter t. P(t) is a function of time, with respect to branch length. A value of t=1.0 is representative of the probability of an amino acid being substituted by another amino acid over the course of an entire branch in a phylogenetic tree for a particular residue. Probability matrices are known to be very accurate for small values of t, representing changes occurring over a small period of time [6]. Figure 3.4 shows an example of transitioning from a rate matrix to a substitution matrix and what the resulting probabilities would look like for a large value of t.

(a) Rate Matrix (b) Substitution Matrix

Figure 3.4: Using matrix exponentiation figure (a) is transformed into figure (b), this example is done using a large value for t to show the differences in substitution rates

19 From the probability matrix an expected value for the number of amino acid substitutions that would occur for time t can be calculated.

A = π ∗ 484 (3.4)

E(A) = Σx∈πx ∗ (1 − P (t)xx) (3.5)

From Equation 3.4 the average number of each amino acid in a typical HA sequence of length 484, which is the length of the root sequence used to seed HASEQ simulation runs. Using A the average number of each amino acid can be multiplied by the probability that any substitution occurs for each amino acid. Equation 3.5 produces the expected number of substitutions for a particular amount of time.

20 Chapter 4 Validation and Analysis

The following regime is used to validate the HASEQ model:

1. The determination of a successful simulation run is whether or not the cluster values are equivalent between the reference phylogram and the simulation phylogram 2. The infective populations for successful runs during peak antigenic diversity (specifi- cally the last 3 simulation years for Turkey and Nigeria, and the last 4 for Vietnam) are within a 95% Confidence Interval on the geometric simulation infective populations 3. Infected populations experience fluctuation based upon the sinusoidal modulation curve 4. HA Protein Sequence Strains produce BLAST values within 85% identity scores of real HA protein sequences for the simulated years

These criteria are addressed via experimentation on case models for Turkey, Nigeria, and Vietnam which were identified as high risk countries for avian influenza epidemics [11]. The original calibration settings for PhySim were derived from nucleotide substitution rates for the surveyed countries [17] and are the starting points for the HASEQ calibration. The validation toolkit utilizes the ETE library constructed in Python as well as BLAST. The ETE library allows comparisons to be drawn from pylogenetic trees programmatically [18]. BLAST allows comparisons between the sequences produced from the HASEQ model with those in the WHO flu database [19][20]. The validation program is as follows:

1. Identify a pEpConv scalar to ensure an exponential relationship between P-Epitope values and risk of infection 2. Illustrate the area of overlap that P-Epitope exploits 3. Calibrate PhySim to find successful simulation parameter settings to produce correct phylogenetic trees for Turkey, Nigeria, and Vietnam 4. Compare infective populations during peak antigenic diversity for Turkey, Nigeria, and Vietnam

21 5. Identify a Rand Index value for sequences produced from simulation runs for Turkey, Nigeria, and Vietnam

4.1 P-Epitope Validation

The geometric simulation model uses an abstract nucleotide distance to gauge antigenic similarity, this is done on an exponential scale as displayed in Figure 4.1. Vaccine efficacy has been shown to be almost directly correlated with P-Epitope values [3], but the relationship with the risk factor will still be exponential [21]. From a prelimi- nary General Sensitivity Analysis and calibration run a pEpConv value of approximately 12 provides the most consistent success rates between the two scales of antigenic distance. The infected populations during peak antigenic diversity in the simulation are similar, the cluster output for the simulation matches the number of reference clades for each country, and the phylogenetic trees have similar structure. This value will be explored later in Section 4.6.

Figure 4.1: A comparison of the risk function with typical antigenic distances for both the geometric (orange) and HASEQ (blue) scales

Figure 4.1 shows the risk factor for infection as a function of P-Epitope in comparison to the risk function for the geometric model. The relationship is exponential. In a preliminary survey of the H5N1 HA database for the targeted simulation years P-Epitope clusters around a mean of 0.06 for intra-cluster distance, and a mean of 0.13 for inter-cluster distance. Figure 4.2 shows a distribution function for these values, and identifies an ideal cutoff value

22 for determining antigenic similarity. There is overlap between the distributions, but this is to be expected as pinpointed by Gupta et al., where they identified key case studies where P-Epitope was shown to have high vaccine efficacy for strains that were deemed to be antigenically distant by the WHO, and the reverse relationship also holds where the WHO deemed strains to be antigenically similar but had low P-Epitope values and the use of the strain as a vaccine candidate led to an unsuccessful vaccination regime [3].

Figure 4.2: A comparison of P-Epitope value distributions for inter-clade distances (orange) and intra-clade (blue) distances, sample was done on 100 different sequences from 10 different clades in the H5N1 2012 nomenclature

By using P-Epitope as a measure of antigenic distance, a more accurate portrayal of similarities between virus strains can be assessed. This allows for a more informed decision when comparing potential infections to the immune history of a host. As the biological processes used in vaccinations are the same that dictate how a potential infection invades a host.

4.2 PhySim Calibration and Comparative Analysis

Due to the discrete nature of the HASEQ amino acid model one adjustment is made to ensure PhySim has consistent propagation (e.g. the simulation propagates past day 360 with non-zero infective population). Simulations are seeded with the root sequence as well as a

23 small sample of slightly mutated variants. These variants represent viruses within the same cluster as the root sequence, mutated for 100 simulation days at the same mutation rate as the simulation. This is done for two reasons:

1. The viral landscape at the start of the simulation will more accurately model that of each target country 2. Because mutations are made in discrete steps if there is unlucky random number gen- eration the simulation can fail early, guaranteeing a few small mutations enables more consistent propagation

The initial infected population for the target countries is set to 100 infected hosts. At the beginning of each simulation 100 slight variants of the HASEQ nomenclature root sequence are spawned and randomly assigned to the 100 initial infected. The variants are generated for each country separately based on the mutation rate that produces the most consistent successes for the region. This adjustment is validated by the results in the case studies by measuring the antigenic diversity at the end of the simulation to the root sequence the slight variants are derived from. So long as antigenic diversity, or the number of clusters/clades of sequences, remains the same at the end of the simulation then the model is valid. The adjustment is then further validated by the lower influence of the number of initial infective individuals in the simulations as shown in 4.3. Because the initial infective parameter value’s impact is still minor compared to other parameter values it can be safely assumed that varying the number of initial root sequences has little impact on the success rate of the model. Two parameter settings are adjusted and the results of the simulations for each combi- nation recorded for each country, the contact rate between individuals in the model, and the average/expected mutation rate. Turkey’s geometric calibration produces the clearest trends of the three countries, followed by Nigeria. Success rates for Vietnam are more varied and require more fine-tuning for the geometric model, this relationship holds for the HASEQ model calibration. Contact and mutation rate parameters are chosen based on previous Gen- eral Sensitivity Analysis (GSA) results which list them as highly influential variables for the geometric model. GSA is a proven method of identifying pivotal parameters for a simulation model, and a short section on GSA with example results from Turkey using the HASEQ model are included in Section 4.3[22].

24 4.2.1 Turkey

Figure 4.3 shows a side-by-side comparison of three phylograms for Turkey. Subfigures 4.3a and 4.3b are outputs from PhySim runs that produce successful results for Turkey, subfig- ure 4.3c is the reference phylogram for the region.

(a) HASEQ (b) Geometric (c) Reference

Figure 4.3: Phylograms produced from HASEQ and Geometric simulations compared to the reference phylogram for Turkey

The calibration settings used to produce the HASEQ phylogram can be found in Fig- ure 4.4. Successful simulation runs for Turkey result in 2 clusters, or clades, being produced. A total of 25 different parameter combinations are shown. The most successful combi-

25 nations are indicated by the yellow squares in Figure 4.4. For each combination 10 different simulations were ran, each simulation takes approximately 2910 seconds to complete.

Figure 4.4: Calibration exploration for Turkey. Results from 10 runs, brightly colored nodes represent successful runs.

The yellow highlighted regions in Figure 4.4 illustrate the calibration region that produces the most consistently successful runs. This results in a contact rate of 2.4 and an average amino acid substitution rate of 0.09-0.10. As expected from the geometric calibration there is a trend of increasing rate of success for contact rates between 2.0-2.4 and a mutation rate between 0.08-0.09. A mutation rate of 0.0096 is the value that produces the most consistently successful runs for the geometric model, paired with a contact rate of 2.1.

26 Figure 4.5: Graph of average HASEQ and Geometric simulation infective populations over the course of 10 runs with a 95% Confidence Interval on the Geometric population for Turkey

Figure 4.5 shows a comparison of infective population over the course of the entire simu- lation which runs for approximately 6500 days. An indication of the sinusoidal modulation factor being taken into consideration in the risk of infection is the spike of infections that occurs once per year to account for the higher chance of infection during colder months. This relationship holds for both the geometric and HASEQ models. The higher infective population during the first 3000 days of the simulation is attributed to the increase in initial antigenic diversity at the start of the simulation from the seeding of the infective hosts with variant root sequences. Figure 4.6 shows a closer view of the last 5 simulation years with an average infective population within the 95% Confidence Interval of the geometric infec- tive population. These are the years the phylograms in Figure 4.3 are produced from, and sequences are sampled from.

27 Figure 4.6: Infective population for the last 5 simulation years with a 95% Confidence Interval on the Geometric population for Turkey

28 4.2.2 Nigeria

Phylograms for the HASEQ and Geometric simulations are presented in Figure 4.7. Cali- bration settings for PhySim are shown in a heatmap in Figure 4.12, with the most successful combinations highlighted in yellow.

0.014

0.016

(a) HASEQ (b) Geometric (c) Reference

Figure 4.7: Phylograms produced from HASEQ and Geometric simulations compared to the reference phylogram for Turkey

Similar to the geometric calibration there are less successful combinations for Nigeria than for Turkey, but a general trend is still present in the heatmap. There is a slight discrepancy between the contact rates that produced the most successful runs in the geometric model

29 versus the combination that produced the most successful runs in the HASEQ model. A slightly higher contact rate of 3.0 for the HASEQ model performed better than the expected 2.4 contact rate for the Geometric model. This is coupled with a higher mutation rate of 0.24-0.25 compared to the expected 0.14 of the geometric model.

Figure 4.8: Calibration exploration for Nigeria. Results from 10 runs, brightly colored nodes represent successful runs.

The infective population over the course of the entire simulation for the HASEQ simula- tions follows the same trends as the geometric simulation in Figure 4.9. The initial spike in infective population is much closer to the initial spike in the geometric model in comparison to Turkey’s initial spike in Figure 4.5.

30 Figure 4.9: Graph of average HASEQ and Geometric simulation infective populations over the course of 5 runs with a 95% Confidence Interval on the Geometric population for Nigeria

The phylograms for Nigeria are constructed using the sequences produced over the last 3 simulation years, Figure 4.10 shows a magnified view of these years where the average infective population lies within the 95% Confidence Interval for the geometric simulation runs.

Figure 4.10: Infective population for the last 5 simulation years with a 95% Confidence Interval on the Geometric population for Nigeria

31 4.2.3 Vietnam

Successful calibration settings are more sparse for Vietnam, and there is a much lower prob- ability of success.

0.036

0.018

0.025

0.028

0.011 0.014 0.016

0.015

0.025

0.017 0.047

0.023

0.057

0.012

0.036

0.034

0.015

0.016 0.011

0.074

0.025

0.019

0.027

0.033 0.011

0.03

0.023

0.015 0.031

0.018

0.015

0.022

(a) HASEQ (b) Geometric (c) Reference

Figure 4.11: Phylograms produced from HASEQ and Geometric simulations compared to the reference phylogram for Vietnam

A 60% success rate for a mutation rate of 0.24 and a contact rate of 3.0 produces the most consistent results for Vietnam. There are no apparent trends in the calibration settings whereas there are slight trends for Nigeria and Turkey.

32 Figure 4.12: Calibration exploration for Vietnam. Results from 10 runs, brightly colored nodes represent successful runs.

The calibration for the HASEQ simulation model is inconsistent for a large range of parameter settings, and takes smaller steps in the mutation rate and contact rate to find sufficient results. The inconsistent success of the simulation runs for the Vietnam studies is attributed to the narrow window of success similar to the geometric model. The discrete nature of mutations in the HASEQ model introduces a layer of invariability that also attributes to the inconsistency.

33 Figure 4.13: Graph of average HASEQ and Geometric simulation infective populations over the course of 10 runs with a 95% Confidence Interval on the Geometric population for Turkey

The infective population for Vietnam is very consistent between the geometric and HASEQ simulation models. Figure 4.13 shows the trendline over the course of the en- tire simulation, the infective population for the first 3000 simulation days showed very little difference, with a slight gap in population appearing more significantly after day 4000.

Figure 4.14: Infective population for the last 5 simulation years with a 95% Confidence Interval on the Geometric population for Vietnam

34 4.3 Generalized Sensitivity Analysis: Turkey

Generalized Sensitivity Analysis (GSA) allows analysis to determine which parameter set- tings demonstrate the most influence for simulation runs using the HASEQ model for PhySim [22]. Median values for each variable parameter setting are provided, and the values vary within a 28% range of the median in steps of 10% of the range. For each setting combination 4 simulation runs are evaluated based on the number of successes.

Contact (β) Mutation Rate (ψ) 1 1 0.8 0.8 0.6 0.256 0.6 0.4 0.4 0.152 0.2 0.2 0 0 1.75 2 2.25 2.5 0.008 0.009 0.01 0.011 0.012 Initial N (N) Initial I (I) 1 1 0.8 0.8 0.6 0.6 0.153 0.115 0.4 0.4 0.2 0.2 0 0 15000 17500 20000 22500 25000 80 90 100 110 120 Species Skew (Skew) Recovery Rate (ν) 1 1 0.8 0.8 0.386 0.6 0.6 0.150 0.4 0.4 0.2 0.2 0 0 49.5 49.75 50 50.25 50.5 4 4.5 5 5.5 6 p-epitope Scale (p-Ep) Legend 1 0.8 Success Failure 0.6 The orange lines show 0.4 0.105 parameter settings explored 0.2 by GSA 0 10 12.5 15

Figure 4.15: Generalized Sensitivity Analysis (GSA) results for Turkey, x-axis values in each sub-chart show the range of values for each parameter explored. The y-axis shows dm,n for the values explored.

Figure 4.15 shows the results of GSA for Turkey. The values in red in each subfigure show the dm,n value for each parameter, this value ranges from 0 to 1, inclusive, and is equivalent to the maximum separation between cumulative probability distributions observed in a two-sample Kolmogorov Smirnov Test. This value reflects changes in central tendency and difference in the distribution functions of each parameter. Figure 4.16 summarizes the results of Figure 4.15, higher dm,n values represent more influential parameters, and is the

35 difference between success and failure cumulative probabilities.

0.6 0.386 (± 0.12)

0.5 0.256 (± 0.13) 0.4 0.152 0.153 0.150 (± 0.12) (± 0.12) (± 0.12) 0.3 0.115 0.105 (± 0.10) (± 0.10) 0.2

0.1 F-measure (and 95% CI) 0 β Mut.Rt. N Init.I Skew ν p-Ep

Figure 4.16: Summary of dm,n values for Figure 4.15

Infectious period ν and contact rate β are the most influential parameters at dm,n values of 0.386 and 0.256 respectively. This is consistent with the geometric model. These are the primary variables that are able to be targeted via isolation in the case of contact rate, and vaccination efforts in regards to infectious period. Low influence on parameters such as total population N, Species Skew, and Initial Infec- tive are important due to founding the basis of extrapolating results to actual vaccination and containment efforts. A low dm,n value for N, for example, means a smaller number of birds for each country can be modeled, and the results extrapolated to a larger popula- tion. Low influence on Initial Infective allows the assumption that regardless of how many infective hosts the simulation starts with, the results will still be similar. This validates the use of seeding the simulation run with slightly mutated variants of the WHO nomenclature root sequence. By reducing or increasing the number of mutated variants, it’s shown that the results aren’t significantly impacted by it’s variation. As previously mentioned pEpConv is the least influential parameter and doesn’t have a significant impact on the success rate of the simulation. Figure 4.17 shows a chart of the correlation between each range of parameter values explored during GSA. Some important relationships show validation of the HASEQ model such as an inverse relationship between contact rate and recovery rate. As observed in Figure 4.17 as contact rate decreases and recovery rate is increased the antigenic diversity

36 will stay the same.

0.008 80 110 4.0 5.5

Contact 0.14 −0.30 −0.18 0.12 −0.46 −0.22 p = 0.302 p = 0.024 p = 0.195 p = 0.363 p = 0.000 p = 0.104 2.2 Frequency

● ●● ●● ● ● ● 1.6 ● ● ● ● ● ●● Step ●● ● ●● ●● ● ● ● ● ● ●●● ● ● ● −0.055 0.12 −0.13 −0.42 −0.0037 ●● ● ● x p = 0.685 p = 0.398 p = 0.325 p = 0.001 p = 0.978 ● ●● ● ● ● ● ● Frequency ● ● ● ● ● ● ● ● ●

0.008 ● ●● ● ● ● ●● ●● ● ● ● ● ●● ● ● ●● ●● ● ●● ●● ●● ● ●● Initial.N ●● ●●● ●● ●● ● ●● ● ● ●● 24000 ● ● ● ● ●● ● ●●● ● ●● ●● ● 0.028 −0.13 −0.42 0.14 ● ●● ● ● ● ●● ●● ● x ● p = 0.837 p = 0.329 p = 0.001 p = 0.291 ● ●● ● ● ● ● ●●● ●● ● ●● ● ● ●Frequency ● ● ● ● ● ● ● ● ● ● ●● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ●● ● ● ● ●●●● ● ●● ● ● 16000 ● ● ● ● ● ●●● ● ● ● ● Initial.I ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● 110 ●● ● ● ● ● ●● ● −0.037 0.083 −0.019 ●● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● x● ● p = 0.785 p = 0.545 p = 0.889 ● ●● ●● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ●● ● ●● Frequency ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ●●● ●● ● ●● ● ● ●

80 ● ● ● ● ● ● ● ●● ●● ● ●●●● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Skew ●●●● ● ●●● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● 0.091 0.091 50.2 ●●● ● ● ●● ●●●● ● ● ● ●●● ● ●● ● ● ● ● ● ●● ●● ●● ●● ● ● ● ● x ● ● p = 0.503 p = 0.505 ● ● ● ● ● ●● ●● ● ● ● ● ● ●

● ● ● ● ●● ● ● Frequency ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ●● ● ● ● ● ●● ● ●●●● ● ●● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● 49.6 ● ●●● ● ●● ● ● ●● ●● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ●● ● ● ● ● ● Recovery ● ●●● ●● ●●●● ●●● ● ●● ●●● ●● ● ● ● ● ●●● ● ●● ● ● 5.5 ● ● ●● ● ●● ● ● ● ●● ● ● ● ●● ● ●● ●● ● ● ● ● ●●● ●●● ●● ● ●●●● ●●● ● ● ●● ●●● ● ●● ● ● ●●●● −0.34 ● ● ●●● ● ● ●●● ● ●● ● ● ● ●●● ● ●●● ● ● ● ●● ●● ●● ● ● ●● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ●● ● x ● p = 0.010 ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● Frequency ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● 4.0 ● ●● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ●●● ●●●●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● p.Ep ● ● ● ● ● ● 14 ● ●●●● ● ● ● ●●●●●● ●●● ●● ● ● ● ● ●● ● ● ● ● ●● ●●● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● x ●● ● ● ● ● ● ● ● ● ● ● ●●● ● ●●●● ● ●●● ● ●● ● ●●●● ● ● ● ●● ● ● ● ● ● ● ● 11 ● ● ● ● ● ● ● ● ● ● Frequency ●● ●● ● ●● ● ● ● ● ● ●● ●● ● ● ●●● ●● ●● ● ● ●● ● ● ●●●●●● ● ●●● ● ●● ● ● ●●● ● ● ●●● ● ●● ●● ●● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● 9 1.6 2.2 16000 24000 49.6 50.2 9 11 14

x Figure 4.17: Correlation between parameter variables for PhySim runs using the HASEQ model

4.4 BLAST Analysis

Analysis of the sequences produced from HASEQ simulation runs of PhySim are compared to virus strains in the Influenza Research Database (IRB). IRB contains a database of H5N1 HA protein sequences for each clade in the H5N1 nomenclature. Analysis of the strains produced from simulations for Turkey, Nigeria, and Vietnam are conducted using the follow procedure:

1. PhySim parameter settings calibrated to produce successful simulation runs 2. A phylogenetic tree is produced with sequence names and amino acid sequences anno- tated on leaf nodes 3. BLAST analysis is ran for each sequence produced from the simulation on the set of sequences used to generate the reference phylogram

37 4. For each simulated sequence a simulation clade is generated from a clustering algorithm, and an expected clade is assigned by assigning the clade of the best-matching sequence in the BLAST analysis. Identity and positives scores for the most-similar sequence are also recorded. 5. A rand index is then evaluated on the calculated clades versus the expected clades for each sequence generated from the simulation 6. Results from 10 successful simulations are averaged together

This procedure produces a rand index between the values of 0 and 1, where values of 0 represent low correlation, or random assignments, between the calculated and expected clades and a value of 1 represents high correlation. The results of this procedure for a sample of successful runs are provided in Table 4.1. Phylograms for Vietnam are the most consistent in assigning clades to similar sequences, followed by Turkey and then Nigeria. None of the target countries clustered sequences with strong correlation to the reference phylogram clades. This highlights a few of the limitations previously discussed, such those with respect to the substitution models being used by the HASEQ model. Identity scores range from 85.280% to 86.542%, an identity score is the pair-wise percent- age of matching residues between two protein sequences. The positives score gives a more accurate representation of the similarities between the sequences produced using PhySim and the sequences used to generate the reference phylograms for each country. Because FLUModel was derived with respect to chemical properties of amino acids in HA protein sequences discrepancies in residue matches can be attributed to mutations between amino acids that are biologically and chemically similar. A rand index value measures the similarity between two different ways of clustering the same data, the set of simulation sequences will be set S. The clades assigned from generating the phylogram for each country are the first cluster, clustering X and the second cluster are the clades assigned via finding the most-similar BLAST sequence results reference clade, clustering Y. Rand index is calculated as follows:

a = # of pairs of elements in S that are in the same clade in X and the same clade in Y b = # of pairs of elements in S that are in different clades in X and different clades in Y c = # of pairs of elements in S that are in the same clade in X and different clades in Y d = # of pairs of elements in S that are in different clades in X and the same clade in Y

38 a + b RandIndex = (4.1) a + b + c + d

Country Rand Index Average Identity Average Positives Turkey 0.56978 86.542% 90.058% Nigeria 0.50408 85.280% 88.581% Vietnam 0.64922 85.746% 89.352%

Table 4.1: BLAST Results for sequences produced from successful HASEQ PhySim runs for Turkey, Nigeria, and Vietnam. Identity score is the percentage of amino acids that are similar between residues in an alignment, positives is an adjusted identity score which accounts for not only direct matches between amino acids but also for amino acids with similar chemical properties.

The following tables illustrate a break-down of which clades host sequences that produce the most mismatches using this procedure. Sequences from clades with high mismatch percentages represent clades that mutated away from the sequences used to generate the reference phylograms.

Clade # # Sim Sequences # Ref Sequences Mismatches (%) Average BLAST % 0 58 60 69.355 86.746 1 129 24 31.183 86.746

Table 4.2: Per-clade Mismatch Analysis for Turkey

Clade # # Sim Sequences # Ref Sequences Mismatches (%) Average BLAST % 0 28 10 49.397 85.520 1 25 83 47.216 85.828 2 2 3 48.649 86.643 3 94 66 50.302 85.034

Table 4.3: Per-clade Mismatch Analysis for Nigeria

39 Clade # # Sim Sequences # Ref Sequences Mismatches (%) Average BLAST % 0 39 17 29.884 85.104 1 57 3 45.451 85.416 2 8 1 39.172 86.027 3 48 5 44.038 85.787 4 22 2 49.730 87.075 5 87 2 42.718 86.310 6 99 3 41.424 86.377 7 12 11 15.192 86.602 8 50 118 9.929 87.378 9 41 13 13.413 86.689 10 53 1 46.386 86.922 11 56 1 42.836 85.665 12 8 25 51.241 87.632 13 23 14 9.709 87.384 14 24 1 9.601 87.427 15 8 1 33.590 86.316 16 44 18 36.138 86.243 17 7 1 30.744 84.075 18 44 1 26.753 83.620 19 28 2 35.059 84.110 20 1 1 12.082 85.764 21 20 1 30.782 83.646 22 28 28 40.268 84.553 23 49 38 26.214 82.927 24 72 2 44.934 85.129 25 7 1 30.744 86.015

Table 4.4: Per-clade Mismatch Analysis for Vietnam

4.5 Run-time and Memory Analysis

There is a considerable cost in time that comes with manipulating and analyzing protein sequences. A comparison of the time and memory costs of PhySim for the two different models is presented in Table 4.5. The most serious increase in time and memory requirements were found for Vietnam. Configuring the Java run-time environment to allocate 8GB of RAM is sufficient to conclude simulations for Turkey and Nigeria, and 10GB of RAM sufficient for Vietnam. The geometric model requires 4GB of RAM for Turkey, Nigeria, and Vietnam.

40 Country Geo Mem HASEQ Mem Geo Avg Time (secs) HASEQ Avg Time (secs) Turkey 4GB 8GB 712 ± 144 2910 ± 162 Nigeria 4GB 8GB 1033 ± 111 7142 ± 226 Vietnam 4GB 10GB 1567 ± 41 10242 ± 163

Table 4.5: Table of Simulation Timing and Memory Requirements for Turkey, Nigeria, and Vietnam. Averages and STD deviations are calculated from 10 successful runs.

4.6 Discussion

The ability to simulate the evolution and track the changes in the amino acid structure of avian influenza viruses will allow actual comparisons to be made between the sequences produces from the simulation to reference sequences collected from in vivo analysis. For example using a calibrated model that produces highly correlated sequences between the simulation and reference phylograms, the effect of epidemic containment regimes can be assessed at a lower level. For example, if the parameters are calibrated to represent a high- risk country where a large-scale vaccination effort has been prescribed the evolution of strains could be examined to determine if specific strains or clades should be targeted. This allows vaccination efforts to be targeted at specific clades and sequences that are mutating rapidly away from the strains already found in the country that will give rise to future epidemics. This benefit comes at the trade off of increased requirements on computing power, as displayed in 4.5 the memory requirements double, and run times are approximately 7x higher for the HASEQ model. Some of the increased computational demand are offset by using distributed computing techniques, such as submitting large batches of simulation runs to computing clusters in order to run simulations in parallel and pooling the results. This technique was used in 4.3 to conclude thousands of simulation runs in a little less than 24 hours of user time. Further discussion of efforts to reduce the computation time are discussed in Section 5.3.

41 Chapter 5 Conclusion

5.1 Limitations

While the drawbacks of the Euclidean model have been addressed, new limitations arise using the HASEQ model:

1. In order to run simulations using HA protein sequence data, higher RAM requirements are in order shown in Section 4.5 2. While the substitution rates derived from FLUModel are more representative of in vivo mutations, they still do not fully model areas of reservation in HA sequences 3. Amino Acid Substitution matrices are unable to model insertions and deletions in protein sequences 4. With more data to process the run-time performance of PhySim will degrade 5. There aren’t concrete relationships that can be used to derive expected amino acid substitution rates from nucleotide substitution rates so the new model will have to be calibrated

5.2 Summary

Avian Influenza Viruses have the potential to cause devastating socioeconomic losses on a yearly basis, with widespread impact on food supply and livestock population throughout the world. In order to contain spreading epidemics the WHO utilizes in vivo and in silico analysis methods of AIVs to understand the viral landscape and organize vaccination efforts. Because AIVs mutate rapidly there is motivation to support in silico analysis efforts such that containment efforts can be informed in order to prevent the spread of infection rather than using reactionary measures such as culling of livestock populations. In silico analysis measures enhance existing in vivo methods by providing bioligical and ecological information that feeds into in silico methods that’s then used to calibrate forecasting systems and provide

42 a base truth to propagate information for upcoming years. The contributions of the research conducted are summarized as:

1. Drawbacks of the current modeling techniques can be summarized as: 1.1. Current simulations do not model actual viral strains, which allows more robust analysis of virus sequences in target simulation regions 1.2. Mutation rates in the simulation are not representative of actual amino acid changes, and are assumed to be uniform and random 1.3. Antigenic characteristics are approximated and not as closely related to vaccine efficacy as newer measures 1.4. Forecasting of epidemics to inform vaccine design is not straight forward due to constructing phylograms without using actual viral sequences 2. The new model introduced as HASEQ includes: 2.1. Simulated viruses modeled using actual HA sequence(s) instead of the abstract Euclidean model, starting with the root HA sequence (A/turkey/England/5092/1991) corresponding to the root of the WHO H5N1 nomenclature [1]. 2.2. Realistic mutations simulated using observed mutation rates in nature as reported by Dang et al [5]. However, the mutation rates are further calibrated to charac- terize phylogenetic diversity in a given region. 2.3. Antigenic diversity is measured using an amino-acid level comparison algorithm, called P-Epitope proposed by Gupta et al [3]. 3. Other deliverables: 3.1. An enhanced version of PhySim with the proposed antigenic model implemented 3.2. Parameter settings to reconstruct phylogenetic trees similar to those already pro- duced using PhySim 3.3. A tool to generate an amino acid substitution model from a nucleotide substitution rate

Calibration settings such as those explored in Section4 present parameter values that suc- cessfully produce phlyogenetic trees that are representative of reference phylogenetic trees that are extracted from the WHO 2012 Nomenclature (WHO, 2012). By exploring pa- rameter values for 3 high risk countries (Turkey, Nigeria, and Vietnam) phylogenetic trees

43 similar to those produced from the geometric simulation model as well as the reference phy- lograms for each region are able to be produced. Alongside these trees actual HA protein sequences extracted from simulation case studies closely represent those collected through in vivo surveillance. These results were made possible by employing a measure of antigenic distance that has a higher correlation to vaccine efficacy than traditional measures. Alongside this new measure of antigenic distance amino acid substitution matrices derived from HA protein sequence data were harnessed to more accurately model the changes in HA protein structure over time. At the trade off of computational complexity the ability to explore the viral landscape at an unprecedented level was gained. Using the HASEQ simulation model vaccine design and containment efforts will be able to be more effectively informed, and the evolutionary trends of avian influenza viruses examined.

5.3 Future Work

One major limitation of the HASEQ model is the computational complexity that processing biological data entails. Region dependent speedups ranged from 4-7x the geometric model run-time, and future work could focus on employing multi-threaded or distributed computing approaches to cutting down on processing time required for each simulation. Other areas of interest are in GPU computing, and possibly heterogeneous computing (GPU and parallel combined). While the models birth and death cycle are calibrated using a 0.1 delta time step value, different approaches could be explored to limit the number of mutations that have to occur each day. Because of the nature of the amino acid substitution models, mutations can be scaled to occur on any calibrated time schedule. The mutations and comparisons to immune history are the most computationally demanding aspects of the simulation, so targeting these areas to increase computation time should be a strong focal point of future work. Another area of curiosity is within regions of conservation and exploring other amino acid substitution models. FLUModel was chosen due to the availability of it’s data as well as it’s performance in comparison to other available substitution models. The performance of PhySim could be tested using a variety of matrices to compare the different training methods involved in deriving each model. Due to the invariability of the Vietnam calibration settings more work could be done in

44 exploring different parameter combinations as well as expanding this work into other regions or combinations of at-risk regions. Waterfowl have complex migration patterns, crossing into multiple regions. Targeting larger regions and modeling all the possibles interactions available for those regions is another future possibility.

45 References

[1] World Health Organization (WHO). Continued evolution of highly pathogenic avian influenza A (H5N1): updated nomenclature. Influenza and Other Respiratory Viruses, 6(1):1–5, 2012.

[2] Bryan T. Grenfell, Oliver G. Pybus, Julia R. Gog, James L. N. Wood, Janet M. Daly, Jenny A. Mumford, and Edward C. Holmes. Unifying the epidemiological and evolu- tionary dynamics of pathogens. Science, 303(5656):327–332, 2004.

[3] Vishal Gupta, David J. Earl, and Michael W. Deem. Quantifying influenza vaccine efficacy and antigenic distance. Vaccine, 24(18):3881 – 3888, 2006. 3rd International Conference on Vaccines for Enteric Diseases.

[4] Trevor Bedford, Andrew Rambaut, and Mercedes Pascual. Canalization of the evolu- tionary trajectory of the human influenza virus. BMC Biology, 10(1):1–12, 2012.

[5] Cuong Cao Dang, Quang Si Le, Olivier Gascuel, and Vinh Sy Le. Flu, an amino acid substitution model for influenza proteins. BMC Evolutionary Biology, 10(1):99, Apr 2010.

[6] Olivier Gascuel and Si Quang Le. An Improved General Amino Acid Replacement Matrix. Molecular Biology and Evolution, 25(7):1307–1320, 03 2008.

[7] Erik M. Volz, Katia Koelle, and Trevor Bedford. Viral phylodynamics. PLoS Compu- tational Biology, 9(3):e1002947, 2013.

[8] Hendra Wibawa, John Bingham, Harimurti Nuradji, Sue Lowther, Jean Payne, Jenni Harper, Akhmad Junaidi, Deborah Middleton, and Joanne Meers. Experimentally in- fected domestic ducks show efficient transmission of indonesian h5n1 highly pathogenic avian influenza virus, but lack persistent viral shedding. PLoS ONE, 9:e383417, 01 2014.

46 [9] Benjamin Roche, John M. Drake, Justin Brown, David E. Stallknecht, Trevor Bedford, and Pejman Rohani. Adaptive evolution and environmental durability jointly structure phylodynamic patterns in avian influenza viruses. PLoS Biol, 12(8):e1001931, 08 2014.

[10] World Health Organization (WHO). Candidate vaccine viruses and potency testing reagents for Influenza A (H5N1), 2014.

[11] Neil Giridharan and Dhananjai M. Rao. Eliciting characteristics of h5n1 in high-risk regions using phylogeography and phylodynamic simulations. Computing in Science Engineering, 18(4):11–24, July 2016.

[12] GROMS. Global Register of Migratory Species (GROMS): Summarising Knowledge about Migratory Species for Conservation, Jul 2013.

[13] Gianna Sheffield and Dhananjai M. Rao. Modeling and simulation of phylodynam- ics of avian influenza. In Proceedings of the 30th European Modelling and Simulation Conference (ESM 16), University of Las Palmas, Siani, Spain, October 2016. EuroSis.

[14] Dhananjai M. Rao. Enhancing epidemiological analysis of intercontinental dispersion of H5N1 viral strains by migratory waterfowl using phylogeography. BMC Proceedings, 8(Suppl 6):S1, 2014.

[15] Benjamin Roche, John M. Drake, and Pejman Rohani. An agent-based model to study the epidemiological and evolutionary dynamics of influenza viruses. BMC Bioinformat- ics, 12(1):1–10, 2011.

[16] Yousong Peng, Yuanqiang Zou, Honglei Li, Kenli Li, and Taijiao Jiang. Inferring the antigenic epitopes for highly pathogenic avian influenza h5n1 viruses. Vaccine, 32(6):671 – 676, 2014.

[17] Giovanni Cattoli, Alice Fusaroa, Isabella Monnea, Fethiye Covenb, Tony Joannisc, Hatem S. Abd El-Hamidd, Aly Ahmed Husseine, Claire Corneliusf, Nadim Mukhles Amaring, Marzia Mancina, Edward C. Holmesh, and Ilaria Capuaa. Evidence for differ- ing evolutionary dynamics of A/H5N1 viruses among countries applying or not applying avian influenza vaccination in poultry. Vaccine, 29:9368–9375, November 2011.

[18] Jaime Huerta-Cepas, Joaquin Dopazo, and Toni Gabaldn. ETE: a python Environment for Tree Exploration. BMC Bioinformatics, 11(24), 2010.

47 [19] Miller Myers-Lipman Altschul, Gish. Basic local alignment search tool. 1990.

[20] Anderson Burke-Dauphin Gu He Kumar Larsen Lee Li Macken Mahaffey Pickett Rear- don Smith Stweart Suloway Sun Tong Vincent Walters Zaremba Zhao Zhou Zmasek Klem Scheuermann Zhang, Aever. Influenza research database: An integrated bioinfor- matics resource for influenza virus research. 2017.

[21] Julia R. Gog and Bryan T. Grenfell. Dynamics and selection of many-strain pathogens. Proceedings of the National Academy of Sciences, 99(26):17209–17214, 2002.

[22] Basak Guven and Alan Howard. Identifying the critical parameters of a cyanobacte- rial growth and movement model by using generalised sensitivity analysis. Ecological Modelling, 207(1):11 – 21, 2007.

48