Molecular Evolution and Population Genetics of Infectious Diseases

By Eduardo Felipe Castro Nallar

B.S. in Biochemistry, December 2007, Universidad de Santiago de Chile

A Dissertation submitted to

The Faculty of

The Columbian College of Arts and Sciences of The George Washington University in partial fulfillment of the requirements for the degree of Doctor of Philosophy

January 31, 2015

Dissertation directed by

Keith Alan Crandall Professor of Biological Sciences The Columbian College of Arts and Sciences of The George Washington University certifies that Eduardo Felipe Castro Nallar has passed the Final Examination for the degree of Doctor of Philosophy as of September 16, 2014. This is the final and approved form of the dissertation.

Molecular Evolution and Population Genetics of Infectious Diseases

Eduardo Felipe Castro Nallar

Dissertation Research Committee:

Keith A. Crandall, Professor of Biological Sciences, Dissertation Director

Amy E. Zanne, Associate Professor of Biological Sciences, Committee Member

Guillermo Ortí, Louis Weintraub Professor of Biological Sciences, Committee Member

ii Dedication

To Romina, the love of my life

To my parents, Pablo and Isabel, who taught me to focus, think, and understand the world

To René and Paula, for being the most kind and generous siblings an older brother can have

To Andrea, for helping me get through my first challenges in research

To my friends, for making me company, making me part of something, and making me a better person

“Historia magistra vitae et testis temporum” Marcus Tullius Cicero

“América, […] vivo en la sombra que me determina, duermo y despierto en tu esencial aurora […]” Pablo Neruda, Canto General, XVIII

“You may have noticed that the less I know about a subject the more confidence I have, and the more new light I throw on it” Mark Twain

iii Acknowledgements

The author wishes to thank the many persons that have been related, at different levels, in the development of this work. First, I would like to thank my wife Romina for her unconditional support during my PhD program. She has been with me every step of the way, always encouraging me to work hard and enjoy what I do. Romi, I could not have done this without you, and I thank you not only for being here with me but also for being my friend.

Neither this work nor the person who I am today could exist without the countless opportunities and lessons Keith has always generously provided. I do not recall a single time you have refused any of my whims, from buying fancy computers and books to encouraging me to go to workshops and supporting any research idea that has come to my mind. I still remember that it took me about a day to write an email to you back in

2008 asking you if I could join your lab, and took only a few minutes to get an email back from you saying that you will be glad to receive me. For all that and for being a friend, I thank you.

Two institutions and two places shaped my experience as a graduate student. I would like to thank my friends in Utah, especially Arley and Melody that helped me get settled and with which I share so many good moments. I would like to thank Jack Sites,

Fernanda, Rafael, César, Ana, and Patri for all the guidance and experiences we shared together. Special thanks go to my friends Justin and Flor (and baby Ben Bagley) that always have been there for me. Thanks to Andrés, Los Vega, Miguel, el primo, Gonzalo and the rest of the crew for giving me my social life back.

I would also like to thank the people of the CBI, especially Chris, Sarah, and

iv Veronica. I enjoy all our conversations from discussing the weather to the latest TV show to more serious stuff like baseball and basketball games. To Marcos, with which I have learned quickly to be cautious and to think more than twice when I analyze data. I would never be able to keep pace with you in our weekend hikes on the Appalachians but do know that I enjoy them very much. I am also grateful of Guillermo, the Ortí Lab, Amy and the rest of my committee, who have helped me in the lab and advised me during my last two years of program.

I would like to thank my undergraduate advisor Eugenio Spencer and my former boss Ana María Sandino for opening my mind to the world with new perspectives and stories, and for believing in me and encouraging me to pursue a PhD abroad. With them I am in debt.

Finally, I would like to thank my family and friends in Chile. There are too many to thank here but they know I love them and care for them deeply. Thanks to my mom,

Isabel, for teaching me the love of reading and for always making me look for answers in the dictionary and our encyclopedia, for instilling me with hunger to learn always more, and for being with me unconditionally always. Thanks to my dad, Pablo, that in his own way taught me so many essential lessons. I will always admire you. Thanks to my sister

Paula, that recently graduated and beat me for a few months. You are the living example that hard work pays off. To René, my little brother that is now a man, thanks for bringing joy, good will, and exceptional stories. Sergio, I followed your advice, I kept my eyes opened, sharpened my senses, and saved only what it was valuable.

v Abstract of Dissertation

Molecular Evolution and Population Genetics of Infectious Diseases

The rapid emergence and spread of infections and the rapid evolution of established pathogens affect our ability to monitor and control disease. Continually, we are reminded of the challenges of controlling disease, not only of pathogens affecting human health, but also those indirectly affecting human activities such as food production. Phylogenetic coalescent methods are conventionally used to infer evolutionary relationships and processes from patterns of homologous characters including genomic data. Viral and bacterial pathogens are especially fit for such inferences, as opposed to in phylogenetic systematics, due to their short generation times, large population sizes, and high substitution rates. These traits enable us to observe changes at the genomic level that are causally linked or correlated to ecological and evolutionary processes. Traditionally, phylogenetic methods, typically used by systematists, have been co-opted and applied to human-related pathogens of health importance; however, little has been done regarding pathogens that affect food production and/or public safety. This work aimed to review the breadth of phylogenetic applications to microorganisms, particularly viruses; and to apply these methods to questions about molecular evolution and population genetics of non-model microorganisms.

The first section reviews phylogenetic approaches applied to the study of the model organism Human Immunodeficiency Virus (HIV). I review applications to viral origin, global dispersal, and population genetics of within- and among-host infections.

vi The second section deals with relative performance of multi locus sequence typing (MLST) in molecular epidemiology, and whether different molecular survey approaches, namely MLST, single nucleotide polymorphisms, and/or genomes yield comparable inferences regarding origin and dispersal of select agents.

Lastly, the third section shows two case studies: a disease outbreak investigation of Infectious Salmon Anemia Virus (ISAV) in Chile with negative consequences to salmon farming, and a Lactococcus phage investigation to understand transmission between affected cheese factories in Australia. In aggregate, this body of work demonstrates the challenges and benefits of using phylogenetic methods to study the evolution of infectious diseases.

vii Dissertation citations

The following work has been published in scientific journals and below I provide the citations to the fully formatted articles.

1. Castro-Nallar, Eduardo, Marcos Pérez-Losada, Gregory F. Burton, and Keith A.

Crandall. "The evolution of HIV: inferences using phylogenetics." Molecular

phylogenetics and evolution 62, no. 2 (2012): 777-792.

2. Castro-Nallar, Eduardo, Keith A. Crandall, and Marcos Pérez-Losada. "Genetic

diversity and molecular epidemiology of HIV transmission." Future Virology 7,

no. 3 (2012): 239-252.

3. Pérez-Losada, Marcos, Patricia Cabezas, Eduardo Castro-Nallar, and Keith A.

Crandall. "Pathogen typing in the genomics era: MLST and the future of

molecular epidemiology." Infection, Genetics and Evolution 16 (2013): 38-53.

4. Eduardo Castro-Nallar, Nur Hasan, Richard Robison, Rita Colwell, W. Evan

Johnson, and Keith A. Crandall. 2014. “Evaluation of genomic tools for molecular

epidemiology.” PeerJ

5. Castro-Nallar, Eduardo, Marcelo Cortez-San Martín, Carolina Mascayano, Cristian

Molina, and Keith A. Crandall. "Molecular phylodynamics and protein modeling

of infectious salmon anemia virus (ISAV)." BMC evolutionary biology 11, no. 1

(2011): 349.

6. Castro-Nallar, Eduardo, Honglei Chen, Simon Gladman, Sean C. Moore, Torsten

Seemann, Ian B. Powell, Alan Hillier, Keith A. Crandall, and P. Scott Chandry.

"Population genomics and phylogeography of an Australian dairy factory derived

lytic bacteriophage." Genome biology and evolution 4, no. 3 (2012): 382-393.

viii Table of Contents

Dedication ...... iii

Acknowledgements ...... iv

Abstract of Dissertation ...... vi

Dissertation citations ...... viii

List of Figures ...... x

List of Tables ...... xii

Section I: Literature overview ...... 1

Chapter 1: The Evolution of HIV: Inferences using phylogenetics ...... 2

Chapter 2: Genetic diversity and molecular epidemiology of HIV transmission ...... 76

Section II: Molecular survey approaches and method comparison ...... 116

Chapter 3: Pathogen typing in the genomics era: MLST and the future of molecular epidemiology ...... 117

Chapter 4: A Survey Of Genomic Tools For Molecular Epidemiology ...... 189

Section III: Case studies related to food production ...... 233

Chapter 5: Molecular phylodynamics and protein modeling of Infectious

Salmon Anemia Virus (ISAV) ...... 234

Chapter 6: Population Genomics and Phylogeography of an Australian

Dairy Factory Derived Lytic Bacteriophage ...... 268

Summary ...... 299

ix List of Figures

1.1.1 Phylogenetic tree representation of HIV-1 recombinants

and discrete subtypes ...... 7

1.1.2 Schematic representation of HIV-1 genome organization ...... 15

1.1.3 HIV-1 intra-host Statistical Parsimony network ...... 25

1.1.4 Phylogenetic tree showing HIV cross-species transmission ...... 31

1.1.5 HIV-1 past population dynamics in North America and Thailand ...... 41

1.2.1 HIV-1 recombinants and subtypes ...... 80

1.2.2 Phylodynamic patterns ...... 85

1.2.3 HIV-1 group M global distribution ...... 88

2.1.1 Number of publications related to bacterial typing methods as a

function of time ...... 125

2.1.2 Schematic diagram showing direct sequencing approaches to

obtain and discover genetic markers for MLST analysis ...... 135

2.2.1 Geographic distribution of isolates used in this study ...... 204

2.2.2 Substitution rates for all datasets as estimated from different

data approaches ...... 210

2.2.3 Median node ages in years ...... 213

2.2.4 pseudomallei phylogenies by survey approach ...... 215

3.1.1 Schematic representation mapping of positively selected sites and

tertiary structure models for both ISAV surface proteins ...... 241

3.1.2 Bayesian Phylogenetic Inference for ISAV f gene ...... 246

3.1.3 Bayesian Skyline Plot reconstruction for the f and he genes ...... 250

x 3.2.1 Schematic genome alignment of Australian cheese factory

derived 936-like phage used for this study ...... 277

3.2.2 Plot of recombination rate and genetic diversity for aligned

phage genomes ...... 280

3.2.3 Maximum clade credibility phylogeny of the twenty-eight 936-like phage ...... 281

3.2.4 Plot of genetic diversity over time ...... 282

3.2.5 Dispersion pattern of Australian 936-like phages ...... 283

3.2.6 Molecular mapping of positive selected sites detected in RBP ...... 286

xi List of Tables

1.2.1 Summary of software used for phylodynamic inferences ...... 83

2.1.1 Comparison of most common bacterial typing techniques ...... 124

2.1.2 Comparison of NGS-based methods used in gene mining and sequencing ...... 139

2.1.3 List of population genetics programs listed in this review including

their functionalities and online links ...... 142

2.2.1 Summary of genomes sequenced and collected in this study ...... 194

2.2.2 Genetic diversity and dataset length for different species and

data approaches ...... 208

2.2.3 Topology distances among phylogenies inferred using different

data approaches ...... 217

2.2.4 Trait-phylogeny association statistics ...... 219

3.2.1 Bayes Factors Hypothesis Testing on Demographic Tree Priors for

Australian 936 Phages ...... 284

3.2.2 Positive Selected Sites (dN/dS > 1) on Australian 936 Phages Genes ...... 287

xii

Section I

Literature Overview

1

Chapter 1:

The Evolution Of HIV: Inferences

Using Phylogenetics

2 Abstract

Molecular phylogenetics has revolutionized the study of not only evolution but also disparate fields such as genomics, bioinformatics, epidemiology, ecology, microbiology, molecular biology and biochemistry. Particularly significant are its achievements in population genetics as a result of the development of coalescent theory, which have contributed to more accurate model-based parameter estimation and explicit hypothesis testing. The study of the evolution of many microorganisms, and HIV in particular, have benefited from these new methodologies. HIV is well suited for such sophisticated population analyses because of its large population sizes, short generation times, high substitution rates and a relatively small genome. All these factors make HIV an ideal and fascinating model to study molecular evolution in real time. Here we review the significant advances made in HIV evolution through the application of phylogenetic approaches. We first examine the relative roles of mutation and recombination on the molecular evolution of HIV and its adaptive response to drug therapy and tissue allocation. We then review some of the fundamental questions in HIV evolution in relation to its origin and diversification and describe some of the insights gained using phylogenies. Finally, we show how phylogenetic analysis has advanced our knowledge of

HIV dynamics (i.e., phylodynamics).

3 Introduction

AIDS is one of the most serious modern diseases (Leeper and Reddi, 2010;

UNAIDS, 2010) and the Human Immunodeficiency Virus (HIV) is the causative agent

(Barre-Sinoussi et al., 1983; Gallo et al., 1983; Popovic et al., 1984; Sarngadharan et al.,

1984). According to World Health Organization (WHO), 33.4 million people (31.1 million–35.8 million) were living with HIV worldwide as of 2009. In the same year, two million infected people died and the disease grew at a rate of 7400 new infections per day, more than 97% of which occurred in low- and middle-income countries. To date, sub-Saharan Africa is the most affected region on earth, with 67% of the world’s HIV infections (UNAIDS, 2010). Host restriction factors in the form of proteins such as

TRIM5/22, APOBEC, and Tetherin have been to some extent ineffective in blocking early HIV-1 infection (Neil et al., 2008; Sakuma et al., 2007; Stopak et al., 2003; Tissot and Mechti, 1995). In addition, Highly Active Antiretroviral therapy (HAART) has been very effective at reducing viral loads within patients and thereby significantly prolonging life expectancy for HIV infected individuals, particularly in those countries where

HAART is accessible. However, even when HAART is available, effective control remains elusive due to the number of evolved mechanisms that HIV uses to evade the host immune system (Fischer et al., 2010; Wei et al., 2003), the evolution of drug resistance (Price et al., 2011; Shi et al., 2010), and the isolation of viral reservoirs from drug treatments (Chomont et al., 2009; Finzi et al., 1997). The extraordinary genetic diversity observed among circulating populations of HIV has hampered the development of a vaccine to provide immunity or control of AIDS. Despite much research, phase III trials of HIV vaccines have been few and have failed to provide full protection, which is

4 probably due to the extensive variation of HIV isolates (McBurney and Ross, 2008).

Although some partial protection was observed in the Thai RV144 Phase III HIV vaccine trial (Rerks-Ngarm et al., 2009), an effective vaccine against HIV remains elusive.

While our knowledge of HIV biology is still limited, we have gained significant insights through the application of phylogenetics to HIV diversity. For example, phylogenetic analyses have elucidated the origins of HIV-1 and -2 epidemic (Gao et al.,

1999; Gao et al., 1994; Korber et al., 2000; Lemey et al., 2003; Salemi et al., 2001; Sharp et al., 2000), the relationships of HIV to other simian lentiviruses (Bailes et al., 2003;

Essex, 1994; Gao et al., 1992; Wertheim and Worobey, 2007), and the classification of

HIV diversity within HIV-1 (Kosakovsky Pond et al., 2009). Cross-species transmissions

(Bailes et al., 2003; Beer et al., 1999; Gao et al., 1999; Hahn et al., 2000; Plantier et al.,

2009; Takehisa et al., 2009; Wertheim and Worobey, 2007; Worobey et al., 2004) have been identified and characterized through the use of phylogenetic approaches. Such methods have been used to test hypotheses of transmission events of HIV between individuals (Hillis and Huelsenbeck, 1994; Leitner et al., 1996; Xin et al., 1995), including their use as evidence of transmission in legal settings (Bernard et al., 2007;

Crandall, 1995; Metzker et al., 2002; Ou et al., 1992).

Phylogenetics has been key to the identification of drug-resistance mutational pathways (Buendia et al., 2009; Crandall et al., 1999) and the mechanisms of drug resistance (Carvajal-Rodríguez et al., 2008; Lemey et al., 2005a; Machado et al., 2009).

Moreover, phylogenetic approaches have been used to assess within- and among-host

HIV diversity and population dynamics (i.e., phylodyanamics) (Grenfell et al., 2004), co- divergence (Beer et al., 1999; Bibollet-Ruche et al., 2004; Chen et al., 1996; Wertheim

5 and Worobey, 2007), and the role of recombination in the diversification process

(Carvajal-Rodríguez et al., 2007; Jobes et al., 2006; Schlub et al., 2010). This exceptional diversity has been examined to infer geographical distribution and dispersion patterns (Lemey et al., 2003; Robbins et al., 2003) and thereby test hypotheses associated with molecular epidemiology (Holmes et al., 1995; Salemi et al., 2008). Phylogenetics has been key to identifying patterns and mechanisms of natural selection (Lemey et al.,

2007; Pond et al., 2008; Poon et al., 2010; Templeton et al., 2004), including the intra- and inter-host adaptive forces that shape the evolution of the virus (Carvajal-Rodríguez et al., 2008; Keele et al., 2008a; Salazar-González et al., 2008; Shankarappa et al., 1999), which is essential for effective vaccine development (Frahm et al., 2008).

The first applications of phylogenetics to the study of HIV date from the early

1990s and were aimed at inferring the origins of HIV-1 and the classification of HIV into different types (1 and 2), groups (M, N, O within HIV-1), and subtypes (A-D, F-H, and J and K within Group M of HIV-1) (Huet et al., 1990; Fig. 1). Today, phylogenetic analysis has become a common practice of many HIV/AIDS research programs, due mainly to the many insights these analyses can provide and the novel questions they can address over a variety of topics related to HIV biology. Over the past two decades, HIV data have accumulated rapidly in public and specialized databases thereby creating one of the richest datasets we have for a single entity in terms of sequence tallies and epidemiological information (e.g., sampling locality, drug resistance mutations, tissue allocation, etc.). For instance, the number of available sequences in the Los Alamos database (http://www.hiv.lanl.gov) has exploded to 339,306 sequences, a 45% increase over the preceding year, with 2576 complete genomes (Kuiken et al., 2010).

6

Figure 1: Phylogenetic tree representation of HIV-1 recombinants and discrete subtypes.

Note the lack of genetic structure due to the presence of recombinant sequences. Also note that the existence of discrete subtypes is questionable due to the high evolutionary rates the virus exhibits. This tree should be regarded as a snapshot of part of the observable diversity. CRF = circulating recombinant form; cpx = complex recombinant pattern. A-D, F-H, J and K denote HIV-1 subtypes.

7

Here, we review the application of phylogenies to the study of HIV. Although because of the nature of the subject, no article-size review can be comprehensive, we hope to instructively show how phylogenetic approaches have influenced our current understanding of the emergence, evolution of drug resistance, epidemiology and dynamics of HIV and assist with the problem of its eradication (Grenfell et al., 2004;

Holmes and Grenfell, 2009; Stack et al., 2010).

Molecular evolution of HIV

The defining feature of HIV is its exceptional genetic diversity. This high diversity stems from at least four sources; namely, high substitution rates, a rather small genome, short generation times, and high recombination frequency. HIV-1 substitution rates [~0.002 substitutions/site/year (Korber et al., 2000)] are related, among others, to high mutation rates, which in HIV-1 has been estimated to be 0.1–0.2 mutations/genome/generation (Mansky and Temin, 1995), that is ~33 times more than

Neurospora crassa, but around ten-fold lower than that of Influenza A virus (Drake et al.,

1998). The HIV genome, as well as of other RNA viruses, is small (exceptions are coronaviruses and roniviruses with > 25 kb) with ~9.8 kb in length (see Figure 2 for genome structure). Genome size in HIV, as in other RNA viruses, is apparently limited by the error-prone nature of its replication machinery: the longer the genome, the more mutations produced, most of them being deleterious (Holmes, 2009). In terms of HIV diversity, such a small genome size impacts on generation time (1.2 days for HIV-1) with

~1010 virions produced daily in an infected individual (Rodrigo et al., 1999). In addition to the staggering numbers above, HIV-1 recombines at a frequency of 1-3 recombination

8 events/genome/generation (Jetzt et al., 2000; Shriner et al., 2004). Altogether, this represents a tremendous amount of raw material for evolution. However, variation is distributed unevenly across the genome. The HIV-1 genome structure is composed of three main genes, gag, pol and env, plus accessory genes, tat, rev, vif, vpr, vpu and nef, flanked by Long-Terminal Repeats (LTRs). All genes are coded over the three forward reading frames, including frame shifts in the case of tat and rev (Fig. 2). Variation in substitution rates is known to happen genome-wide and within specific genes, which has been interpreted as evidence of functional and structural constrains acting upon nucleotide sequences (Ngandu et al., 2008; Pond and Muse, 2005). As described below, this variation leads to divergent patterns in HIV evolution, whether we look at population or within-host data, with consequences to disease progression, natural selection, and drug resistance.

Substitution rates

Phylogenies allow researchers to determine patterns of the extensive genetic diversity of HIV to examine human-scale ecological and epidemiological processes.

Divergent patterns of HIV evolution are observed when comparing intra- and inter-host phylogenies. Ladder-like intra-host phylogenies are evidence of a continuous immune- driven selection [similar to inter-host influenza phylogenies Buonagurio et al. (1986);

Bush et al. (1999)], in which there is no high genetic diversity at any given time point; rather, there are few lineages with sequential replacement of strains over time. In contrast, inter-host phylogenies do not exhibit this pattern; instead multiple lineages coexist at any given time. This is probably the product of major bottlenecks at transmission (Brown, 1997). Whether drift, selection or both govern these bottlenecks is

9 not clear (Edwards et al., 2006; Keele et al., 2008a; Salazar-González et al., 2008), but the impact of random genetic drift on the population dynamics, genetic diversity, and clinical outcome is well studied (Tazi et al., 2011). Thus, HIV-1 possesses intrinsic mutational properties that prompt it to exhibit different patterns of substitution depending upon whether we look at within-host or inter-host genetic data.

Disease progression

The relationship between disease progression and substitution rates (or genetic diversity) was recognized early in HIV studies. The hypothesis is that host processes that determine HIV pathogenesis, also determine viral replication rates. Thus, by looking at substitution rates we can infer what is happening with HIV within patients. In a thorough study, Shankarappa et al. (1999) monitored nine patients over a 6-12 years period. They distinguished three phases of disease progression associated with diversity in ~600 bp of env. In the first phase, a linear increase of diversity was associated with initial features of infection and X5 HIV populations. During the second phase, diversity leveled off or even decreased, which was correlated with the appearance of X4 HIV populations. Finally, the third phase was related to a decline in CD4+ T cells and with the failure of T cell homeostasis and a reduction in diversity. Although, these results might not be general for other patients, as these may have been infected with phenotypically different strains

(different rates), and different genes may generate different patterns, other studies have somewhat supported this hypothesis. Using the previous dataset along with others,

Lemey et al. (2007) found a positive association between disease progression and HIV-1 synonymous substitution rates. When analyzing HIV-2 sequences, they observed an overall low substitution rate that might reflect the reduced virulence that this viral type

10 exhibits. On the other hand, non-synonymous rate changes have been associated with immune pressure. Consequently, a decrease in selective pressure would correlate with the breakdown of the immune system. However, with a different approach and on a different dataset, Carvajal-Rodríguez et al. (2008) reported no relationship between disease progression and substitution rates when analyzed separately into adaptive and neutral categories of variation. An opposite pattern of synonymous and non-synonymous substitutions were observed through time when dividing the dataset in rapid (RP) and non-rapid progressors (NRP), with RP showing a slow increase in non-synonymous substitutions and NRP showing a fast increase. These results could be a result of the different methodologies used; Lemey et al. were comparing absolute rates of substitution while Carvajal-Rodríguez et al. were comparing relative rates. Moreover, the former study accounted for deleterious mutations, but not for recombination and the latter did the opposite. It remains unknown if both datasets would converge on similar conclusions if the same methodologies were used. It is worth mentioning that animal models can provide an optimal environment to account for subject variability, sample size, and HIV variation (Berges et al., 2010), thus, providing a well-defined and structured opportunity to test the hypothesis above.

Natural selection

Substitution rates are logically tied to the identification of natural selection and site-based molecular adaptation. Development and refinement of methods have opened a plethora of questions that biologists can address with these sophisticated methods (Mens et al., 2007; Pond and Muse, 2005). Several studies have addressed questions regarding natural selection in HIV at the molecular level. It is generally accepted that nucleotide

11 changes that do not change the amino acid composition (synonymous; dS) are more likely to be neutral than changes that affect amino acid composition (non-synonymous; dN) (Sharp, 1997). Therefore, rate ratio changes between both types of substitutions

(dN/dS) could predict whether purifying (dN/dS<1), positive (dN/dS>1) or neutral

(dN/dS=1) selection is present at the gene and/or codon level. Under this paradigm and using phylogenies to make sense of nucleotide changes, several studies have tried to identify molecular determinants of selection in HIV. For instance, these studies have been instrumental in demonstrating that the switch from X5 to X4 tropic populations is highly positively selective, and so is the switch from non-syncytium forming HIV strains to the ones able to form syncytia (Templeton et al., 2004). Using these approaches, others have attempted to study selection dynamics within- and among- hosts (even inter-populations) to ascertain the extent to which HIV variation is maintained and passed between individuals (Pond et al., 2006; Poon et al., 2007). At the inter-population level, several within-host adaptations seem to be transient while persistent substitutions are subject to stronger selective pressures, which ultimately fix these variants in different populations

(Pond et al., 2006). More ambitious studies have analyzed large quantities of data in order to come up with novel drug-resistant and high-fitness mutations that exhibit signatures of positive selection (Chen et al., 2004). Recently, a method well suited for analyzing adaptive rates (adaptations/codon/year) in large datasets has been published

(Bhatt et al., 2011); some of its virtues are computation-tractability and robustness to biases introduced by synonymous mutations and RNA secondary structures. When using serially sampled data, a previous time point can be used as a homologous outgroup to the more contemporaneous large ingroup dataset. In doing so, the algorithm determines

12 which sites are ancestral or derived. Sites are classified as silent or replacement and/or high-, medium- and low- frequency polymorphisms. These values are then combined so that the output reflects the proportion of fixed sites that have undergone adaptive change.

For additional details on detecting selection, see the following (McDonald and Kreitman,

1991; Pérez-Losada et al., 2007c; Smith and Eyre-Walker, 2002).

Drug resistance

Drug therapy was once envisioned as a potential cure for HIV infected patients

(Perelson et al., 1997; Wain-Hobson, 1997). With the emergence of the first drugs during the mid of the 1980s, antivirals seemed to control viral infection readily (Mitsuya et al.,

1985). However, it took just 4 years following the introduction of Zidovudine (AZT) before the first mono-resistant HIV-1 strains were found and officially reported (Larder and Kemp, 1989). HIV drug resistance has risen considerably in resource-rich countries, perhaps due to widespread treatment access, although resistance is also present in low- and middle- income countries. Mathematical predictions failed to account for certain biological characteristics that impact the population dynamics of HIV, including, i) extremely fast evolutionary rates and within host population structure by specific cell types (Perelson et al., 1997); ii) high within-host population sizes, reaching 107-108 productively infected cells in lymphoid tissue; iii) high substitution rates due to an error- prone reverse transcriptase (RT) (see Mansky and Temin, (1995); and iv) the once- neglected recombination process that seems to play a major role in HIV evolution and, consequently, in drug resistance (Carvajal-Rodríguez et al., 2006). In fact, recombination is of prime importance in HIV, accounting for much of the observed diversity, and at times exceeding the mutation rate by 5.5 times (Shriner et al., 2004). Furthermore, cells

13 can harbor different proviruses, (Keele et al., 2008a) and multiple infections can occur simultaneously, (Jobes et al., 2006; Xin et al., 1995).

Several gene products have been targeted for drug treatment as these are involved in key stages of the viral replication cycle. Those include RT, protease, the envelope glycoprotein complex (gp120-gp41) and lately the virion infectivity factor (vif) (see

Coffin, 1999) (Fig. 2). Inhibitors include nucleoside and nucleotide analogs as well as non-nucleoside analogs that are able to impair some parts of reverse transcription (analog incorporation, analog removal), inhibitors of protease activity, and inhibitors of fusion to plasma membrane (Clavel and Hance, 2004). There is an extensive database

(http://hivdb.stanford.edu/) of mutations conferring drug resistance spread throughout the

HIV genome, some of them conferring cross-resistance, i.e., resistance to drugs that the patient has never been exposed to and mutations that have compensatory effects on fitness lost by the primary mutations (Rhee et al., 2003). Without doubt, antiviral treatments such as HAART have greatly improved the quality of life and life expectancy of those infected. However, it is far from being a cure as, in part, evidenced by the appearance of drug resistance mutations in treated and untreated patients (Lataillade et al., 2010).

14

Figure 2: Schematic representation of HIV-1 genome organization. The three coding reading frames are depicted along with their open reading frames (rectangles). Genome position numbering is based on the HXB2 reference strain. The small number in the upper left corner of each rectangle indicates the gene start, while the number in the lower right indicates the last position of the stop codon. Trans-spliced rev and tat forms are represented by black connecting lines between the third and second, and second and first open reading frames, respectively.

15 Recombination

Recombination in retroviruses was described a few decades ago with mechanistic details (Coffin, 1979; Goodrich and Duesberg, 1990; Temin, 1991). However, through the mid 90s, recombination in HIV-1 was regarded as almost non-existent mainly because it was thought that multiple infections within the same individual were rather unlikely. This led to the general thought that recombination could not contribute to HIV-1 evolution. Retroviral recombination was demonstrated experimentally in feline and murine species, for which recombinant viruses possessed altered tropism, host ranges or virulence (Golovkina et al., 1994; Tumas et al., 1993); but again evidence for recombination in HIV was rare. However, in 1995 and using a phylogenetic incongruence method, recombination was detected in HIV and it was suggested to be underestimated

(Robertson et al., 1995a). Initially, recombination was detected readily in Africa, probably due to the high genetic divergence of HIV-1 strains co-circulating in that country. This allowed for the opportunity for co-infection of different subtypes, which made recombination detection easier. Soon more evidence of recombination in HIV-1

(Diaz et al., 1995; Robertson et al., 1995b; Zhu et al., 1995) and for HIV-2 was detected using phylogenies (Gao et al., 1994). Evidence of intra-subtype recombination within- host was also detected through phylogenetic and substitution rate analyses (Jobes et al.,

2006). There are now a variety of methods available for detecting recombination in HIV sequences and estimating recombination rates (Posada et al., 2002) and it is apparent that recombination plays a significant role in HIV evolution (Rambaut et al., 2004).

HIV recombination has some unique characteristics that resembles sexual reproduction in multicellular organisms (Temin, 1991). HIV is essentially diploid in that

16 it possesses two full-length replication-capable genome copies within the protein capsid that can have different evolutionary histories and thus be viewed as a heterozygous virion. It is not diploid though in the sense that just one genome copy gets replicated and finally segregated when the virus infects another cell; instead just one allele is passed onto the progeny (Onafuwa-Nuga and Telesnitsky, 2009). However, different HIV genomes within the same cell can recombine and yield offspring carrying genetic material from both “parents.” The costs of sexual reproduction have been reviewed extensively (Fox et al., 2001), though it is widely accepted that sex can put together

“good mutations” and it can purge bad mutations out of the gene pool. Moreover, when a super infection occurs, i.e., an infection by a second strain in a patient already infected, the likelihood of recombination or recombination detection should increase. Thus, this process could accelerate the emergence of multi-drug resistance recombinant forms.

However, this view has been challenged by research groups that have found negative correlations between the appearance of drug resistance mutations and super-infection under computational genetic models (Bretscher et al., 2004).

Regardless of these models, inter-subtype recombinants, i.e., circulating recombinant forms (CRFs), have been described since 1996 and now occur worldwide encompassing at least 49 variants (Fig. 1 and http://www.hiv.lanl.gov/). These recombinant forms reflect successful recombination events that have been fixed in populations and may represent higher fitness forms; nevertheless this is debatable because of a lack of fitness measurements in vivo (Holmes, 2009). Therefore, evidence of recombination has been now found at every virus level, i.e., inter- and intra- subtypes, among HIV groups and among primate lentiviruses. In fact, recombination is so

17 pervasive and has such an impact on genetic structure, that it is questionable whether

HIV occurs as discrete subtypes (Holmes, 2009).

The effect of recombination on drug resistance evolution seems to be dependent on the intensity of selection pressure. By using simulated data, Carvajal-Rodríguez et al.

(2007) painted a more detailed picture of this process. They found that, under high selection pressures, recombination would favor the appearance of drug resistant HIV variants. This would be dependent on population size; the larger the population size, the more likely drug resistance recombinants will appear and become fixed in the population.

We expect this prediction to be met in, for instance, patients under drug treatment and/or experiencing a strong adaptive immune response.

Given that recombination is such an important factor in the evolution of HIV, it is important to test for recombination in DNA sequence data prior to phylogenetic analyses

(Posada and Crandall, 2002). Phylogenetic recombination detection methods are the most popular if the goal is to analyze considerable amounts of data, e.g., bootscanning algorithm (Lole et al., 1999), although it has been shown that they do not perform as well as others, e.g., Runs test (Posada and Crandall, 2001). On the other hand, experimental detection of recombination relies on laborious and time-consuming assays based on single-round replication cycles. Typically, these use pairs of vectors which reconstitute a selectable marker when recombination occurs, or they use single vectors when the goal is to assess intra-strand recombination (Onafuwa-Nuga and Telesnitsky, 2009). The information you can draw from experimental studies allows for detecting average recombination frequencies or hotspots. However, the estimation of recombination rates,

18 breakpoints, and the identification of parental sequences at the population level is not likely achievable in this framework.

On the other hand, statistical methods are well suited for HIV inter-, intra-, and host population studies. The literature contains numerous methodologies for detecting recombination breakpoints (reviewed in Posada et al. (2002). Based on relevant evidence for recombination, they have been tentatively classified as distance methods, phylogenetic methods, compatibility methods and substitution distribution methods

(Posada and Crandall, 2001). By far, phylogenetic methods are the most commonly used

(again, not always the best choice). Despite the plethora of methodological alternatives at hand, recombination detection is not an easy task. One reason for this is that it depends on several factors, including the amount of divergence among sequences, and where and how frequently the event is occurring (Lewis-Rogers, 2004). In addition, modern recombination rate estimation involves the use of coalescent-based methods that account for evolutionary history and uncertainty in the estimates (Kuhner, 2006). Although efforts have been made to implement more complex and realistic models, caution should be exercised when using them because of their sensitivity to assumption violations (e.g., deviations from neutrality and population stability, which are likely to occur in natural populations, thus frequently violated (Carvajal-Rodríguez et al., 2006; Kuhner, 2009).

In HIV research, recombination methods have been used to explore many different aspects of HIV biology. For instance, intra-host recombination has been studied in post mortem tissues exhibiting normal and abnormal histopathology in patients who received HAART. Tissues with abnormal histopathology show higher numbers of recombinant sequences. Likewise, these tissues display increased macrophage

19 proliferation and it is well known that this cell type is involved in hiding HIV from

HAART. Thus, macrophages may contribute to elusive recombinant forms evidenced by extensive recombination in non-lymphoid populations (Lamers et al., 2009). Nora et al.,

(2007) provided additional evidence supporting the role of recombination in drug resistance evolution. Based on phylogenetic incongruence of samples taken before and after patient treatment change, they showed that resistant HIV strains, after the treatment change, likely originated through recombination of strains carrying previously existing resistance mutations in a novel combination. It is worth noting that the observed patterns were apparently not consistent with convergent evolution because potential donors could be identified due to the extensive sampling performed. Moreover, the pattern of substitutions observed in the multidrug resistant variants present after treatment change suggests that this variation arises by recombination and not likely by the accumulation of mutations. However, it would be interesting to see whether observed recombination patterns are congruent when applying more stringent statistical and phylogenetic methods of detection (Martin et al., 2005).

The viral reservoir

A viral reservoir refers to a specific cell type or anatomical compartment where

(i) HIV is protected from antiviral drugs and the immune system, (ii) shows greater stability than virus in the active replicating pool, (iii) possesses greater genetic diversity than non-reservoir virus due to the presence of archival strains, and (iv) remains replication-competent. HAART effectively reduces viral loads to <50 genome- copies/mL, the detection limit of most approved assays. The importance of virus in reservoirs is underscored by the observation that once therapy is withdrawn, viral loads

20 increase within a few weeks to the levels of drug-naïve individuals (e.g., Imamichi et al.,

2001). Taking into account the average half-life of long-term, latently infected cells,

Siliciano et al. (2003) estimated that it would take around 60 years to deplete the main viral reservoir. This estimate is based on current antiretroviral therapy and does not consider drug resistance, drug toxicity and tolerance, or treatment costs. Thus, although novel methods for inducing proliferation of latently infected cells and subsequent elimination have been proposed (Marsden and Zack, 2009), the problem of eradication still persists and is not likely to be eliminated under existing therapies.

Phylogenetic methods are especially helpful in characterizing HIV reservoirs. By sampling a suspected reservoir over time and inferring evolutionary relationships, answers to questions such as whether HIV replicates in a particular compartment or not can be answered by looking at the branch lengths (i.e., genetic changes over time) of specific phylogenies. Depending on the amount of change and data richness present in the collected dataset, more specific inferences can be drawn such as divergence times and changes in population size over the time. A reservoir is then expected to have greater genetic diversity than other compartments (e.g., blood). Genotype networks are particularly suited to this aim since ancestral genotypes are located “center-wise” in the network from which “founder viruses” branch off (Crandall and Templeton, 1993), migrating into other compartments (Fig. 3). This is also true if the aim is to test whether archival strains are present in the bloodstream. To a great extent, phylogenies are also instrumental to estimate parameters that can describe diversity in within-host HIV populations. Modern estimates of genetic diversity, expressed as substitution rate-scaled

21 effective population size ( θ = 4Neµ ), rely on coalescent simulations for which genealogical reconstructions and phylogenetic models are essential.

Several cell €types have been identified that play a role in hosting HIV at different time points during the course of drug therapy. The main cell type that plays host to HIV after initiation of HAART is activated CD4+ T cells (Chun et al., 1997; Delobel et al.,

2005), which rapidly die within 2-3 days. Then, dendritic cells (DC), partially activated

CD4+ T cells, and macrophages are thought to contribute to persistence due to their susceptibility to HIV infection, less vulnerability to cytopathic effects, and half-life up to several weeks (Dahl et al., 2010). However, long-lived, memory CD4+ T cells bearing latent integrated provirus contribute the most to the problem of persistence. Populations of this cell type are maintained by the intrinsic long-term survival and homeostatic proliferation of infected cells (Chomont et al., 2009).

Whether or not HAART impairs viral replication and then HIV evolution within

CD4+ cell reservoirs is not entirely clear, yet replication in reservoirs is thought to be low

(Hermankova et al., 2001; Kieffer et al., 2004; Parera et al., 2004; Persaud et al., 2007;

Tobin et al., 2005). For instance, some studies have looked at different CD4+ memory cells and found almost no drug resistance mutations in this reservoir and short genetic distances when inferring phylogenetic trees after 8.3 years of uninterrupted HAART treatment (Nottet et al., 2009). Although the authors chose a less-reliable method of phylogenetic analysis and substitution model (Sullivan and Joyce, 2005; Susko et al.,

2004), their inferences seem robust since other studies have reached similar conclusions

(Mens et al., 2007). Nottet et al. (2009) hypothesized that viral replication in reservoirs would be indicated the appearance of drug resistance mutations, indicative of evolution

22 within the reservoir. They concluded that due to the lack of drug resistance mutations and short tree branch lengths observed in the HIV viruses stored in reservoirs, replication and evolution have been halted in the blood compartment. Bailey et al. (2006) also inspected

CD4+ T cells with a thorough sampling strategy and found limited evolution in the CD4+ reservoir. This conclusion was reached even when a greater diversity of pol genes was found in the reservoir than in plasma samples and that some sequences isolated from plasma were also found in the reservoir. Of course these results do not preclude the possibility that other reservoirs are the source of residual viremia. Although lack of evolution within an organism or entity is conceptually difficult to imagine, particularly since evolution can be defined in its simplest form as genetic change over time, irrespective of the observed magnitude, it seems like HIV hiding in reservoirs evolves at a slow rate and/or that the viruses released from the reservoir are subjected to strong purifying selection in the plasma.

Diversity between reservoirs and peripheral blood cells or within specific tissues has been studied in other reservoirs as well. For example, follicular dendritic cells

(FDCs) also act as a viral reservoir; although less is known about this reservoir compared to latent CD4+ cells and macrophages. In the mouse, FDC-trapped HIV has a half-life of about two months and it remains replication-competent for at least nine months (Burton et al., 2002). In contrast to the other common reservoirs of HIV, the FDC is not infected, but contains only trapped extracellular HIV. Because the studies cited above were performed in mice, it was unclear how FDC-virus could contribute to HIV persistence

(Smith et al., 2001). However, using experimental and phylogenetic approaches with human tissues and cells, Keele et al. (2008b) confirmed that HIV trapped on FDCs was

23 replication-competent. More importantly, HIV on FDCs had greater genetic diversity than viruses in other tissues and cells examined, including CD4+ T cells. Importantly, within the FDC trapped viruses, drug resistance variants were found and these were not identified in other sites (Fig. 3). Moreover, with an elegant network approach, they showed the existence of archival viral variants from various time points of infection that were trapped on FDCs. Altogether these findings indicate that FDCs can act as reservoir and hold viruses for years (Fig. 3).

Phylogenetic analysis has been used to explore the contribution of HIV to tumorigenesis in reservoir cell types. Recently, Salemi et al. (2009) explored the dynamics of HIV-infected macrophages using p24 staining and prediction of co-receptor usage in tumor and non-tumor postmortem tissues from patients that died of AIDS-related lymphoma (ARL). They observed a high degree of compartmentalization between HIV from macrophages found in tumor and non-tumor tissues and an intermixing of HIV strains obtained from auxiliary lymph nodes. Viral effective population size was 100-fold greater in tumor tissues than in non-tumor tissues and, strikingly, the onset of lymphoma correlate with viral expansion. Moreover, evidence of gene flow to/from lymph nodes and tumor tissues indicates that lymph nodes might facilitate the movement of metastatic cells to different parts of the body.

24

Figure 3: HIV-1 intra-host Statistical Parsimony network. Strains were isolated from different tissue types. Note how Follicular Dendritic Cell (FDC) derived sequences cluster in the center (ancestral) portion of the network; whereas, Peripheral Blood

Mononuclear Cell (PBMC) derived sequences surround FDC-derived HIV sequences.

This suggests an active role for the FDCs as an HIV reservoir.

25 Poor penetration of HAART or properties such as immune privileged tissues can drive anatomical compartments to act as “sanctuary sites”, places where HIV keeps replicating. Some suggested sanctuaries include the central nervous system (CNS), gut- associated lymphoid tissue (GALT) and the genitourinary tract (Dahl et al., 2010).

Evidence of compartmentalization of HIV sequences from different tissue types has been typically inferred from monophyletic assemblages of those sequences in phylogenetic trees (Wang et al., 2001; Wong et al., 1997). Even further sub-compartmentalization has been found in GALT tissue throughout the gastrointestinal tract (van Marle et al., 2007).

When coupled with differential HIV gene expression, this can indicate that GALT has the capacity to host different HIV replicating strains. Hence, GALT tissue should also be considered when screening for drug resistant variants. While these conclusions are insightful for HIV genetic structure in GALT, overlooking nucleotide substitution model of selection, not using an optimality criterion for tree inference that accounts for phylogenetic uncertainty, and lacking powerful statistical methods such as the coalescent that account for historical patterns of divergence (Wakeley, 2004) (all issues with theses studies), can bias conclusions.

Origin and timing of HIV

Phylogenetic analyses can bring to light dimensions of HIV evolution such as

“where and when” and even “how” infections are spreading across the globe that are impossible to assess with other approaches (Grenfell et al., 2004; Holmes, 2004, 2007;

Holmes, 2009; Holmes and Drummond, 2007; Moya et al., 2004; Welch et al., 2005). In the following sections, we discuss these dimensions and how phylogenetics has led to our current understanding of HIV origin and geographic spread.

26 Geographic and host origins, where and how?

The origins of HIV have been controversial since the beginning of the epidemic.

HIV-1 and HIV-2, both species belonging to the genus Lentivirus (Retroviridae), are distinguished on the basis of their genome organization and phylogenetic relationships, clinical characteristics, virulence, infectivity and geographic distribution. During the beginning of the HIV-1 epidemic, serological evidence pointed to African green monkeys

(Chlorocebus spp; agm) as carriers of an HIV-1-like virus, simian T-cell lymphotrophic virus 3 [STLV-3, now Simian Immunodeficiency Virus (SIV) agm; Kanki et al. (1987)].

Serum from HIV-1 infected patients cross-reacted with STLV-3 proteins (Hirsch et al.,

1986; Kanki et al., 1985a; Kanki et al., 1985b) and STLV-3-infected African green monkeys had an overlapping geographic distribution with the HIV-1 epidemic, which led researchers to believe that HIV-1 jumped to humans from African green monkeys (Kanki et al., 1987). In contrast to the serological evidence, phylogenetics argued for a chimpanzee origin; however, scientists distrusted the results, taking the analyses with caution and concluding that their evidence was not enough to prove a chimpanzee cross- species infection (Huet et al., 1990). The justifications stated for ignoring the phylogenetic evidence included the observations that high vpu gene divergence occurred between HIV-1 and SIVcpz, few lentiviruses had been isolated from simian hosts, and

SIV had low prevalence in chimpanzees (Huet et al., 1990).

Evidence of a simian origin is now clear, as similar lentiviruses have been found in more than 40 species of African primates (Bibollet-Ruche et al., 2004; Hahn et al.,

2000; Van Heuverswyn and Peeters, 2007) and geographical correlation exists between

SIVs hosted in different primate species and HIV (Peeters et al., 2008). Phylogenetic

27 evidence has shed light on this subject, showing that HIV-1 and HIV-2 are the product of several cross-species transmission events between chimpanzee (Pan troglodytes troglodytes) SIV (SIVcpz) and sooty mangabey SIV (Cercocebus atys; SIVsm) with humans (Gao et al., 1999; Gao et al., 1992; Hahn et al., 2000; Huet et al., 1990; Plantier et al., 2009; Van Heuverswyn and Peeters, 2007) (Fig. 4). Moreover, SIVcpz and SIVsm geographic range distributions correlate well with African regions where HIV-1 and

HIV-2 show great endemicity, e.g., sooty mangabeys are most abundant in the regions of

West Africa where HIV-2 is highly prevalent and diverse; thus, HIV-2 likely emerged there (Chen et al., 1997; Chen et al., 1996; Santiago et al., 2005). The HIV-2 simian origin was rapidly established, since the only species naturally infected with a closely related virus is C. atys (Chen et al., 1996). Furthermore, SIVsm cross-species transmission has occurred in multiple occasions demonstrated by phylogenetic analyses

(Chen et al., 1997; Gao et al., 1994; Gao et al., 1992; Lemey et al., 2003).

Indications of natural chimpanzee infections have been increasingly corroborated by phylogenetic analyses. Support for “the chimpanzee hypothesis” began to accumulate when SIV was found in natural populations in West Africa, initially in extremely low prevalence (Santiago et al., 2002). Subsequently, scientists sequenced the entire genome of SIVcpz from a fecal sample of a wild chimpanzee in Tanzania (Santiago et al., 2003) and confirmed previous phylogenetic inferences based on the gag, pol and env genes.

Moreover, epidemiological evidence has shown that SIVcpz can reach prevalence rates of up to 29% to 35% in some African communities (Keele et al., 2006). Recently, also in

Tanzania, Keele et al. (2009) followed populations of chimpanzees over nine-years and found typical AIDS-like features, e.g., increased death hazard for animals having SIVcpz,

28 CD4+ T-cell depletion with high viral replication and histopathological findings consistent with end-stage AIDS. Altogether, epidemiological, physiological and clinical evidence support early phylogenetic predictions establishing chimpanzee cross-species transmission to humans, HIV-1 origin and chimpanzees as natural reservoir for the virus.

Timing, when?

The timescale of the evolution of HIV-1 and 2 has been estimated and it has allowed us to understand the circumstances surrounding the emergence of HIV and to test the hypothesis regarding natural or artificial means of cross-species transmission.

Initially, several hypothesis of “when” HIV came into human populations could not be tested in a reductionist framework. Among these, the so-called Oral Polio Vaccine (OPV) hypothesis (Hooper, 2003) stated that HIV was introduced into human populations by the use of inadvertently infected monkeys (advocates claimed chimpanzees) as means for polio vaccine production. In fact, around 1960 African green monkeys were used to produce an attenuated polio vaccine (Plotkin, 2001). Thus, if the molecular timing of HIV

“jump” in humans matched this date, the result would be consistent with this hypothesis.

On the other hand, if the timing of the HIV “jump” significantly predated 1960, then

OPV hypothesis would seem less likely.

As increasing amounts of data and more powerful computational/statistical approaches became available, the time to the most recent common ancestor (TMRCA) of

HIV-1, whether it was in a human or a chimpanzee host, was estimated with increasing confidence (Korber et al., 2000; Salemi et al., 2001; Sharp et al., 2000). These studies have used different tactics, yet obtained similar estimates for the HIV-1 M group radiation (strict and relaxed molecular clock analyses). Applications of these methods

29 have led scientists to estimate that the M group originated near 1930 with a range, depending on the study, of 10-20 years on either side. It is worth noting that, although most estimates are consistent, they can be biased partly by recombination and few historical samples. Recombination, apart from violating the single ancestry assumption, may increase apparent variation in rates among nucleotide sites and also has a decreasing effect upon genetic distances between sequences (Posada, 2001). On the other hand, the partial lack of historical samples makes the calibration of such methods a hard task resulting in estimates with wide confidence intervals.

The only archival samples available, DRC60 and ZR59, suggest extensive genetic diversity of HIV-1 in West Africa by 1960. In turn, divergence time estimates date to the

1920s, depending upon the coalescent tree model chosen (Worobey et al., 2008). Since divergence time estimates represent the TMRCA of just the isolates included in the analysis, it is likely that new diverse archival sequences will yield even older divergence time estimates for the HIV-1 M group radiation. Hence, the OPV hypothesis can be ruled out because the molecular data suggest that HIV-1 group M isolates originated 30 years prior to the use of primates in OPV preparation (Korber et al., 2000; Worobey et al.,

2008). In addition, it has not been possible to detect chimpanzee DNA in archival stocks of OPV (Berry et al., 2001; Blancou et al., 2001). Most likely, cross-species transmission can be explained by invoking socio-cultural factors during the postcolonial period in

Africa (Chitnis et al., 2000).

30 Figure 4: Phylogenetic tree showing HIV cross-species transmission. The tree was built using 93 pol gene amino acid sequences from Los Alamos database and the Bayesian approach implemented in MrBayes 3.1.2. Taxon names represent accession number, host species name and isolate designation. A red circle highlights potential cross-species transmissions. Representative sequences from HIV-1 and 2 types are also shown. The scale bar denotes amino acid changes as substitutions per site.

31 By similar means, the origin of HIV-2 has been dated to 1940 ±16 for subtype A and to

1945 ± 14 for subtype B (Lemey et al., 2003). Moreover, phylogenetic inference has dated the introduction of HIV-1 clade B in North America to 1968 (1966-1970) (Gilbert et al., 2007; Pérez-Losada et al., 2010), which is consistent with the earliest known retrospective studies (Robbins et al., 2003).

It is worth mentioning that time estimation of deep (old) viral divergent nodes based on molecular clock analyses can be biased towards the present if no external calibrations are used. This could arise from extremely high saturation problems and constraints that would be hard to account for with substitution models, as it has been shown using an island biogeography approach to SIV dating (Worobey et al., 2010).

The study of the origin of HIV viruses is far from being a resolved issue; rather, phylogenetics is currently providing more insights as new isolates are being analyzed, especially isolates from other non-human primates. The study of HIV relatives shows that similar processes shape their natural history. For instance, SIVcpz itself is the product of recombination between SIVrcm (red-capped mangabeys, Cercocebus torquatus) and

SIVgsn (greater spot-nosed monkeys, Cercopithecu snictitans). This has been revealed by strong discordance between topologies that suggested a hybrid origin for SIVcpz (Bailes et al., 2003). Likewise, western gorilla SIV (SIVgor, from Gorilla gorilla gorilla) also seems to be the product of cross-species transmission (Fig. 4). Despite the small number of samples available, phylogenetic analyses suggest that SIVgor is closely related to

SIVcpz from Pan troglodytes troglodytes being also sister taxa to HIV-1 group O

(Takehisa et al., 2009). However, due to the yet poor sampling of SIVgor, whether chimpanzees infected gorillas and humans, or humans were infected first and then

32 humans infected gorillas or even gorillas to humans, is yet to be determined (Fig. 4).

Interestingly, a new HIV-1 group P, has been proposed based on one isolate found in a

Cameroonian woman. The phylogenetic placement of group P as the sister taxon to all

SIVgor but distinct from HIV-1 group O, could play a key role in testing hypotheses of human-gorilla transmissions (Fig. 4) (Plantier et al., 2009).

Phylogenetic analysis can be very informative, but the accuracy of phylogenetic conclusions is highly dependent on the method chosen and sampling strategy. As more lentivirus sequences from different locations and archival sequences become available, the issue of the origin of HIV should converge to more reliable conclusions. Before we explore different aspects of HIV dynamics and its applications, we should add a cautionary note on genetic marker choice to capture transmission and other desired signals. Extensive debate exists concerning the gene(s) choice in HIV phylogenetics (Hué et al., 2005a). The env gene is sometimes preferable on the basis of high genetic variability; however, indications of convergent evolution on this region would preclude its use since it violates the unique evolutionary history assumption made by phylogenetic methods. On the other hand, the pol gene has been suggested as a candidate as well, however some researchers have been reluctant to use it given the number of drug resistance mutations associated with this region. Lemey et al. (2005b) showed that phylogenetic trees based on pol sequences, after excluding codons associated with resistance, were congruent with independent data on epidemiology and with trees based on env sequences. Similarly, criminal cases of HIV transmission that rely solely on phylogenetic evidence are precarious. Besides the inherent issues about model selection and phylogenetic inference, data availability also plays a major role. Some of the

33 concerns are related to the direction of transmission or who infected whom, availability of all involved sexual contacts, and interpretation of the phylogeny given that certain individuals could be infected with more than one strain. Finally, issues of convergent evolution can erroneously link individuals in the absence of any other independent source of evidence (Pillay et al., 2007).

Phylodynamics and HIV

The term phylodynamics was coined in reference to “the melding of immunodynamics, epidemiology, and evolutionary biology […]” in particular to pathogens such as viruses and in the whole breadth from within-host variation and immunity through transmission events, bottlenecks and global epidemiological dynamics (Grenfell et al., 2004). This is one of the most exciting and insightful ongoing fields in which phylogenetics is contributing to our understanding of virus evolution and in particular to HIV. Viral populations dynamics can be explored using phylogenetic, coalescent and other statistical methods to make historical inferences about their temporal and spatial distributions. This is possible, basically, by taking advantage of certain attributes, such as high mutation rates, large population sizes, short generation times and the realization that genetic changes occur so fast that ecological and epidemiological processes leave marks on their genomes. The basic idea behind this new approach is that phylogenies are modulated by immune selection, viral population sizes and spatial dynamics and thus, together with experimental data, it would be possible to tear apart individual contributions and identify forces dominating pathogen evolution and behavior.

Although mainly focused on RNA viruses (Amore et al., 2010; Bennett et al., 2010;

Holmes and Grenfell, 2009; Kerr et al., 2009; Mondini et al., 2010; Pérez-Losada et al.,

34 2011; Pérez-Losada et al., 2010; Rambaut et al., 2008; Siebenga et al., 2010), phylodynamic approaches have been also applied to DNA viruses (Zehender et al.,

2010a) and bacteria (Conlan et al., 2007; Pérez-Losada et al., 2007a; Pérez-Losada et al.,

2007b; Pérez-Losada et al., 2005; Tazi et al., 2010).

Most of the methods available [see: Kuhner et al. (2009); Pérez-Losada et al.

(2007c)] take advantage of the coalescent theory developed by Kingman based on previous work made by Wright and others (Kingman, 2000). Although the idea of a coalescent theory was used several times in population genetics [reviewed in Tavaré

(1984)], the development is credited to Kingman (Kingman, 1982) and independently to

Hudson (Hudson, 1983) and Tajima (Tajima, 1983) [see (Wakeley, 2008) for a thorough discussion of coalescent theory]. The basic idea behind coalescent theory, as opposed to summary statistics or classic population genetics, is that coalescence tries to explain the present of a population by taking a look into its past. It is a realization of the Wright-

Fisher neutral model of evolution, recording the genealogical relationships among a random sample of population genetic data. The model has been further generalized to account for varying population size, different time scales, structure, recombination and selection (Nordborg, 2004).

Within a coalescent framework, statistical advances in Bayesian inference regarding the use of time-stamped data (Drummond et al., 2002), models of population dynamics and relaxation of molecular clock assumptions (Drummond et al., 2006;

Drummond and Rambaut, 2007; Drummond et al., 2005) have greatly helped to understand better HIV patterns and processes on a temporal scale. Recent advances in sequence dating provide tools to estimate unknown sequence ages as these can be jointly

35 or individually estimated under a full probabilistic framework (Shapiro et al., 2011). In addition, this temporal framework can be enhanced by explicitly modeling spatial dispersion rates in a phylogeographic context (Lemey et al., 2009; Lemey et al., 2010).

Using phylogenetic diffusion models, one can infer ancestral state locations for the sequences sampled under a discrete or continuous context. Moreover, the most parsimonious explanation for the diffusion process is obtained under this Bayesian framework by the implementation of Bayesian Stochastic Search Variable Selection

(BSSVS). This new methodology has several advantages over previous maximum likelihood and maximum parsimony methods such as fitting a diffusion model simultaneously with a substitution model, incorporation of branch lengths in ancestral state reconstruction, and accommodation of uncertainty in both the phylogeny and the diffusion process. Applications of this method are rapidly increasing as recent studies in dengue (Raghwani et al., 2011), influenza (Nelson et al., 2011), Staphylococcus aureus

(Gray et al., 2011), and, of course, HIV-1 (Esbjörnsson et al., 2011; Skar et al., 2011) indicate. More recently, new methods have been adapted from systematic studies to estimate the basic reproductive number (R0) (Stadler, 2010; Stadler et al., 2011). The R0 parameter has been traditionally used in epidemics to determine whether or not an infectious agent can spread in a population, i.e., if R0 > 1 the infectious agent will spread in the population and if R0 < 1 the infection will die out. Given the amount of viral sequence data and the ease of data acquisition, estimating R0 from genetic data can become a novel tool for molecular epidemiologists. The model uses a Birth-Death process (as in species phylogenetics) instead of a coalescent model, in which a birth event

36 is equivalent to a new infection and a death event to various phenomena such as death, treatment, eradication, etc.

In order to exploit most of these methods, sampling strategy is paramount and arbitrary sequence collection from GenBank or the Los Alamos database is probably not adequate (Stack et al., 2010). Next-gen sequencing approaches, e.g. (Bybee et al., 2011), allow for more comprehensive sampling at efficient costs for future studies of HIV diversity rather than half-hazard sampling of sequences available from other studies in public databases. Clearly, for greatest utility in studying HIV, sequence data submitted to public databases should include geographic, clinical, and, especially, temporal information (see http://datadryad.org for storage options for such data).

Transmission dynamics

Transmission dynamics have been studied thoroughly, in particular regarding transmission network reconstructions and inspecting the loss of diversity at the transmission event. Popular examples of HIV transmission are cases involving legal issues such as the Florida dentist case (Ou et al., 1992) and the Louisiana attempted murder trial (Metzker et al., 2002). In both of these cases, phylogenetic evidence was concordant with the transmission hypothesis from the defendant to the victims. In fact, the Louisiana case constituted the first case in the U.S. in which phylogenetic analyses were used in a criminal court case. In this case, different substitution models, genes (env and pol), and optimality criteria were used in linking the defendant with the victim’s

HIV-1 variants. Additionally, drug resistance genetic signatures were also used as indications of transmission events. Some other examples of transmission reconstruction include a Swedish rape case (Albert et al., 1994), and a healthcare related case in

37 Baltimore (Holmes et al., 1993). It is also interesting to highlight a recent report of highly divergent HIV variants transmitted by a donor to another two individuals on the same evening (English et al., 2011).

Transmission dynamics have also been studied across transmissions because of the opportunity for treatment due to a reduction in genetic diversity (Edwards et al., 2006;

Fischer et al., 2010). Most of the work has focused on monitoring discordant couples

(i.e., couples in which one partner is HIV positive and the other is not) and to test whether there is a reduction in genetic diversity down to one virus at the transmission event. Edwards et al. (2006) showed using phylodynamic methods that the reduction in genetic diversity (<1%) is no different between horizontal (homo- or heterosexual) and vertical transmission (mother-to-child). Understanding genetic diversity at transmission events has therapeutic implications as less diverse populations of small size are strongly influenced by genetic drift, decreasing the chance of transmission of high fitness variants.

Indications that single virus variants were transmitted horizontally came first from studies using Sanger sequencing and single-genome amplification coupled with phylogenetic and mathematical modeling (Keele et al., 2008a; Salazar-González et al., 2008), and were further supported by the enhanced capabilities of ultra-deep-sequencing (UDS). UDS revealed that early HIV variants explored extensive sequence space within epitope regions. Interestingly, as the infection proceeds, reversion to the canonical subtype sequence occurred in positions under immune pressure, but not in positions that were not under pressure even in the earliest samples, suggesting that immune pressure is present earlier than previously known (Fischer et al., 2010). Thus, phylogenetic analyses can be very insightful regarding practical situations such as court trials, and also in situations of

38 medical importance such as characterizing genetic diversity for potential therapy development.

Population dynamics

While the population dynamics of HIV has been well characterized (Coffin,

1995), phylogenetic studies have added greatly to our understanding, especially of the dynamic nature of genetic diversity over the course of infection within a host individual and across transmission events. Much of the phylodynamic research has focused on associations between clinical/epidemiological aspects and genetic diversity such as transmission/spread dynamics between men having sex with men [MSM; Lewis et al.

(2008)] revealing episodic clusters of transmission, and between heterosexual patients in which transmission dynamics appeared to be slower compared to MSM (Hughes et al.,

2009). Note that these studies used large sample sizes to draw their conclusions, a desirable feature to capture the phylodynamic signature from the data.

Phylodynamic studies have focused on temporal dynamics of transmission and its frequency. In Italy, a recent study showed similar conclusions to that of Hue et al.

(2005b) in the U.K. in that currently circulating subtype B HIV was introduced multiple times into MSM populations. Similarly, when comparing inter-node intervals between transmission clades in Lewis et al. (2008) and Zehender et al. (2010b) the time between transmission events differed with medians of 14 months and 30 months for UK and

Italian study, respectively. These results could reflect actual differences in transmission dynamics or could be due to an artifact because of the small sample size and the restricted area sampled.

39 Other studies have explored dissemination patterns and possible transmission of particular HIV strains between risk groups. Recently, Liao et al. (2009) looked at spatial and temporal patterns to explain the distribution of the CRF01_AE variant in Vietnam.

Their results suggested that CRF01_AE came from Thailand and that, within Vietnamese population, it has been transmitted from heterosexual patients to Intravenous Drug Users

(IDUs). Similar work was done with other subtypes and recombinant forms in Asia (Tee et al., 2008), South America (Bello et al., 2010), Africa (Gray et al., 2009; Lemey et al.,

2003) and North America (Gilbert et al., 2007), among others.

Patterns of diversity among populations and how they compare to epidemiological data have been studied. Pérez-Losada et al. (2010) studied HIV-1 envelope gene sequence variation in cohorts of vaccinated and placebo-treated patients in North

America. Their phylodynamic analysis showed that genetic diversity remained nearly constant from approximately the 1970s to date, suggesting that viral populations had already expanded around ten years before HIV was detected in the US (Fig. 5). In addition, despite a drop in the number of cases since the 90’s, genetic diversity has remained high across time. Previously, Robbins et al. (2003) showed similar results in a different cohort of HIV-1 positive US subjects. Although they used parametric and non- parametric methods that did not account for phylogenetic uncertainty and a smaller dataset, they reached similar demographic conclusions. In a similar study, phylodynamic analyses revealed that CRF01_AE was cryptically circulating in the Thai HIV-1 virus population for 3-10 years before it was detected in 1989 [Fig. 5; (Pérez-Losada et al.,

2011)]. In both the North American and Thai studies, historical estimates of genetic diversity correlate well with known epidemiological data.

40 Figure 5: HIV-1 past population dynamics in North America (green) and Thailand (gray).

Plots were built using the env gene and the Bayesian Skyline Plot model. The analyses primarily revealed that genetic diversity (thick lines) has remained high through time despite the number of AIDS cases (thin lines) dropping considerably and that HIV was circulating years before the first AIDS cases were detected.

41 Within-host dynamics

Within-host evolution in HIV has proven to be important in understanding clinical processes associated with disease progression. Within-host, HIV genetic diversity of plasma isolates is reduced at any time point, but increasing over the course of infection, similar to that observed in the population phylogenies of influenza virus (Grenfell et al.,

2004; Rambaut et al., 2008; Shankarappa et al., 1999). HIV evolution could be also different in specific tissues. For example, phylodynamic analyses of post mortem brain tissues have revealed that HIV is evolving at different rates at different brain compartments. However, this is apparently not related to selective pressure, but rather to inherent drift associated with macrophage-tropic viral expansion after immune failure

(Salemi et al., 2005).

Within-host variation, commonly misnamed as quasispecies (Holmes, 2009), has been addressed under a phylodynamic framework for co-receptor usage. For example,

Salemi et al. (2007) explored co-receptor usage dynamics in tissue and peripheral blood mononuclear cells (PBMC) and found a temporal structure between CCR5-tropic (R5) virus and the appearance of CXCR4 (X4) variants, i.e., majority of X4 virus found in thymus tissue seemed to come from PBMC viruses. Conversely, substitution rates between R5 and X4 sequences were not significantly different; supporting that X4 amplification could be due to the availability of target cells. As recognized by the authors, phylodynamic studies of HIV subpopulations can be greatly enlightening, but they have some limitations as human tissue samples are not easy to obtain and could involve ethical issues. As new animal models (e.g., new humanized mice) for HIV infections become available (Denton and Garcia, 2009), viral subpopulations can be

42 studied over the course of 1-2 years infection (Berges et al., 2010). Nonetheless, despite the potential of these models, little has been done in experimental HIV evolution due to little interaction between evolutionary and molecular biologists.

Although the field of phylodynamics is young, its statistical tools are key in linking epidemiological and evolutionary information (Tibayrenc, 2005). Surveillance programs would greatly benefit from the implementation of these approaches, as it would be possible to study the impact of vaccines/chemotherapeutic treatments in population genetic diversity. Likewise, such studies would allow for the identification of novel risk groups and indicate changes (or not) in population dynamics as a result of intervention strategies. At the same time, the creation of specialized databases to collect phylodynamic informative data (randomly sampled HIV sequences across a broad target area of HIV infection to monitor HIV diversity and associated changes) would greatly aid the implementation of these approaches.

Future prospects

Phylogenetics has a new vigor. The development of new robust statistical frameworks such as Bayesian inference (Huelsenbeck et al., 2001) has allowed the testing of complex hypothesis accounting for the inherent uncertainties of historical and unrepeatable processes. Currently, implementation of these methods under biologist- accessible software has empowered scientists to test biologically meaningful scenarios.

To a certain extent, the great phylogenetic questions in HIV evolution have been undertaken, e.g., HIV origin, evolutionary driving forces, within and among host variation. However, phylogenetics opens the door to new questions and new insights in

HIV. For example, are recombination/substitution rates impacted by antiviral treatments?

43 Do transmission routes influence the outcome of infection? How does compartmentalization of HIV strains evolve within an individual? These are all questions motivated by phylogenetic approaches. Similarly, large collaborative efforts such as the

UK HIV Drug Resistance Collaboration provide opportunities for addressing key questions in phylodynamics and drug resistance through the study of longitudinal and/or retrospective data from large cohorts. In summary, phylogenetics is an ever-evolving field that promises to give more insights into pathogen evolution, mainly in pathogens such as HIV that form measurably evolving populations (Drummond et al., 2003).

References

Albert, J., Wahlberg, J., Leitner, T., Escanilla, D., Uhlen, M., 1994. Analysis of a rape case by direct sequencing of the human immunodeficiency virus type 1 pol and gag genes. J. Virol. 68, 5918-5924.

Amore, G., Bertolotti, L., Hamer, G.L., Kitron, U.D., Walker, E.D., Ruiz, M.O., Brawn,

J.D., Goldberg, T.L., 2010. Multi-year evolutionary dynamics of West Nile virus in suburban Chicago, USA, 2005-2007. Philos T R Soc B 365, 1871-1878.

Bailes, E., Gao, F., Bibollet-Ruche, F., Courgnaud, V., Peeters, M., Marx, P.A., Hahn,

B.H., Sharp, P.M., 2003. Hybrid Origin of SIV in Chimpanzees. Science 300, 1713-.

Bailey, J.R., Sedaghat, A.R., Kieffer, T., Brennan, T., Lee, P.K., Wind-Rotolo, M.,

Haggerty, C.M., Kamireddi, A.R., Liu, Y., Lee, J., Persaud, D., Gallant, J.E.,

Cofrancesco, J., Quinn, T.C., Wilke, C.O., Ray, S.C., Siliciano, J.D., Nettles, R.E.,

Siliciano, R.F., 2006. Residual human immunodeficiency virus type 1 viremia in some patients on Antiretroviral therapy is dominated by a small number of invariant clones rarely found in circulating CD4(+) T cells. J Virol 80, 6441-6457.

44 Barre-Sinoussi, F., Chermann, J., Rey, F., Nugeyre, M., Chamaret, S., Gruest, J.,

Dauguet, C., Axler-Blin, C., Vezinet-Brun, F., Rouzioux, C., Rozenbaum, W.,

Montagnier, L., 1983. Isolation of a T-lymphotropic retrovirus from a patient at risk for acquired immune deficiency syndrome (AIDS). Science 220, 868-871.

Beer, B., Bailes, E., Goeken, R., Dapolito, G., Coulibaly, C., 1999. Simian immunodeficiency virus (SIV) from sun-tailed monkeys (Cercopithecus solatus): evidence for host-dependent evolution of SIV within the C. lhoesti superspecies. J. Virol.

73, 7734.

Bello, G., Aulicino, P.C., Ruchansky, D., Guimaraes, M.L., Lopez-Galindez, C., Casado,

C., Chiparelli, H., Rocco, C., Mangano, A., Sen, L., Morgado, M.G., 2010.

Phylodynamics of HIV-1 Circulating Recombinant Forms 12_BF and 38_BF in

Argentina and Uruguay. Retrovirology 7, -.

Bennett, S.N., Drummond, A.J., Kapan, D.D., Suchard, M.A., Munoz-Jordan, J.L.,

Pybus, O.G., Holmes, E.C., Gubler, D.J., 2010. Epidemic Dynamics Revealed in Dengue

Evolution. Mol Biol Evol 27, 811-818.

Berges, B.K., Akkina, S.R., Remling, L., Akkina, R., 2010. Humanized Rag2-/-

[gamma]c-/- (RAG-hu) mice can sustain long-term chronic HIV-1 infection lasting more than a year. Virology 397, 100-103.

Bernard, E., Azad, Y., Vandamme, A., Weait, M., Geretti, A., 2007. HIV forensics: pitfalls and acceptable standards in the use of phylogenetic analysis as evidence in criminal investigations of HIV transmission. HIV Medicine 8, 382-387.

45 Berry, N., Davis, C., Jenkins, A., Wood, D., Minor, P., Schild, G., Bottiger, M., Holmes,

H., Almond, N., 2001. Vaccine safety: Analysis of oral polio vaccine CHAT stocks.

Nature 410, 1046-1047.

Bhatt, S., Holmes, E.C., Pybus, O.G., 2011. The genomic rate of molecular adaptation of the human influenza A virus. Mol Biol Evol 28, 2443-2451.

Bibollet-Ruche, F., Bailes, E., Gao, F., Pourrut, X., Barlow, K., 2004. New simian immunodeficiency virus infecting De Brazza's monkeys (Cercopithecus neglectus): evidence for a Cercopithecus monkey virus clade. J. Virol. 78, 7748.

Blancou, P., Vartanian, J.-P., Christopherson, C., Chenciner, N., Basilico, C., Kwok, S.,

Wain-Hobson, S., 2001. Polio vaccine samples not linked to AIDS. Nature 410, 1045-

1046.

Bretscher, M.T., Althaus, C.L., Muller, V., Bonhoeffer, S., 2004. Recombination in HIV and the evolution of drug resistance: for better or for worse? Bioessays 26, 180-188.

Brown, A.J.L., 1997. Analysis of HIV-1 env gene sequences reveals evidence for a low effective number in the viral population. Proceedings of the National Academy of

Sciences of the United States of America 94, 1862-1865.

Buendia, P., Cadwallader, B., DeGruttola, V., 2009. A phylogenetic and Markov model approach for the reconstruction of mutational pathways of drug resistance.

Bioinformatics 25, 2522-2529.

Buonagurio, D., Nakada, S., Parvin, J., Krystal, M., Palese, P., Fitch, W., 1986. Evolution of human influenza A viruses over 50 years: rapid, uniform rate of change in NS gene.

Science 232, 980-982.

46 Burton, G.F., Keele, B.F., Estes, J.D., Thacker, T.C., Gartner, S., 2002. Follicular dendritic cell contributions to HIV pathogenesis. Semin Immunol 14, 275-284.

Bush, R.M., Bender, C.A., Subbarao, K., Cox, N.J., Fitch, W.M., 1999. Predicting the

Evolution of Human Influenza A. Science 286, 1921-1925.

Bybee, S.M., Bracken-Grissom, H., Haynes, B.D., Hermansen, R.A., Byers, R.L.,

Clement, M.J., Udall, J.A., Wilcox, E.R., Crandall, K.A., 2011. Targeted amplicon sequencing (TAS): A scalable next-gen approach to multi-locus, multi-taxa phylogenetics. Genome Biology and Evolution.

Carvajal-Rodríguez, A., Crandall, K.A., Posada, D., 2006. Recombination estimation under complex evolutionary models with the coalescent composite-likelihood method.

Mol Biol Evol 23, 817-827.

Carvajal-Rodríguez, A., Crandall, K.A., Posada, D., 2007. Recombination favors the evolution of drug resistance in HIV-1 during antiretroviral therapy. Infection, Genetics and Evolution 7, 476-483.

Carvajal-Rodríguez, A., Posada, D., Pérez-Losada, M., Keller, E., Abrams, E.J., Viscidi,

R.P., Crandall, K.A., 2008. Disease progression and evolution of the HIV-1 env gene in

24 infected infants. Infect Genet Evol 8, 110-120.

Chen, L.M., Perlina, A., Lee, C.J., 2004. Positive selection detection in 40,000 human immunodeficiency virus (HIV) type 1 sequences automatically identifies drug resistance and positive fitness mutations in HIV protease and reverse transcriptase. J Virol 78,

3722-3732.

Chen, Z., Luckay, A., Sodora, D., Telfer, P., Reed, P., Gettie, A., Kanu, J., Sadek, R.,

Yee, J., Ho, D., Zhang, L., Marx, P., 1997. Human immunodeficiency virus type 2 (HIV-

47 2) seroprevalence and characterization of a distinct HIV-2 genetic subtype from the natural range of simian immunodeficiency virus-infected sooty mangabeys. J. Virol. 71,

3953-3960.

Chen, Z., Telfier, P., Gettie, A., Reed, P., Zhang, L., Ho, D., Marx, P., 1996. Genetic characterization of new West African simian immunodeficiency virus SIVsm: geographic clustering of household-derived SIV strains with human immunodeficiency virus type 2 subtypes and genetically diverse viruses from a single feral sooty mangabey troop. J.

Virol. 70, 3617-3627.

Chitnis, A., Rawls, D., Moore, J., 2000. Origin of HIV Type 1 in Colonial French

Equatorial Africa? Aids Res Hum Retrov 16, 5-8.

Chomont, N., El-Far, M., Ancuta, P., Trautmann, L., Procopio, F.A., Yassine-Diab, B.,

Boucher, G., Boulassel, M.-R., Ghattas, G., Brenchley, J.M., Schacker, T.W., Hill, B.J.,

Douek, D.C., Routy, J.-P., Haddad, E.K., Sekaly, R.-P., 2009. HIV reservoir size and persistence are driven by T cell survival and homeostatic proliferation. Nat Med 15, 893-

900.

Chun, T.-W., Carruth, L., Finzi, D., Shen, X., DiGiuseppe, J.A., Taylor, H., Hermankova,

M., Chadwick, K., Margolick, J., Quinn, T.C., Kuo, Y.-H., Brookmeyer, R., Zeiger,

M.A., Barditch-Crovo, P., Siliciano, R.F., 1997. Quantification of latent tissue reservoirs and total body viral load in HIV-1 infection. Nature 387, 183-188.

Clavel, F., Hance, A.J., 2004. HIV Drug Resistance. New England Journal of Medicine

350, 1023-1035.

Coffin, J.M., 1979. Structure, Replication, and Recombination of Retrovirus Genomes:

Some Unifying Hypotheses. Journal of General Virology 42, 1-26.

48 Coffin, J.M., 1995. HIV population dynamics in vivo: implications for genetic variation, pathogenesis, and therapy. Science 267, 483-489.

Coffin, J.M., 1999. Molecular biology of HIV. In: Crandall, K.A. (Ed.), The Evolution of

HIV. Johns Hopkins University Press, Baltimore, MD, pp. 3-40

Conlan, A.J.K., Coward, C., Grant, A.J., Maskell, D.J., Gog, J.R., 2007. Campylobacter jejuni colonization and transmission in broiler chickens: a modelling perspective. J R Soc

Interface 4, 819-829.

Crandall, K.A., 1995. Intraspecific Phylogenetics - Support for Dental Transmission of

Human-Immunodeficiency-Virus. J Virol 69, 2351-2356.

Crandall, K.A., Kelsey, C.R., Imamichi, H., Lane, H.C., Salzman, N.P., 1999. Parallel evolution of drug resistance in HIV: Failure of nonsynonymous/synonymous substitution rate ratio to detect selection. Mol Biol Evol 16, 372-382.

Crandall, K.A., Templeton, A.R., 1993. Empirical tests of some predictions from coalescent theory with applications to intraspecific phylogeny reconstruction. pp. 959-

969.

Dahl, V., Josefsson, L., Palmer, S., 2010. HIV reservoirs, latency, and reactivation:

Prospects for eradication. Antivir Res 85, 286-294.

Delobel, P., Sandres-Saune, K., Cazabat, M., L'Faqihi, F.E., Aquilina, C., Obadia, M.,

Pasquier, C., Marchou, B., Massip, P., Izopet, J., 2005. Persistence of distinct HIV-1 populations in blood monocytes and naive and memory CD4 T cells during prolonged suppressive HAART. Aids 19, 1739-1750.

Denton, P., Garcia, J., 2009. Novel humanized murine models for HIV research. Current

HIV/AIDS Reports 6, 13-19.

49 Diaz, R., Sabino, E., Mayer, A., Mosley, J., Busch, M., 1995. Dual human immunodeficiency virus type 1 infection and recombination in a dually exposed transfusion recipient. The Transfusion Safety Study Group. J. Virol. 69, 3273-3281.

Drake, J.W., Charlesworth, B., Charlesworth, D., Crow, J.F., 1998. Rates of Spontaneous

Mutation. Genetics 148, 1667-1686.

Drummond, A.J., Ho, S.Y.W., Phillips, M.J., Rambaut, A., 2006. Relaxed phylogenetics and dating with confidence. Plos Biol 4, 699-710.

Drummond, A.J., Nicholls, G.K., Rodrigo, A.G., Solomon, W., 2002. Estimating mutation parameters, population history and genealogy simultaneously from temporally spaced sequence data. Genetics 161, 1307-1320.

Drummond, A.J., Pybus, O.G., Rambaut, A., Forsberg, R., Rodrigo, A.G., 2003.

Measurably evolving populations. Trends in ecology & evolution (Personal edition) 18,

481-488.

Drummond, A.J., Rambaut, A., 2007. BEAST: Bayesian evolutionary analysis by sampling trees. Bmc Evol Biol 7, 214.

Drummond, A.J., Rambaut, A., Shapiro, B., Pybus, O.G., 2005. Bayesian coalescent inference of past population dynamics from molecular sequences. Mol Biol Evol 22,

1185-1192.

Edwards, C.T.T., Holmes, E.C., Wilson, D.J., Viscidi, R.P., Abrams, E.J., Phillips, R.E.,

Drummond, A.J., 2006. Population genetic estimation of the loss of genetic diversity during horizontal transmission of HIV-1. Bmc Evol Biol 6, 28.

English, S., Katzourakis, A., Bonsall, D., Flanagan, P., Duda, A., Fidler, S., Weber, J.,

McClure, M., Investigators, S.T., Phillips, R., Frater, J., 2011. Phylogenetic analysis

50 consistent with a clinical history of sexual transmission of HIV-1 from a single donor reveals transmission of highly distinct variants. Retrovirology 8, 54.

Esbjörnsson, J., Mild, M., Månsson, F., Norrgren, H., Medstrand, P., 2011. HIV-1

Molecular Epidemiology in Guinea-Bissau, West Africa: Origin, Demography and

Migrations. PLoS ONE 6, e17025.

Essex, M., 1994. Simian Immunodeficiency Virus in People. N Engl J Med 330, 209-210.

Finzi, D., Hermankova, M., Pierson, T., Carruth, L.M., Buck, C., Chaisson, R.E., Quinn,

T.C., Chadwick, K., Margolick, J., Brookmeyer, R., Gallant, J., Markowitz, M., Ho,

D.D., Richman, D.D., Siliciano, R.F., 1997. Identification of a Reservoir for HIV-1 in

Patients on Highly Active Antiretroviral Therapy. Science 278, 1295-1300.

Fischer, W., Ganusov, V.V., Giorgi, E.E., Hraber, P.T., Keele, B.F., Leitner, T., Han,

C.S., Gleasner, C.D., Green, L., Lo, C.-C., Nag, A., Wallstrom, T.C., Wang, S.,

McMichael, A.J., Haynes, B.F., Hahn, B.H., Perelson, A.S., Borrow, P., Shaw, G.M.,

Bhattacharya, T., Korber, B.T., 2010. Transmission of Single HIV-1 Genomes and

Dynamics of Early Immune Escape Revealed by Ultra-Deep Sequencing. PLoS ONE 5, e12303.

Fox, C.W., Roff, D.A., Fairbairn, D.J., 2001. Evolutionary Ecology: Concepts and Case

Studies. Oxford University Press, USA.

Frahm, N., Nickle, D.C., Linde, C.H., Cohen, D.E., Zuñiga, R., Lucchetti, A., Roach, T.,

Walker, B.D., Allen, T.M., Korber, B.T., Mullins, J.I., Brander, C., 2008. Increased detection of HIV-specific T cell responses by combination of central sequences with comparable immunogenicity. Aids 22, 447-456.

51 Gallo, R., Sarin, P., Gelmann, E., Robert-Guroff, M., Richardson, E., Kalyanaraman, V.,

Mann, D., Sidhu, G., Stahl, R., Zolla-Pazner, S., Leibowitch, J., Popovic, M., 1983.

Isolation of human T-cell leukemia virus in acquired immune deficiency syndrome

(AIDS). Science 220, 865-867.

Gao, F., Bailes, E., Robertson, D.L., Chen, Y., Rodenburg, C.M., Michael, S.F.,

Cummins, L.B., Arthur, L.O., Peeters, M., Shaw, G.M., Sharp, P.M., Hahn, B.H., 1999.

Origin of HIV-1 in the chimpanzee Pan troglodytes troglodytes. Nature 397, 436-441.

Gao, F., Yue, L., Robertson, D.L., Hill, S.C., Hui, H., Biggar, R.J., Neequaye, A.E.,

Whelan, T.M., Ho, D.D., Shaw, G.M., 1994. Genetic diversity of human immunodeficiency virus type 2: evidence for distinct sequence subtypes with differences in virus biology. J. Virol. 68, 7433-7447.

Gao, F., Yue, L., White, A.T., Pappas, P.G., Barchue, J., Hanson, A.P., Greene, B.M.,

Sharp, P.M., Shaw, G.M., Hahn, B.H., 1992. Human infection by genetically diverse

SIVSM-related HIV-2 in West Africa. Nature 358, 495-499.

Gilbert, M.T.P., Rambaut, A., Wlasiuk, G., Spira, T.J., Pitchenik, A.E., Worobey, M.,

2007. The emergence of HIV/AIDS in the Americas and beyond. Proceedings of the

National Academy of Sciences of the United States of America 104, 18566-18570.

Golovkina, T.V., Jaffe, A.B., Ross, S.R., 1994. Coexpression of exogenous and endogenous mouse mammary tumor virus RNA in vivo results in viral recombination and broadens the virus host range. J. Virol. 68, 5019-5026.

Goodrich, D.W., Duesberg, P.H., 1990. Retroviral recombination during reverse transcription. Proceedings of the National Academy of Sciences 87, 2052-2056.

52 Gray, R.R., Tatem, A.J., Johnson, J.A., Alekseyenko, A.V., Pybus, O.G., Suchard, M.A.,

Salemi, M., 2011. Testing Spatiotemporal Hypothesis of Bacterial Evolution Using

Methicillin-Resistant Staphylococcus aureus ST239 Genome-wide Data within a

Bayesian Framework. Mol Biol Evol 28, 1593-1603.

Gray, R.R., Tatem, A.J., Lamers, S., Hou, W., Laeyendecker, O., Serwadda, D.,

Sewankambo, N., Gray, R.H., Wawer, M., Quinn, T.C., Goodenow, M.M., Salemi, M.,

2009. Spatial phylodynamics of HIV-1 epidemic emergence in east Africa. Aids 23, F9-

F17.

Grenfell, B., Pybus, O., Gog, J., Wood, J., Daly, J., 2004. Unifying the epidemiological and evolutionary dynamics of pathogens. Science 303, 327.

Hahn, B.H., Shaw, G.M., De, K.M., Cock, Sharp, P.M., 2000. AIDS as a Zoonosis:

Scientific and Public Health Implications. Science 287, 607-614.

Hermankova, M., Ray, S.C., Ruff, C., Powell-Davis, M., Ingersoll, R., D'Aquila, R.T.,

Quinn, T.C., Siliciano, R.F., Persaud, D., 2001. HIV-1 drug resistance profiles in children and adults with viral load of < 50 copies/mL receiving combination therapy. Jama-J Am

Med Assoc 286, 196-207.

Hillis, D.M., Huelsenbeck, J.P., 1994. Support for dental HIV transmission. Nature 369,

24-25.

Hirsch, V., Riedel, N., Kornfeld, H., Kanki, P.J., Essex, M., Mullins, J.I., 1986. Cross- reactivity to human T-lymphotropic virus type III/lymphadenopathy-associated virus and molecular cloning of simian T-cell lymphotropic virus type III from African green monkeys. Proc Natl Acad Sci U S A. 83, 9754-9758.

Holmes, E.C., 2004. The phylogeography of human viruses. Mol. Ecol. 13, 745.

53 Holmes, E.C., 2007. Viral evolution in the genomic age. PLoS Biol. 5, e278.

Holmes, E.C., 2009. The Evolution and Emergence of RNA Viruses. Oxford University

Press, New York, NY, USA.

Holmes, E.C., Drummond, A.J., 2007. The evolutionary genetics of viral emergence.

Curr Top Microbiol 315, 51-66.

Holmes, E.C., Grenfell, B.T., 2009. Discovering the Phylodynamics of RNA Viruses.

Plos Comput Biol 5, -.

Holmes, E.C., Zhang, L.Q., Robertson, P., Cleland, A., Harvey, E., Simmonds, P.,

Brown, A.J.L., 1995. The Molecular Epidemiology of Human Immunodeficiency Virus

Type 1 in Edinburgh. The Journal of Infectious Diseases 171, 45-53.

Holmes, E.C., Zhang, L.Q., Simmonds, P., Rogers, A.S., Brown, A.J.L., 1993. Molecular

Investigation of Human-Immunodeficiency-Virus (Hiv) Infection in a Patient of an Hiv-

Infected Surgeon. J Infect Dis 167, 1411-1414.

Hooper, E., 2003. The River: A Journey Back to the Source of HIV and AIDS. Penguin,

London.

Hudson, R.R., 1983. Properties of a neutral allele model with intragenic recombination.

Theoretical Population Biology 23, 183-201.

Hué, S., Clewley, J.P., Cane, P.A., Pillay, D., 2005a. Investigation of HIV-1 transmission events by phylogenetic methods: requirement for scientific rigour. Aids 19, 449-450.

Hué, S., Pillay, D., Clewley, J.P., Pybus, O.G., 2005b. Genetic analysis reveals the complex structure of HIV-1 transmission within defined risk groups. Proceedings of the

National Academy of Sciences of the United States of America 102, 4425-4429.

54 Huelsenbeck, J.P., Ronquist, F., Nielsen, R., Bollback, J.P., 2001. Bayesian Inference of

Phylogeny and Its Impact on Evolutionary Biology. Science 294, 2310-2314.

Huet, T., Cheynier, R., Meyerhans, A., Roelants, G., Wain-Hobson, S., 1990. Genetic organization of a chimpanzee lentivirus related to HIV-1. Nature 345, 356-359.

Hughes, G.J., Fearnhill, E., Dunn, D., Lycett, S.J., Rambaut, A., Brown, A.J.L.,

Collaboration, U.H.D.R., 2009. Molecular Phylodynamics of the Heterosexual HIV

Epidemic in the United Kingdom. Plos Pathog 5, -.

Jetzt, A.E., Yu, H., Klarmann, G.J., Ron, Y., Preston, B.D., Dougherty, J.P., 2000. High

Rate of Recombination throughout the Human Immunodeficiency Virus Type 1 Genome.

J Virol 74, 1234-1240.

Jobes, D.V., Daoust, M., Nguyen, V.T., Padua, A., Sinangil, F., Pérez-Losada, M.,

Crandall, K.A., Oliphant, T., Posada, D., Rambaut, A., Fuchs, J., Berman, P.W., 2006.

Longitudinal population analysis of dual infection with recombination in two strains of

HIV type 1 subtype B in an individual from a phase 3 HIV vaccine efficacy trial. Aids

Res Hum Retrov 22, 968-978.

Kanki, P.J., Hopper, J.R., Essex, M., 1987. The Origins of HIV-1 and HTLV-4/HIV-2.

Annals of the New York Academy of Sciences 511, 370-375.

Kanki, P.J., Kurth, R., Beckerg, W., Dreesman, G., McLane, F., Essex, M., 1985a.

Antibodies to simian T-lymphotropic retrovirus type Ill in African green monkeys and recognition of STLV-Ill viral proteins by AIDS and related sera. Lancet 1, 1330-1332.

Kanki, P.J., McLane, M.F., King, N.W., Letvin, N.L., Hunt, R.D., Sehgal, P., Daniel,

M.D., Desrosiers, R.C., Essex, M., 1985b. Serologic Identification and Characterization

55 of a Macaque T-Lymphotropic Retrovirus Closely Related to HTLV-III. Science 228,

1199-1201.

Keele, B.F., Giorgi, E.E., Salazar-González, J.F., Decker, J.M., Pham, K.T., Salazar,

M.G., Sun, C., Grayson, T., Wang, S., Li, H., Wei, X., Jiang, C., Kirchherr, J.L., Gao, F.,

Anderson, J.A., Ping, L.-H., Swanstrom, R., Tomaras, G.D., Blattner, W.A., Goepfert,

P.A., Kilby, J.M., Saag, M.S., Delwart, E.L., Busch, M.P., Cohen, M.S., Montefiori,

D.C., Haynes, B.F., Gaschen, B., Athreya, G.S., Lee, H.Y., Wood, N., Seoighe, C.,

Perelson, A.S., Bhattacharya, T., Korber, B.T., Hahn, B.H., Shaw, G.M., 2008a.

Identification and characterization of transmitted and early founder virus envelopes in primary HIV-1 infection. Proceedings of the National Academy of Sciences 105, 7552-

7557.

Keele, B.F., Jones, J.H., Terio, K.A., Estes, J.D., Rudicell, R.S., Wilson, M.L., Li, Y.,

Learn, G.H., Beasley, T.M., Schumacher-Stankey, J., Wroblewski, E., Mosser, A.,

Raphael, J., Kamenya, S., Lonsdorf, E.V., Travis, D.A., Mlengeya, T., Kinsel, M.J., Else,

J.G., Silvestri, G., Goodall, J., Sharp, P.M., Shaw, G.M., Pusey, A.E., Hahn, B.H., 2009.

Increased mortality and AIDS-like immunopathology in wild chimpanzees infected with

SIVcpz. Nature 460, 515-519.

Keele, B.F., Tazi, L., Gartner, S., Liu, Y.L., Burgon, T.B., Estes, J.D., Thacker, T.C.,

Crandall, K.A., McArthur, J.C., Burton, G.F., 2008b. Characterization of the follicular dendritic cell reservoir of human immunodeficiency virus type 1. J Virol 82, 5548-5561.

Keele, B.F., Van Heuverswyn, F., Li, Y., Bailes, E., Takehisa, J., Santiago, M.L.,

Bibollet-Ruche, F., Chen, Y., Wain, L.V., Liegeois, F., Loul, S., Ngole, E.M., Bienvenue,

Y., Delaporte, E., Brookfield, J.F.Y., Sharp, P.M., Shaw, G.M., Peeters, M., Hahn, B.H.,

56 2006. Chimpanzee Reservoirs of Pandemic and Nonpandemic HIV-1. Science 313, 523-

526.

Kerr, P.J., Kitchen, A., Holmes, E.C., 2009. Origin and Phylodynamics of Rabbit

Hemorrhagic Disease Virus. J Virol 83, 12129-12138.

Kieffer, T.L., Finucane, M.M., Nettles, R.E., Quinn, T.C., Broman, K.W., Ray, S.C.,

Persaud, D., Siliciano, R.F., 2004. Genotypic analysis of HIV-1 drug resistance at the limit of detection: Virus production without evolution in treated adults with undetectable

HIV loads. J Infect Dis 189, 1452-1465.

Kingman, J.F.C., 1982. On the Genealogy of Large Populations. Journal of Applied

Probability 19, 27-43.

Kingman, J.F.C., 2000. Origins of the coalescent: 1974-1982. Genetics 156, 1461-1463.

Korber, B., Muldoon, M., Theiler, J., Gao, F., Gupta, R., Lapedes, A., Hahn, B.H.,

Wolinsky, S., Bhattacharya, T., 2000. Timing the Ancestor of the HIV-1 Pandemic

Strains. Science 288, 1789-1796.

Kosakovsky Pond, S.L., Posada, D., Stawiski, E., Chappey, C., Poon, A.F.Y., Hughes,

G., Fearnhill, E., Gravenor, M.B., Leigh Brown, A.J., Frost, S.D.W., 2009. An

Evolutionary Model-Based Algorithm for Accurate Phylogenetic Breakpoint Mapping and Subtype Prediction in HIV-1. Plos Comput Biol 5, e1000581.

Kuhner, M.K., 2006. LAMARC 2.0: maximum likelihood and Bayesian estimation of population parameters. Bioinformatics 22, 768-770.

Kuhner, M.K., 2009. Coalescent genealogy samplers: windows into population history.

24, 86-93.

57 Kuiken, C., Foley, B., Leitner, T., Apetrei, T., Hahn, B., Mizrachi, I., Mullins, J.,

Rambaut, A., Wolinsky, S., Korber, B.E., 2010. HIV Sequence Compendium 2010. Los

Alamos National Laboratory, Theoretical Biology and Biophysics, Los Alamos, New

Mexico.

Lamers, S.L., Salemi, M., Galligan, D.C., de Oliveira, T., Fogel, G.B., Granier, S.C.,

Zhao, L., Brown, J.N., Morris, A., Masliah, E., McGrath, M.S., 2009. Extensive HIV-1

Intra-Host Recombination Is Common in Tissues with Abnormal Histopathology. PLoS

ONE 4, e5065.

Larder, B.A., Kemp, S.D., 1989. Multiple Mutations in Hiv-1 Reverse-Transcriptase

Confer High-Level Resistance to Zidovudine (Azt). Science 246, 1155-1158.

Lataillade, M., Chiarella, J., Yang, R., Schnittman, S., Wirtz, V., Uy, J., Seekins, D.,

Krystal, M., Mancini, M., McGrath, D., Simen, B., Egholm, M., Kozal, M., 2010.

Prevalence and Clinical Significance of HIV Drug Resistance Mutations by Ultra-Deep

Sequencing in Antiretroviral-Naïve Subjects in the CASTLE Study. PLoS ONE 5, e10952.

Leeper, S.C., Reddi, A., 2010. United States global health policy: HIV/AIDS, maternal and child health, and The President's Emergency Plan for AIDS Relief (PEPFAR). Aids

24, 2145-2149.

Leitner, T., Escanilla, D., Franzén, C., Uhlén, M., Albert, J., 1996. Accurate reconstruction of a known HIV-1 transmission history by phylogenetic tree analysis.

Proceedings of the National Academy of Sciences of the United States of America 93,

10864-10869.

58 Lemey, P., Derdelinckx, I., Rambaut, A., Van Laethem, K., Dumont, S., Vermeulen, S.,

Van Wijngaerden, E., Vandamme, A.-M., 2005a. Molecular Footprint of Drug-Selective

Pressure in a Human Immunodeficiency Virus Transmission Chain. J. Virol. 79, 11981-

11989.

Lemey, P., Pond, S.L.K., Drummond, A.J., Pybus, O.G., Shapiro, B., Barroso, H.,

Taveira, N., Rambaut, A., 2007. Synonymous substitution rates predict HIV disease progression as a result of underlying replication dynamics. Plos Comput Biol 3, 282-292.

Lemey, P., Pybus, O.G., Wang, B., Saksena, N.K., Salemi, M., Vandamme, A.-M., 2003.

Tracing the origin and history of the HIV-2 epidemic. Proceedings of the National

Academy of Sciences of the United States of America 100, 6588-6592.

Lemey, P., Rambaut, A., Drummond, A.J., Suchard, M.A., 2009. Bayesian

Phylogeography Finds Its Roots. Plos Comput Biol 5, -.

Lemey, P., Rambaut, A., Welch, J.J., Suchard, M.A., 2010. Phylogeography Takes a

Relaxed Random Walk in Continuous Space and Time. Mol Biol Evol 27, 1877-1885.

Lemey, P., Van Dooren, S., Van Laethem, K., Schrooten, Y., Derdelinckx, I., Goubau, P.,

Brun-Vézinet, F., Vaira, D., Vandamme, A.-M., 2005b. Molecular testing of multiple

HIV-1 transmissions in a criminal case. Aids 19, 1649-1658.

Lewis, F., Hughes, G.J., Rambaut, A., Pozniak, A., Brown, A.J.L., 2008. Episodic sexual transmission of HIV revealed by molecular phylodynamics. Plos Med 5, 392-402.

Lewis-Rogers, N., K. A. Crandall, and D. Posada., 2004. Evolutionary analyses of genetic recombination. Dynamical Genetics., 408.

59 Liao, H.A., Tee, K.K., Hase, S., Uenishi, R., Li, X.J., Kusagawa, S., Pham, H.T.,

Nguyen, T.H., Pybus, O.G., Takebe, Y., 2009. Phylodynamic analysis of the dissemination of HIV-1 CRF01_AE in Vietnam. Virology 391, 51-56.

Lole, K.S., Bollinger, R.C., Paranjape, R.S., Gadkari, D., Kulkarni, S.S., Novak, N.G.,

Ingersoll, R., Sheppard, H.W., Ray, S.C., 1999. Full-Length Human Immunodeficiency

Virus Type 1 Genomes from Subtype C-Infected Seroconverters in India, with Evidence of Intersubtype Recombination. J. Virol. 73, 152-160.

Machado, E.S., Afonso, A.O., Nissley, D.V., Lemey, P., Cunha, S.M., Oliveira, R.H.,

Soares, M.A., 2009. Emergency of Primary NNRTI Resistance Mutations without

Antiretroviral Selective Pressure in a HAART-Treated Child. PLoS ONE 4, e4806.

Mansky, L.M., Temin, H.M., 1995. Lower in-Vivo Mutation-Rate of Human-

Immunodeficiency-Virus Type-1 Than That Predicted from the Fidelity of Purified

Reverse-Transcriptase. J Virol 69, 5087-5094.

Marsden, M.D., Zack, J.A., 2009. Eradication of HIV: current challenges and new directions. J. Antimicrob. Chemother. 63, 7-10.

Martin, D.P., Williamson, C., Posada, D., 2005. RDP2: recombination detection and analysis from sequence alignments. Bioinformatics 21, 260-262.

McBurney, S.P., Ross, T.M., 2008. Viral sequence diversity: challenges for AIDS vaccine designs. Expert Review of Vaccines 7, 1405-1417.

McDonald, J.H., Kreitman, M., 1991. Adaptive protein evolution at the Adh locus in

Drosophila. Nature 351, 652-654.

60 Mens, H., Pedersen, A.G., Jorgensen, L.B., Hue, S., Yang, Y.Z., Gerstoft, J., Katzenstein,

T.L., 2007. Investigating signs of recent evolution in the pool of proviral HIV type 1

DNA during years of successful HAART. Aids Res Hum Retrov 23, 107-115.

Metzker, M.L., Mindell, D.P., Liu, X.M., Ptak, R.G., Gibbs, R.A., Hillis, D.M., 2002.

Molecular evidence of HIV-1 transmission in a criminal case. Proceedings of the

National Academy of Sciences of the United States of America 99, 14292-14297.

Mitsuya, H., Weinhold, K.J., Furman, P.A., St Clair, M.H., Lehrman, S.N., Gallo, R.C.,

Bolognesi, D., Barry, D.W., Broder, S., 1985. 3'-Azido-3'-deoxythymidine (BW A509U): an antiviral agent that inhibits the infectivity and cytopathic effect of human T- lymphotropic virus type III/lymphadenopathy-associated virus in vitro. Proc Natl Acad

Sci U S A. 82, 7096–7100.

Mondini, A., Bronzoni, R.V.D., Nunes, S.H.P., Chiaravalloti-Neto, F., Massad, E.,

Alonso, W.J., Zanotto, P.M.D., Nogueira, M.L., 2010. Spatio-temporal tracking and phylodynamics of a DENV-3 outbreak in a city from Brazil. Cladistics 26, 218-218.

Moya, A., Holmes, E.C., González-Candelas, F., 2004. The population genetics and evolutionary epidemiology of RNA viruses. Nat. Rev. Microbiol. 2, 279.

Neil, S.J.D., Zang, T., Bieniasz, P.D., 2008. Tetherin inhibits retrovirus release and is antagonized by HIV-1 Vpu. Nature 451, 425-430.

Nelson, M.I., Lemey, P., Tan, Y., Vincent, A., Lam, T.T.-Y., Detmer, S., Viboud, C.c.,

Suchard, M.A., Rambaut, A., Holmes, E.C., Gramer, M., 2011. Spatial Dynamics of

Human-Origin H1 Influenza A Virus in North American Swine. Plos Pathog 7, e1002077.

61 Ngandu, N., Scheffler, K., Moore, P., Woodman, Z., Martin, D., Seoighe, C., 2008.

Extensive purifying selection acting on synonymous sites in HIV-1 Group M sequences.

Virology Journal 5, 160.

Nora, T., Charpentier, C., Tenaillon, O., Hoede, C., Clavel, F., Hance, A.J., 2007.

Contribution of Recombination to the Evolution of Human Immunodeficiency Viruses

Expressing Resistance to Antiretroviral Treatment. J. Virol. 81, 7620-7628.

Nordborg, M., 2004. Coalescent Theory. Handbook of Statistical Genetics. John Wiley &

Sons, Ltd.

Nottet, H.S.L.M., van Dijk, S.J., Fanoy, E.B., Goedegebuure, I.W., de Jong, D.,

Vrisekoop, N., van Baarle, D., Boltz, V., Palmer, S., Borleffs, J.C.C., Boucher, C.A.B.,

2009. HIV-1 Can Persist in Aged Memory CD4(+) T Lymphocytes With Minimal Signs of Evolution After 8.3 Years of Effective Highly Active Antiretroviral Therapy. Jaids-J

Acq Imm Def 50, 345-353.

Onafuwa-Nuga, A., Telesnitsky, A., 2009. The Remarkable Frequency of Human

Immunodeficiency Virus Type 1 Genetic Recombination. Microbiol. Mol. Biol. Rev. 73,

451-480.

Ou, C.-Y., Ciesielski, C.A., Myers, G., Bandea, C.I., Luo, C.-C., Korber, B.T.M.,

Mullins, J.I., Schochetman, G., Berkelman, R.L., Economou, A.N., Witte, J.J., Furman,

L.J., Satten, G.A., Maclnnes, K.A., Curran, J.W., Jaffe, H.W., Laboratory Investigation

Group, Epidemiologic Investigation Group, 1992. Molecular Epidemiology of HIV

Transmission in a Dental Practice. Science 256, 1165-1171.

62 Parera, M., Ibanez, A., Clotet, B., Martinez, M.A., 2004. Lack of evidence for protease evolution in HIV-1-infected patients after 2 years of successful highly active antiretroviral therapy. J Infect Dis 189, 1444-1451.

Peeters, M., Chaix, M.L., Delaporte, E., 2008. Genetic diversity and phylogeographic distribution of SIV: how to understand the origin of HIV. M S-Med Sci 24, 621-628.

Perelson, A.S., Essunger, P., Cao, Y., Vesanen, M., Hurley, A., Saksela, K., Markowitz,

M., Ho, D.D., 1997. Decay characteristics of HIV-1-infected compartments during combination therapy. Nature 387, 188-191.

Pérez-Losada, M., Crandall, K.A., Bash, M.C., Dan, M., Zenilman, J., Viscidi, R.P.,

2007a. Distinguishing importation from diversification of quinolone-resistant Neisseria gonorrhoeae by molecular evolutionary analysis. Bmc Evol Biol 7, 84.

Pérez-Losada, M., Crandall, K.A., Zenilman, J., Viscidi, R.P., 2007b. Temporal trends in gonococcal population genetics in a high prevalence urban community. Infect Genet Evol

7, 271-278.

Pérez-Losada, M., Jobes, D.V., Sinangil, F., Crandall, K.A., Arenas, M., Posada, D.,

Berman, P.W., 2011. Phylodynamics of HIV-1 from a Phase III AIDS Vaccine Trial in

Bangkok, Thailand. PLoS ONE 6, e16902.

Pérez-Losada, M., Jobes, D.V., Sinangil, F., Crandall, K.A., Posada, D., Berman, P.W.,

2010. Phylodynamics of HIV-1 from a Phase-III AIDS Vaccine Trial in North America.

Mol Biol Evol 27, 417-425.

Pérez-Losada, M., Porter, M.L., Tazi, L., Crandall, K.A., 2007c. New methods for inferring population dynamics from microbial sequences. Infect Genet Evol 7, 24-43.

63 Pérez-Losada, M., Viscidi, R.P., Demma, J.C., Zenilman, J., Crandall, K.A., 2005.

Population genetics of Neisseria gonorrhoeae in a high-prevalence community using a hypervariable outer membrane porB and 13 slowly evolving housekeeping genes. Mol

Biol Evol 22, 1887-1902.

Persaud, D., Ray, S.C., Kajdas, J., Ahonkhai, A., Siberry, G.K., Ferguson, K., Ziemniak,

C., Quinn, T.C., Casazza, J.P., Zeichner, S., Gange, S.J., Watson, D.C., 2007. Slow human immunodeficiency virus type 1 evolution in viral reservoirs in infants treated with effective antiretroviral therapy. Aids Res Hum Retrov 23, 381-390.

Pillay, D., Rambaut, A., Geretti, A.M., Brown, A.J.L., 2007. HIV phylogenetics. BMJ

335, 460-461.

Plantier, J.-C., Leoz, M., Dickerson, J.E., De Oliveira, F., Cordonnier, F., Lemee, V.,

Damond, F., Robertson, D.L., Simon, F., 2009. A new human immunodeficiency virus derived from gorillas. Nat Med 15, 871-872.

Plotkin, S.A., 2001. Untruths and consequences: the false hypothesis linking CHAT type

1 polio vaccination to the origin of human immunodeficiency virus. Philosophical

Transactions of the Royal Society of London. Series B: Biological Sciences 356, 815-

823.

Pond, S.K., Muse, S.V., 2005. Site-to-site variation of synonymous substitution rates.

Mol Biol Evol 22, 2375-2385.

Pond, S.L.K., Frost, S.D.W., Grossman, Z., Gravenor, M.B., Richman, D.D., Brown,

A.J.L., 2006. Adaptation to different human populations by HIV-1 revealed by codon- based analyses. Plos Comput Biol 2, 530-538.

64 Pond, S.L.K., Poon, A.F.Y., Zarate, S., Smith, D.M., Little, S.J., Pillai, S.K., Ellis, R.J.,

Wong, J.K., Brown, A.J.L., Richman, D.D., Frost, S.D.W., 2008. Estimating selection pressures on HIV-1 using phylogenetic likelihood models. Stat Med 27, 4779-4789.

Poon, A.F.Y., Pond, S.L.K., Bennett, P., Richman, D.D., Brown, A.J.L., Frost, S.D.W.,

2007. Adaptation to human populations is revealed by within-host polymorphisms in

HIV-1 and hepatitis C virus. Plos Pathog 3, -.

Poon, A.F.Y., Swenson, L.C., Dong, W.W.Y., Deng, W.J., Pond, S.L.K., Brumme, Z.L.,

Mullins, J.I., Richman, D.D., Harrigan, P.R., Frost, S.D.W., 2010. Phylogenetic Analysis of Population-Based and Deep Sequencing Data to Identify Coevolving Sites in the nef

Gene of HIV-1. Mol Biol Evol 27, 819-832.

Popovic, M., Sarngadharan, M., Read, E., Gallo, R., 1984. Detection, isolation, and continuous production of cytopathic retroviruses (HTLV-III) from patients with AIDS and pre-AIDS. Science 224, 497-500.

Posada, D., 2001. Unveiling the molecular clock in the presence of recombination. Mol

Biol Evol 18, 1976-1978.

Posada, D., Crandall, K.A., 2001. Evaluation of methods for detecting recombination from DNA sequences: Computer simulations. Proceedings of the National Academy of

Sciences of the United States of America 98, 13757-13762.

Posada, D., Crandall, K.A., 2002. The effect of recombination on the accuracy of phylogeny estimation. Journal of Molecular Evolution 54, 396-402.

Posada, D., Crandall, K.A., Holmes, E.C., 2002. Recombination in evolutionary genomics. Annu Rev Genet 36, 75-97.

65 Price, M.A., Wallis, C.L., Lakhi, S., Karita, E., Kamali, A., Anzala, O., Sanders, E.J.,

Bekker, L.G., Twesigye, R., Hunter, E., Kaleebu, P., Kayitenkore, K., Allen, S.,

Ruzagira, E., Mwangome, M., Mutua, G., Amornkul, P.N., Stevens, G., Pond, S.L.K.,

Schaefer, M., Papathanasopoulos, M.A., Stevens, W., Gilmour, J., Study, I.E.I.C., 2011.

Transmitted HIV Type 1 Drug Resistance Among Individuals with Recent HIV Infection in East and Southern Africa. Aids Res Hum Retrov 27, 5-12.

Raghwani, J., Rambaut, A., Holmes, E.C., Hang, V.T., Hien, T.T., Farrar, J., Wills, B.,

Lennon, N.J., Birren, B.W., Henn, M.R., Simmons, C.P., 2011. Endemic Dengue

Associated with the Co-Circulation of Multiple Viral Lineages and Localized Density-

Dependent Transmission. Plos Pathog 7, e1002064.

Rambaut, A., Posada, D., Crandall, K.A., Holmes, E.C., 2004. The causes and consequences of HIV evolution. Nat Rev Genet 5, 52-61.

Rambaut, A., Pybus, O.G., Nelson, M.I., Viboud, C., Taubenberger, J.K., Holmes, E.C.,

2008. The genomic and epidemiological dynamics of human influenza A virus. Nature

453, 615-U612.

Rerks-Ngarm, S., Pitisuttithum, P., Nitayaphan, S., Kaewkungwal, J., Chiu, J., Paris, R.,

Premsri, N., Namwat, C., de Souza, M., Adams, E., Benenson, M., Gurunathan, S.,

Tartaglia, J., McNeil, J.G., Francis, D.P., Stablein, D., Birx, D.L., Chunsuttiwat, S.,

Khamboonruang, C., Thongcharoen, P., Robb, M.L., Michael, N.L., Kunasol, P., Kim,

J.H., 2009. Vaccination with ALVAC and AIDSVAX to Prevent HIV-1 Infection in

Thailand. New England Journal of Medicine 361, 2209-2220.

66 Rhee, S.-Y., Gonzales, M.J., Kantor, R., Betts, B.J., Ravela, J., Shafer, R.W., 2003.

Human immunodeficiency virus reverse transcriptase and protease sequence database.

Nucleic Acids Research 31, 298-303.

Robbins, K.E., Lemey, P., Pybus, O.G., Jaffe, H.W., Youngpairoj, A.S., Brown, T.M.,

Salemi, M., Vandamme, A.-M., Kalish, M.L., 2003. U.S. Human Immunodeficiency

Virus Type 1 Epidemic: Date of Origin, Population History, and Characterization of

Early Strains. J. Virol. 77, 6359-6366.

Robertson, D.L., Hahn, B.H., Sharp, P.M., 1995a. Recombination in AIDS viruses.

Journal of Molecular Evolution 40, 249-259.

Robertson, D.L., Sharp, P.M., McCutchan, F.E., Hahn, B.H., 1995b. Recombination in

HIV-1. Nature 374, 124-126.

Rodrigo, A.G., Shpaer, E.G., Delwart, E.L., Iversen, A.K.N., Gallo, M.V., Brojatsch, J.r.,

Hirsch, M.S., Walker, B.D., Mullins, J.I., 1999. Coalescent estimates of HIV-1 generation time in vivo. Proceedings of the National Academy of Sciences 96, 2187-

2191.

Sakuma, R., Noser, J.A., Ohmine, S., Ikeda, Y., 2007. Rhesus monkey TRIM5[alpha] restricts HIV-1 production through rapid degradation of viral Gag polyproteins. Nat Med

13, 631-635.

Salazar-González, J.F., Bailes, E., Pham, K.T., Salazar, M.G., Guffey, M.B., Keele, B.F.,

Derdeyn, C.A., Farmer, P., Hunter, E., Allen, S., Manigart, O., Mulenga, J., Anderson,

J.A., Swanstrom, R., Haynes, B.F., Athreya, G.S., Korber, B.T.M., Sharp, P.M., Shaw,

G.M., Hahn, B.H., 2008. Deciphering Human Immunodeficiency Virus Type 1

67 Transmission and Early Envelope Diversification by Single-Genome Amplification and

Sequencing. J. Virol. 82, 3952-3970.

Salemi, M., Burkhardt, B.R., Gray, R.R., Ghaffari, G., Sleasman, J.W., Goodenow,

M.M., 2007. Phylodynamics of HIV-1 in Lymphoid and Non-Lymphoid Tissues Reveals a Central Role for the Thymus in Emergence of CXCR4-Using Quasispecies. PLoS ONE

2, -.

Salemi, M., de Oliveira, T., Ciccozzi, M., Rezza, G., Goodenow, M.M., 2008. High-

Resolution Molecular Epidemiology and Evolutionary History of HIV-1 Subtypes in

Albania. PLoS ONE 3, e1390.

Salemi, M., Lamers, S.L., Huysentruyt, L.C., Galligan, D., Gray, R.R., Morris, A.,

McGrath, M.S., 2009. Distinct Patterns of HIV-1 Evolution within Metastatic Tissues in

Patients with Non-Hodgkins Lymphoma. PLoS ONE 4, -.

Salemi, M., Lamers, S.L., Yu, S., de Oliveira, T., Fitch, W.M., McGrath, M.S., 2005.

Phylodynamic analysis of human immunodeficiency virus type 1 in distinct brain compartments provides a model for the neuropathogenesis of AIDS. J Virol 79, 11343-

11352.

Salemi, M., Strimmer, K., Hall, W.W., Duffy, M., Delaporte, E., Mboup, S., Peeters, M.,

Vandamme, A.-M., 2001. Dating the common ancestor of SIVcpz and HIV-1 group M and the origin of HIV-1 subtypes using a new method to uncover clock-like molecular evolution. FASEB J. 15, 276-278.

Santiago, M.L., Bibollet-Ruche, F., Bailes, E., Kamenya, S., Muller, M.N., Lukasik, M.,

Pusey, A.E., Collins, D.A., Wrangham, R.W., Goodall, J., Shaw, G.M., Sharp, P.M.,

68 Hahn, B.H., 2003. Amplification of a Complete Simian Immunodeficiency Virus

Genome from Fecal RNA of a Wild Chimpanzee. J. Virol. 77, 2233-2242.

Santiago, M.L., Range, F., Keele, B.F., Li, Y., Bailes, E., Bibollet-Ruche, F., Fruteau, C.,

Noe, R., Peeters, M., Brookfield, J.F.Y., Shaw, G.M., Sharp, P.M., Hahn, B.H., 2005.

Simian Immunodeficiency Virus Infection in Free-Ranging Sooty Mangabeys

(Cercocebus atys atys) from the Tai Forest, Cote d'Ivoire: Implications for the Origin of

Epidemic Human Immunodeficiency Virus Type 2. J. Virol. 79, 12515-12527.

Santiago, M.L., Rodenburg, C.M., Kamenya, S., Bibollet-Ruche, F., Gao, F., Bailes, E.,

Meleth, S., Soong, S.-J., Kilby, J.M., Moldoveanu, Z., Fahey, B., Muller, M.N., Ayouba,

A., Nerrienet, E., McClure, H.M., Heeney, J.L., Pusey, A.E., Collins, D.A., Boesch, C.,

Wrangham, R.W., Goodall, J., Sharp, P.M., Shaw, G.M., Hahn, B.H., 2002. SIVcpz in

Wild Chimpanzees. Science 295, 465-.

Sarngadharan, M., Popovic, M., Bruch, L., Schupbach, J., Gallo, R., 1984. Antibodies reactive with human T-lymphotropic retroviruses (HTLV-III) in the serum of patients with AIDS. Science 224, 506-508.

Schlub, T.E., Smyth, R.P., Grimm, A.J., Mak, J., Davenport, M.P., 2010. Accurately

Measuring Recombination between Closely Related HIV-1 Genomes. Plos Comput Biol

6, -.

Shankarappa, R., Margolick, J.B., Gange, S.J., Rodrigo, A.G., Upchurch, D., Farzadegan,

H., Gupta, P., Rinaldo, C.R., Learn, G.H., He, X., Huang, X.L., Mullins, J.I., 1999.

Consistent viral evolutionary changes associated with the progression of human immunodeficiency virus type 1 infection. J Virol 73, 10489-10502.

69 Shapiro, B., Ho, S.Y.W., Drummond, A.J., Suchard, M.A., Pybus, O.G., Rambaut, A.,

2011. A Bayesian Phylogenetic Method to Estimate Unknown Sequence Ages. Mol Biol

Evol 28, 879-887.

Sharp, P.M., 1997. In search of molecular darwinism. Nature 385, 111-112.

Sharp, P.M., Bailes, E., Gao, F., Beer, B.E., Hirsch, V.M., Hahn, B.H., 2000. Origins and evolution of AIDS viruses: estimating the time-scale. Biochem. Soc. Trans. 28, 275-282.

Shi, B., Kitchen, C., Weiser, B., Mayers, D., Foley, B., Kemal, K., Anastos, K., Suchard,

M., Parker, M., Brunner, C., Burger, H., 2010. Evolution and recombination of genes encoding HIV-1 drug resistance and tropism during antiretroviral therapy. Virology 404,

5-20.

Shriner, D., Rodrigo, A.G., Nickle, D.C., Mullins, J.I., 2004. Pervasive Genomic

Recombination of HIV-1 in Vivo. Genetics 167, 1573-1583.

Siebenga, J.J., Lemey, P., Pond, S.L.K., Rambaut, A., Vennema, H., Koopmans, M.,

2010. Phylodynamic Reconstruction Reveals Norovirus GII. 4 Epidemic Expansions and their Molecular Determinants. Plos Pathog 6, -.

Siliciano, J.D., Kajdas, J., Finzi, D., Quinn, T.C., Chadwick, K., Margolick, J.B., Kovacs,

C., Gange, S.J., Siliciano, R.F., 2003. Long-term follow-up studies confirm the stability of the latent reservoir for HIV-1 in resting CD4(+) T cells. Nature Medicine 9, 727-728.

Skar, H., Axelsson, M., Berggren, I., Thalme, A., Gyllensten, K., Liitsola, K., Brummer-

Korvenkontio, H., Kivela, P., Spangberg, E., Leitner, T., Albert, J., 2011. Dynamics of

Two Separate but Linked HIV-1 CRF01_AE Outbreaks among Injection Drug Users in

Stockholm, Sweden, and Helsinki, Finland. J. Virol. 85, 510-518.

70 Smith, B.A., Gartner, S., Liu, Y., Perelson, A.S., Stilianakis, N.I., Keele, B.F., Kerkering,

T.M., Ferreira-Gonzalez, A., Szakal, A.K., Tew, J.G., Burton, G.F., 2001. Persistence of

Infectious HIV on Follicular Dendritic Cells. Journal of Immunology 166, 690-696.

Smith, N.G.C., Eyre-Walker, A., 2002. Adaptive protein evolution in Drosophila. Nature

415, 1022-1024.

Stack, J.C., Welch, J.D., Ferrari, M.J., Shapiro, B.U., Grenfell, B.T., 2010. Protocols for sampling viral sequences to study epidemic dynamics. J R Soc Interface 7, 1119-1127.

Stadler, T., 2010. Sampling-through-time in birth-death trees. Journal of Theoretical

Biology 267, 396-404.

Stadler, T., Kouyos, R., von Wyl, V., Yerly, S., Böoni, J., Bürgisser, P., Klimkait, T.,

Joos, B., Rieder, P., Xie, D., Günthard, H.F., Drummond, A.J., Bonhoeffer, S., Study, t.S.H.C., 2011. Estimating the basic reproductive number from viral sequence data. Mol

Biol Evol.

Stopak, K., de Noronha, C., Yonemoto, W., Greene, W.C., 2003. HIV-1 Vif Blocks the

Antiviral Activity of APOBEC3G by Impairing Both Its Translation and Intracellular

Stability. Molecular Cell 12, 591-601.

Sullivan, J., Joyce, P., 2005. Model Selection in Phylogenetics. Annual Review of

Ecology, Evolution, and Systematics 36, 445-466.

Susko, E., Inagaki, Y., Roger, A.J., 2004. On Inconsistency of the Neighbor-Joining,

Least Squares, and Minimum Evolution Estimation When Substitution Processes Are

Incorrectly Modeled. Mol Biol Evol 21, 1629-1642.

Tajima, F., 1983. Evolutionary Relationship of DNA Sequences in Finite Populations.

Genetics 105, 437-460.

71 Takehisa, J., Kraus, M.H., Ayouba, A., Bailes, E., Van Heuverswyn, F., Decker, J.M., Li,

Y., Rudicell, R.S., Learn, G.H., Neel, C., Ngole, E.M., Shaw, G.M., Peeters, M., Sharp,

P.M., Hahn, B.H., 2009. Origin and Biology of Simian Immunodeficiency Virus in Wild-

Living Western Gorillas. J. Virol. 83, 1635-1648.

Tavaré, S., 1984. Line-of-descent and genealogical processes, and their applications in population genetics models. Theoretical Population Biology 26, 119-164.

Tazi, L., Imamichi, H., Hirschfeld, S., Metcalf, J., Orsega, S., Pérez-Losada, M., Posada,

D., Lane, H.C., Crandall, K., 2011. HIV-1 infected monozygotic twins: a tale of two outcomes. Bmc Evol Biol 11, 62.

Tazi, L., Pérez-Losada, M., Gu, W.M., Yang, Y., Xue, L., Crandall, K.A., Viscidi, R.P.,

2010. Population dynamics of Neisseria gonorrhoeae in Shanghai, China: a comparative study. Bmc Infect Dis 10, -.

Tee, K.K., Pybus, O.G., Li, X.J., Han, X.X., Shang, H., Kamarulzaman, A., Takebe, Y.,

2008. Temporal and spatial dynamics of human immunodeficiency virus type 1 circulating recombinant forms 08_BC and 07_BC in Asia. J Virol 82, 9206-9215.

Temin, H.M., 1991. Sex and Recombination in Retroviruses. Trends Genet 7, 71-74.

Templeton, A.R., Reichert, R.A., Weisstein, A.E., Yu, X.F., Markham, R.B., 2004.

Selection in context: Patterns of natural selection in the glycoprotein 120 region of human immunodeficiency virus 1 within infected individuals. Genetics 167, 1547-1561.

Tibayrenc, M., 2005. Bridging the gap between molecular epidemiologists and evolutionists. Trends in Microbiology 13, 575-580.

72 Tissot, C., Mechti, N., 1995. Molecular Cloning of a New Interferon-induced Factor That

Represses Human Immunodeficiency Virus Type 1 Long Terminal Repeat Expression.

Journal of Biological Chemistry 270, 14891-14898.

Tobin, N.H., Learn, G.H., Holte, S.E., Wang, Y., Melvin, A.J., McKernan, J.L., Pawluk,

D.M., Mohan, K.M., Lewis, P.F., Mullins, J.I., Frenkel, L.M., 2005. Evidence that low- level viremias during effective highly active antiretroviral therapy result from two processes: Expression of archival virus and replication of virus. J Virol 79, 9625-9634.

Tumas, K.M., Poszgay, J.M., Avidan, N., Ksiazek, S.J., Overmoyer, B., Blank, K.J.,

Prystowsky, M.B., 1993. Loss of Antigenic Epitopes as the Result of Env Gene

Recombination in Retrovirus-Induced Leukemia in Immunocompetent Mice. Virology

192, 587-595.

UNAIDS, 2010. Global report: UNAIDS report on the global AIDS epidemic 2010. Joint

United Nations Programme on HIV/AIDS (UNAIDS).

Van Heuverswyn, F., Peeters, M., 2007. The origins of HIV and implications for the global epidemic. Current Infectious Disease Reports 9, 338-346. van Marle, G., Gill, M.J., Kolodka, D., McManus, L., Grant, T., Church, D.L., 2007.

Compartmentalization of the gut viral reservoir in HIV-1 infected patients. Retrovirology

4, -.

Wain-Hobson, S., 1997. Down or out in blood and lymph? Nature 387, 123-124.

Wakeley, J., 2004. Recent trends in population genetics: More data! More math! Simple models? J Hered 95, 397-405.

Wakeley, J., 2008. Coalescent Theory: An Introduction. Roberts & Company Publishers.,

Greenwood Village, CO.

73 Wang, T.H., Donaldson, Y.K., Brettle, R.P., Bell, J.E., Simmonds, P., 2001.

Identification of Shared Populations of Human Immunodeficiency Virus Type 1 Infecting

Microglia and Tissue Macrophages outside the Central Nervous System. J. Virol. 75,

11686-11699.

Wei, X., Decker, J.M., Wang, S., Hui, H., Kappes, J.C., Wu, X., Salazar-Gonzalez, J.F.,

Salazar, M.G., Kilby, J.M., Saag, M.S., Komarova, N.L., Nowak, M.A., Hahn, B.H.,

Kwong, P.D., Shaw, G.M., 2003. Antibody neutralization and escape by HIV-1. Nature

422, 307-312.

Welch, D., Nicholls, G.K., Rodrigo, A., Solomon, W., 2005. Integrating genealogy and epidemiology: the ancestral infection and selection graph as a model for reconstructing host virus histories. Theor. Popul. Biol. 68, 65.

Wertheim, J., Worobey, M., 2007. A challenge to the ancient origin of SIVagm based on

African green monkey mitochondrial genomes. PLoS Pathog. 3, e95.

Wong, J., Ignacio, C., Torriani, F., Havlir, D., Fitch, N., Richman, D., 1997. In vivo compartmentalization of human immunodeficiency virus: evidence from the examination of pol sequences from autopsy tissues. J. Virol. 71, 2059-2071.

Worobey, M., Gemmel, M., Teuwen, D.E., Haselkorn, T., Kunstman, K., Bunce, M.,

Muyembe, J.-J., Kabongo, J.-M.M., Kalengayi, R.M., Van Marck, E., Gilbert, M.T.P.,

Wolinsky, S.M., 2008. Direct evidence of extensive diversity of HIV-1 in Kinshasa by

1960. Nature 455, 661-664.

Worobey, M., Santiago, M.L., Keele, B.F., Ndjango, J.-B.N., Joy, J.B., Labama, B.L.,

Dhed'a, B.D., Rambaut, A., Sharp, P.M., Shaw, G.M., Hahn, B., H., 2004. Origin of

AIDS: Contaminated polio vaccine theory refuted. Nature 428, 820-820.

74 Worobey, M., Telfer, P., Souquiere, S., Hunter, M., Coleman, C.A., Metzger, M.J., Reed,

P., Makuwa, M., Hearn, G., Honarvar, S., Roques, P., Apetrei, C., Kazanji, M., Marx,

P.A., 2010. Island Biogeography Reveals the Deep History of SIV. Science 329, 1487-

1487.

Xin, K.Q., Ma, X.H., Crandall, K.A., Bukawa, H., Ishigatsubo, Y., Kawamoto, S.,

Okuda, K., 1995. Dual Infection with Hiv-1 Thai Subtype-B and Subtype-E. Lancet 346,

1372-1373.

Zehender, G., De Maddalena, C., Canuti, M., Zappa, A., Amendola, A., Lai, A., Galli,

M., Tanzi, E., 2010a. Rapid molecular evolution of human bocavirus revealed by

Bayesian coalescent inference. Infect Genet Evol 10, 215-220.

Zehender, G., Ebranati, E., Lai, A., Santoro, M.M., Alteri, C., Giuliani, M., Palamara, G.,

Perno, C.F., Galli, M., Lo Presti, A., Ciccozzi, M., 2010b. Population Dynamics of HIV-

1 Subtype B in a Cohort of Men-Having-Sex-With-Men in Rome, Italy. JAIDS Journal of

Acquired Immune Deficiency Syndromes 55, 156-160.

Zhu, T., Wang, N., Carr, A., Wolinsky, S., Ho, D., 1995. Evidence for coinfection by multiple strains of human immunodeficiency virus type 1 subtype B in an acute seroconvertor. J. Virol. 69, 1324-1327.

75

Chapter 2:

Genetic Diversity And Molecular

Epidemiology Of HIV Transmission

76 Abstract

The high genetic diversity of HIV is one of its most impactful features as it has consequences in global distribution, vaccine design, therapy success, disease progression, transmissibility, and viral load testing. Studying HIV diversity helps to understand its origins, migration patterns, current distribution, and transmission events. New advances in sequencing technologies based on the parallel acquisition of data are now used to characterize within host and population processes in depth. Additionally, we have seen similar advances in statistical methods designed to model the past history of lineages (the phylodynamic framework) to ultimately gain better insights into the evolutionary history of HIV. We can, for example, estimate population size changes, lineage dispersion over geographic areas and epidemiological parameters solely from sequence data. In this article, we review some of the evolutionary approaches used to study transmission patterns and processes in HIV.

77 Genetic Diversity of HIV

Genetic diversity (GD) is probably one of the most important concepts in biology. In its most simple definition, GD refers to any and every kind of genetic variation at the individual, population, inter-population, or species level. GD has a large impact on conservation biology [1], the study of human origins [2], as well as molecular epidemiology [3], domestication [4], fitness [5], and disease [6]. In HIV-1, higher levels of GD have been associated with clinical outcomes such as immune escape of selected variants [7], emergence of drug resistance mutations and the consequent therapy failure

[8], and even with disease progression [9, 10]. GD has also been used to study the geographic and temporal spread of HIV-1, shedding light on global and regional population dynamics.

HIV-1 genetic diversity stems from at least three different sources: HIV-1 multiple introductions into the human population [11-13], the low fidelity and high recombinogenic power [14, 15] of its reverse transcriptase [16] and its high virus turnover [17]. HIV-1 and -2 genetic diversity has been classified within discrete groups and subtypes that largely respond to geographic structure [18] (Fig. 1). HIV-1 includes four groups, which represent different introductions into human populations, namely group M (Main, Major), N (New, Non-major), O (Outlier) and recently P [19]. HIV-1 group M is the most commonly detected variant, which in turn, is further divided into subtypes, i.e., A-D, F-H, and J and K. Also, circulating recombinant forms (CRF) carrying genetic information from two or more subtypes have been detected and continuingly been added (49 to date [20]). HIV-2 includes two groups (A-G). It is worth noting that HIV-2 CRFs have been reported only once [21]. Genetic variation among

78 HIV subtypes is tremendous, with within subtype divergences reaching up to 17% and between subtype divergences of 17 to 35%. For comparison, human and chimpanzee divergence can reach up to 3.9% if substitutions and insertions/deletions are considered

[22].

Early on, researchers noted differences in transmissibility between HIV-1 and -2

[23]. Some transmission differences are explained on the basis of structural and evolutionary differences in env genes [24]. It is clear that HIV-2 exhibits lower rates of transmission, almost no vertical transmission, and long incubation periods [25].

Moreover, HIV-2 infected patients have reduced immune activation, low viremia and rarely develop disease [26]. Within HIV-1 groups and subtypes there have been indications that there may be biological and clinical differences. For instance, studies have shown that HIV-1 subtype A infected women are less likely to develop AIDS [27].

Also, differences among subtypes have been reported in relation to chemokine coreceptor usage (tissue tropism) [28] and transmission [29].

79

Figure 1: HIV-1 recombinants and subtypes - Phylogenetic tree representation of HIV-1 recombinants and discrete subtypes. CRF = circulating recombinant form; cpx = complex recombinant pattern. A-D, F-H, J and K denote HIV-1 subtypes. No subtypes E or I are shown, as these resulted to be recombinant forms of other subtypes. Color pattern as in

Figure 3.

80 Measures of Genetic Diversity

Given the broad utility of understanding population level GD of HIV, it becomes an essential parameter to estimate. GD can be directly estimated from nucleotide sequence data using different approaches.

Traditional measures of genetic diversity

The most intuitive way of measuring GD would be simply counting the difference in mismatches in a sequence alignment or the number of polymorphic sites [30], although this poses some problems such as multiple hits in the same position, different probability of change in coding sequences, transitions to transversions bias, etc. In general, measures of GD based on sequence data can be classified in (i) summary statistics and (ii) coalescent estimators. Some commonly used summary statistic approaches are: nucleotide diversity [31], haplotype diversity [31], allelic diversity, gene diversity and theta [32]. More recently, theta (Q), expressed as substitution rate-scaled effective population size (Nefm), has been estimated under a coalescent framework explicitly taking into account evolutionary history [33-36], which differentiates this model from approaches based on summary statistics. The coalescent model has been further generalized to account for varying population sizes, different time scales, structure, recombination and selection, and now it is probably the most used form of genetic diversity measure. Detailed descriptions regarding algorithms for Q estimation can be found in [37-39].

Novel approaches

Newly developed approaches intend to estimate diversity parameters by taking advantage of the massive amounts of data that next-generation sequencing (NGS) technologies can

81 deliver [40]. New methodologies have focused on characterizing intra-host diversity by capturing low-frequency variants [41]. Recent implementations take advantage of

Bayesian Inference to correct errors [42] and infer haplotypes and their frequencies (as low as 0.1%; [43]) [44]. Also, the determination of HIV full genomic sequences and related measures of diversity are now feasible, which open unexplored possibilities to comprehensively address how HIV mutates under selective pressures [45]. To date, most of the applications of NGS technologies to “ultra-deep-sequencing” of viral populations have focused on drug resistance characterization, fluctuations in genetic diversity through disease progression, and on certain events in HIV biology, such as tropism switch, transmission bottlenecks, immune escape [46, 47], epistasis [48] and superinfections [49].

Geographic and Temporal Spread of Diversity

Phylodynamics, or the description of infectious disease behavior that arises from the blending of evolutionary and ecological processes, has become a hot subject in virology and epidemiology especially after recent statistical developments (see Table 1) [50-64].

These new methods often use the coalescent under a Maximum Likelihood or Bayesian

Inference framework and have been used in HIV epidemiology to answer questions ranging from within host and transmission dynamics to epidemiology and global spread of disease.

82 Table 1. Summary of software used for phylodynamic inferences.

Inference Implementation Migrate-n [134], BEAST [135], IMa2, Migration, spatial dispersion Lamarc [136]. Substitution rates BEAST. Recombination rates and recombination Lamarc, LDhat [137], RDP3 [138]. breakpoints Changes in population sizes Migrate-n, BEAST.

Divergence time estimation BEAST, Multidivtime .

Leaf ages BEAST.

Reproductive number BEAST.

Growth rate BEAST, Lamarc, migrate-n.

Population divergence times IMa2. Haplotype reconstruction from NGS data ShoRAH. and genetic diversity estimation HyPhy [139], ADAPTSITE [140], Detection of selection TREESAAP [141]. Assigning samples to populations, inferring Structurama, Structure, StructHDP. the number of populations Ancestral State Reconstruction BEAST, MESQUITE [142]. NGS = Next-Generation Sequencing.

83 Estimating phylodynamics

To exploit the most from these methods, the sampling strategy is of paramount importance [65]. Sparse sampling could lead to inappropriate inferences, e.g., the East

Africa direct transmission of South American HIV-1 subtype C [66]. Since phylodynamic inferences rely on “time trees” or dated phylogenies, it is also necessary to calibrate the molecular clock model in use. Due to the lack of fossil records in HIV, time-stamped sequence data is vital to produce reliable inferences. Also, incorporating independent prior knowledge about substitution rates will help any virus dating effort, generally increasing statistical power. Specialized databases, e.g., influenza [67] and HIV

(http://www.hiv.lanl.gov/) may help in this regard as they can provide more clinically relevant information along with the genetic data (e.g., time of collection data).

In essence, most phylodynamic methods capitalize on analyzing distributions of trees usually obtained by sampling from the posterior distribution of a model given the data. It works under the theoretical realization that the shape of a tree reflects dynamic processes impacting the data, such as population size changes, selective processes, and changes in substitution rates, etc. (Fig. 2). In short, the coalescent models used in phylodynamic analyses describe a probability distribution on ancestral genealogies given a population history. If we can simulate the genealogies underlying a collection of sequence data, by extension we can infer its population history.

84 Figure 2: Phylodynamic patterns – Population, selection and spatial dynamic patterns and their respective idealized trees.

85 Global spread

The best hypothesis we have regarding the origin and dispersion of HIV indicates that

HIV-1 and -2 originated in Africa during the first half of the last century, and that it was the product of several cross-species transmissions between humans and non-human primates [12, 13, 68]. Globally, ~33 million people were living with HIV worldwide as of

2009. In the same year, two million infected people died and the disease grew at a rate of

7400 new infections per day, more than 97% of which occurred in low- and middle- income countries [69]. Global distribution of groups and subtypes has remained rather constant within the last ten years [70]. Although both HIV lineages spread exponentially at the beginning of the epidemic, HIV-2 is now mostly distributed in Western Africa. In turn, HIV-1 is distributed worldwide, being group M the one that accounts for most infections, while groups O, N and P appear to be concentrated in Central Africa. The time to the most recent common ancestors (TMRCAs) of HIV lineages has been dated using modern molecular clock techniques as follows: 1905-1942 for HIV-2 group A and 1914-

1945 for group HIV-2 B [71, 72]. Similarly, HIV-1 TMRCA estimates were as follows:

1894-1931 for group M, 1932-1966 for group N and 1914-1925 for group O [13, 71, 73].

Regional spread

Founder-effect events are thought to play a major role in the spread of HIV out of Africa; although other factors cannot be ruled out completely such as viral selective advantages, sociocultural factors and human genetic background. Democratic Republic of Congo

(DRC) is one of the places in which HIV-1 diversity is greatest, and probably the site where cross-species transmission occurred [74-76]. Two archival samples are evidence of this, DRC60 and ZR59 [77], plus the existence of almost all group M subtypes [70, 78]

86 (Fig. 3).

Studies worldwide have attempted to infer regional spatial and temporal spread particularly of HIV-1. In the U.S., HIV-1 B seems to have emerged from a single migration out of Haiti in 1969, the place with the highest subtype B diversity. In turn,

Haitian HIV-1 B emerged in 1966 from DRC [79-81].

South Africa has recently shown an increase in HIV infections. Almost 6 million people are infected, being the majority HIV-1 subtype C that now accounts for 50% of all infected individuals worldwide [70, 82]. The C subtype was first reported in 1990 and its

TMRCA was dated to 1958 [83, 84]. The spread of subtype C seems to have occurred eastward from South Africa to India [85, 86] and also probably to China, while some founder events have also been identified from East Africa to South America and to Israel

[87, 88] (Fig. 3). The introduction of HIV-1 subtype C in South America dates to the

1980s, most likely through Brazil [87, 88]. Nonetheless, in a more comprehensive study,

South American subtype C appears to be more related to UK subtype C, with these two groups related to East African isolates [66], which stresses the importance of including global isolates in phylogenetic studies of HIV phylogeography.

Besides being present in the Western world, HIV-1 subtype B is also present in

Asia where introduction is thought to be through Thailand in 1985, termed subtype B’

[89]. From here, HIV-1 subtype B’ expanded into Asia, coexisting with the pandemic subtype B and others, and fueling the development of CRFs across the continent [89-92].

It is worth noting that CRFs represent 20% of all HIV-1 infections, with half of these infections involving CRF02_AG and CRF01_AE [70].

HIV studies implementing phylodynamic methods have been used to address a

87 variety of questions, including epidemics origins [79-81, 83-86, 89-92], correlations between epidemiological data and changes in population size or GD, viral spread over geographic regions, and evolutionary processes associated with certain risk groups, such as men who have sex with men (MSM) and intravenous drug users (IDUs). Some [93-95] have shown that currently circulating subtype B has been introduced multiple times into

MSM European (UK and Italy) populations in intervals ranging from 14 to 30 months.

Others have looked at dissemination patterns and transmission routes between risk groups to explain the distribution of HIV strains (e.g., [96]), suggesting spread from heterosexual patients to IDUs.

88 Figure 3: HIV-1 group M global distribution – Countries are color coded according to their last reported prevalence [70]. Pie charts represent subtypes and circulating recombinant forms distributions over the globe. Arrows represent potential migration routes for A, B, and C subtypes.

Local transmissions

Phylogenies are relevant to reconstruct transmission histories. Successful cases in which sequence data have been collated with known transmission histories are, for instance, the

Swedish transmission chain [97]. However, caution should be taken, as convergence evolution might be a common theme in HIV-1 evolution [98]. Phylogenetic inference of transmission histories has been also accepted as evidence in court trials such as the

Florida dentist case [99], a Swedish rape case [100], and healthcare related cases in

Baltimore [101] and Louisiana [102]. The reader is referred to a detailed review about the use of HIV phylogenies in forensics (see [103]).

HIV evolution has been thoroughly studied across transmissions because of the opportunity for treatment due to a reduction in GD [7, 104]. Most of the work has focused on monitoring discordant couples (i.e., couples in which one partner is HIV positive and the other is not) and to test whether there is a reduction in GD down to one virus at the transmission event. Edwards et al. [104] showed using phylodynamic methods that the reduction in GD (<1%) is no different between horizontal (homo- or heterosexual) and vertical transmission (mother-to-child). Understanding GD at transmission events has therapeutic implications as less diverse populations of small size are strongly influenced by genetic drift, decreasing the chance of transmission of high fitness variants. Indications that single virus variants were transmitted horizontally came

89 from studies using Sanger sequencing and single-genome amplification coupled with phylogenetic and mathematical modeling [105, 106]. These initial observations were additionally supported by the enhanced capabilities of “ultra-deep-sequencing” revealing that early HIV variants explored an extensive sequence space within epitope regions.

Interestingly, as the infection proceeds, reversion to the canonical subtype sequence occurred in positions under immune pressure but not in positions that were not under pressure even in the earliest samples, suggesting that immune pressure is present earlier than previously known [7].

Within host dynamics

Different evolutionary forces are in action when we look at viral evolution within host data. The viral population is targeted by both cellular and humoral immune responses, resulting in relatively strong diversifying selection. As a result, when reconstructing within host evolutionary histories, we observe a ladder-like pattern (continual immune selection in Fig. 2 and [107]), as opposed to the strong spatial structure we observed in

HIV population phylogenies (spatial dynamics in Fig. 2) However, whether within host

HIV-1 evolution is governed by natural selection or genetic drift (deterministic or stochastic models, respectively) has been the subject of considerable debate. This dispute stems from disparate estimates of effective population sizes [108]; the smaller the population size, the more susceptible the population is to genetic drift. However, signatures of natural selection have been demonstrated using different estimators [109].

Recombination also plays an important role in shaping within host viral populations.

Conventionally, it can purge “bad mutations” out of the gene pool and “put together” novel combinations of genome regions, increasing allele diversity. Although, it is not

90 entirely clear whether recombination is associated with an increased fitness, simulations have shown that under strong drug pressures recombination will favor the appearance of drug resistance variants [110].

Within host variation has been extensively studied because of the development of drug resistance, viral reservoirs, and clinical conditions related to some tissues such as dementia and lymphoma. For example, phylodynamic analyses of post mortem brain tissues have revealed that HIV is evolving at different rates at different brain compartments, apparently not due to any selective pressure, but rather because of inherent drift associated with macrophage-tropic viral expansion after immune failure

[111]. Within-host variation has been also addressed under a phylodynamic framework for co-receptor usage [112]. For example, co-receptor dynamics in tissues and peripheral blood mononuclear cells (PBMC) showed temporal structure between CCR5-tropic (R5) virus and the appearance of CXCR4 (X4) variants, i.e., majority of X4 virus found in thymus tissue seemed to come from PBMC viruses [112]. Similarly, a high degree of compartmentalization and differences in population size have been described in HIV viruses from macrophages found in tumor and non-tumor tissues and an intermixing of

HIV strains obtained from axillary lymph nodes [113].

Population structure

Population structure refers to the degree of subdivision or differentiation a population exhibits. It has evolutionary consequences as subdivided populations can evolve somewhat independently. Examples of HIV structured populations can be found in within host data from different tissues, which usually exhibit a large degree of compartmentalization [113] as well as in among host data, i.e. founder effects and the

91 global HIV-1 distribution (Fig. 3). Traditional methods for estimating population structure include summary statistics that attempt to measure the diversity of randomly chosen sequence markers within the same sub-population relative to what is found in the entire population. These include the fixation index (Fst) as well as all its relatives (Gst, Rst,

Dst). Summary statistic methods have been used to assess compartmentalization in lymphocyte reservoir populations [114, 115].

New methods take advantage of the flexibility of the Bayesian framework to accommodate uncertainties in parameter estimation and to incorporate population history by using the coalescent (see [116]). Previous developments used a fixed number of populations (K) to assign posterior probabilities of assigning individuals to populations

[117]. Recently, the model has been modified and now considers K as a random variable that follows a Dirichlet process prior so it is no longer necessary to fix K to an arbitrary value [118, 119]. It is likely that these methods will soon be used in studies analyzing

HIV tissue differentiation, viral reservoirs, and their link to plasma GD.

Implications of the Genetic Diversity of HIV

Estimating the GD of HIV and mapping its distribution across space and time has clinical implications in terms of disease progression, drug resistance, and vaccine development.

Disease progression and drug resistance

Disease progression is seemingly related to GD. Several studies have found a correlation between them when estimating relative and absolute substitution rates from time-stamped sequence data [10, 120, 121]. In general, there are three phases that can be identified on the basis of diversity: 1) diversity increases linearly in association with initial features of infection and X5 HIV populations; 2) diversity levels off or even decreases, which could

92 be correlated with the appearance of X4 HIV populations; 3) decline in CD4+ T cells and failure of T cell homeostasis with a general reduction in GD. However, others have reported no relationship between disease progression and substitution rates when analyzed separately into adaptive and neutral categories of variation [9].

Disease progression is related to Transmitted Drug Resistance (TDR) and the clinical impact of low-frequency variants. TDRs are thought to exhibit diminished fitness compared to “wild types”, although additional compensatory mutations might restore fitness levels [122]. Low-frequency variants, typically characterized using NGS, have clinical significance, especially when the drug resistance mutation has a low genetic barrier, i.e., few mutations instead of several mutations that confer resistance. Non- nucleoside reverse transcriptase inhibitors and protease inhibitors are examples of low and high genetic barriers, respectively [123, 124]. TDRs seem fairly common, ~30% (B and non-B subtypes) and subjects with multiple protease TDRs are infrequent [125].

Vaccine strategies

Designing an HIV vaccine is an extraordinarily difficult challenge. A successful vaccine must be capable of eliciting broadly cross-reactive neutralizing antibodies in order to cope with the extreme GD of HIV [121]. Strategies have considered HIV geographic distribution and HLA allele frequencies in different countries or regions [120]. This makes sense in the light of HIV geographic structure (see above). Thus, in southwest

China a vaccine must be capable of neutralizing mainly HIV-1 M C/B’ recombinants, whereas a South African vaccine must consider, among others, HIV-1 M subtype C variants (Fig. 3) [126, 127]. Although this strategy seems sound, several considerations have led to search for a global vaccine. Even in countries where the HIV epidemic is

93 dominated by one subtype, e.g., India, there are other subtypes that might increase its frequency if the vaccine targets just one variant. Also, and apart from cost concerns, nations with high HIV prevalence have complex epidemics, such as central and southern

Africa, which will be less impacted by single subtype vaccines (pie charts in Fig. 3).

Some researchers have tried to account for HIV GD by looking for potential T-cell epitopes within env three dimensional structure [128].

Novel strategies for vaccine design include but are not limited to poly-epitope vaccines, use of conserved regions on the proteome, central vaccines (ancestral state, consensus, center-of-the-tree), and polyvalent mosaic vaccines. For details the reader is referred to Korber et al. [129]. Central vaccines are appealing from a phylogenetics standpoint and have also proven useful [130]. The idea is to maximize GD coverage by inferring the ancestral sequence of a particular clade or tree region, under the assumption that such sequence, once translated, will elicit antibodies against any and every member of the clade or group of interest. In effect, central vaccines are designed to minimize the genetic distance to circulating strains.

Recent research has focused on characterizing broadly neutralizing antibodies

(bNAbs) that occur naturally. Independently and using different approaches, Wu et al.

[131] and Scheid et al. [132] converged in identifying CD4 binding sites in antibodies derived from multiple unrelated HIV-infected individuals. These promising results have obvious vaccine design implications as bNAbs can prevent new cells to get infected.

Conclusion

Understanding the geographic and temporal spread of HIV genetic diversity is crucial for effective intervention strategies to combat infection whether through educational

94 programs, drug treatments, or vaccine strategies. New sequencing technologies have the potential to improve our understanding of the patterns and processes that lead to the temporal and spatial spread of HIV. Methods such as “ultra-deep sequencing” have already impacted our understanding of HIV in terms of within host variation: emergence and transmission of drug resistance, identification of super-infections, characterization of early dynamics and immune escape. At the population level, NGS has provided data to test for co-evolving sites and added to our knowledge about clinical impact of low- frequency variants. Phylogenetic approaches provide a sound probabilistic framework for analyzing NGS data to gain better insights into the impact of genetic diversity.

Future Perspective

NGS data have already demonstrated their utility in HIV biology; nevertheless, there are sequencing applications and aspects of phylodynamics that are presently beyond the reach of current sequencing technologies, leaving ground for additional innovation. It is imperative for the continuous progress of phylodynamics then to capitalize and incorporate in a “biologist-accessible” manner the power of this technology and of the just arriving “third-generation sequencing technologies”, namely single molecule sequencing (PacBio, Helicos) and semiconductor sequencing technology (Ion Torrent)

[133]. Additionally, an increase in the number of epidemiological models coupled with phylogenetic models will provide researchers with more flexibility when applying these tools. Examples of this are models borrowed from epidemiology [62] and other recent advances to estimate epidemiology parameters directly from sequence data [63, 64]. Over the next 5–10 years with this rapid increase in our capacity to generate data, much effort will be needed to expand our models of HIV evolution to incorporate the large volume of

95 data to test key hypotheses concerning HIV evolution and disease progression.

Executive Summary

• HIV high genetic diversity stems at least from three different sources, namely

high mutation and recombination rates, high replication error and multiple

introductions into human populations.

• Traditional measures of genetic diversity (summary statistics and coalescent

estimators) have helped to understand HIV diversity and distribution. Next-

Generation Sequencing (NGS) is providing a new dimension to explore within

host and population diversity in depth.

• The integration of ecological, epidemiological and evolutionary processes through

the use of sequence data under a phylogenetic framework (phylodynamic) has

allowed us to investigate changes in population size over time, estimate relevant

parameters as the reproductive number and substitution rates, and most notably,

the dispersion of lineages over geographic area.

• Through the use of phylogenetics we now know that HIV-1 and -2 originated in

Central and West Africa, respectively. HIV-1, sub-classified in groups and

subtypes, is present worldwide exhibiting a great extent of geographic

distribution, probably because of several founder-effect events.

• Evolutionary forces within and among hosts differ. Within host HIV-1 evolution

seem to be governed by continual immune selection, whereas among hosts

evolution seem to be ruled by founder effects, as reflected in the strong

geographic structure observed.

• HIV increased genetic diversity has implications in almost every aspect of its

96 biology, including vaccine design, drug treatment, disease progression, viral

reservoirs, transmissibility, and viral load testing. It also allows us to apply

ambitious statistic models to interrogate these very same aspects. On the other

hand, its high genetic diversity has prompted HIV as a model entity to test new

developments in phylogenetics and evolutionary theory.

• Over the next 5–10 years it is very likely that phylodynamic studies will integrate

with NGS and third-generation sequencing, which, in turn, will require from HIV

biologists a better understanding of the strengths and weaknesses of these

methods.

References

Papers of special note have been highlighted as:

• of interest

• • of considerable interest

1. Laikre L, Allendorf FW, Aroner LC et al.: Neglect of Genetic Diversity in

Implementation of the Convention on Biological Diversity. Conservation Biology

24(1), 86-88 (2010).

2. Tishkoff SA, Reed FA, Friedlaender FOR et al.: The Genetic Structure and

History of Africans and African Americans. Science 324(5930), 1035-1044

(2009).

3. De Oliveira T, Pybus O, Rambaut A et al.: Molecular Epidemiology: HIV-1 and

HCV sequences from Libyan outbreak. Nature 444(7121), 836-837 (2006).

4. Groeneveld L, Lenstra J, Eding H et al.: Genetic diversity in farm animals--a

review. Anim Genet 41 Suppl 1, 6-31 (2010).

97 5. Agashe D, Falk J, Bolnick D: Effects of Founding Genetic Variation on

Adaptation to a Novel Resource. Evolution 65(9), 2481-2491 (2011).

6. Merlo LM, Maley CC: The role of genetic diversity in cancer. J Clin Invest

120(2), 401-403 (2010).

7. Fischer W, Ganusov V, Giorgi E et al.: Transmission of Single HIV-1 Genomes

and Dynamics of Early Immune Escape Revealed by Ultra-Deep Sequencing.

PLoS ONE 5(8), e12303 (2010).

• Characterization of early HIV dynamics by ultra-deep sequencing showing

the reduction of GD at the point of transmission and early diversification of

transmitted viruses.

8. Ross LL, Weinberg WG, Dejesus E et al.: Impact of Low Abundance HIV

Variants on Response to Ritonavir-Boosted Atazanavir or Fosamprenavir Given

Once Daily with Tenofovir/Emtricitabine in Antiretroviral-Naive HIV-Infected

Patients. Aids Res Hum Retrov 26(4), 407-417 (2010).

9. Carvajal-Rodriguez A, Posada D, Pérez-Losada M et al.: Disease progression and

evolution of the HIV-1 env gene in 24 infected infants. Infect Genet Evol 8(2),

110-120 (2008).

10. Lemey P, Pond SLK, Drummond AJ et al.: Synonymous substitution rates predict

HIV disease progression as a result of underlying replication dynamics. Plos

Comput Biol 3(2), 282-292 (2007).

11. Salemi M, Strimmer K, Hall WW et al.: Dating the common ancestor of SIVcpz

and HIV-1 group M and the origin of HIV-1 subtypes using a new method to

uncover clock-like molecular evolution. FASEB J. 15(2), 276-278 (2001).

98 12. Sharp P, Bailes E, Gao F, Beer B, Hirsch V, Hahn B: Origins and evolution of

AIDS viruses: estimating the time-scale. Biochem. Soc. Trans. 28(2), 275-282

(2000).

13. Korber B, Muldoon M, Theiler J et al.: Timing the Ancestor of the HIV-1

Pandemic Strains. Science 288(5472), 1789-1796 (2000).

14. Hu W, Temin H: Retroviral recombination and reverse transcription. Science

250(4985), 1227-1233 (1990).

15. Shriner D, Rodrigo AG, Nickle DC, Mullins JI: Pervasive Genomic

Recombination of HIV-1 in Vivo. Genetics 167(4), 1573-1583 (2004).

16. Preston B, Poiesz B, Loeb L: Fidelity of HIV-1 reverse transcriptase. Science

242(4882), 1168-1171 (1988).

17. Wei X, Ghosh SK, Taylor ME et al.: Viral dynamics in human immunodeficiency

virus type 1 infection. Nature 373(6510), 117-122 (1995).

18. Holmes EC: On the origin and evolution of the human immunodeficiency virus

(HIV). Biol Rev 76(2), 239-254 (2001).

19. Plantier J, Leoz M, Dickerson J et al.: A new human immunodeficiency virus

derived from gorillas. Nat Med 15(8), 871-872 (2009).

20. Robertson DL, Anderson JP, Bradac JA et al.: HIV-1 Nomenclature Proposal.

Science 288(5463), 55 (2000).

21. Ibe S, Yokomaku Y, Shiino T et al.: Hiv-2 Crf01_Ab: First Circulating

Recombinant Form of Hiv-2. Jaids-J Acq Imm Def 54(3), 241-247 (2010).

99 • Circulating recombinant forms also occur in HIV-2. Using phylogenetic

methods, the authors demonstrated the recombinant nature of HIV-2

isolates.

22. Cheng Z, Ventura M, She X et al.: A genome-wide comparison of recent

chimpanzee and human segmental duplications. Nature 437(7055), 88-93 (2005).

23. Peeters M, Sharp PM: Genetic diversity of HIV-1: the moving target. Aids 14,

S129-S140 (2000).

24. Barroso H, Borrego P, Bartolo I et al.: Evolutionary and Structural Features of the

C2, V3 and C3 Envelope Regions Underlying the Differences in HIV-1 and HIV-

2 Biology and Infection. Plos One 6(1), (2011).

25. Kanki PJ, Travers KU, Marlink RG et al.: Slower heterosexual spread of HIV-2

than HIV-1. The Lancet 343(8903), 943-946 (1994).

26. Leligdowicz A, Yindom L-M, Onyango C et al.: Robust Gag-specific T cell

responses characterize viremia control in HIV-2 infection. The Journal of Clinical

Investigation 117(10), 3067-3074 (2007).

27. Kanki PJ, Hamel DJ, Sankale JL et al.: Human immunodeficiency virus type 1

subtypes differ in disease progression. J Infect Dis 179(1), 68-73 (1999).

28. Tscherning C, Alaeus A, Fredriksson R et al.: Differences in chemokine

coreceptor usage between genetic subtypes of HIV-1. Virology 241(2), 181-188

(1998).

29. Kiwanuka N, Laeyendecker O, Quinn TC et al.: HIV-1 subtypes and differences

in heterosexual HIV transmission among HIV-discordant couples in Rakai,

Uganda. Aids 23(18), 2479-2484 (2009).

100 30. Tajima F: Evolutionary Relationship of DNA-Sequences in Finite Populations.

Genetics 105(2), 437-460 (1983).

31. Nei M: Phylogenetic analysis in molecular evolutionary genetics. Annu Rev Genet

30, 371-403 (1996).

32. Watterson GA: On the number of segregating sites in genetical models without

recombination. Theoretical Population Biology 7(2), 256-276 (1975).

33. Kuhner MK: Coalescent genealogy samplers: windows into population history.

24(2), 86-93 (2009).

• Summary of diversity and feature of current coalescent based estimators of

phylodynamic-related parameter. Good reference to compare current

implementations.

34. Kuhner MK, Smith LP: Comparing likelihood and Bayesian coalescent estimation

of population parameters. Genetics 175(1), 155-165 (2007).

35. Kuhner MK: Robustness of coalescent estimators to between-lineage mutation

rate variation. Molecular biology and evolution 23(12), 2355-2360 (2006).

36. Kuhner MK, Yamato J, Felsenstein J: Estimating effective population size and

mutation rate from sequence data using Metropolis-Hastings sampling. Genetics

140, 1421-1430 (1995).

37. Crandall K, Posada D, Vasco D: Effective population sizes: missing measures and

missing concepts. Anim Conserv 2(4), 317-319 (1999).

38. Wang J: Estimation of effective population sizes from data on genetic markers.

Philosophical Transactions of the Royal Society B: Biological Sciences

360(1459), 1395-1409 (2005).

101 39. Schwartz MK, Tallmon DA, Luikart G: Review of DNA-based census and

effective population size estimators. Anim Conserv 1(4), 293-299 (1998).

40. Metzker ML: Sequencing technologies - the next generation. Nat Rev Genet

11(1), 31-46 (2010).

• Excellent review covering next-generation sequencing technologies , their

pros and cons, and applications.

41. Vrancken B, Lequime S, Theys K, Lemey P: Covering All Bases in HIV

Research: Unveiling a Hidden World of Viral Evolution. Aids Rev 12(2), 89-102

(2010).

42. Zagordi O, Klein R, Daumer M, Beerenwinkel N: Error correction of next-

generation sequencing data and reliable estimation of HIV quasispecies. Nucleic

Acids Res 38(21), 7400 - 7409 (2010).

43. Zagordi O, Geyrhofer L, Roth V, Beerenwinkel N: Deep sequencing of a

genetically heterogeneous sample: local haplotype reconstruction and read error

correction. J Comput Biol 17(3), 417 - 428 (2010).

44. Zagordi O, Bhattacharya A, Eriksson N, Beerenwinkel N: ShoRAH: estimating

the genetic diversity of a mixed sample from next-generation sequencing data.

BMC Bioinformatics 12(1), 119 (2011).

• Original description of an algorithm that can take NGS data and return

genetic diversity measures and haplotype number and frequencies using a

Bayesian framework.

102 45. Willerth SM, Pedro HaM, Pachter L, Humeau LM, Arkin AP, Schaffer DV:

Development of a Low Bias Method for Characterizing Viral Populations Using

Next Generation Sequencing Technology. Plos One 5(10), e13564 (2010).

46. Henn M, Boutwell C, Lennon N et al.: P09-20 LB. Ultra-deep sequencing of full-

length HIV-1 genomes identifies rapid viral evolution during acute infection.

Retrovirology 6(Suppl 3), P400 (2009).

47. Fischer W, Keele B, Bhattacharya T et al.: P09-21 LB. Deep sequencing of HIV-

1 from acute infection: low initial diversity, and rapid but variable CTL escape.

Retrovirology 6(0), 1-1 (2009).

48. Poon AFY, Swenson LC, Dong WWY et al.: Phylogenetic Analysis of

Population-Based and Deep Sequencing Data to Identify Coevolving Sites in the

nef Gene of HIV-1. Molecular biology and evolution 27(4), 819-832 (2010).

49. Redd AD, Collinson-Streng A, Martens C et al.: Identification of HIV

superinfection in seroconcordant couples in Rakai, Uganda using next generation

deep sequencing. J. Clin. Microbiol., JCM.00804-00811 (2011).

50. Drummond A, Ho S, Phillips M, Rambaut A: Relaxed phylogenetics and dating

with confidence. Plos Biol 4(5), 699-710 (2006).

51. Thorne JL, Kishino H: Divergence time and evolutionary rate estimation with

multilocus data. Syst Biol 51(5), 689-702 (2002).

52. Drummond A, Suchard M: Bayesian random local clocks, or one rate to rule them

all. Bmc Biol 8, (2010).

103 53. Drummond A, Rambaut A, Shapiro B, Pybus O: Bayesian coalescent inference of

past population dynamics from molecular sequences. Molecular biology and

evolution 22(5), 1185-1192 (2005).

54. Lemey P, Rambaut A, Welch JJ, Suchard MA: Phylogeography Takes a Relaxed

Random Walk in Continuous Space and Time. Molecular biology and evolution

27(8), 1877-1885 (2010).

55. Lemey P, Rambaut A, Drummond AJ, Suchard MA: Bayesian Phylogeography

Finds Its Roots. Plos Comput Biol 5(9), - (2009).

56. Beerli P, Felsenstein J: Maximum likelihood estimation of a migration matrix and

effective population sizes in n subpopulations by using a coalescent approach.

Proceedings of the National Academy of Sciences, U.S.A. 98(8), 4563-4568

(2001).

57. Hey J: Isolation with Migration Models for More Than Two Populations.

Molecular biology and evolution 27(4), 905-920 (2010).

58. Shapiro B, Ho SYW, Drummond AJ, Suchard MA, Pybus OG, Rambaut A: A

Bayesian Phylogenetic Method to Estimate Unknown Sequence Ages. Molecular

biology and evolution 28(2), 879-887 (2011).

59. Choi S, Hey J: Joint Inference of Population Assignment and Demographic

History. Genetics, (2011).

60. Huelsenbeck JP, Andolfatto P: Inference of population structure under a Dirichlet

process model. Genetics 175(4), 1787-1802 (2007).

61. Pritchard JK, Stephens M, Donnelly P: Inference of population structure using

multilocus genotype data. Genetics 155(2), 945-959 (2000).

104 62. Volz EM, Pond SLK, Ward MJ, Brown AJL, Frost SDW: Phylodynamics of

Infectious Disease Epidemics. Genetics 183(4), 1421-1430 (2009).

• Excellent article that deals with phylodynamic patterns and their associated

processess. It also describes algorithms to integrate epidemiological and

phylogenetic models.

63. Stadler T: Sampling-through-time in birth-death trees. Journal of Theoretical

Biology 267(3), 396-404 (2010).

64. Stadler T, Kouyos R, Von Wyl V et al.: Estimating the basic reproductive number

from viral sequence data. Molecular biology and evolution, (2011).

65. Stack JC, Welch JD, Ferrari MJ, Shapiro BU, Grenfell BT: Protocols for sampling

viral sequences to study epidemic dynamics. J R Soc Interface 7(48), 1119-1127

(2010).

66. De Oliveira T, Pillay D, Gifford R, Resistance UCGHD: The HIV-1 Subtype C

Epidemic in South America Is Linked to the United Kingdom. Plos One 5(2),

(2010).

67. Bao Y, Bolotov P, Dernovoy D et al.: The influenza virus resource at the national

center for biotechnology information. J Virol 82(2), 596-601 (2008).

68. Salemi M, Strimmer K, Hall WW et al.: Dating the common ancestor of SIVcpz

and HIV-1 group M and the origin of HIV-1 subtypes using a new method to

uncover clock-like molecular evolution. Faseb J 15(2), 276-278 (2001).

69. Unaids: Global report: UNAIDS report on the global AIDS epidemic 2010.

(2010).

105 70. Hemelaar J, Gouws E, Ghys PD, Osmanov S, Who-Unaids: Global trends in

molecular epidemiology of HIV-1 during 2000-2007. Aids 25(5), 679-689 (2011).

71. Wertheim JO, Worobey M: Dating the Age of the SIV Lineages That Gave Rise

to HIV-1 and HIV-2. Plos Comput Biol 5(5), e1000377 (2009).

72. Lemey P, Pybus OG, Wang B, Saksena NK, Salemi M, Vandamme A-M: Tracing

the origin and history of the HIV-2 epidemic. Proceedings of the National

Academy of Sciences of the United States of America 100(11), 6588-6592 (2003).

73. Lemey P, Pybus OG, Rambaut A et al.: The molecular population genetics of

HIV-1 group O. Genetics 167(3), 1059-1068 (2004).

74. Gao F, Bailes E, Robertson D et al.: Origin of HIV-1 in the chimpanzee Pan

troglodytes troglodytes. Nature 397(6718), 436-441 (1999).

75. Santiago ML, Bibollet-Ruche F, Bailes E et al.: Amplification of a Complete

Simian Immunodeficiency Virus Genome from Fecal RNA of a Wild

Chimpanzee. J. Virol. 77(3), 2233-2242 (2003).

76. Santiago ML, Rodenburg CM, Kamenya S et al.: SIVcpz in Wild Chimpanzees.

Science 295(5554), 465- (2002).

77. Worobey M, Gemmel M, Teuwen DE et al.: Direct evidence of extensive

diversity of HIV-1 in Kinshasa by 1960. Nature 455(7213), 661-664 (2008).

78. Vidal N, Peeters M, Mulanga-Kabeya C et al.: Unprecedented Degree of Human

Immunodeficiency Virus Type 1 (HIV-1) Group M Genetic Diversity in the

Democratic Republic of Congo Suggests that the HIV-1 Pandemic Originated in

Central Africa. J. Virol. 74(22), 10498-10507 (2000).

106 79. Gilbert M, Rambaut A, Wlasiuk G, Spira T, Pitchenik A, Worobey M: The

emergence of HIV/AIDS in the Americas and beyond. Proc Natl Acad Sci U S A,

(2007).

80. Pérez-Losada M, Jobes DV, Sinangil F, Crandall KA, Posada D, Berman PW:

Phylodynamics of HIV-1 from a Phase-III AIDS Vaccine Trial in North America.

Molecular biology and evolution 27(2), 417-425 (2010).

81. Robbins KE, Lemey P, Pybus OG et al.: U.S. Human Immunodeficiency Virus

Type 1 Epidemic: Date of Origin, Population History, and Characterization of

Early Strains. J. Virol. 77(11), 6359-6366 (2003).

82. Unaids: AIDS Epidemic Update 2010. 1-100 (2010).

83. Rousseau CM, Learn GH, Bhattacharya T et al.: Extensive intrasubtype

recombination in South African human immunodeficiency virus type I subtype C

infections. J Virol 81(9), 4492-4500 (2007).

84. Ayehunie S, Johansson B, Sonnerborg A et al.: New Subtype of Hiv-1 in

Ethiopia. Lancet 336(8720), 942-942 (1990).

85. Sridharan G, Kandathil AJ, Ramalingam S, Kannangai R, David S: Molecular

epidemiology of HIV. Indian J Med Res 121(4), 333-344 (2005).

86. Shankarappa R, Gupta P, Chatterjee R et al.: Human immunodeficiency virus

type 1 env sequences from Calcutta in eastern India: Identification of features that

distinguish subtype C sequences in India from other subtype C sequences. J Virol

75(21), 10479-10487 (2001).

87. Fontella R, Schrago C, Soares M: On the origin of HIV-1 subtype C in South

America. Aids 22(15), 2001-2011 (2008).

107 88. Bello G, Passaes C, Guimaraes ML et al.: Origin and evolutionary history of

HIV-1 subtype C in Brazil. Aids 22(15), 1993-2000 (2008).

89. Deng X, Yang R, Liu H, Shao Y, Rayner S: The epidemic origin and molecular

properties of B ': a founder strain of the HIV-1 transmission in Asia. Aids 22(14),

1851-1858 (2008).

90. Piyasirisilp S, Mccutchan FE, Carr JK et al.: A recent outbreak of human

immunodeficiency virus type 1 infection in southern China was initiated by two

highly homogeneous, geographically separated strains, circulating recombinant

form AE and a novel BC recombinant. J Virol 74(23), 11286-11295 (2000).

91. De Silva U, Warachit J, Sattagowit N et al.: Genotypic Characterization of HIV

Type 1 env gp160 Sequences from Three Regions in Thailand. Aids Res Hum

Retrov 26(2), 223-227 (2010).

92. Su L, Graf M, Zhang YZ et al.: Characterization of a virtually full-length human

immunodeficiency virus type 1 genome of a prevalent intersubtype (C/B')

recombinant strain in China. J Virol 74(23), 11367-11376 (2000).

93. Hué S, Pillay D, Clewley JP, Pybus OG: Genetic analysis reveals the complex

structure of HIV-1 transmission within defined risk groups. Proceedings of the

National Academy of Sciences of the United States of America 102(12), 4425-

4429 (2005).

• Phylodynamic study disentangling the complex structure of HIV in

populations of MSM in the UK. It shows that HIV has been introduced

several times into that population.

108 94. Zehender G, Ebranati E, Lai A et al.: Population Dynamics of HIV-1 Subtype B

in a Cohort of Men-Having-Sex-With-Men in Rome, Italy. JAIDS Journal of

Acquired Immune Deficiency Syndromes 55, 156-160 (2010).

95. Lewis F, Hughes GJ, Rambaut A, Pozniak A, Brown AJL: Episodic sexual

transmission of HIV revealed by molecular phylodynamics. Plos Med 5(3), 392-

402 (2008).

96. Liao HA, Tee KK, Hase S et al.: Phylodynamic analysis of the dissemination of

HIV-1 CRF01_AE in Vietnam. Virology 391(1), 51-56 (2009).

97. Leitner T, Escanilla D, Franzén C, Uhlénn M, Albert J: Accurate reconstruction of

a known HIV-1 transmission history by phylogenetic tree analysis. Proceedings

of the National Academy of Sciences 93(20), 10864-10869 (1996).

98. Lemey P, Derdelinckx I, Rambaut A et al.: Molecular Footprint of Drug-Selective

Pressure in a Human Immunodeficiency Virus Transmission Chain. J. Virol.

79(18), 11981-11989 (2005).

99. Ou C-Y, Ciesielski CA, Myers G et al.: Molecular Epidemiology of HIV

Transmission in a Dental Practice. Science 256(5060), 1165-1171 (1992).

100. Albert J, Wahlberg J, Leitner T, Escanilla D, Uhlen M: Analysis of a rape case by

direct sequencing of the human immunodeficiency virus type 1 pol and gag genes.

J. Virol. 68(9), 5918-5924 (1994).

101. Holmes EC, Zhang LQ, Simmonds P, Rogers AS, Brown AJL: Molecular

Investigation of Human-Immunodeficiency-Virus (Hiv) Infection in a Patient of

an Hiv-Infected Surgeon. Journal of Infectious Diseases 167(6), 1411-1414

(1993).

109 102. Metzker ML, Mindell DP, Liu XM, Ptak RG, Gibbs RA, Hillis DM: Molecular

evidence of HIV-1 transmission in a criminal case. Proceedings of the National

Academy of Sciences of the United States of America 99(22), 14292-14297

(2002).

103. Bernard E, Azad Y, Vandamme A, Weait M, Geretti A: HIV forensics: pitfalls

and acceptable standards in the use of phylogenetic analysis as evidence in

criminal investigations of HIV transmission. HIV Medicine 8(6), 382-387 (2007).

104. Edwards C, Holmes E, Wilson D et al.: Population genetic estimation of the loss

of genetic diversity during horizontal transmission of HIV-1. Bmc Evol Biol 6, -

(2006).

•• Relevant study showing the loss of GD across vertical and horizontal

transmissions. It also illustrates the use of phylodynamic approaches to

answer clinically important questions.

105. Keele BF, Giorgi EE, Salazar-González JF et al.: Identification and

characterization of transmitted and early founder virus envelopes in primary HIV-

1 infection. Proceedings of the National Academy of Sciences 105(21), 7552-7557

(2008).

106. Salazar-González JF, Bailes E, Pham KT et al.: Deciphering Human

Immunodeficiency Virus Type 1 Transmission and Early Envelope

Diversification by Single-Genome Amplification and Sequencing. J. Virol. 82(8),

3952-3970 (2008).

110 107. Rambaut A, Pybus OG, Nelson MI, Viboud C, Taubenberger JK, Holmes EC:

The genomic and epidemiological dynamics of human influenza A virus. Nature

453(7195), 615-U612 (2008).

108. Brown AJL: Analysis of HIV-1 env gene sequences reveals evidence for a low

effective number in the viral population. Proceedings of the National Academy of

Sciences of the United States of America 94(5), 1862-1865 (1997).

109. Crandall KA, Kelsey CR, Imamichi H, Lane HC, Salzman NP: Parallel evolution

of drug resistance in HIV: Failure of nonsynonymous/synonymous substitution

rate ratio to detect selection. Molecular biology and evolution 16(3), 372-382

(1999).

110. Carvajal-Rodríguez A, Crandall K, Posada D: Recombination favors the evolution

of drug resistance in HIV-1 during antiretroviral therapy. Infection, Genetics and

Evolution 7(4), 476-483 (2007).

111. Salemi M, Lamers SL, Yu S, De Oliveira T, Fitch WM, Mcgrath MS:

Phylodynamic analysis of human immunodeficiency virus type 1 in distinct brain

compartments provides a model for the neuropathogenesis of AIDS. J Virol

79(17), 11343-11352 (2005).

112. Salemi M, Burkhardt BR, Gray RR, Ghaffari G, Sleasman JW, Goodenow MM:

Phylodynamics of HIV-1 in Lymphoid and Non-Lymphoid Tissues Reveals a

Central Role for the Thymus in Emergence of CXCR4-Using Quasispecies. PLoS

ONE 2(9), - (2007).

111 113. Salemi M, Lamers SL, Huysentruyt LC et al.: Distinct Patterns of HIV-1

Evolution within Metastatic Tissues in Patients with Non-Hodgkins Lymphoma.

PLoS ONE 4(12), - (2009).

114. Potter SJ, Lemey P, Achaz G et al.: HIV-1 compartmentalization in diverse

leukocyte populations during antiretroviral therapy. J Leukoc Biol 76(3), 562-570

(2004).

115. Potter SJ, Lemey P, Dyer WB et al.: Genetic analyses reveal structured HIV-1

populations in serially sampled T lymphocytes of patients receiving HAART.

Virology 348(1), 35-46 (2006).

116. Pearse DE, Crandall KA: Beyond FST: Analysis of population genetic data for

conservation. Conservation Genetics 5(5), 585-602 (2004).

117. Hubisz MJ, Falush D, Stephens M, Pritchard JK: Inferring weak population

structure with the assistance of sample group information. Molecular Ecology

Resources 9(5), 1322-1332 (2009).

118. Huelsenbeck JP, Andolfatto P, Huelsenbeck ET: Structurama: bayesian inference

of population structure. Evol Bioinform Online 7, 55-59 (2011).

119. Shringarpure S, Won D, Xing EP: StructHDP: automatic inference of number of

clusters and population structure from admixed genotype data. Bioinformatics

27(13), i324-332 (2011).

120. Shankarappa R, Margolick JB, Gange SJ et al.: Consistent viral evolutionary

changes associated with the progression of human immunodeficiency virus type 1

infection. J Virol 73(12), 10489-10502 (1999).

112 121. Lee HY, Perelson AS, Park SC, Leitner T: Dynamic Correlation between

Intrahost HIV-1 Quasispecies Evolution and Disease Progression. Plos Comput

Biol 4(12), - (2008).

122. Llibre JM, Schapiro JM, Clotet B: Clinical Implications of Genotypic Resistance

to the Newer Antiretroviral Drugs in HIV-1–Infected Patients with Virological

Failure. Clinical Infectious Diseases 50(6), 872-881 (2010).

123. Van Laethem K, De Munter P, Schrooten Y et al.: No response to first-line

tenofovir+lamivudine+efavirenz despite optimization according to baseline

resistance testing: Impact of resistant minority variants on efficacy of low genetic

barrier drugs. Journal of clinical virology : the official publication of the Pan

American Society for Clinical Virology 39(1), 43-47 (2007).

124. Pingen M, Nijhuis M, De Bruijn JA, Boucher CaB, Wensing AMJ: Evolutionary

pathways of transmitted drug-resistant HIV-1. Journal of Antimicrobial

Chemotherapy 66(7), 1467-1480 (2011).

125. Lataillade M, Chiarella J, Yang R et al.: Prevalence and clinical significance of

HIV drug resistance mutations by ultra-deep sequencing in antiretroviral-naive

subjects in the CASTLE study. PLoS ONE 5(6), e10952 (2010).

126. Burgers WA, Shephard E, Monroe JE et al.: Construction, characterization, and

immunogenicity of a multigene modified vaccinia Ankara (MVA) vaccine based

on HIV type 1 subtype C. AIDS Res Hum Retroviruses 24(2), 195-206 (2008).

127. Chen Z, Huang Y, Zhao X, Ba L, Zhang W, Ho DD: Design, construction, and

characterization of a multigenic modified vaccinia Ankara candidate vaccine

113 against human immunodeficiency virus type 1 subtype C/B'. J Acquir Immune

Defic Syndr 47(4), 412-421 (2008).

128. Korber B, Gnanakaran S: The implications of patterns in HIV diversity for

neutralizing antibody induction and susceptibility. Curr Opin HIV AIDS 4(5),

408-417 (2009).

•• Excellent review covering novel approaches to vaccine design and their

known performances. It summarizes novel attemps to cope with HIV GD in

the design of vaccines.

129. Korber BT, Letvin NL, Haynes BF: T-cell vaccine strategies for human

immunodeficiency virus, the virus with a thousand faces. J Virol 83(17), 8300-

8314 (2009).

130. Rolland M, Jensen MA, Nickle DC et al.: Reconstruction and function of

ancestral center-of-tree human immunodeficiency virus type 1 proteins. J Virol

81(16), 8507-8514 (2007).

131. Wu X, Zhou T, Zhu J et al.: Focused Evolution of HIV-1 Neutralizing Antibodies

Revealed by Structures and Deep Sequencing. Science 333(6049), 1593-1602

(2011).

132. Scheid JF, Mouquet H, Ueberheide B et al.: Sequence and Structural

Convergence of Broad and Potent HIV Antibodies That Mimic CD4 Binding.

Science 333(6049), 1633-1637 (2011).

133. Schadt EE, Turner S, Kasarskis A: A window into third-generation sequencing.

Human Molecular Genetics 19(R2), R227-R240 (2010).

114 134. Beerli P, Palczewski M: Unified Framework to Evaluate Panmixia and Migration

Direction Among Multiple Sampling Locations. Genetics 185(1), 313-326 (2010).

135. Drummond AJ, Rambaut A: BEAST: Bayesian evolutionary analysis by sampling

trees. Bmc Evol Biol 7, - (2007).

136. Kuhner MK: LAMARC 2.0: maximum likelihood and Bayesian estimation of

population parameters. Bioinformatics 22(6), 768-770 (2006).

137. Auton A, Mcvean G: Recombination rate estimation in the presence of hotspots.

Genome Res 17(8), 1219-1227 (2007).

138. Martin DP, Lemey P, Lott M, Moulton V, Posada D, Lefeuvre P: RDP3: a

flexible and fast computer program for analyzing recombination. Bioinformatics

26(19), 2462-2463 (2010).

139. Pond SLK, Frost SDW, Muse SV: HyPhy: hypothesis testing using phylogenies.

Bioinformatics 21(5), 676-679 (2005).

140. Suzuki Y, Gojobori T, Nei M: ADAPTSITE: detecting natural selection at single

amino acid sites. Bioinformatics 17(7), 660-661 (2001).

141. Woolley S, Johnson J, Smith MJ, Crandall KA, Mcclellan DA: TreeSAAP:

Selection on Amino Acid Properties using phylogenetic trees. Bioinformatics

19(5), 671-672 (2003).

142. Maddison WP, Maddison DR: Mesquite: a modular system for evolutionary

analysis. (2011).

115

Section II

Molecular Survey Approaches And

Method Comparison

116

Chapter 3:

Pathogen Typing In The Genomics

Era: MLST And The Future Of

Molecular Epidemiology

117 Abstract

Multi-locus sequence typing (MLST) is a high-resolution genetic typing approach to identify species and strains of pathogens impacting human health, agriculture (animals and plants), and biosafety. In this review, we outline the general concepts behind MLST, molecular approaches for obtaining MLST data, analytical approaches for analyzing

MLST data, and the contributions MLST studies have made in a wide variety of areas.

We then look at the future of MLST and their relative strengths and weaknesses with respect to whole genome sequence typing approaches that are moving into the research arena at an ever-increasing pace. Throughout the paper, we provide exemplar references of these various aspects of MLST. The literature is simply too vast to make this review comprehensive, nevertheless, we have attempted to include enough references in a variety of key areas to introduce the reader to the broad applications and complications of

MLST data.

118 Introduction

The vast majority of bacteria are harmless or beneficial, but the few pathogenic strains are a major cause of human disease and death. Bacterial pathogens are the etiological agents of a wide range of infections including syphilis, cholera and tuberculosis among others. Understanding the processes controlling transmission relies first and foremost on the ability to identify and accurately distinguish between strains of infectious pathogens. Accurate and efficient strain identification is also essential for epidemiological surveillance and subsequent design of public health control strategies

(Comas et al., 2009; Schulte and Perera, 1993). Over the last decades, different molecular techniques have been extensively exploited to identify isolates and localized disease outbreaks, but their poor portability usually hindered, rather than elucidated, bacterial epidemiology (Maiden, 2006; Urwin and Maiden, 2003). To overcome this problem, molecular microbiology took advantage of existing knowledge on bacterial evolution and population biology, easy access and low cost of high-throughput of Sanger sequencing, and internet databasing resources, to propose the nucleotide sequence-based approach of

Multilocus Sequence Typing (MLST; Maiden et al., 1998). This procedure allows for the unambiguous characterization of isolates from infectious agents using sequences of internal fragments of usually seven housekeeping genes (i.e., constitutive genes required for the maintenance of basic cellular functions). Gene regions of approximately 450-500 bp are sequenced and those found unique within a species are assigned an allele number.

Each isolate is then characterized by the alleles at each of the seven loci, which constitute its allelic profile or sequence type (ST).

119 The MLST approach provides an accurate assessment of species and sometimes even strains and has the added advantage of also providing population genetic insights into levels and directionality of gene flow. This genetic based species diagnosis is much more accurate than performing conventional immunological assays to determine species and strain. Often, these phenotypic assays do not reflect underlying genealogical information (e.g., Lewis-Rogers et al., 2009). Thus, misdiagnoses can easily occur without relevant genealogical information analyzed in an evolutionary and population genetic framework (Crandall and Pérez-Losada, 2008). MLST approaches provide such high-resolution genealogical data.

The first studies on bacterial population structure in the 1980’s were fundamental to the development of MLST (Feil et al., 1999). These studies revealed genetic exchange through recombination as a major driving force in the evolution of most prokaryotes

(Maiden, 2006). This finding changed the predominant paradigm of the “clonal model” in bacterial population genetics to a broader concept of panmictic and partially clonal models (Smith et al., 1993). Consequently, inferring genetic relatedness among isolates based on single markers was unreliable and a new method contrasting information from independent markers was needed. The MLST scheme played a major role in investigating the extent of genetic structure in bacterial populations and rapidly became the cornerstone technique for molecular typing of pathogenic microorganisms (Maiden, 2006).

As currently used, MLST has achieved high levels of discrimination and has provided meaningful data to understand the evolution and epidemiology of pathogens.

But given the recent advances in sequencing technologies, the question naturally arises: what is the future of the MLST scheme in the genomic era? Technological advances in

120 high-throughput genome sequencing platforms (e.g. 454 Roche, Illumina/Solexa, Ion

Torrent, and ABI SOLiD) glimpse a promising scenario to improve the resolution of molecular epidemiological studies to the most accurate level ever seen, and will likely provide unprecedented insights into the evolution of bacterial populations. Here we review the past, present, and future of the MLST approach. Because of the extensive literature published on the topic, this review cannot be comprehensive in its scope.

Instead, we hope to provide a summary on how the MLST scheme transformed molecular epidemiological studies (section 1), it is now integrated within the next-generation sequencing techniques (2), it can be efficiently analyzed (3) and contributed to understand molecular epidemiology and evolution of bacterial pathogens (4). Moreover,

MLST future challenges in the light of genomic era data are also discussed (section 5).

MLST databases: origins and recent advances via Internet resources

The MLST approach provided for the first time the reproducibility and portability needed to develop a worldwide pathogen-typing database easily accessible to public health and research communities. The MLST scheme was first developed and available via the Internet for the species Neisseria meningitidis (Maiden et al., 1998), and this trend grew rapidly to include other bacterial species (Enright and Spratt, 1998; Heym et al.,

2002; Kriz et al., 2002). The first MLST website was implemented early on in the software MLSTdB, which was structured as a single combined database (Chan et al.,

2001). This first online resource worked well for the small datasets initially produced, but as the number of schemes available increased, several limitations as data redundancy, isolate bias and access became apparent (Pérez-Losada et al., 2011). Consequently, a reworked version of the original software, namely MLSTdbNET (Jolley et al., 2004), was

121 developed in order to provide a network database structure. The premise behind this new tool was the creation of separated databases to store isolate-specific information and allelic profiles, so that any numbers of isolate databases could be constructed. Those databases are actively curated to avoid the accumulation of sequencing errors that could lead to illusory alleles and ST profiles (Jolley, 2009). However, data retrieved from the databases comprise reported diversity, but are unstructured and do not necessarily represent natural populations (Urwin and Maiden, 2003).

MLST databases are now available for at least 79 organisms (75 for bacteria, 3 for fungi and 1 protozoan) and offer three main types of queries: 1) allele sequence identification and comparison, 2) allelic profile identification and comparison and 3) matching of isolates. More recently, a Bayesian model-based method also offers the possibility to automatically relate unidentified isolates with information deposited in curated databases

(Cheng et al., 2011). This method can be used with any MLST dataset through the software BAPS 5.4 (http://web.abo.fi/fak/mnf/mate/jc/software/baps.html). The query functionality has also been online implemented for the Staphylococcus aureus database deposited at http://www.mlst.net and shortly it is expected to be available for other

MLST databases hosted at the same web site (Cheng et al., 2011). Most MLST schemes are available at the websites hosted at the University of Oxford in the United Kingdom

(http://pubmlst.org) and the United Kingdom’s Imperial College (http://www.mlst.net), although some schemes can also be found at the Environmental Research Institute, Cork,

Ireland (http://mlst.ucc.ie) and the Pasteur Institute, Paris (http://www.pasteur.fr/mlst).

The international mirrored PubMLST website provides access to the abovementioned

MLSTdbNET database, but also to the antigen sequence software (agdbNET) for

122 bacterial typing (Jolley and Maiden, 2006), and to the recently developed Bacterial

Isolate Genome Sequence Database (BIGSDB), which implements a combined taxonomic and typing approach for the whole domain of bacteria (Jolley and Maiden, 2010).

Given the success of website technologies, recent efforts have exploited the potential of Internet resources to incorporate geospatial information in bacterial epidemiological studies (Aanensen et al., 2009; Baker et al., 2010; Grundmann et al.,

2010). The websites www.spatialepidemiology.net/ and maps.mlst.net/, for example, provide precise locality data related to strain distribution and also provide a map-based interface for displaying and analyzing epidemiological information. Moreover, the portal www.eMLSA.net enables species identification by means of a taxonomic platform. The integration of genomic and epidemiological data together with geographic information through MLST databases will greatly improve our ability to track and prevent infectious pathogens and associated diseases.

Table 1. Comparison of most common bacterial typing techniques (adapted from Foxman et al., 2005). See section 1 for abbreviations referred to typing methods.

123

124 The MLST scheme: a comparison with other bacterial typing methods

To be useful, a strain typing method should provide enough discriminatory power to distinguish between isolates from unlinked sources and to be sufficiently reliable to cluster isolates from the same source (Killgore et al., 2008; Unemo and Dillon, 2011).

Since its proposal in 1998, MLST rapidly emerged as the state-of-the-art technique for bacterial molecular typing over other techniques (Fig. 1). Unfortunately the MLST scheme is not the panacea to address all questions pertaining to molecular epidemiology, and alternative methods exist that offer complementary or even better discriminatory power at different temporal scales (see Table 1). In addition to this, the cost issue is also pivotal when choosing a bacterial typing technique and a considerable number of isolates need to be investigated.

125 Figure 1: Number of publications related to bacterial typing methods as a function of time. Abbreviations are defined in section 1. WGS = Whole-Genome Sequencing.

Currently, the main drawback of the MLST method is that the selection of housekeeping loci requires a reference genome (Parkhill et al., 2003; Sreevatsan et al.,

1997). Moreover, the lack of diversity throughout entire genomes or housekeeping genes in some pathogens, as well as the presence of recently emerged species or recent population bottlenecks, may yield the MLST scheme very limited in discriminatory power (Harbottle et al., 2006; Pourcel et al., 2004; Torpdahl et al., 2005). Until the development of MLST, the most widely used technique for indexing allelic variation was the multilocus enzyme electrophoresis approach (MLEE). A major drawback of the

MLEE is that only genetic changes altering the electrophoretic properties of the studied protein can be detected (about one 20th of all possible mutations), and consequently synonymous mutations are overlooked. Alternative gel-based methods, such as the pulsed-field gel electrophoresis (PFGE), restriction fragment length polymorphism

(RFLP) or amplified fragment length polymorphism (AFLP) offer a more affordable alternative to the MLST scheme and can provide better resolution at short-temporal scales in some bacterial species (Melles et al., 2007). However, the MLST approach is usually preferred because in all these gel-based approaches, comparison of results between laboratories is often problematic and a high level of expertise is needed to interpret and to translate banding patterns.

Another multiple-locus technique is the variable number of tandem repeats analysis (MLVA), which is based on the analyses of polymorphic repeated sequences

126 (VNTR). Comparative studies between MLVA and MLST have yielded similar results

(Elberse et al., 2011; Malachowa et al., 2005; Schouls et al., 2006; Top et al., 2004), and in recently originated species, the MLVA approach has higher discriminatory power

(Vergnaud and Pourcel, 2006). This technique shares all the advantages of the MLST scheme in terms of portability and reproducibility at a lower cost, but VNTR may evolve too quickly to provide reliable phylogenetic relationships among closely related strains and the size difference may not always reflect the real number of tandem repetitions because the presence of insertions and deletions (Li et al., 2009).

Recently a new methodology has been proposed based on high resolution melting curves (HRM) to distinguish single base variation and so identify SNPs without the burden of sequencing (Erali et al., 2008; Taylor, 2009). After amplification, PCR products are characterized in relation to their disassociation (melting) curves. This method provides a rapid, close-tubed, highly efficient and low cost-effective strategy for detecting base substitutions and small insertions or deletions (Millat et al., 2009).

However, the detection of an unidentified melting profile demand sequencing to identify the new profile and thus an increasing cost.

In addition, the Ribosomal Multilocus Sequence Typing method (rMLST) has been proposed to index the molecular variation of 53 genes encoding bacterial ribosome protein subunits (Jolley et al., 2012). This novel method pursues the integration of a taxonomic and typing method in a similar curated MLST scheme. Data generated can be easily accessed and accommodated in the abovementioned database BIGSDB, a reference genome is not required, targeted loci are conserved across the whole bacteria domain and the reanalysis of existing allele designations is not required (Jolley and Maiden, 2010).

127 Although more expensive, the rMLST is likely to provide a better resolution than previous methodologies, which coupled with the decreasing cost of sequencing DNA make it a promising technique. The method still requires further exploration, but certainly it has the potential to provide a universal bacterial typing method extending the idea of the MLST scheme.

Finally, in order to achieve greater resolution, a method has been developed that relies on presence or absence of pan-genomic or distributed genes among bacterial species that have the same MLST profile. This clustering method leverages on the massive amount of whole genome information that is being accumulated and demonstrates its utility by resolving close strain relationships (Hall et al., 2010). This novel method represents an affordable technique if microarrays are used; however, the cost increases dramatically if whole genome sequencing is required.

Population genetics and phylogenetics under the MLST scheme

The MLST scheme was originally proposed for the identification of highly related bacterial genotypes (Maiden et al., 1998), but the genealogical information inherited in the DNA sequences also allowed one to address questions about species boundaries, population dynamics, and phylogenetics (Spratt, 1999). Different mechanisms for the exchange of genetic material among bacteria were known for years (Lorenz and

Wackernagel, 1994), but their role on population structure was widely assumed to be negligible. This paradigm radically changed after several studies revealed extensive genetic exchange caused by recombination (e.g., DuBose et al., 1988), which entailed a broad spectrum of bacterial populations ranging from fully clonal (recombination does not effectively occur) to non-clonal populations (genetic diversity is randomized by

128 frequent events of genetic exchange). Subsequent evidence showed that those extremes are rare in nature, and most bacterial populations exhibit high levels of recombination, but not sufficient to prevent the emergence of clonal lineages (Spratt, 1999).

With population genetic and phylogenetic studies of bacterial species, then, one is forced to examine the role of genetic recombination (Posada et al., 2002). In this regard, the MLST scheme tries to overcome this problem by combining several neutral molecular markers scattered across the genome that are relatively short in length thereby avoiding complications due to recombination (Maiden et al., 1998). As in all population studies, the sampling strategy is critical to avoid bias towards certain isolates and to accurately assess the overall genetic variation in the population. Housekeeping genes usually offer enough resolution to accurately infer population parameters and reconstruct phylogenetic relationships. However there is no single core of universal genes that can be used throughout all pathogens (but see Jolley et al., 2012), since recombination, substitution and selection rates vary across loci and species (Pérez-Losada et al., 2006); therefore, choosing the appropriate set of loci ultimately relies upon the biology of the individual species under study (Spratt, 1999). Molecular phylogenetic studies based on microbial populations face problems that are not often encountered in typical evolutionary studies

(Fraser et al., 2007). Bacterial species typically exist as clusters of genetically related strains (Acinas et al., 2004), but finding those clusters may not be straightforward since high rates of recombination can certainly render meaningless and misleading phylogenetic trees (Posada and Crandall, 2002). In addition, isolates tend to be very closely related and frequently both the parent strains and their descendants are included in the same sample (Hall and Barlow, 2006). Thus, recombination requires a different

129 paradigm for visualizing genealogical relationships as networks instead of trees (Posada and Crandall, 2001) and special approaches for estimating population genetic parameters that accommodate the biological reality of recombination (Schierup and Hein, 2000).

Housekeeping genes: diversity levels and phylogenetic resolution

The MLST approach uses only a small fraction of the genome (usually between 6-

7 housekeeping genes of approximately 450-500 bp), which is assumed to be a representative sample of the entire genome diversity (Didelot and Maiden, 2010).

Protein-encoding housekeeping genes are viewed as the most reliable markers, since they are presumed to evolve slowly by the random accumulation of neutral variation, providing much more reliable data for both accurate typing and phylogeny estimation.

Levels of genetic polymorphism in housekeeping genes are usually high enough to assess population structure and strain relatedness (Maiden, 2006). However, how much genetic variability is necessary to accurately infer inter- and intra-species evolutionary relationships remains an open question; similarly, the correlation between gene function and phylogenetic resolution has been barely addressed (Cooper and Feil, 2006; Ferreira et al., 2012; Zeigler, 2003). For example, contrarily to expectations, Kuhn et al. (2006),

Robinson et al. (2005) and Cooper and Feil (2006) showed for Staphylococcus aureus that the inclusion of rapidly evolving genes under diversifying selection did not hamper the accurate inferences of evolutionary parameters (Cooper and Feil, 2006; Kuhn et al.,

2006; Robinson et al., 2005); in fact, in the same studies, standard MLST genes provided the poorest phylogenetic resolution. These results suggested that loci selection, at least at the intra-species level, should be primarily based on nucleotide diversity rather than gene function (Cooper and Feil, 2006). Hence, if higher resolution is required, including more

130 fast-evolving genes (as those subject to positive diversifying selection) might be more beneficial than adding more MLST genes (Maiden, 2006).

It is not clear what values of genetic variability yield better phylogenetic estimates or why variation greater than 1% generally does not improve resolution (Cooper and Feil,

2006). As a general rule, it has been suggested that loci comprising at least the average diversity for all genes may have the potential to accurately trace molecular epidemiological studies (Cooper and Feil, 2006). The presence of “sufficient diversity” is a critical factor when analyzing closely related strains within species. This issue becomes less problematic at higher taxonomic levels, and in that case, MLST data are likely to provide the appropriate framework for studying molecular epidemiology in microbial pathogens. Several studies have tried to identify a universal set of housekeeping genes for bacterial typing and prediction of phylogenetic relatedness at different taxonomic levels

(Stackebrandt et al., 2002; Zeigler, 2003). These studies have shown that a careful selection of single genes could be sufficient for discriminating between bacterial species, but the inference of intrageneric evolutionary relationships may be difficult when a small set of genes is used (Zeigler, 2003).

More recently, cutting-edge approaches based on full-genome sequences have been applied with the expectation that including more genetic data will buffer the effect of non-informative loci (Schürch and van Soolingen, 2012). However, Ferreira et al.

(2012) have pointed out the need for a careful examination of genomic features such as polymorphism dispersion, intergeneric region sizes, and positively selected loci ratios; since these factors may impact recombination and mutation rates differently, resulting in non-convergent and incongruent phylogenies. In agreement with previous studies,

131 Ferreira et al. also showed that inclusion of positively selected genes did not prevent the accurate inference of the evolutionary parameters, and curiously, non-coding regions yielded similar results (Ferreira et al., 2012). Although this study relates to a specific bacterial species, it provides valuable clues about the potential of non-standard loci as potential markers for MLST. Currently, most inferences on bacterial evolution have been and are still produced using MLST data. However, next-generation sequencing platforms now provide the means to capture multiple non-standard target loci to detect single nucleotide polymorphisms or to sequence full genomes. Such methods are briefly described in the next section.

Sequencing approaches to MLST

Next-generation sequencing (NGS) is permeating many aspects of biology including those endeavors typically related to MLST (Metzker, 2010). Although traditionally Sanger sequencing is still used more by far than NGS, as revealed by a simple Web of Science search (Sanger/NGS = 2x106/5x104), the latter is gaining popularity for reasons such as affordability (when sequencing large numbers of samples), scalability, and marker discovery (gene mining). In this section, we present a review of the sequencing approaches currently used in relation to MLST.

Sanger sequencing

Traditional Sanger sequencing still enjoys great popularity primarily because of its low costs at small scales and perceived superior quality when it comes to error rates

(Hoff, 2009). In a nutshell, to carry a Sanger reaction we need a single stranded DNA molecule plus dideoxy-nucleotides triphosphates (along with tagged chain terminators), and a primer that will be extended by a DNA polymerase. Tagged amplicons of different

132 lengths are then fractionated via electrophoresis or with a chromatography capillary column so that “color” tags are read and a digital consensus sequence is inferred. Sanger sequencing provides unambiguous DNA sequence markers that can be used to design

MLST schemes. Read lengths, or the mean/mode length achieved by a sequencing method, are typically longer in Sanger than those generated by other sequencing approaches, which may reduce the number of loci required for accurate bacterial characterization. Additionally, Sanger is amenable to sequencing single molecules and therefore reduces the potential impact of artificial recombination; implying that all detectable recombinant signals come from real biological events (Salazar-Gonzalez et al.,

2008). Moreover, post-processing in Sanger sequencing is simple compared to NGS, which lends itself to be preferentially used in laboratories lacking strong bioinformatic capabilities.

Sanger sequencing is still the gold standard for generating DNA sequence data

(Harismendy et al., 2009). One of its more attractive features is its low error rate (from

0.0001% to 1%), which seems to depend on the algorithms used for post-processing

(Ewing and Green, 1998; Ewing et al., 1998). NGS techniques such as pyrosequencing, on the other hand, report error rates of 0.49-2.8% (Harismendy et al., 2009), though the technologies are improving regarding sequencing chemistry and software post- processing.

Next-generation sequencing

Although Sanger sequencing still can fulfill the needs of many microbiology labs, the prospects that NGS technologies offer, along with the dimensions of their benefits, will likely surpass Sanger sequencing (Castro-Nallar et al., 2012). Large-scale

133 sequencing efforts using Sanger require expensive infrastructure and laborious bench work (Medini et al., 2008). Several platforms and chemistries are available within NGS, however large-scale projects can be done on a bench-top machine with ease (see Hui et al

2012 for a review).

NGS contributes at least two-fold to the development of MLST schemes. First, traditional MLST schemes need a reference genome in order to develop appropriate markers (Table 2). Currently, there are many genomes available from which one can extract marker information (3.334 complete and 11.056 incomplete; GOLD database; http://www.genomesonline.org). In fact, software implementations such as PhyloMark are already accessible, and can aid with genome-wide marker examinations. The aim of

PhyloMark is to identify the minimum number of markers that recapitulate a full genome phylogeny (Sahl et al., 2012) (Fig. 2B). Due to NGS technologies, the number of available bacterial genomes is increasing at a fast pace. However, still a large proportion of bacteria are lacking genome information and thus the abovementioned strategies cannot be applied. Secondly, NGS has proved useful in generating sequence data when little is known about the target organisms by providing the raw material to extract markers for MLST schemes (Fig. 2C). Furthermore, NGS read lengths are now falling within the size range of the genes (450-500 bp) used in MLST (http://454.com), and with the addition of multiplexing IDs (MID), it is possible to pool large numbers of samples and still get the benefit of sequencing sample targets with high depth (coverage). Sanger sequencing is a mature technology with little room to improve. In contrast, NGS technologies are rapidly evolving in a complementary non-overlapping manner. For instance, pyrosequencing is improving both regarding homopolymer errors and read

134 lengths. On the other hand, Illumina systems do not provide reads as long as those from pyrosequencing, but its coverage is greater, which could be advantageous for assessing bacterial genetic diversity in intra-host dynamics.

To date, several methodologies have been put forward to improve traditional MLST schemes, many of which are taking advantage of NGS. In general, they fall into a category in which, given presence or absence of genetic information, they use some sort of gene/genomic region targeting (Fig. 2A and 2B) or enrichment (Fig. 2C) to obtain potential marker sequences that can be used in downstream MLST applications. If genomic information of the group of interest is available, it is possible to develop markers that would resemble genomic relationships (Fig. 2B). In turn, if no information is available except from related taxa, then it is possible to design sequence capture

135 Figure 2: Schematic diagram showing direct sequencing approaches to obtain and discover genetic markers for MLST analysis. Left (2A and 2B) and right (2C) panels show approaches when genomic information is available or not, respectively. TAS =

Targeted-Amplicon Sequencing; HiMLST = High-Throughput MLST; AHE = Anchored

Hybrid Enrichment; UCE = Ultra-Conserved Elements Enrichment. See section 2 for other abbreviations and further detail.

experiments (usually with probes) to develop or discover new markers (Fig. 2C).

Alternatively, if no genomic information exists for the group of interest, an enriched de novo approach can be also applied to discover new markers.

With the decreasing cost of NGS, new affordable applications have arisen to perfect or create new ways of generating and analyzing sequence-typing data. A natural step toward high-throughput sequence typing is to combine the power of NGS with sequence targeting for which some extent of variability is already known. Methods relying on targeting known genes (Fig. 2A and 2B) or enriching genomic fractions to discover new markers (Fig. 2C) are now available (Table 2). In general, these methods, though not heavily used yet, promise to overcome some of the limitations of the MLST classic approach. For instance, the lack of a reference genome might not be a limitation since by performing enrichment steps prior to NGS, it is possible to single out large homologous regions of the genome that can be scaled up to analyze larger datasets and/or more populations.

One example is the Targeted-Amplicon Sequencing method (TAS; Fig. 2A), which capitalizes on NGS to sequence a large number of regions from large numbers of

136 pooled samples (Bybee et al., 2011). Given its relatively longer reads (800 bps) compared to other NGS technologies, pyrosequencing has been the preferred choice of targeted approaches. Recently, a method was made available in which MLST genes are amplified in a two-step PCR using sequence specific primers that have attached MID (HiMLST), similar to what it is routinely done when adding a restriction site to a target gene (Boers et al., 2012). Then, samples from multiple strains or species are pooled and sequenced as usual in a 454 Roche machine. It is worth noting that Roche 454 technology is able to deliver reads of up to 800 bp (using the GS FLX+ system), which may be particularly useful for MLST analysis (www.454.com). This method is essentially the same as the

TAS method published earlier but specifically designed for MLST. Both approaches use

MID multiplexing capabilities, so costs are lowered by pooling samples. A simple post- processing procedure guarantees that sequences are obtained in a per strain/species basis, for example by using the BarcodeCruncher software

(http://crandalllab.byu.edu/ComputerSoftware.aspx). The reported HiMLST protocol was able to profile 575 isolates from several bacterial species (7 genes). In addition, the TAS protocol was able to obtain sequences from 6 genes over 44 taxa in a quarter plate (Table

2; Boers et al., 2012; Bybee et al., 2011).

On the other hand, examples of directed sequencing by enrichment are: 1)

Anchored Hybrid Enrichment/Ultra conserved Elements (Faircloth et al., 2012; Lemmon et al., 2012) and 2) the PRGmatic approach (Hird et al., 2011). Although these methods have been originally developed for phylogenomics and high-level systematics (i.e., phylogenies of species), they can also be applicable to MLST, since multiple informative markers are also often needed to resolve genealogical relationships among individuals.

137 Enrichment methods (or sequence capture methods) can be of help when little is known about the species under scrutiny or the objective is to discover new MLST markers. The

PRGmatic approach, for example, uses restriction enzyme-digested, size-selected genomic DNA sequenced by pyrosequencing. Then, it clusters aligned reads by identity into alleles and then into loci. A great innovation of the method is that it generates a

Provisional Reference Genome (PRG) that is further used to align reads and generate sequences for each locus. In turn, the Anchor Hybrid Enrichment method (or Ultra

Conserved Elements enrichment by Faircloth et al. 2012), probably a more powerful approach in terms of finding loci, attempts to “capture” conserved genomic regions using probes and then sequence them using the Illumina platform. The post processing is fairly straightforward in terms of bioinformatic burden, though trained personnel are probably necessary to automate post-processing by writing tailored computer scripts. Although this method is more powerful regarding the number of loci recovered, it is likely to be more expensive as well. In particular, DNA library generation could be an economic burden for a medium-sized laboratory in terms of initial investment (Table 2). However, per base or per loci sequencing costs are very low compared to other NGS-based methods. Other enrichment methods are discussed elsewhere (Cronn et al., 2012; Mamanova et al.,

2010).

In principle, due to their higher sequencing power (up to 854 loci; Table 2), all the abovementioned approaches should help to overcome some of the problems standard

MLST schemes may encounter, such as lack of diversity in genome or housekeeping genes, or more importantly, the ability to detect patterns in emergent species or in species under demographic or selective processes. Very few studies looking at bacterial evolution

138 and epidemiology using these methodologies have been published so far. As sequencing costs keep decreasing, we foresee an increase in MLST studies using NGS. Coupling

NGS to MLST is a challenge and new strategies are starting to emerge. Recently, for example, Singh et al. (2012) developed a hairpin-primed multiple amplification method that can amplify numerous target genes simultaneously.

Table 2. Comparison of NGS-based methods used in gene mining and sequencing.

139 Analysis of MLST

Methods of analysis of MLST data can be classified in two basic strategies: a) those that rely on allele and ST designations to estimate relatedness among isolates

(allele-based methods) and so ignore the number of nucleotide differences between alleles; and b) those that rely on nucleotide sequences directly to estimate relatedness and population parameters (nucleotide-based methods) (Table 3). The allele-based approach is thought to work well in non-clonal organisms (e.g., Helicobacter pylori), while nucleotide-based approaches are preferable for clonal organisms (e.g., Staphylococcus aureus), since the former approaches are likely misleading because they cannot distinguish between single-base changes in multiple loci versus multiple mutations in the same number of loci (Maiden, 2006). In practice, most microbes show some degree of clonality (clonal complex) in their populations, hence, in principle, both types of analyses could be carried out in population and epidemiological studies (e.g., Tazi et al., 2010).

Allele-based methods

These types of methods require first the coding of DNA sequences from each locus into numbers using information available in public MLST databases (see section 1).

If no match is found, a new number is assigned in order of discovery. Several computational programs, such as Sequence Typing Analysis and Retrieval System

(STARS), have been developed for this task. Once alleles have been assigned, data are entered in the MLST databases to acquire an ST profile. At this point exploratory analysis (e.g., allele and profile frequencies, polymorphism estimates, codon usage, etc.) could be performed using Sequence Type Analysis and Recombinational Tests

(START2) software (Jolley et al., 2001). Relatedness among STs can then be displayed

140 using methods of cluster reconstruction such as the simple Unweighted Pair Group

Method with Arithmetic Mean (UPGMA) and the Based Upon Related Sequences Types

(eBURST) approach. The former method uses a matrix of distances among STs to estimate isolate relatedness, while eBURST (Feil et al., 2004) infers patterns of evolutionary descent among isolates using a simple model of clonal expansion and diversification. A new globally optimized version (goeBURST) has also been developed that identifies alternative patterns of descent using graphic matroids (Francisco et al.,

2009). Recently, a new approach (PHYLOViZ) has been released for microbial epidemiological and population analysis that allows for the integration of allelic profiles from MLST or MLVA methods (although Single Nucleotide Polymorphism data can also be included) and associated epidemiological data (Francisco et al., 2012). PHYLOViZ uses goeBURST for representing the possible evolutionary relationships between strains.

Allele-based methods have the advantage of simplicity and speed, which are crucial for efficient epidemiological surveillance and public health management, but disregard much of the evolutionary information contained at the nucleotide level. They are, therefore, better suited for exploratory data analysis rather than fine statistical inference (Didelot and Falush, 2007). A larger and more sophisticated plethora of nucleotide-based methods exist to estimate isolate relationships and population parameters.

Table 3. List of population genetics programs listed in this review including their functionalities and online links.

141 142

Nucleotide-based methods

Any analysis of nucleotide data usually begins with an alignment (i.e., estimation of site homology; Rosenberg, 2009). Several fast and accurate strategies for aligning gene regions and genomes are implemented in MAFFT (Katoh et al., 2005) and MAUVE

(Darling et al., 2010), respectively. After the alignment has been generated, we need to determine the model of evolution that fits the data the best. Model choice is a critical issue and the chosen model (or lack thereof) will affect all subsequent phylogenetic

(section 3.2.1) and population (section 3.2.2) analyses (Kelsey et al., 1999). This issue is usually assessed within a maximum likelihood or Bayesian phylogenetic framework and under multiple criteria, like the Akaike or Bayesian Information Criterion and marginal likelihoods (see Baele et al., 2012; Posada and Buckley, 2004; Xie et al., 2011). These and other model choice strategies are implemented in JModeltest2 (Darriba et al., 2012).

Phylogenetic relatedness

Phylogenetic reconstruction methods can be divided into two types, those that proceed algorithmically (e.g., UPGMA, Neighbor-joining) and those based on optimality criteria. Here we will focus on the latter since we find this feature particularly important for analyzing MLST data; a more extensive review of phylogenetic methods can be found in Pérez-Losada et al. (2007c).

Maximum likelihood (ML) inference attempts to identify the topology that explains the evolution of a set of aligned sequences under a given model of evolution with the greatest likelihood (Felsenstein, 1981). RAxML (Stamatakis, 2006), GARLI

(Zwickl, 2006) or PHYML (Guindon et al., 2010) implement the ML criterion efficiently

143 and accurately and can handle datasets of >1.000 sequences. Confidence in the estimated

ML relationships (i.e., clade support) can be assessed using the nonparametric bootstrap procedure (Felsenstein, 1985).

Bayesian inference (BI) combines the prior probability of a phylogeny with the likelihood to produce a posterior probability distribution of trees, which can be interpreted as the probability that the tree(s) is (are) correct (Huelsenbeck et al., 2001). BI has the advantage over ML approaches both in accounting for uncertainty in the phylogeny and model parameters estimated, and allowing for hypothesis testing. Clade support is estimated by summarizing the frequency of that clade across a distribution of trees through a consensus analysis. Bayesian phylogenies are estimated using Metropolis- coupled Markov chain Monte Carlo (MC3) methods and both are implemented in programs like MrBayes (Ronquist and Huelsenbeck, 2003) or BEAST (Drummond and

Rambaut, 2007). The output generated by these programs can then be evaluated in Tracer

(Rambaut and Drummond, 2009) to confirm that MC3 chains have mixed well and converged.

Standard phylogenetic methods assume a lack of recombination, an assumption violated by many microorganisms. Hence if recombination is suspected in our data, we should first detect and eliminate recombinant regions or identify breakpoints (see section

3.2.2 below), so alignments can then be subdivided into non-recombinant regions and analyzed separately. Alternatively, one could use an approach that takes homologous recombination into account while inferring clonal relationships between the members of a sample. Such a method is implemented in ClonalFrame (Didelot and Falush, 2007) within a Bayesian coalescent framework. Similarly, phylogenetic strategies that assume a

144 reticulated model of evolution (network) instead of a bifurcating tree may be better when recombination is substantial (Posada and Crandall, 2001); the Union of Maximum

Parsimonious trees (Cassens et al., 2005) and TCS (Templeton et al., 1992) are two of such approaches and both perform well under relatively low levels of diversity and recombination (Woolley et al., 2008). Another broadly used network approach is

SplitsTree4 (Huson and Bryant, 2006). An interesting application of the network strategy has been recently developed by Plucinski et al. (2011) to infer local and global properties of the host populations in commensal bacteria.

Often gene trees differ even when sampled from the same population. This can be the result of molecular processes (e.g., recombination) or stochastic variation (e.g., incomplete lineage sorting). New coalescent methods have been developed to deal with stochastic variation in gene trees. Among these, the Bayesian-based BEST (Liu, 2008),

STEM (Kubatko et al., 2009), and *BEAST (Heled and Drummond, 2010) approaches are well suited to estimate the joint posterior distribution of gene trees and the organism tree using multilocus molecular data.

Population dynamics

The evolution of DNA sequences in natural populations can be described by parameters like recombination, mutation, growth and selection rates. Indeed, the accurate estimation of these parameters is key for understanding the dynamics and evolutionary history of those populations, their epidemiology, and ultimately for applying efficient public health control strategies. Population parameters are more efficiently estimated using explicit statistical models of evolution such as the coalescent approach, hence here we describe some population parameter estimators based on such models.

145 Recombination is generally defined as the exchange of genetic information between two nucleotide sequences. Comprehensive reviews of statistical methods for detecting and estimating recombination rates are presented in Posada et al. (2002); although since then, new methods have been developed (e.g., Jeffrey, 2004; Lefebvre and

Labuda, 2008; Padhukasahasram et al., 2006; Wang and Rannala, 2008, 2009) and revised (e.g., Auton and McVean, 2012; Martin et al., 2011; Stumpf and McVean, 2003).

Posada et al. (2002) concluded that multiple methods should be used to detect or estimate recombination. Consequently, software packages like RDP4 (Martin et al., 2010) have been developed to implement up to eight recombination estimators that allow the user to draw conclusions based on the outcome of multiple tests.

Genetic diversity is the most important population parameter and is usually estimated in relation to recombination as the rate of recombination to mutation (r/m), so the relative impact of each force on generating microbe genetic diversity can be assessed

(Feil et al., 1999). Reviews of classical and coalescent statistical methods for estimating genetic diversity can be found in Pearse and Crandall (2004), Excoffier and Heckel

(2006) and Waples and Gaggiotti (2006); nonetheless newer methods have been developed since these reviews (e.g., Bashalkhanov et al., 2009).

Growth rates reflect the variation of genetic diversity along time. Growth can be estimated under a certain demographic model (e.g., exponential) or without dependence on a pre-specified model, such as the Bayesian skyline plot (Drummond et al., 2005) or the Skyride model (Minin et al., 2008), both implemented in BEAST. Interestingly,

BEAST also allows for the analysis of temporally spaced sequence data. Recombination, genetic diversity, and exponential growth rates can all be estimated in LAMARC

146 (Kuhner, 2006).

The standard method for estimating selection in protein-coding DNA sequences is through the nonsynonymous (dN) to synonymous (dS) amino acid substitution ratio dN/dS

(ω). ω > 1 indicates adaptive or diversifying selection, ω < 1 purifying selection and ω ≈

0 lack of selection. ω is usually estimated within a ML phylogenetic framework and assuming an explicit model of codon substitution. If significant evidence (usually obtained through likelihood ratio tests) of adaptive selection is obtained, then Bayesian tests can be applied to detect amino acid sites under selection (e.g., Yang et al., 2005).

These methods are implemented and described in more detail in PAML (Yang, 2007). If recombination is present, other methods exist that can estimate recombination and selection rates simultaneously (OmegaMap; Wilson and McVean, 2006), or account for the former while estimating the latter (HYPHY; Kosakovsky Pond et al., 2005).

Other key factors in pathogen dynamics are the time of emergence of the epidemic and the geographical distribution of pathogens. New probabilistic models have been recently developed within the Bayesian framework (Lemey et al., 2009; Lemey et al., 2010) that allow the inference and hypothesis testing of divergence times, ancestral locations and historical patterns of migration (i.e., phylogeographic history). Those parameters can be estimated in BEAST and SPREAD (Bielejec et al., 2011) and visualized using virtual globe software like Google Earth

(www.google.com/earth/index.html). Such methods have already begun to be applied to the analysis of MLST and/or genome and SNP (see section 5) data (Gray et al., 2011;

McAdam et al., 2012; Weinert et al., 2012). Similarly, divergence times and ancestral states can be also estimated in LAMARC.

147 Applications of MLST

The popularity of MLST is driven by its ease of use and discriminating power.

Consequently, over the last few years we have seen not only an increase in MLST schemes (Fig. 1) and sequence types available, but also in the diversity of their applications. Although primarily developed for the characterization of organisms

(typing), MLST sequence data have also been applied to other aspects of molecular epidemiology (e.g., disease transmission, evolution of virulence) and public health (e.g., monitor vaccination programs), as well as to other areas such as phylogenetics, , speciation, population genetics, biosafety, and even to the inference of human migrations. Below we list a series of examples taken from the most recently published literature showing some of those applications.

Molecular epidemiology and public health

MLST has become the routine typing approach for the identification of clinical specimens. Accurate and quick characterization of organisms is crucial for epidemiological surveillance (Brehony et al., 2007; Trotter et al., 2007), detection and management of disease outbreaks (Byrnes et al., 2010; Palazzo et al., 2011; Vanderkooi et al., 2011), estimate prevalence rates (Haran et al., 2012; Ibarz-Pavon et al., 2011;

Sproston et al., 2011) or study horizontal (Stensvold et al., 2012; Walker et al., 2012) and vertical (Makino et al., 2011; Martin et al., 2012) transmission of infectious agents.

Interestingly, new epidemic models have been recently developed that make use of

MLST data to infer social network structure in ubiquitous commensal bacteria too

(Plucinski et al., 2011). MLST has also helped to investigate the emergence and spread of antibiotic resistance to meticillin, erythromycin, macrolides and quinolones (Atkinson et

148 al., 2009; De Francesco et al., 2011; Egger et al., 2012; Haran et al., 2012; Ibarz-Pavon et al., 2011; Pérez-Losada et al., 2007a; Tazi et al., 2010) and virulence (including virulent factors and genes and diseases associations) (Ch'ng et al., 2011; Dingle et al., 2011;

Matsunari et al., 2012; Schultsz et al., 2012; Springman et al., 2009). It has also been used to monitor the effects of vaccination programs (pre and post-vaccine) (Adetifa et al.,

2012; Climent et al., 2010; Hanage et al., 2011; Maiden and Stuart, 2002; Pichon et al.,

2009; Stefanelli et al., 2009), improve vaccination strategies (Hanage et al., 2011; Racloz and Luiz, 2010; Stefanelli et al., 2009), and design new vaccines and new approaches to vaccination (Bambini et al., 2009; Pizza et al., 2000; Urwin et al., 2004) against

Streptococcus pneumoniae and Neisseria meningitidis. Finally, MLST has also contributed to the identification of sources of human infection from natural hosts (e.g., livestock animals and dogs) and environmental (e.g., animal-derived food) reservoirs

(Bessell et al., 2012; Gripp et al., 2011; Ngo et al., 2011; O'Mahony et al., 2011), to identify host or niche associations (Hotchkiss et al., 2011; Sheppard et al., 2010a;

Sproston et al., 2011) and zoonotic transmissions (Sahin et al., 2012; Sakwinska et al.,

2011; Walther et al., 2012), and to study biological interactions like symbiosis in

Wolbachia from insects (Russell et al., 2009).

Phylogenetics, taxonomy, and speciation

MLST data have been used to infer clone and species relationships and phylogroups in pathogenic (e.g., Actinomyces) and beneficial (e.g., Oenococcus oeni and

Trypanosoma cruzi) microbiota (Bilhere et al., 2009; Bridier et al., 2010; Henssge et al.,

2011; Yeo et al., 2011), separate and validate similar or sibling species of Streptococcus oralis and Lactobacillus delbrueckii (Do et al., 2009; Tanigawa and Watanabe, 2011) and

149 identify new ones in, for example, the genera Bartonella, Bacillus and Burkholderia

(Chaloner et al., 2011; Guinebretiere et al., 2012; Vanlaere et al., 2009; Vanlaere et al.,

2008), suggest new taxonomic classifications (e.g., Lactococcus lactis) (Passerini et al.,

2010), validate COI barcodes in Wolbachia (Smith et al., 2012), and to discuss the bacterial species concept (Godreuil et al., 2005; Vos, 2011). MLST data are particularly useful for species diagnosis, as they provide both genealogical information as well as information on recombination (see below), which is critical for bacterial species identification (Dykhuizen and Green, 1991; Fraser et al., 2007), as revealed in

Streptococcus (Ahmad et al., 2009).

Population structure and dynamics

MLST has been instrumental at confirming the clonal structure of many organisms like Staphylococcus aureus (see Pérez-Losada et al., 2006 for a review); but also at identifying epidemic clonal complexes in other taxa like Staphylococcus haemolyticus (Cavanagh et al., 2012), Yersinia pseudotuberculosis (Ch'ng et al., 2011) or

Streptococcus suis (Schultsz et al., 2012); or even taxa considered non-clonal, such as

Pseudomonas aeruginosa (Maatallah et al., 2011) or Burkholderia pseudomallei (Dale et al., 2011).

MLST data have been used to infer population structure at both temporal (de

Filippis et al., 2012; Pérez-Losada et al., 2007b; Sproston et al., 2011) and geographical scales (Jorgensen et al., 2011) in for example Neisseria and Campylobacter, and to infer the epidemiological processes that may be responsible for the contemporary geographic distributions of diseases (phylogeography). For example, phylogeographic structure driven by host immunity has been detected in Staphylococcus aureus from West

150 China (Fan et al., 2009), while human activity has driven differentiation in Clostridium difficile isolates from North America, Europe, and Australia (Stabler et al., 2012). Similar studies based on MLST data have determined the geographic origin of Mannheimia haemolytica in European cattle and sheep (Petersen et al., 2009).

Another major contribution of MLST to bacterial population genetics has been the assessment of the relative impact of recombination and point mutation (the r/m ratio) in bacteria and archaea (Vos and Didelot, 2009) and within and among clones of, for example, Neisseria meningitidis, Staphylococcus aureus, Yersinia pseudotuberculosis or

Streptococcus dysgalactiae (Basic-Hammer et al., 2010; Ch'ng et al., 2011; Feil et al.,

2000; Feil et al., 1999; McMillan et al., 2010; McMillan et al., 2011) or among species of

Streptococcus (Ahmad et al., 2009; Do et al., 2010). MLST has also effectively identified the impact of selection in Orientia tsutsugamushi, Neisseria meningitidis, Bacillus cereus, Group B Streptococcus or Vibrio parahaemolyticus (Duong et al., 2012; Jolley et al., 2005; Raymond et al., 2010; Springman et al., 2009; Yan et al., 2011) and the contributors to population genetic diversity (see also Pérez-Losada et al., 2006).

Similarly, MLST has provided insights on past population dynamics (epidemiological history), inferred as the variation in relative genetic diversity (or population size) since some time in the past, usually the time of emergence of the disease, in Neisseria gonorrhoeae (Pérez-Losada et al., 2007a; Pérez-Losada et al., 2007b; Tazi et al., 2010).

Other applications

MLST data have also been applied to biosafety research such as the detection of contamination with Staphylococcus aureus in Portuguese public buses (Simoes et al.,

2010), US West Coast public marine beaches (Soge et al., 2009), and in the working

151 environment of many Swiss microbial laboratories (Schmidlin et al., 2010). Besides farm animals (above), MLST has also been applied in plant agriculture to identify genomospecies of Pseudomonas syringae causing bacterial leaf spot on parsley (Bull et al., 2011) and assess nodule occupancy of soybean by in Bradyrhizobium (van Berkum et al., 2011), or study the evolution of agriculture-associated disease caused by

Campylobacter coli in farm animals from Scotland (Sheppard et al., 2010b). Another interesting application has been the tracing of ancient human migrations worldwide

(Falush et al., 2003) or across India (Devi et al., 2007) and Malaysia (Tay et al., 2009), using Helicobacter pylori MLST data from human gastric mucosa.

Overall, MLST studies have both increased our knowledge of the diversity, population structure and dynamics of bacterial pathogens worldwide (basic research), and helped to design better strategies of control and treatment of the diseases caused by those pathogens (applied research), which ultimately has contributed to improve public health.

MLST in the genomic era

With advances in DNA sequencing technologies comes the natural question of whether or not MLST will continue to have utility in the next-gen $1,000 human genome era. The great advantage of MLST is the unlinked survey of genetic variation at the DNA sequence level at a relatively cheap and efficient cost (Okoro et al., 2012). Yet the next- gen sequencing technologies are rapidly making these advantages mute. NGS also relieves some of the disadvantages of MLST (detailed above), including the need to have a genome of the target organism to begin with, the lack of broad application of individual loci across a diversity of species [because levels of genetic diversity and amounts of recombination vary across species for the same locus; but see Jolley et al. (2012)], and

152 shorter read lengths to avoid complications of recombination. Next we highlight two approaches for incorporating NGS fully into pathogen typing, first through Single

Nucleotide Polymorphism (SNP) analysis and second through whole genome sequence analysis. We then consider the bioinformatic implications and complications of dealing with this totally different volume of data and the associated challenges.

SNP discovery and typing

The first typing approach taking full advantage of whole-genome sequence data is that of SNP analysis. The central idea here is to get not just a single reference genome, as is the case with MLST typing, but a number of reference genomes to identify polymorphic sites within the genome. These sites or SNPs can then be used as diagnostic markers for specific species and/or strains within species, depending on the extent of variation in the species. Ideally, for species diagnostics based on SNPs, one is looking for fixed differences between species. Thus, the method becomes problematic if only a few reference genomes are used to establish whether variants are fixed or not within a species. This problem becomes worse when trying to diagnose strains within species, as many more samples are needed to effectively determine fixation of SNPs within strain and differences among strains. However, the advantage of SNPs is that they can provide broader genomic representation with less linkage (thereby lessening the potential impact of recombination). They are also relatively evolutionarily stable. Because these are genotypic data with character state information, they are amenable to robust phylogenetic and population genetic analyses (detailed above). SNP analyses have been used in pathogen population genetics for a number of years now with highly effective results

(e.g., Filliol et al., 2006). Initially, SNPs were relatively expensive characters to develop

153 for species typing; however, they have become highly efficient and effective for a variety of species. For example, Holt et al. (2010) used a survey of 2,000 SNPs to identify strains of Salmonella enterica serovar Typhi causing a typhoid outbreak in children from

Kathmandu, Nepal.

Whole genome sequence typing (WGST)

With costs of whole genome sequencing coming down significantly and the need for whole genome data for both MLST and SNP approaches, recent studies have simply turned to eliminating these subsequent approaches for typing and used the whole genome data per se. The advantages of whole genome sequence typing (WGST) are clear – the highest resolution of genealogical data possible. This resolution has been instrumental in differentiating strains of, for example, Chlamydia trachomatis, where the relationships based on ompA (clinical) typing are masked by recombination (Harris et al., 2012).

Indeed, in this study, they demonstrate how the whole genome data allow for the identification and therefore accommodation of recombination within the dataset and subsequent phylogenetic analyses. Demonstrating the resolving power of WGST against other genetic (SNPs) and phenotypic (RFLP, VNTR) approaches in distinguishing strains of Mycobacterium tuberculosis, Schürch and van Soolingen (2012) argued that WGST will become the sole diagnostic tool of tuberculosis, including genetic characterization and drug resistance and susceptibility testing. However, others argued for a more integrated approach (combining SNP analysis with WGST), especially while sequencing costs are still high and therefore subject studies to issues of sampling bias (Pearson et al.,

2009). But with technological advances occurring regularly, we are quickly moving to the full capacity of WGST (see Fig. 1 - WGS) as a standard operating procedure (Vogel et

154 al., 2012). This is ideal, as many studies have shown that unique genetic elements can only be revealed through whole genome sequencing and comparative genomics (Rasko et al., 2011).

Bioinformatic considerations

Despite the significant promise of next generation sequencing techniques leading to whole genome sequence typing for pathogens, the move to whole genome analysis is not without challenges. The most significant of these is the ability to analyze this new volume of data in a reasonable and efficient manner. With WGST comes the need for genome assembly which can be fraught with difficulty (Schatz et al., 2010) and thereby introduce errors in assembled genomes that will appear as strain specific variation. Thus, ultimate care must be taken with analyses of whole genome data both at the assembly stage and downstream analyses. One approach to deal with this volume of data is to relate these whole genome sequence data back to MLST (Larsen et al., 2012). However, this approach then looses the advantages of WGST over MLST, including a broader survey of genetic signatures that are often critical in identifying causal agents of pathogenic outbreaks (e.g., Eppinger et al., 2011). An alternative approach is to map raw sequence reads to a reference database of pathogens for rapid and efficient identification of pathogens associated with a next-gen sequencing run from a biological sample (Clement et al., 2010; Francis et al., 2012). This approach has the advantage of avoiding the assembly step altogether, but requires a robust reference library of genomes to query against. No doubt substantial methodological advances will occur as more and more whole genome sequence data sets become available for consideration.

Conclusions and prospects

155 MLST has played a crucial role is diagnosing pathogens of human disease as well as agents of bioterrorism. Rapid identification of such agents is crucial in our ability to identify, track, and treat such outbreaks. MLST has proven to be a high-resolution genetic approach that provides data amenable to sophisticated phylogenetic and population genetic analyses. However, with the decrease in cost of genome sequencing, researchers are already moving to whole genome sequence analyses for such studies. We are clearly in the transition phase moving from MLST to whole genome sequencing typing and this shift provides extensive opportunity for the development of novel methodologies to accommodate this increased volume of high-resolution genomic information.

References

Aanensen, D.M., Huntley, D.M., Feil, E.J., Spratt, B.G., 2009. EpiCollect: linking smartphones to web applications for epidemiology, ecology and community data collection. PLoS One 4, e6968.

Acinas, S.G., Klepac-Ceraj, V., Hunt, D.E., Pharino, C., Ceraj, I., Distel, D.L., Polz,

M.F., 2004. Fine-scale phylogenetic architecture of a complex bacterial community.

Nature 430, 551-554.

Adetifa, I.M., Antonio, M., Okoromah, C.A., Ebruke, C., Inem, V., Nsekpong, D.,

Bojang, A., Adegbola, R.A., 2012. Pre-vaccination nasopharyngeal pneumococcal carriage in a Nigerian population: epidemiology and population biology. PLoS One 7, e30548.

Ahmad, Y., Gertz, R.E., Jr., Li, Z., Sakota, V., Broyles, L.N., Van Beneden, C., Facklam,

R., Shewmaker, P.L., Reingold, A., Farley, M.M., Beall, B.W., 2009. Genetic relationships deduced from emm and multilocus sequence typing of invasive

156 Streptococcus dysgalactiae subsp. equisimilis and S. canis recovered from isolates collected in the United States. J Clin Microbiol 47, 2046-2054.

Atkinson, S.R., Paul, J., Sloan, E., Curtis, S., Miller, R., 2009. The emergence of meticillin-resistant Staphylococcus aureus among injecting drug users. J Infect 58, 339-

345.

Auton, A., McVean, G., 2012. Estimating Recombination Rates from Genetic Variation in Humans

Evolutionary Genomics, in: Anisimova, M. (Ed.). Humana Press, pp. 217-237.

Baele, G., Lemey, P., Bedford, T., Rambaut, A., Suchard, M.A., Alekseyenko, A.V.,

2012. Improving the Accuracy of Demographic and Molecular Clock Model Comparison

While Accommodating Phylogenetic Uncertainty. Molecular biology and evolution 29,

2157-2167.

Baker, S., Hanage, W.P., Holt, K.E., 2010. Navigating the future of bacterial molecular epidemiology. Current opinion in microbiology 13, 640-645.

Bambini, S., Muzzi, A., Olcen, P., Rappuoli, R., Pizza, M., Comanducci, M., 2009.

Distribution and genetic variability of three vaccine components in a panel of strains representative of the diversity of serogroup B meningococcus. Vaccine 27, 2794-2803.

Bashalkhanov, S., Pandey, M., Rajora, O., 2009. A simple method for estimating genetic diversity in large populations from finite sample sizes. BMC Genetics 10, 84.

Basic-Hammer, N., Vogel, V., Basset, P., Blanc, D.S., 2010. Impact of recombination on genetic variability within Staphylococcus aureus clonal complexes. Infect Genet Evol 10,

1117-1123.

157 Bessell, P.R., Rotariu, O., Innocent, G.T., Smith-Palmer, A., Strachan, N.J., Forbes, K.J.,

Cowden, J.M., Reid, S.W., Matthews, L., 2012. Using sequence data to identify alternative routes and risk of infection: a case-study of Campylobacter in Scotland. BMC

Infect Dis 12, 80.

Bielejec, F., Rambaut, A., Suchard, M.A., Lemey, P., 2011. SPREAD: spatial phylogenetic reconstruction of evolutionary dynamics. Bioinformatics 27, 2910-2912.

Bilhere, E., Lucas, P.M., Claisse, O., Lonvaud-Funel, A., 2009. Multilocus sequence typing of Oenococcus oeni: detection of two subpopulations shaped by intergenic recombination. Appl Environ Microbiol 75, 1291-1300.

Boers, S.A., van der Reijden, W.A., Jansen, R., 2012. High-Throughput Multilocus

Sequence Typing: Bringing Molecular Typing to the Next Level. PLoS One 7, e39630.

Brehony, C., Jolley, K.A., Maiden, M.C., 2007. Multilocus sequence typing for global surveillance of meningococcal disease. FEMS Microbiol Rev 31, 15-26.

Bridier, J., Claisse, O., Coton, M., Coton, E., Lonvaud-Funel, A., 2010. Evidence of distinct populations and specific subpopulations within the species Oenococcus oeni.

Appl Environ Microbiol 76, 7754-7764.

Bull, C.T., Clarke, C.R., Cai, R., Vinatzer, B.A., Jardini, T.M., Koike, S.T., 2011.

Multilocus sequence typing of Pseudomonas syringae sensu lato confirms previously described genomospecies and permits rapid identification of P. syringae pv. coriandricola and P. syringae pv. apii causing bacterial leaf spot on parsley.

Phytopathology 101, 847-858.

Bybee, S.M., Bracken-Grissom, H., Haynes, B.D., Hermansen, R.A., Byers, R.L.,

Clement, M.J., Udall, J.A., Wilcox, E.R., Crandall, K.A., 2011. Targeted Amplicon

158 Sequencing (TAS): A Scalable Next-Gen Approach to Multilocus, Multitaxa

Phylogenetics. Genome Biology and Evolution 3, 1312-1323.

Byrnes, E.J., 3rd, Li, W., Lewit, Y., Ma, H., Voelz, K., Ren, P., Carter, D.A., Chaturvedi,

V., Bildfell, R.J., May, R.C., Heitman, J., 2010. Emergence and pathogenicity of highly virulent Cryptococcus gattii genotypes in the northwest United States. PLoS Pathog 6, e1000850.

Cassens, I., Mardulyn, P., Milinkovitch, M.C., 2005. Evaluating intraspecific "network" construction methods using simulated sequence data: do existing algorithms outperform the global maximum parsimony approach? Systematic Biology 54, 363-372.

Castro-Nallar, E., Crandall, K.A., Pérez-Losada, M., 2012. Genetic diversity and molecular epidemiology of HIV transmission. Future Virology 7, 239-252.

Cavanagh, J.P., Klingenberg, C., Hanssen, A.M., Fredheim, E.A., Francois, P., Schrenzel,

J., Flaegstad, T., Sollid, J.E., 2012. Core genome conservation of Staphylococcus haemolyticus limits sequence based population structure analysis. J Microbiol Methods

89, 159-166.

Ch'ng, S.L., Octavia, S., Xia, Q., Duong, A., Tanaka, M.M., Fukushima, H., Lan, R.,

2011. Population structure and evolution of pathogenicity of Yersinia pseudotuberculosis.

Appl Environ Microbiol 77, 768-775.

Chaloner, G.L., Palmira, V., Birtles, R.J., 2011. Multi-locus sequence analysis reveals profound genetic diversity among isolates of the human pathogen Bartonella bacilliformis. PLoS Negl Trop Dis 5, e1248.

Chan, M.S., Maiden, M.C.J., Spratt, B.G., 2001. Database-driven multi locus sequence typing (MLST) of bacterial pathogens. Bioinformatics 17, 1077-1083.

159 Cheng, L., Connor, T.R., Aanensen, D.M., Spratt, B.G., Corander, J., 2011. Bayesian semi-supervised classification of bacterial samples using MLST databases. BMC bioinformatics 12, 302.

Clement, N.L., Snell, Q., Clement, M.J., Hollenhorst, P.C., Purwar, J., Graves, B.J.,

Cairns, B.R., Johnson, W.E., 2010. The GNUMAP algorithm: Unbiased probabilistic mapping of oligonucleotides from next-generation sequencing. Bioinformatics 26, 38-45.

Climent, Y., Urwin, R., Yero, D., Martinez, I., Martin, A., Sotolongo, F., Maiden, M.C.,

Pajon, R., 2010. The genetic structure of Neisseria meningitidis populations in Cuba before and after the introduction of a serogroup BC vaccine. Infect Genet Evol 10, 546-

554.

Comas, I., Homolka, S., Niemann, S., Gagneux, S., 2009. Genotyping of genetically monomorphic bacteria: DNA sequencing in Mycobacterium tuberculosis highlights the limitations of current methodologies. PLoS One 4, e7815.

Cooper, J.E., Feil, E.J., 2006. The phylogeny of Staphylococcus aureus–which genes make the best intra-species markers? Microbiology 152, 1297-1305.

Crandall, K.A., Pérez-Losada, M., 2008. Epidemiological and evolutionary dynamics of pathogens, in: Baquero, F., Nombela, C., Cassell, G.H., Gutiérrez-Fuentes, J.A. (Eds.),

Evolutionary Biology of Bacterial and Fungal Pathogens. ASM Press, Washington, DC, pp. 21-30.

Cronn, R., Knaus, B.J., Liston, A., Maughan, P.J., Parks, M., Syring, J.V., Udall, J.,

2012. Targeted enrichment strategies for next-generation plant biology. Am J Bot 99,

291-311.

160 Dale, J., Price, E.P., Hornstra, H., Busch, J.D., Mayo, M., Godoy, D., Wuthiekanun, V.,

Baker, A., Foster, J.T., Wagner, D.M., Tuanyok, A., Warner, J., Spratt, B.G., Peacock,

S.J., Currie, B.J., Keim, P., Pearson, T., 2011. Epidemiological tracking and population assignment of the non-clonal bacterium, Burkholderia pseudomallei. PLoS Negl Trop Dis

5, e1381.

Darling, A.E., Mau, B., Perna, N.T., 2010. progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS One 5, e11147.

Darriba, D., Taboada, G.L., Doallo, R., Posada, D., 2012. jModelTest 2: more models, new heuristics and parallel computing. Nat Meth 9, 772-772. de Filippis, I., de Lemos, A.P., Hostetler, J.B., Wollenberg, K., Sacchi, C.T., Harrison,

L.H., Bash, M.C., Prevots, D.R., 2012. Molecular epidemiology of Neisseria meningitidis serogroup B in Brazil. PLoS One 7, e33016.

De Francesco, M.A., Caracciolo, S., Gargiulo, F., Manca, N., 2011. Phenotypes, genotypes, serotypes and molecular epidemiology of erythromycin-resistant

Streptococcus agalactiae in Italy. Eur J Clin Microbiol Infect Dis.

Devi, S.M., Ahmed, I., Francalacci, P., Hussain, M.A., Akhter, Y., Alvi, A., Sechi, L.A.,

Megraud, F., Ahmed, N., 2007. Ancestral European roots of Helicobacter pylori in India.

BMC Genomics 8, 184.

Didelot, X., Falush, D., 2007. Inference of bacterial microevolution using multilocus sequence data. Genetics 175, 1251-1266.

Didelot, X., Maiden, M.C.J., 2010. Impact of recombination on bacterial evolution.

Trends in microbiology 18, 315-322.

161 Dingle, K.E., Griffiths, D., Didelot, X., Evans, J., Vaughan, A., Kachrimanidou, M.,

Stoesser, N., Jolley, K.A., Golubchik, T., Harding, R.M., Peto, T.E., Fawley, W., Walker,

A.S., Wilcox, M., Crook, D.W., 2011. Clinical Clostridium difficile: clonality and pathogenicity locus diversity. PLoS One 6, e19993.

Do, T., Gilbert, S.C., Clark, D., Ali, F., Fatturi Parolo, C.C., Maltz, M., Russell, R.R.,

Holbrook, P., Wade, W.G., Beighton, D., 2010. Generation of diversity in Streptococcus mutans genes demonstrated by MLST. PLoS One 5, e9073.

Do, T., Jolley, K.A., Maiden, M.C., Gilbert, S.C., Clark, D., Wade, W.G., Beighton, D.,

2009. Population structure of Streptococcus oralis. Microbiology 155, 2593-2602.

Drummond, A.J., Rambaut, A., 2007. BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evol Biol 7, 214.

Drummond, A.J., Rambaut, A., Shapiro, B., Pybus, O.G., 2005. Bayesian coalescent inference of past population dynamics from molecular sequences. Mol Biol Evol 22,

1185-1192.

DuBose, R.F., Dykhuizen, D.E., Hartl, D.L., 1988. Genetic exchange among natural isolates of bacteria: Recombination within the phoA gene of Escherichia coli.

Proceedings of the National Academy of Sciences, U.S.A. 85, 7036-7040.

Duong, V., Blassdell, K., May, T.T., Sreyrath, L., Gavotte, L., Morand, S., Frutos, R.,

Buchy, P., 2012. Diversity of Orientia tsutsugamushi clinical isolates in Cambodia reveals active selection and recombination process. Infect Genet Evol (in press).

Dykhuizen, D.E., Green, L., 1991. Recombination in Escherichia coli and the definition of biological species. Journal of Bacteriology 173, 7257-7268.

162 Egger, R., Korczak, B.M., Niederer, L., Overesch, G., Kuhnert, P., 2012. Genotypes and antibiotic resistance of Campylobacter coli in fattening pigs. Vet Microbiol 155, 272-278.

Elberse, K.E.M., Nunes, S., Sá-Leão, R., van der Heide, H.G.J., Schouls, L.M., 2011.

Multiple-Locus Variable Number Tandem Repeat Analysis for Streptococcus pneumoniae: Comparison with PFGE and MLST. PLoS One 6, e19668.

Enright, M.C., Spratt, B.G., 1998. A multilocus sequence typing scheme for

Streptococcus pneumoniae: identification of clones associated with serious invasive disease. Microbiology 144, 3049-3060.

Eppinger, M., Mammel, M.K., Leclerc, J.E., Ravel, J., Cebula, T.A., 2011. Genomic anatomy of Escherichia coli O157:H7 outbreaks. Proc Natl Acad Sci U S A 108, 20142-

20147.

Erali, M., Voelkerding, K.V., Wittwer, C.T., 2008. High resolution melting applications for clinical laboratory medicine. Experimental and molecular pathology 85, 50-58.

Ewing, B., Green, P., 1998. Base-calling of automated sequencer traces using phred. II.

Error probabilities. Genome Research 8, 186-194.

Ewing, B., Hillier, L., Wendl, M.C., Green, P., 1998. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Research 8, 175-185.

Excoffier, L., Heckel, G., 2006. Computer programs for population genetics data analysis: a survival guide. Nat Rev Genet 7, 745-758.

Faircloth, B.C., McCormack, J.E., Crawford, N.G., Harvey, M.G., Brumfield, R.T.,

Glenn, T.C., 2012. Ultraconserved Elements Anchor Thousands of Genetic Markers

Spanning Multiple Evolutionary Timescales. Systematic Biology 61, 717-726.

163 Falush, D., Wirth, T., Linz, B., Pritchard, J.K., Stephens, M., Kidd, M., Blaser, M.J.,

Graham, D.Y., Vacher, S., Pérez-Pérez, G.I., Yamaoka, Y., Mégraud, F., Otto, K.,

Reichard, U., Katzowitsch, E., Wang, X., Achtman, M., Suerbaum, S., 2003. Traces of human migrations in Helicobacter pylori populations. Science 299, 1582-1585.

Fan, J., Shu, M., Zhang, G., Zhou, W., Jiang, Y., Zhu, Y., Chen, G., Peacock, S.J., Wan,

C., Pan, W., Feil, E.J., 2009. Biogeography and virulence of Staphylococcus aureus.

PLoS One 4, e6216.

Feil, E.J., Enright, M.C., Spratt, B.G., 2000. Estimating the relative contributions of mutation and recombination to clonal diversification: a comparison between Neisseria meningitidis and Streptococcus pneumoniae. Research in microbiology 151, 465-469.

Feil, E.J., Li, B.C., Aanensen, D.M., Hanage, W.P., Spratt, B.G., 2004. eBURST: inferring patterns of evolutionary descent among clusters of related bacterial genotypes from multilocus sequence typing data. J Bacteriol 186, 1518-1530.

Feil, E.J., Maiden, M.C.J., Achtman, M., Spratt, B.G., 1999. The relative contributions of recombination and mutation to the divergence of clones of Neisseria meningitidis.

Molecular Biology and Evolution 16, 1496-1502.

Felsenstein, J., 1981. Evolutionary trees from DNA sequences: A maximum likelihood approach. Journal of Molecular Evolution 17, 368-376.

Felsenstein, J., 1985. Confidence limits on phylogenies: an approach using the bootstrap.

Evolution 39, 783-791.

Ferreira, R., Borges, V., Nunes, A., Nogueira, P.J., Borrego, M.J., Gomes, J.P., 2012.

Impact of Loci Nature on Estimating Recombination and Mutation Rates in Chlamydia trachomatis. G3: Genes|Genomes|Genetics 2, 761-768.

164 Filliol, I., Motiwala, A.S., Cavatore, M., Qi, W., Hazbon, M.H., Bobadilla del Valle, M.,

Fyfe, J., Garcia-Garcia, L., Rastogi, N., Sola, C., Zozio, T., Guerrero, M.I., Leon, C.I.,

Crabtree, J., Angiuoli, S., Eisenach, K.D., Durmaz, R., Joloba, M.L., Rendon, A.,

Sifuentes-Osornio, J., Ponce de Leon, A., Cave, M.D., Fleischmann, R., Whittam, T.S.,

Alland, D., 2006. Global phylogeny of Mycobacterium tuberculosis based on single nucleotide polymorphism (SNP) analysis: insights into tuberculosis evolution, phylogenetic accuracy of other DNA fingerprinting systems, and recommendations for a minimal standard SNP set. J Bacteriol 188, 759-772.

Francis, O.E., Bendall, M., Clement, N.L., Snell, Q., Schaalje, G.B., Clement, M.J.,

Crandall, K.A., Johnson, W.E., 2012. Species identification and strain attribution with unassembled sequencing data. submitted.

Francisco, A.P., Bugalho, M., Ramirez, M., Carrico, J.A., 2009. Global optimal eBURST analysis of multilocus typing data using a graphic matroid approach. BMC bioinformatics

10, 152.

Francisco, A.P., Vaz, C., Monteiro, P.T., Melo-Cristino, J., Ramirez, M., Carrio, J.A.,

2012. PHYLOViZ: phylogenetic inference and data visualization for sequence based typing methods. BMC bioinformatics 13, 87.

Fraser, C., Hanage, W.P., Spratt, B.G., 2007. Recombination and the nature of bacterial speciation. Science 315, 476-480.

Godreuil, S., Cohan, F., Shah, H., Tibayrenc, M., 2005. Which species concept for pathogenic bacteria? An E-Debate. Infect Genet Evol 5, 375-387.

Gray, R.R., Tatem, A.J., Johnson, J.A., Alekseyenko, A.V., Pybus, O.G., Suchard, M.A.,

Salemi, M., 2011. Testing Spatiotemporal Hypothesis of Bacterial Evolution Using

165 Methicillin-Resistant Staphylococcus aureus ST239 Genome-wide Data within a

Bayesian Framework. Molecular Biology and Evolution 28, 1593-1603.

Gripp, E., Hlahla, D., Didelot, X., Kops, F., Maurischat, S., Tedin, K., Alter, T.,

Ellerbroek, L., Schreiber, K., Schomburg, D., Janssen, T., Bartholomaus, P., Hofreuter,

D., Woltemate, S., Uhr, M., Brenneke, B., Gruning, P., Gerlach, G., Wieler, L.,

Suerbaum, S., Josenhans, C., 2011. Closely related Campylobacter jejuni strains from different sources reveal a generalist rather than a specialist lifestyle. BMC Genomics 12,

584.

Grundmann, H., Aanensen, D.M., Van Den Wijngaard, C.C., Spratt, B.G., Harmsen, D.,

Friedrich, A.W., 2010. Geographic distribution of Staphylococcus aureus causing invasive infections in Europe: a molecular-epidemiological analysis. PLoS medicine 7, e1000215.

Guindon, S.p., Dufayard, J.-F.ß., Lefort, V., Anisimova, M., Hordijk, W., Gascuel, O.,

2010. New Algorithms and Methods to Estimate Maximum-Likelihood Phylogenies:

Assessing the Performance of PhyML 3.0. Systematic Biology 59, 307-321.

Guinebretiere, M.H., Auger, S., Galleron, N., Contzen, M., De Sarrau, B., De Buyser,

M.L., Lamberet, G., Fagerlund, A., Granum, P.E., Lereclus, D., De Vos, P., Nguyen-The,

C., Sorokin, A., 2012. Bacillus cytotoxicus sp. nov. is a new thermotolerant species of the

Bacillus cereus group occasionally associated with food poisoning. Int J Syst Evol

Microbiol in press.

Hall, B.G., Barlow, M., 2006. Phylogenetic analysis as a tool in molecular epidemiology of infectious diseases. Annals of epidemiology 16, 157-169.

166 Hall, B.G., Ehrlich, G.D., Hu, F.Z., 2010. Pan-genome analysis provides much higher strain typing resolution than multi-locus sequence typing. Microbiology 156, 1060-1068.

Hanage, W.P., Bishop, C.J., Lee, G.M., Lipsitch, M., Stevenson, A., Rifas-Shiman, S.L.,

Pelton, S.I., Huang, S.S., Finkelstein, J.A., 2011. Clonal replacement among 19A

Streptococcus pneumoniae in Massachusetts, prior to 13 valent conjugate vaccination.

Vaccine 29, 8877-8881.

Haran, K.P., Godden, S.M., Boxrud, D., Jawahir, S., Bender, J.B., Sreevatsan, S., 2012.

Prevalence and characterization of Staphylococcus aureus, including methicillin-resistant

Staphylococcus aureus, isolated from bulk tank milk from Minnesota dairy farms. J Clin

Microbiol 50, 688-695.

Harbottle, H., White, D., McDermott, P., Walker, R., Zhao, S., 2006. Comparison of multilocus sequence typing, pulsed-field gel electrophoresis, and antimicrobial susceptibility typing for characterization of Salmonella enterica serotype Newport isolates. Journal of clinical microbiology 44, 2449-2457.

Harismendy, O., Ng, P., Strausberg, R., Wang, X., Stockwell, T., Beeson, K., Schork, N.,

Murray, S., Topol, E., Levy, S., Frazer, K., 2009. Evaluation of next generation sequencing platforms for population targeted sequencing studies. Genome Biology 10,

R32.

Harris, S.R., Clarke, I.N., Seth-Smith, H.M., Solomon, A.W., Cutcliffe, L.T., Marsh, P.,

Skilton, R.J., Holland, M.J., Mabey, D., Peeling, R.W., Lewis, D.A., Spratt, B.G.,

Unemo, M., Persson, K., Bjartling, C., Brunham, R., de Vries, H.J., Morre, S.A.,

Speksnijder, A., Bebear, C.M., Clerc, M., de Barbeyrac, B., Parkhill, J., Thomson, N.R.,

2012. Whole-genome analysis of diverse Chlamydia trachomatis strains identifies

167 phylogenetic relationships masked by current clinical typing. Nat Genet 44, 413-419,

S411.

Heled, J., Drummond, A.J., 2010. Bayesian inference of species trees from multilocus data. Mol Biol Evol 27, 570-580.

Henssge, U., Do, T., Gilbert, S.C., Cox, S., Clark, D., Wickstrom, C., Ligtenberg, A.J.,

Radford, D.R., Beighton, D., 2011. Application of MLST and pilus gene sequence comparisons to investigate the population structures of Actinomyces naeslundii and

Actinomyces oris. PLoS One 6, e21430.

Heym, B., Le Moal, M., Armand-Lefevre, L., Nicolas-Chanoine, M.H., 2002. Multilocus sequence typing (MLST) shows that the ‘Iberian’clone of methicillin-resistant

Staphylococcus aureus has spread to France and acquired reduced susceptibility to teicoplanin. Journal of Antimicrobial Chemotherapy 50, 323-329.

Hird, S.M., Brumfield, R.T., Carstens, B.C., 2011. PRGmatic: an efficient pipeline for collating genome-enriched second-generation sequencing data using a ‘provisional- reference genome’. Molecular Ecology Resources 11, 743-748.

Hoff, K., 2009. The effect of sequencing errors on metagenomic gene prediction. BMC

Genomics 10, 520.

Holt, K.E., Baker, S., Dongol, S., Basnyat, B., Adhikari, N., Thorson, S., Pulickal, A.S.,

Song, Y., Parkhill, J., Farrar, J.J., Murdoch, D.R., Kelly, D.F., Pollard, A.J., Dougan, G.,

2010. High-throughput bacterial SNP typing identifies distinct clusters of Salmonella

Typhi causing typhoid in Nepalese children. BMC Infect Dis 10, 144.

168 Hotchkiss, E.J., Hodgson, J.C., Lainson, F.A., Zadoks, R.N., 2011. Multilocus sequence typing of a global collection of Pasteurella multocida isolates from cattle and other host species demonstrates niche association. BMC Microbiol 11, 115.

Huelsenbeck, J.P., Ronquist, F., Nielsen, R., Bollback, J.P., 2001. Bayesian inference of phylogeny and its impact on evolutionary biology. Science 294, 2310-2314.

Huson, D.H., Bryant, D., 2006. Application of Phylogenetic Networks in Evolutionary

Studies. Molecular Biology and Evolution 23, 254-267.

Ibarz-Pavon, A.B., Morais, L., Sigauque, B., Mandomando, I., Bassat, Q., Nhacolo, A.,

Quinto, L., Soriano-Gabarro, M., Alonso, P.L., Roca, A., 2011. Epidemiology, molecular characterization and antibiotic resistance of Neisseria meningitidis from patients ≤ 15 years in Manhica, rural Mozambique. PLoS One 6, e19717.

Jeffrey, D.W., 2004. Estimating recombination rates using three-site likelihoods.

Genetics 167, 1461-1473.

Jolley, K.A., 2009. Internet-based sequence-typing databases for bacterial molecular epidemiology. Methods in molecular biology (Clifton, NJ) 551, 305.

Jolley, K.A., Bliss, C.M., Bennett, J.S., Bratcher, H.B., Brehony, C.M., Colles, F.M.,

Wimalarathna, H.M., Harrison, O.B., Sheppard, S.K., Cody, A.J., 2012. Ribosomal

Multi-Locus Sequence Typing: universal characterisation of bacteria from domain to strain. Microbiology.

Jolley, K.A., Chan, M.S., Maiden, M.C.J., 2004. mlstdbNet–distributed multi-locus sequence typing (MLST) databases. BMC bioinformatics 5, 86.

Jolley, K.A., Feil, E.J., Chan, M.S., Maiden, M.C., 2001. Sequence type analysis and recombinational tests (START). Bioinformatics 17, 1230-1231.

169 Jolley, K.A., Maiden, M.C.J., 2006. AgdbNet–antigen sequence database software for bacterial typing. BMC bioinformatics 7, 314.

Jolley, K.A., Maiden, M.C.J., 2010. BIGSdb: Scalable analysis of bacterial genome variation at the population level. BMC bioinformatics 11, 595.

Jolley, K.A., Wilson, D.J., Kriz, P., McVean, G., Maiden, M.C., 2005. The influence of mutation, recombination, population history, and selection on patterns of genetic diversity in Neisseria meningitidis. Mol Biol Evol 22, 562-569.

Jorgensen, F., Ellis-Iversen, J., Rushton, S., Bull, S.A., Harris, S.A., Bryan, S.J.,

Gonzalez, A., Humphrey, T.J., 2011. Influence of season and geography on

Campylobacter jejuni and C. coli subtypes in housed broiler flocks reared in Great

Britain. Appl Environ Microbiol 77, 3741-3748.

Katoh, K., Kuma, K., Toh, H., Miyata, T., 2005. MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res 33, 511-518.

Kelsey, C.R., Crandall, K.A., Voevodin, A.F., 1999. Different models, different trees:

The geographic origin of PTLV-I. Molecular Phylogenetics and Evolution 13, 336-347.

Killgore, G., Thompson, A., Johnson, S., Brazier, J., Kuijper, E., Pepin, J., Frost, E.H.,

Savelkoul, P., Nicholson, B., Van Den Berg, R.J., 2008. Comparison of seven techniques for typing international epidemic strains of Clostridium difficile: restriction endonuclease analysis, pulsed-field gel electrophoresis, PCR-ribotyping, multilocus sequence typing, multilocus variable-number tandem-repeat analysis, amplified fragment length polymorphism, and surface layer protein A gene sequence typing. Journal of clinical microbiology 46, 431-437.

170 Kosakovsky Pond, S.L., Frost, S.D.W., Muse, S.V., 2005. HyPhy: hypothesis testing using phylogenies. Bioinformatics 21, 676-679.

Kriz, P., Kalmusova, J., Felsberg, J., 2002. Multilocus sequence typing of Neisseria meningitidis directly from cerebrospinal fluid. Epidemiology and infection 128, 157-160.

Kubatko, L.S., Carstens, B.C., Knowles, L.L., 2009. STEM: species tree estimation using maximum likelihood for gene trees under coalescence. Bioinformatics 25, 971-973.

Kuhn, G., Francioli, P., Blanc, D., 2006. Evidence for clonal evolution among highly polymorphic genes in methicillin-resistant Staphylococcus aureus. Journal of bacteriology 188, 169-178.

Kuhner, M.K., 2006. LAMARC 2.0: maximum likelihood and Bayesian estimation of population parameters. Bioinformatics 22, 768-770.

Larsen, M.V., Cosentino, S., Rasmussen, S., Friis, C., Hasman, H., Marvig, R.L., Jelsbak,

L., Sicheritz-Ponten, T., Ussery, D.W., Aarestrup, F.M., Lund, O., 2012. Multilocus sequence typing of total-genome-sequenced bacteria. J Clin Microbiol 50, 1355-1361.

Lefebvre, J.F., Labuda, D., 2008. Fraction of informative recombinations: A heuristic approach to analyze recombination rates. Genetics 178, 2069-2079.

Lemey, P., Rambaut, A., Drummond, A.J., Suchard, M.A., 2009. Bayesian phylogeography finds its roots. PLoS Comput Biol 5, e1000520.

Lemey, P., Rambaut, A., Welch, J.J., Suchard, M.A., 2010. Phylogeography takes a relaxed random walk in continuous space and time. Mol Biol Evol 27, 1877-1885.

Lemmon, A.R., Emme, S.A., Lemmon, E.M., 2012. Anchored Hybrid Enrichment for

Massively High-Throughput Phylogenomics. Systematic Biology 61, 727-744.

171 Lewis-Rogers, N., Bendall, M.L., Crandall, K.A., 2009. Phylogenetic relationships and molecular adaptation dynamics of human rhinoviruses. Molecular Biology and Evolution

26, 969-981.

Li, W., Raoult, D., Fournier, P.-E., 2009. Bacterial strain typing in the genomic era.

FEMS Microbiology Reviews 33, 892-916.

Liu, L., 2008. BEST: Bayesian estimation of species trees under the coalescent model.

Bioinformatics 24, 2542-2543.

Lorenz, M.G., Wackernagel, W., 1994. Bacterial gene transfer by natural genetic transformation in the environment. Microbiological Reviews 58, 563.

Maatallah, M., Cheriaa, J., Backhrouf, A., Iversen, A., Grundmann, H., Do, T., Lanotte,

P., Mastouri, M., Elghmati, M.S., Rojo, F., Mejdi, S., Giske, C.G., 2011. Population structure of Pseudomonas aeruginosa from five Mediterranean countries: evidence for frequent recombination and epidemic occurrence of CC235. PLoS One 6, e25617.

Maiden, M.C., Stuart, J.M., 2002. Carriage of serogroup C meningococci 1 year after meningococcal C conjugate polysaccharide vaccination. Lancet 359, 1829-1831.

Maiden, M.C.J., 2006. Multilocus sequence typing of bacteria. Annu. Rev. Microbiol. 60,

561-588.

Maiden, M.C.J., Bygraves, J.A., Feil, E., Morelli, G., Russell, J.E., Urwin, R., Zhang, Q.,

Zhou, J., Zurth, K., Caugant, D.A., 1998. Multilocus sequence typing: a portable approach to the identification of clones within populations of pathogenic microorganisms. Proceedings of the National Academy of Sciences 95, 3140.

Makino, H., Kushiro, A., Ishikawa, E., Muylaert, D., Kubota, H., Sakai, T., Oishi, K.,

Martin, R., Ben Amor, K., Oozeer, R., Knol, J., Tanaka, R., 2011. Transmission of

172 intestinal Bifidobacterium longum subsp. longum strains from mother to infant, determined by multilocus sequencing typing and amplified fragment length polymorphism. Appl Environ Microbiol 77, 6788-6793.

Malachowa, N., Sabat, A., Gniadkowski, M., Krzyszton-Russjan, J., Empel, J.,

Miedzobrodzki, J., Kosowska-Shick, K., Appelbaum, P.C., Hryniewicz, W., 2005.

Comparison of multiple-locus variable-number tandem-repeat analysis with pulsed-field gel electrophoresis, spa typing, and multilocus sequence typing for clonal characterization of Staphylococcus aureus isolates. Journal of clinical microbiology 43,

3095-3100.

Mamanova, L., Coffey, A.J., Scott, C.E., Kozarewa, I., Turner, E.H., Kumar, A., Howard,

E., Shendure, J., Turner, D.J., 2010. Target-enrichment strategies for next-generation sequencing. Nat Meth 7, 111-118.

Martin, D.P., Lemey, P., Lott, M., Moulton, V., Posada, D., Lefeuvre, P., 2010. RDP3: a flexible and fast computer program for analyzing recombination. Bioinformatics 26,

2462-2463.

Martin, D.P., Lemey, P., Posada, D., 2011. Analysing recombination in nucleotide sequences. Molecular Ecology Resources 11, 943-955.

Martin, V., Maldonado-Barragan, A., Moles, L., Rodriguez-Banos, M., Campo, R.D.,

Fernandez, L., Rodriguez, J.M., Jimenez, E., 2012. Sharing of bacterial strains between breast milk and infant feces. J Hum Lact 28, 36-44.

Matsunari, O., Shiota, S., Suzuki, R., Watada, M., Kinjo, N., Murakami, K., Fujioka, T.,

Kinjo, F., Yamaoka, Y., 2012. Association between Helicobacter pylori virulence factors and gastroduodenal diseases in Okinawa, Japan. J Clin Microbiol 50, 876-883.

173 McAdam, P.R., Templeton, K.E., Edwards, G.F., Holden, M.T.G., Feil, E.J., Aanensen,

D.M., Bargawi, H.J.A., Spratt, B.G., Bentley, S.D., Parkhill, J., Enright, M.C., Holmes,

A., Girvan, E.K., Godfrey, P.A., Feldgarden, M., Kearns, A.M., Rambaut, A., Robinson,

D.A., Fitzgerald, J.R., 2012. Molecular tracing of the emergence, adaptation, and transmission of hospital-associated methicillin-resistant Staphylococcus aureus.

Proceedings of the National Academy of Sciences 109, 9107-9112.

McMillan, D.J., Bessen, D.E., Pinho, M., Ford, C., Hall, G.S., Melo-Cristino, J., Ramirez,

M., 2010. Population genetics of Streptococcus dysgalactiae subspecies equisimilis reveals widely dispersed clones and extensive recombination. PLoS One 5, e11741.

McMillan, D.J., Kaul, S.Y., Bramhachari, P.V., Smeesters, P.R., Vu, T., Karmarkar,

M.G., Shaila, M.S., Sriprakash, K.S., 2011. Recombination drives genetic diversification of Streptococcus dysgalactiae subspecies equisimilis in a region of streptococcal endemicity. PLoS One 6, e21346.

Medini, D., Serruto, D., Parkhill, J., Relman, D.A., Donati, C., Moxon, R., Falkow, S.,

Rappuoli, R., 2008. Microbiology in the post-genomic era. Nat Rev Micro 6, 419-430.

Melles, D.C., van Leeuwen, W.B., Snijders, S.V., Horst-Kreft, D., Peeters, J.K.,

Verbrugh, H.A., van Belkum, A., 2007. Comparison of multilocus sequence typing

(MLST), pulsed-field gel electrophoresis (PFGE), and amplified fragment length polymorphism (AFLP) for genetic typing of Staphylococcus aureus. Journal of microbiological methods 69, 371-375.

Metzker, M.L., 2010. Sequencing technologies - the next generation. Nat Rev Genet 11,

31-46.

174 Millat, G., Chanavat, V., Julia, S., Crehalet, H., Bouvagnet, P., Rousson, R., 2009.

Validation of high-resolution DNA melting analysis for mutation scanning of the LMNA gene. Clinical biochemistry 42, 892-898.

Minin, V.N., Bloomquist, E.W., Suchard, M.A., 2008. Smooth skyride through a rough skyline: Bayesian coalescent-based inference of population dynamics. Mol Biol Evol 25,

1459-1471.

Ngo, T.H., Tran, T.B., Tran, T.T., Nguyen, V.D., Campbell, J., Pham, H.A., Huynh, H.T.,

Nguyen, V.V., Bryant, J.E., Tran, T.H., Farrar, J., Schultsz, C., 2011. Slaughterhouse pigs are a major reservoir of Streptococcus suis serotype 2 capable of causing human infection in southern Vietnam. PLoS One 6, e17943.

O'Mahony, E., Buckley, J.F., Bolton, D., Whyte, P., Fanning, S., 2011. Molecular epidemiology of Campylobacter isolates from poultry production units in southern

Ireland. PLoS One 6, e28490.

Okoro, C.K., Kingsley, R.A., Connor, T.R., Harris, S.R., Parry, C.M., Al-Mashhadani,

M.N., Kariuki, S., Msefula, C.L., Gordon, M.A., de Pinna, E., Wain, J., Heyderman, R.S.,

Obaro, S., Alonso, P.L., Mandomando, I., MacLennan, C.A., Tapia, M.D., Levine, M.M.,

Tennant, S.M., Parkhill, J., Dougan, G., 2012. Intracontinental spread of human invasive

Salmonella Typhimurium pathovariants in sub-Saharan Africa. Nat Genet advance online publication.

Padhukasahasram, B., Wall, J.D., Marjoram, P., Nordborg, M., 2006. Estimating recombination rates from single-nucleotide polymorphisms using summary statistics.

Genetics 174, 1517-1528.

175 Palazzo, I.C., Pitondo-Silva, A., Levy, C.E., da Costa Darini, A.L., 2011. Changes in vancomycin-resistant Enterococcus faecium causing outbreaks in Brazil. J Hosp Infect

79, 70-74.

Parkhill, J., Sebaihia, M., Preston, A., Murphy, L.D., Thomson, N., Harris, D.E., Holden,

M.T.G., Churcher, C.M., Bentley, S.D., Mungall, K.L., 2003. Comparative analysis of the genome sequences of Bordetella pertussis, Bordetella parapertussis and Bordetella bronchiseptica. Nature genetics 35, 32-40.

Passerini, D., Beltramo, C., Coddeville, M., Quentin, Y., Ritzenthaler, P., Daveran-

Mingot, M.L., Le Bourgeois, P., 2010. Genes but not genomes reveal bacterial domestication of Lactococcus lactis. PLoS One 5, e15306.

Pearse, D.E., Crandall, K., 2004. Beyond Fst: Analysis of population genetic data for conservation. Conservation Genetics 5, 585-602.

Pearson, T., Okinaka, R.T., Foster, J.T., Keim, P., 2009. Phylogenetic understanding of clonal populations in an era of whole genome sequencing. Infect Genet Evol 9, 1010-

1019.

Pérez-Losada, M., Browne, E.B., Madsen, A., Wirth, T., Viscidi, R.P., Crandall, K.A.,

2006. Population genetics of microbial pathogens estimated from multilocus sequence typing (MLST) data. Infection, Genetics and Evolution 6, 97-112.

Pérez-Losada, M., Crandall, K.A., Bash, M.C., Dan, M., Zenilman, J., Viscidi, R.P.,

2007a. Distinguishing importation from diversification of quinolone-resistant Neisseria gonorrhoeae by molecular evolutionary analysis. BMC Evol Biol 7, 84.

176 Pérez-Losada, M., Crandall, K.A., Zenilman, J., Viscidi, R.P., 2007b. Temporal trends in gonococcal population genetics in a high prevalence urban community. Infect Genet Evol

7, 271-278.

Pérez-Losada, M., Porter, M.L., Tazi, L., Crandall, K.A., 2007c. New methods for inferring population dynamics from microbial sequences. Infect Genet Evol 7, 24-43.

Pérez-Losada, M., Porter, M.L., Viscidi, R.P., Crandall, K.A., 2011. Multilocus sequence typing of pathogens. , in: Tibayrenc, M. (Ed.), Genetics and evolution of infectious diseases. Elsevier Inc., pp. 503-521.

Petersen, A., Christensen, H., Kodjo, A., Weiser, G.C., Bisgaard, M., 2009. Development of a multilocus sequence typing (MLST) scheme for Mannheimia haemolytica and assessment of the population structure of isolates obtained from cattle and sheep. Infect

Genet Evol 9, 626-632.

Pichon, B., Bennett, H.V., Efstratiou, A., Slack, M.P., George, R.C., 2009. Genetic characteristics of pneumococcal disease in elderly patients before introducing the pneumococcal conjugate vaccine. Epidemiol Infect 137, 1049-1056.

Pizza, M., Scarlato, V., Masignani, V., Giuliani, M.M., Arico, B., Comanducci, M.,

Jennings, G.T., Baldi, L., Bartolini, E., Capecchi, B., Galeotti, C.L., Luzzi, E., Manetti,

R., Marchetti, E., Mora, M., Nuti, S., Ratti, G., Santini, L., Savino, S., Scarselli, M.,

Storni, E., Zuo, P., Broeker, M., Hundt, E., Knapp, B., Blair, E., Mason, T., Tettelin, H.,

Hood, D.W., Jeffries, A.C., Saunders, N.J., Granoff, D.M., Venter, J.C., Moxon, E.R.,

Grandi, G., Rappuoli, R., 2000. Identification of vaccine candidates against serogroup B meningococcus by whole-genome sequencing. Science 287, 1816-1820.

177 Plucinski, M.M., Starfield, R., Almeida, R.P., 2011. Inferring social network structure from bacterial sequence data. PLoS One 6, e22685.

Posada, D., Buckley, T.R., 2004. Model selection and model averaging in phylogenetics:

Advantages of akaike information criterion and Bayesian approaches over likelihood ratio tests. Systematic Biology 53, 793-808.

Posada, D., Crandall, K.A., 2001. Intraspecific gene genealogies: trees grafting into networks. Trends in Ecology and Evolution 16, 37-45.

Posada, D., Crandall, K.A., 2002. The effect of recombination on the accuracy of phylogeny estimation. Journal of Molecular Evolution 54, 396-402.

Posada, D., Crandall, K.A., Holmes, E.C., 2002. Recombination in evolutionary genomics. Annual Review of Genetics 36, 75-97.

Pourcel, C., Andre-Mazeaud, F., Neubauer, H., Ramisse, F., Vergnaud, G., 2004.

Tandem repeats analysis for the high resolution phylogenetic analysis of Yersinia pestis.

BMC microbiology 4, 22.

Racloz, V.N., Luiz, S.J., 2010. The elusive meningococcal meningitis serogroup: a systematic review of serogroup B epidemiology. BMC Infect Dis 10, 175.

Rambaut, A., Drummond, A.J., 2009. Tracer: MCMC trace analysis tool, 1.4.1 ed.

Institute of Evolutionary Biology, Edinburgh, p. http://tree.bio.ed.ac.uk/software/tracer/.

Rasko, D.A., Worsham, P.L., Abshire, T.G., Stanley, S.T., Bannan, J.D., Wilson, M.R.,

Langham, R.J., Decker, R.S., Jiang, L., Read, T.D., Phillippy, A.M., Salzberg, S.L., Pop,

M., Van Ert, M.N., Kenefic, L.J., Keim, P.S., Fraser-Liggett, C.M., Ravel, J., 2011.

Bacillus anthracis comparative genome analysis in support of the Amerithrax investigation. Proc Natl Acad Sci U S A 108, 5027-5032.

178 Raymond, B., Wyres, K.L., Sheppard, S.K., Ellis, R.J., Bonsall, M.B., 2010.

Environmental factors determining the epidemiology and population genetic structure of the Bacillus cereus group in the field. PLoS Pathog 6, e1000905.

Robinson, D.A., Monk, A.B., Cooper, J.E., Feil, E.J., Enright, M.C., 2005. Evolutionary genetics of the accessory gene regulator (agr) locus in Staphylococcus aureus. Journal of bacteriology 187, 8312-8321.

Ronquist, F., Huelsenbeck, J.P., 2003. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19, 1572-1574.

Rosenberg, M.S., 2009. Sequence Alignment. University of California Press, Berkeley,

CA, p. 337.

Russell, J.A., Goldman-Huertas, B., Moreau, C.S., Baldo, L., Stahlhut, J.K., Werren,

J.H., Pierce, N.E., 2009. Specialization and geographic isolation among Wolbachia symbionts from ants and lycaenid butterflies. Evolution 63, 624-640.

Sahin, O., Fitzgerald, C., Stroika, S., Zhao, S., Sippy, R.J., Kwan, P., Plummer, P.J., Han,

J., Yaeger, M.J., Zhang, Q., 2012. Molecular evidence for zoonotic transmission of an emergent, highly pathogenic Campylobacter jejuni clone in the United States. J Clin

Microbiol 50, 680-687.

Sahl, J.W., Matalka, M.N., Rasko, D.A., 2012. Phylomark, a Tool To Identify Conserved

Phylogenetic Markers from Whole-Genome Alignments. Applied and Environmental

Microbiology 78, 4884-4892.

Sakwinska, O., Giddey, M., Moreillon, M., Morisset, D., Waldvogel, A., Moreillon, P.,

2011. Staphylococcus aureus host range and human-bovine host shift. Appl Environ

Microbiol 77, 5908-5915.

179 Salazar-Gonzalez, J.F., Bailes, E., Pham, K.T., Salazar, M.G., Guffey, M.B., Keele, B.F.,

Derdeyn, C.A., Farmer, P., Hunter, E., Allen, S., Manigart, O., Mulenga, J., Anderson,

J.A., Swanstrom, R., Haynes, B.F., Athreya, G.S., Korber, B.T., Sharp, P.M., Shaw,

G.M., Hahn, B.H., 2008. Deciphering human immunodeficiency virus type 1 transmission and early envelope diversification by single-genome amplification and sequencing. J Virol 82, 3952-3970.

Schatz, M.C., Delcher, A.L., Salzberg, S.L., 2010. Assembly of large genomes using second-generation sequencing. Genome Res 20, 1165-1173.

Schierup, M.H., Hein, J., 2000. Consequences of recombination on traditional phylogenetic analysis. Genetics 156, 879-891.

Schmidlin, M., Alt, M., Vogel, G., Voegeli, U., Brodmann, P., Bagutti, C., 2010.

Contaminations of laboratory surfaces with Staphylococcus aureus are affected by the carrier status of laboratory staff. J Appl Microbiol 109, 1284-1293.

Schouls, L.M., Van Der Ende, A., Damen, M., Van De Pol, I., 2006. Multiple-locus variable-number tandem repeat analysis of Neisseria meningitidis yields groupings similar to those obtained by multilocus sequence typing. Journal of clinical microbiology

44, 1509-1518.

Schulte, P.A., Perera, F., 1993. Molecular epidemiology: principles and practices.

Recherche 67, 02.

Schultsz, C., Jansen, E., Keijzers, W., Rothkamp, A., Duim, B., Wagenaar, J.A., van der

Ende, A., 2012. Differences in the population structure of invasive Streptococcus suis strains isolated from pigs and from humans in The Netherlands. PLoS One 7, e33854.

180 Schürch, A.C., van Soolingen, D., 2012. DNA fingerprinting of Mycobacterium tuberculosis: From phage typing to whole-genome sequencing. Infection, Genetics and

Evolution 12, 602-609.

Sheppard, S.K., Colles, F., Richardson, J., Cody, A.J., Elson, R., Lawson, A., Brick, G.,

Meldrum, R., Little, C.L., Owen, R.J., Maiden, M.C., McCarthy, N.D., 2010a. Host association of Campylobacter genotypes transcends geographic variation. Appl Environ

Microbiol 76, 5269-5277.

Sheppard, S.K., Dallas, J.F., Wilson, D.J., Strachan, N.J.C., McCarthy, N.D., Jolley,

K.A., Colles, F.M., Rotariu, O., Ogden, I.D., Forbes, K.J., Maiden, M.C.J., 2010b.

Evolution of an Agriculture-Associated Disease Causing Campylobacter coli Clade:

Evidence from National Surveillance Data in Scotland. PLoS ONE 5, e15708.

Simoes, R.R., Aires-de-Sousa, M., Conceicao, T., Antunes, F., da Costa, P.M., de

Lencastre, H., 2010. High prevalence of EMRSA-15 in Portuguese public buses: a worrisome finding. PLoS One 6, e17630.

Singh, P., Foley, S.L., Nayak, R., Kwon, Y.M., 2012. Multilocus sequence typing of

Salmonella strains by high-throughput sequencing of selectively amplified target genes.

Journal of microbiological methods 88, 127-133.

Smith, J.M., Smith, N.H., O'Rourke, M., Spratt, B.G., 1993. How clonal are bacteria?

Proceedings of the National Academy of Sciences 90, 4384.

Smith, M.A., Bertrand, C., Crosby, K., Eveleigh, E.S., Fernandez-Triana, J., Fisher, B.L.,

Gibbs, J., Hajibabaei, M., Hallwachs, W., Hind, K., Hrcek, J., Huang, D.W., Janda, M.,

Janzen, D.H., Li, Y., Miller, S.E., Packer, L., Quicke, D., Ratnasingham, S., Rodriguez,

J., Rougerie, R., Shaw, M.R., Sheffield, C., Stahlhut, J.K., Steinke, D., Whitfield, J.,

181 Wood, M., Zhou, X., 2012. Wolbachia and DNA barcoding insects: patterns, potential, and problems. PLoS One 7, e36514.

Soge, O.O., Meschke, J.S., No, D.B., Roberts, M.C., 2009. Characterization of methicillin-resistant Staphylococcus aureus and methicillin-resistant coagulase-negative

Staphylococcus spp. isolated from US West Coast public marine beaches. J Antimicrob

Chemother 64, 1148-1155.

Spratt, B.G., 1999. Multilocus sequence typing: molecular typing of bacterial pathogens in an era of rapid DNA sequencing and the internet. Current opinion in microbiology 2,

312-316.

Springman, A.C., Lacher, D.W., Wu, G., Milton, N., Whittam, T.S., Davies, H.D.,

Manning, S.D., 2009. Selection, recombination, and virulence gene diversity among group B streptococcal genotypes. J Bacteriol 191, 5419-5427.

Sproston, E.L., Ogden, I.D., MacRae, M., Dallas, J.F., Sheppard, S.K., Cody, A.J.,

Colles, F.M., Wilson, M.J., Forbes, K.J., Strachan, N.J., 2011. Temporal variation and host association in the Campylobacter population in a longitudinal ruminant farm study.

Appl Environ Microbiol 77, 6579-6586.

Sreevatsan, S., Pan, X., Stockbauer, K.E., Connell, N.D., Kreiswirth, B.N., Whittam,

T.S., Musser, J.M., 1997. Restricted structural gene polymorphism in the Mycobacterium tuberculosis complex indicates evolutionarily recent global dissemination. Proceedings of the National Academy of Sciences 94, 9869.

Stabler, R.A., Dawson, L.F., Valiente, E., Cairns, M.D., Martin, M.J., Donahue, E.H.,

Riley, T.V., Songer, J.G., Kuijper, E.J., Dingle, K.E., Wren, B.W., 2012. Macro and

182 micro diversity of Clostridium difficile isolates from diverse sources and geographical locations. PLoS One 7, e31559.

Stackebrandt, E., Frederiksen, W., Garrity, G.M., Grimont, P.A.D., Peter, K., Maiden,

M.C.J., Nesme, X., Rossell, R., Swings, J., Tr, H.G., 2002. Report of the ad hoc committee for the re-evaluation of the species definition in bacteriology. International journal of systematic and evolutionary microbiology 52, 1043-1047.

Stamatakis, A., 2006. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22, 2688-2690.

Stefanelli, P., Fazio, C., Sofia, T., Neri, A., Mastrantonio, P., 2009. Serogroup C meningococci in Italy in the era of conjugate menC vaccination. BMC Infect Dis 9, 135.

Stensvold, C.R., Alfellani, M., Clark, C.G., 2012. Levels of genetic diversity vary dramatically between Blastocystis subtypes. Infect Genet Evol 12, 263-273.

Stumpf, M.P.H., McVean, G.A.T., 2003. Estimating recombination rates from population-genetic data. Nat Rev Genet 4, 959-968.

Tanigawa, K., Watanabe, K., 2011. Multilocus sequence typing reveals a novel subspeciation of Lactobacillus delbrueckii. Microbiology 157, 727-738.

Tay, C.Y., Mitchell, H., Dong, Q., Goh, K.L., Dawes, I.W., Lan, R., 2009. Population structure of Helicobacter pylori among ethnic groups in Malaysia: recent acquisition of the bacterium by the Malay population. BMC Microbiol 9, 126.

Taylor, C.F., 2009. Mutation scanning using high-resolution melting. Biochemical

Society Transactions 37, 433.

183 Tazi, L., Pérez-Losada, M., Gu, W., Yang, Y., Xue, L., Crandall, K.A., Viscidi, R.P.,

2010. Population dynamics of Neisseria gonorrhoeae in Shanghai, China: A comparative study. BMC Infect Dis 10, 13.

Templeton, A.R., Crandall, K.A., Sing, C.F., 1992. A cladistic analysis of phenotypic associations with haplotypes inferred from restriction endonuclease mapping and DNA sequence data. III. Cladogram estimation. Genetics 132, 619-633.

Top, J., Schouls, L.M., Bonten, M.J.M., Willems, R.J.L., 2004. Multiple-locus variable- number tandem repeat analysis, a novel typing scheme to study the genetic relatedness and epidemiology of Enterococcus faecium isolates. Journal of clinical microbiology 42,

4503-4511.

Torpdahl, M., Skov, M.N., Sandvang, D., Baggesen, D.L., 2005. Genotypic characterization of Salmonella by multilocus sequence typing, pulsed-field gel electrophoresis and amplified fragment length polymorphism. Journal of microbiological methods 63, 173-184.

Trotter, C.L., Chandra, M., Cano, R., Larrauri, A., Ramsay, M.E., Brehony, C., Jolley,

K.A., Maiden, M.C., Heuberger, S., Frosch, M., 2007. A surveillance network for meningococcal disease in Europe. FEMS Microbiol Rev 31, 27-36.

Unemo, M., Dillon, J.A.R., 2011. Review and International Recommendation of Methods for Typing Neisseria gonorrhoeae Isolates and Their Implications for Improved

Knowledge of Gonococcal Epidemiology, Treatment, and Biology. Clinical microbiology reviews 24, 447-458.

Urwin, R., Maiden, M.C.J., 2003. Multi-locus sequence typing: a tool for global epidemiology. Trends in microbiology 11, 479-487.

184 Urwin, R., Russell, J.E., Thompson, E.A., Holmes, E.C., Feavers, I.M., Maiden, M.C.,

2004. Distribution of surface protein variants among hyperinvasive meningococci: implications for vaccine design. Infect Immun 72, 5955-5962. van Berkum, P., Elia, P., Song, Q., Eardly, B.D., 2011. Development and Application of a Multilocus Sequence Analysis Method for the Identification of Genotypes Within

Genus Bradyrhizobium and for Establishing Nodule Occupancy of Soybean (Glycine max

L. Merr). Molecular Plant-Microbe Interactions 25, 321-330.

Vanderkooi, O.G., Church, D.L., MacDonald, J., Zucol, F., Kellner, J.D., 2011.

Community-based outbreaks in vulnerable populations of invasive infections caused by

Streptococcus pneumoniae serotypes 5 and 8 in Calgary, Canada. PLoS One 6, e28547.

Vanlaere, E., Baldwin, A., Gevers, D., Henry, D., De Brandt, E., LiPuma, J.J.,

Mahenthiralingam, E., Speert, D.P., Dowson, C., Vandamme, P., 2009. Taxon K, a complex within the Burkholderia cepacia complex, comprises at least two novel species,

Burkholderia contaminans sp. nov. and Burkholderia lata sp. nov. Int J Syst Evol

Microbiol 59, 102-111.

Vanlaere, E., Lipuma, J.J., Baldwin, A., Henry, D., De Brandt, E., Mahenthiralingam, E.,

Speert, D., Dowson, C., Vandamme, P., 2008. Burkholderia latens sp. nov., Burkholderia diffusa sp. nov., Burkholderia arboris sp. nov., Burkholderia seminalis sp. nov. and

Burkholderia metallica sp. nov., novel species within the Burkholderia cepacia complex.

Int J Syst Evol Microbiol 58, 1580-1590.

Vergnaud, G., Pourcel, C., 2006. Multiple locus VNTR (variable number of tandem repeat) analysis. Molecular identification, systematics, and population structure of prokaryotes. Springer-Verlag, Berlin, Germany, 83-104.

185 Vogel, U., Szczepanowski, R., Claus, H., Junemann, S., Prior, K., Harmsen, D., 2012. Ion torrent personal genome machine sequencing for genomic typing of Neisseria meningitidis for rapid determination of multiple layers of typing information. J Clin

Microbiol 50, 1889-1894.

Vos, M., 2011. A species concept for bacteria based on adaptive divergence. Trends

Microbiol 19, 1-7.

Vos, M., Didelot, X., 2009. A comparison of homologous recombination rates in bacteria and archaea. ISME J 3, 199-208.

Walker, A.S., Eyre, D.W., Wyllie, D.H., Dingle, K.E., Harding, R.M., O'Connor, L.,

Griffiths, D., Vaughan, A., Finney, J., Wilcox, M.H., Crook, D.W., Peto, T.E., 2012.

Characterisation of Clostridium difficile hospital ward-based transmission using extensive epidemiological data and molecular typing. PLoS Med 9, e1001172.

Walther, B., Hermes, J., Cuny, C., Wieler, L.H., Vincze, S., Abou Elnaga, Y., Stamm, I.,

Kopp, P.A., Kohn, B., Witte, W., Jansen, A., Conraths, F.J., Semmler, T., Eckmanns, T.,

Lubke-Becker, A., 2012. Sharing more than friendship--nasal colonization with coagulase-positive staphylococci (CPS) and co-habitation aspects of dogs and their owners. PLoS One 7, e35197.

Wang, Y., Rannala, B., 2008. Bayesian inference of fine-scale recombination rates using population genomic data. Philos T R Soc B 363, 3921-3930.

Wang, Y., Rannala, B., 2009. Population genomic inference of recombination rates and hotspots. P Natl Acad Sci USA 106, 6215-6219.

186 Waples, R.S., Gaggiotti, O., 2006. INVITED REVIEW: What is a population? An empirical evaluation of some genetic methods for identifying the number of gene pools and their degree of connectivity. Molecular Ecology 15, 1419-1439.

Weinert, L.A., Welch, J.J., Suchard, M.A., Lemey, P., Rambaut, A., Fitzgerald, J.R.,

2012. Molecular dating of human-to-bovid host jumps by Staphylococcus aureus reveals an association with the spread of domestication. Biology Letters.

Wilson, D.J., McVean, G., 2006. Estimating diversifying selection and functional constraint in the presence of recombination. Genetics 172, 1411-1425.

Woolley, S.M., Posada, D., Crandall, K.A., 2008. A comparison of phylogenetic network methods using computer simulation. PLoS ONE 3, 1-12.

Xie, W., Lewis, P.O., Fan, Y., Kuo, L., Chen, M.-H., 2011. Improving Marginal

Likelihood Estimation for Bayesian Phylogenetic Model Selection. Systematic Biology

60, 150-160.

Yan, Y., Cui, Y., Han, H., Xiao, X., Wong, H.C., Tan, Y., Guo, Z., Liu, X., Yang, R.,

Zhou, D., 2011. Extended MLST-based population genetics and phylogeny of Vibrio parahaemolyticus with high levels of recombination. Int J Food Microbiol 145, 106-112.

Yang, Z., 2007. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol

24, 1586-1591.

Yang, Z., Wong, W.S., Nielsen, R., 2005. Bayes empirical bayes inference of amino acid sites under positive selection. Mol Biol Evol 22, 1107-1118.

Yeo, M., Mauricio, I.L., Messenger, L.A., Lewis, M.D., Llewellyn, M.S., Acosta, N.,

Bhattacharyya, T., Diosque, P., Carrasco, H.J., Miles, M.A., 2011. Multilocus sequence

187 typing (MLST) for lineage assignment and high resolution diversity studies in

Trypanosoma cruzi. PLoS Negl Trop Dis 5, e1049.

Zeigler, D.R., 2003. Gene sequences useful for predicting relatedness of whole genomes in bacteria. International journal of systematic and evolutionary microbiology 53, 1893-

1900.

Zwickl, D.J., 2006. Genetic algorithm approaches for the phylogenetic analysis of large biological sequence datasets under the maximum likelihood criterion, Department of

Biological Sciences. The University of Texas at Austin, Austin, TX.

188

Chapter 4:

A Survey Of Genomic Tools For

Molecular Epidemiology

189 Abstract

The post-genomic era is characterized by the direct acquisition and analysis of genomic data with many applications including enhancing the understanding of microbial epidemiology and pathology. However, there are a number of molecular approaches to survey pathogen diversity and the impact of these different approaches on parameter estimation and inference are not entirely clear. We sequenced whole genomes of bacterial pathogens, Burkholderia pseudomallei, Yersinia pestis, and Brucella spp. (60 new genomes), and combined them with 55 genomes from GenBank to address how different molecular survey approaches (whole genomes, SNPs, and MLST) impact downstream inferences on molecular evolutionary parameters, evolutionary relationships, and trait character associations. We selected isolates for sequencing to represent temporal, geographic origin, and host range variability. We found that substitution rate estimates vary widely among approaches, and that SNP and genomic datasets yielded different but strongly supported phylogenies. MLST yielded poorly supported phylogenies, especially in our low diversity dataset, i.e., Y. pestis. Trait associations showed that B. pseudomallei and Y. pestis phylogenies are significantly associated with geography, irrespective of the molecular survey approach used, while Brucella spp. phylogeny appears to be strongly associated with geography and host origin. We contrast inferences made among monomorphic (clonal) and non-monomorphic bacteria, and between intra- and inter- specific datasets. We also discuss our results in light of underlying assumptions of different approaches.

190 Introduction

Genomic data coupled with phylogenetic methodology have enhanced the ability to track infectious disease epidemics through space and time (Baker et al. 2010). For example, studies have tracked and characterized epidemics occurring at different geographic scales, across local, regional, global, and even historical scales; investigating multidrug- resistant Staphylococcus aureus in hospital settings (Kos et al. 2012; Köser et al. 2012), inferring continental origins of food pathogens (Goss et al. 2014), explaining seasonal influenza dynamics (Lemey et al. 2014), and ancient oral pathogens (Warinner et al.

2014), respectively. Such studies provide valuable information regarding migration rates, directionalities of spread, unique variants, genetic diversity, and drug resistance, as well as informing policy-makers about infection patterns associated with human activities

(Bos et al. 2011; Morelli et al. 2010; Zhang et al. 2010). Accordingly, applications of analytical tools to large datasets are abundant in clinical pathology, bioforensics, biosurveillance, and molecular epidemiology (Reimer et al. 2011; Wilson et al. 2013).

Whole-genome sequencing (WGS) has become an affordable approach for such studies (Bertelli & Greub 2013; Chen et al. 2013; Cornejo et al. 2013; Croucher et al.

2013; Pérez-Lago et al. 2013; Sheppard et al. 2013; Wielgoss et al. 2013). New technologies make it possible to compile datasets that were not even dreamed of twenty years ago (Chewapreecha et al. 2014; Marttinen et al. 2012; Nasser et al. 2014; Sheppard et al. 2013) which, in turn, is prompting scientists to ask new questions regarding pathogen distribution, diversity, origin, and phenotype (Butler et al. 2013; Castillo-

Ramirez et al. 2012; Grad & Waldor 2013; Holt et al. 2012; Spoor et al. 2013). To date, massive amounts of data have accumulated in publicly available databases (SRA; EBI;

191 GOLD; NCBI), yet concerns have been raised that data analysis and not data generation has become a rate-limiting step in molecular epidemiology and pathology (Nielsen et al.

2010; Pybus et al. 2013).

Because there are now a variety of molecular survey approaches: whole genome sequencing (WGS), multi-locus sequence typing (MLST), and single nucleotide polymorphism (SNP) data, with different costs and resolution abilities, we explored the impact of these different approaches on inferences of population dynamics, transmission patterns, and parameter estimation. For instance, tracking the origin of bioterrorism agents depends on identifying diagnostic mutations, as in the anthrax attacks of 2001

(Read et al. 2002), and understanding the extent to which sampling strategy and choice of molecular survey approach affects temporal and spatial inferences.

Here, we set out to investigate how molecular survey approaches compare, using three select agents as models, namely Yersinia pestis (causative agent of plague),

Burkholderia pseudomallei (causative agent of melioidosis), and Brucella spp. (febrile disease). These bacterial species are relevant from health and biosecurity perspectives, and there exists a sizable amount of genomic and supporting information (date of collection, geographic location, and host) for them. Also, they allow for interesting contrasts including comparing intraspecific datasets (Y. pestis v. B. pseudomallei), one from monomorphic bacteria (clonal), and the other from polymorphic bacteria, as well as interspecific comparisons (Y. pestis and B. pseudomallei vs. Brucella spp.)

We present and analyze new draft genomic sequences for 20 Brucella spp., 20 Y. pestis, and 20 B. pseudomallei isolates, which we combine with publicly available genomes (totaling 115 genomes) to compare inferences on evolutionary relationships,

192 dates and rates, and geographic and host structure. We selected 20 isolates of each group to represent a diversity of isolation dates, geographic location, and pathogen host and applied different molecular survey approaches (WGS, SNPs, MLST) to test whether these approaches can recover equivalent evolutionary relationships, evolutionary rates and divergence dates, and whether phylogenies inferred with these approaches represent equivalent geographic and host structures.

We hypothesized that phylogenies from MLST data would not be as resolved, quantitatively, as phylogenies from SNPs or genomes, especially for low-diversity bacterial datasets. Likewise, we hypothesized that substitution rates and dates will vary widely among datasets according to their genetic diversity estimates, and that inferences from MLSTs will vary significantly from SNP and genomic inferences. In addition, we contrast inferences made among monomorphic and non-monomorphic bacteria, and between intra- and inter-specific datasets in light of underlying assumptions of the different approaches.

Methods

Strain selection and sequencing

DNA was isolated from 20 strains of Burkholderia pseudomallei, Yersinia pestis, and

Brucella spp. from the Brigham Young University Select Agent Archive. Samples were selected for sequencing to provide a range of 1) time of isolation, 2) geographic spread, and 3) host association (Table 1). DNA isolation followed standard protocols for select agents and was conducted at the Brigham Young University BSL-3 facility. All DNA preparations received a Certification of Sterility (10% of the final DNA preparation from each isolate was plated for sterility on appropriate agar, and after a minimum of five days

193 of incubation at 37 °C, the samples showed no growth, indicating they contained no viable organisms) before being prepared for sequencing.

Table 1: Summary of genomes sequenced and collected in this study. Metadata on strain source, host, location and date of collection also provided when available

NCBI Species Strain Source Host Location Date of Accession Collection Number SRX286342 Burkholderia 5 Public Health Sheep Australia 1949 pseudomallei Laboratory Service, London SRX286347 Burkholderia 6 Public Health Human Bangladesh 1960 pseudomallei Laboratory Service, London SRX286346 Burkholderia 9 Public Health Human Pakistan 1988 pseudomallei Laboratory Service, London SRX286345 Burkholderia 18 Public Health Monkey Indonesia 1990 pseudomallei Laboratory Service, London SRX286357 Burkholderia 24 Public Health Horse France 1976 pseudomallei Laboratory Service, London SRX286354 Burkholderia 25 Public Health Soil Madagasca 1977 pseudomallei Laboratory r Service, London SRX286353 Burkholderia 31 Public Health Water Kenya 1992 pseudomallei Laboratory Drain Service, London SRX286352 Burkholderia 33 Public Health Manure France 1976 pseudomallei Laboratory Service, London SRX286350 Burkholderia 35 Public Health Human Vietnam 1963 pseudomallei Laboratory Service, London SRX286348 Burkholderia 68 Public Health Human Fiji 1992 pseudomallei Laboratory Service, London SRX286359 Burkholderia 91 Public Health Sheep Australia 1984 pseudomallei Laboratory Service, London SRX286361 Burkholderia 104 Public Health Goat Australia 1990 pseudomallei Laboratory Service, London

194 SRX286363 Burkholderia 208 Public Health Human Ecuador 1990 pseudomallei Laboratory Service, London SRX286364 Burkholderia 4075 Public Health Human Holland 1999 pseudomallei Laboratory Service, London SRX286418 Burkholderia Darwi Royal Darwin Human Australia 2003 pseudomallei n-035 Hospital SRX286420 Burkholderia Darwi Royal Darwin Dog Australia 1992 pseudomallei n-051 Hospital SRX286421 Burkholderia Darwi Royal Darwin Pig Australia 1992 pseudomallei n-060 Hospital SRX286422 Burkholderia Darwi Royal Darwin Bird Australia 1994 pseudomallei n-077 Hospital SRX286423 Burkholderia Darwi Royal Darwin Soil Australia 2006 pseudomallei n-150 Hospital SRX286344 Burkholderia 80800 Utah Department Human USA 2008 pseudomallei 117 of Health NC_017832. Burkholderia 1026b [email protected] Human Thailand 1993 1 pseudomallei ngton.edu NC_017831. 1 NC_009078. Burkholderia 1106a JCVI Human Thailand 1993 1 pseudomallei NC_009076. 1 NC_012695. Burkholderia MSH LANL DOE JGI Human Australia 1996 1 pseudomallei R346 NC_006351. Burkholderia k9624 Sanger Institute Human Thailand 1996 1 pseudomallei 3 NC_006350. 1 NC_018529. Burkholderia BPC0 Third military Human China 2008 1 pseudomallei 06 technical NC_018527. university 1 NZ_CM000 Burkholderia 1106b JCVI Human Thailand 1996 774.1 pseudomallei NZ_CM000 775.1 NZ_CM000 Burkholderia 1710a JCVI Human Thailand 1996 833.1 pseudomallei NZ_CM000 832.1 NC_007435. Burkholderia 1710b JCVI Human Thailand 1999 1 pseudomallei NC_007434. 1

195 NC_009074. Burkholderia 668 JCVI Human Australia 1995 1 pseudomallei NC_009075. 1 NZ_CM001 Burkholderia Bp22 GIS Human Singapore 1989 156.1 pseudomallei NZ_CM001 157.1 NC_007651 Burkholderia E264 JCVI soil Thailand 1994 NC_007650 thailandensis SRX278648 Brucella 1004, National Animal Bovine MO, USA 1990 abortus Strain Disease Center 2032 SRX278790 Brucella 1007, National Animal Bovine FL, USA 1990 abortus Strain Disease Center 2045 SRX278791 Brucella 1019, National Animal Bovine TN, USA 1990 abortus Strain Disease Center 2038 SRX278792 Brucella 1022, National Animal Bovine GA, USA 1990 abortus Strain Disease Center 2073 SRX278793 Brucella 1146, National Animal Elk MT, USA 1992 abortus Strain Disease Center 8-953 SRX278794 Brucella 1668, National Animal Elk WY, USA 2000 abortus Strain Disease Center 00- 666 SRX278891 Brucella YELL INEEL Bison WY, USA 1999 abortus -99- (amnioti 067 c fluid) SRX282032 Brucella 1614, National Animal Bovine TX, USA 2000 abortus Strain Disease Center Weinh eimer 4 SRX282039 Brucella canis 1107, National Animal Canine MO, USA 1990 Strain Disease Center 1-107 SRX282040 Brucella 1253, National Animal Caprine unknown 1994 melitensis Strain Disease Center Ether, L657 SRX282041 Brucella BA New Mexico human NM, USA 2003 melitensis 4837 Department of Health SRX282042 Brucella 70000 Utah Department blood, UT, USA 2000 melitensis 565 of Health human

196 SRX282044 Brucella 80600 Utah Department blood, UT, USA 2006 melitensis 020 of Health human SRX282045 Brucella 80800 Utah Department human CA, USA 2008 melitensis 076 of Health SRX282046 Brucella 1156, National Animal desert unknown, 1992 neotomae Strain Disease Center wood USA 5K33, rat ATCC #2345 9 SRX282047 Brucella ovis 1117, National Animal Ovine GA, USA 1991 Strain Disease Center 1-507 SRX282048 Brucella ovis 1698, National Animal Ovine Ft. Collins, 2001 Strain Disease Center (semen) CO, USA 13551 -2114; 1985: Dhyat t SRX282050 Brucella 70100 Utah Department blood, USA- UT 2001 species 304 of Health human SRX282053 Brucella suis 1103, National Animal Porcine SC, USA 1990 Strain Disease Center 2483 SRX282057 Brucella suis 1108, National Animal Porcine NJ, USA 1990 Strain Disease Center 1-138 NC_016777. Brucella A133 Macrogen bovine Korea unknown 1 abortus 34 NC_016795. 1 NC_006932. Brucella bv 1, USDA bovine WY, USA unknown 1 abortus 9-941 NC_006933. 1 NC_010740. Brucella S19 Crasta OR bovine unknown, 1923 1 abortus USA NC_010742. 1 NC_010103. Brucella canis ATCC DOE JGI Dog unknown unknown 1 23365 NC_010104. 1 NC_016796. Brucella canis HSK National dog South unknown 1 A521 Veterinary Korea NC_016778. 41 Research and 1 Quarantine

197 NC_012442. Brucella ATCC LANL human India 1963 1 melitensis 23457 NC_012441. 1 NC_017244. Brucella M28 Chinese National sheep China 1955 1 melitensis Human Genome NC_017245. Center at Shanghai 1 NC_003317. Brucella bv 1, Integrated goat unknown, unknown 1 melitensis 16M Genomics Inc USA NC_003318. 1 NC_007618. Brucella bv. 1 Lawrence Standar unknown unknown 1 melitensis Abort Livermore d NC_007624. us National Lab laborato 1 2308 ry strain NC_017246. Brucella M5- Chinese National Standar unknown unknown 1 melitensis 90 Human Genome d NC_017247. Center at Shanghai laborato 1 ry strain NC_017248. Brucella bv. 3 China Agricultural bovine Inner 2007 1 melitensis NI Univ Mongolia, NC_017283. China 1 CP001578.1 Brucella CCM Sudic S vole Czech 2000 CP001579.1 microti 4915 Republic NC_009505. Brucella ovis ATCC J. Craig Venter sheep Australia 1960 1 25840 Institute NC_009504. 1 NC_015858. Brucella B2/94 Zygmunt,M.S. seal UK 1994 1 pinnipedialis NC_015857. 1 NC_016775. Brucella suis VBI2 Harold R. Garner Bovine, TX, USA unknown 1 2 milk NC_016797. 1 NC_004311. Brucella suis bv 1, J. Craig Venter pig unknown, 1950 2 1330 Institute USA NC_004310. 3 NC_010167. Brucella suis ATCC LANL DOE JGI hare UK 1951 1 23445 NC_010169. 1 NC_009667. Ochrobactrum DOE JGI Arsenic Australia 1988 1 anthropi ATCC al cattle- NC_009668. 49188 dipping 1 fluid

198 SRX282065 Yersinia pestis 4954 New Mexico Human NM, USA 1987 Department of Health SRX282089 Yersinia pestis 1901b New Mexico Human NM, USA 1983 Department of Health SRX282090 Yersinia pestis Java Michigan State unknow Far East unknown (D88) University n SRX282091 Yersinia pestis Kimb Michigan State unknow Near East unknown erley University n (D17) SRX282092 Yersinia pestis KUM Michigan State unknow Manchuria, unknown A University n China (D11) SRX282093 Yersinia pestis TS Michigan State unknow Far East unknown (D5) University n SRX282094 Yersinia pestis 86071 New Mexico Dog NM, USA unknown 16 Department of Health SRX282095 Yersinia pestis 1866 New Mexico Squirrel NM, USA unknown Department of Health SRX282096 Yersina pestis 4139 New Mexico cat NM, USA 1995 Department of Health SRX286281 Yersinia pestis 4412 New Mexico Human NM, USA 1991 Department of Health SRX286283 Yersinia pestis 2965 New Mexico Human NM, USA 1995 Department of Health SRX286290 Yersinia pestis 2055 New Mexico Human NM, USA 1998 Department of Health SRX286302 Yersinia pestis 2106 New Mexico Human NM, USA 2001 Department of Health SRX286303 Yersinia pestis 2772 New Mexico Cat NM, USA 1984 Department of Health SRX286304 Yersinia pestis 3357 New Mexico mountai NM, USA 1999 Department of n lion Health SRX286305 Yersinia pestis AS New Mexico Rodent NM, USA 2004 2509 Department of Health SRX286306 Yersinia pestis AS New Mexico rabbit, United 2009 20090 Department of liver/spl States, 0596 Health een Santa Fe, NM

199 SRX286307 Yersinia pestis V- New Mexico Llama Las Vegas, unknown 6486 Department of NM, USA Health SRX286340 Yersinia pestis KIM Michigan State Human Iran/Kurdis 1968 (D27) University tan SRX286341 Yersinia pestis AS20 New Mexico liver/spl Santa Fe, 2009 09015 Department of een, NM, USA 09 Health prairie dog NC_017168. Yersinia pestis A112 LANL ground California 1939 1 2 squirrel NC_010159. Yersinia pestis Angol JCVI Human Angola unknown 1 a NC_008150. Yersinia pestis Antiq DOE JGI Human Congo 1965 1 ua PRJNA544 Yersinia pestis B4200 JCVI Marmot China 2003 73 3004 a baibacin a PRJNA545 Yersinia pestis CA88 DOE JGI Human California 1988 63 -4125 NC_003143. Yersinia pestis CO92 Sanger Institute Human/ Colorado 1992 1 cat NC_017154. Yersinia pestis D106 Chinese Center for Apodem Yulong 2006 1 004 Disease Control us County, and Prevention chevrier China i NC_017160. Yersinia pestis D182 Chinese Center for Apodem Yunnan, 1982 1 038 Disease Control us China and Prevention chevrier i PRJNA544 Yersinia pestis E1979 JCVI Eotheno China 1979 71 001 mys miletus PRJNA544 Yersinia pestis F1991 JCVI Flavus China 1991 69 016 rattivecu s PRJNA543 Yersinia pestis FV-1 The Translational Prairy Arizona 2001 99 Genomics dog Research Institute PRJNA553 Yersinia pestis India DOE JGI Human India unknown 39 195 PRJNA543 Yersinia pestis IP275 The Institute for Human Madagasca 1995 83 Genomic Research r NC_009708. Yersinia IP317 JCVI Human Russia 1966 1 pseudotubercu 58 losis PRJNA544 Yersinia pestis K197 JCVI Marmot China 1973 75 3002 a himalay

200 a PRJNA424 Yersinia pestis KIM JCVI Human Iran/Kurdis 1968 95 D27 tan NC_004088. Yersinia pestis KIM1 Genome Center of Human Iran/Kurdis 1968 1 0+ Wisconsin tan NC_017265. Yersinia pestis Medie Virginia Human China 1940 1 valis Bioinformatics str. Institute Harbi n 35 PRJNA544 Yersinia pestis MG05 JCVI Human Madagasca 2005 77 -1020 r NC_005810. Yersinia pestis Micro Academy of Microtu China 1970 1 tus Military Medical s brandti 91001 Sciences, The Institute of Microbiology and Epidemiology, China NC_008149. Yersinia pestis Nepal Genome Center of Human/ Nepal 1967 1 516 Wisconsin soil PRJNA553 Yersinia pestis Pestoi DOE JGI Human FSU 1960 43 des A PRJNA586 Yersinia pestis Pestoi DOE JGI Human FSU unknown 19 des F PRJNA553 Yersinia pestis PEXU ERIC-BRC Rodent Brazil 1966 41 2 PRJNA544 Yersinia pestis UG05 JCVI Human Uganda 2005 79 -045 PRJNA473 Yersinia pestis Z1760 CCDC Marmot Tibet 1976 17 03 a himalay ana

201

The DNA samples were prepared for multiplexed (single-end, 82 cycles) sequencing using Illumina GAIIx genome analyzer (Illumina Inc., San Diego, CA). For each isolate, genomic library preparations were generated using Nextera DNA Sample

Prep Kit. Post library quality control and quantification was done using BioAnalyzer

2100 high-sensitivity chips and KAPA SYBR FAST Universal 2X qPCR Master Mix.

Post processing of reads was performed by the RTA/SCS v1.9.35.0 and CASAVA 1.8.0.

Reads were trimmed back to the Q30 level using CLCBio's quality_trim program,

CutAdapt v0.95 was used to excise adapter and transposon contamination.

All sequencing run data and metadata were deposited in the Sequence Read

Archive (SRA) under three projects, SRP022877, SRP022862, and SRP023117 for Y. pestis, Brucella spp., and B. pseudomallei, respectively.

Dataset Collection

Short reads were quality filtered (average read quality > 30 Phred) and mapped against reference genomes employing the Burrows-Wheeler Transform algorithm, as implemented in SOAP (Li et al. 2008). The resulting SAM/BAM files were filtered for duplicate reads that might have arisen by PCR, and consensus sequences were called in

Geneious 6.1.6 (Kearse et al. 2012; Li et al. 2009). We additionally retrieved full genomes along with host, collection date, and country of origin metadata for B. pseudomallei (11), Brucella spp. (18) and Y. pestis (26) from GenBank, GOLD, IMG,

Patric, Broad Institute, and Pathema databases and resources totaling 115 genomes (Table

1; geographic distribution in Fig. 1) (Benson et al. 2010; Brinkac et al. 2010; Gillespie et

202 al. 2011; Liolios et al. 2008; Markowitz et al. 2012). From the assembled genomes we derived all datasets as described below.

Multi-locus sequence type markers for B. pseudomallei, namely ace, gltB, gmhD, lepA, lipA, narK, and ndh were retrieved from the PubMLST database

(http://bpseudomallei.mlst.net). For Brucella spp., we resorted to markers used by

Whatmore et al., i.e., gap, aroA, glk, dnaK, gyrB, trpE, cobQ, omp25, and int-hyp

(Whatmore et al. 2007). Likewise, for Y. pestis we obtained markers from PubMLST

(Yersinia spp.; http://pubmlst.org/yersinia/) aarF, dfp, galR, glnS, hemA, rfaE, and speA.

In addition, we obtained markers from Achtman (1999) (dmsA, glnA, manB, thrA, tmk, and trpE) and from Revazishvili (2008) (16S rDNA, gyrB, yhsp, psaA and recA). We created a custom BLAST (Altschul et al. 1990) database with our new genome sequences combined with the publicly available genomes for all three species groups.

We created datasets based on SNPs by searching for k-mer = 25 (SNP on position

13) among unaligned genomes and without conditioning on a reference, as implemented in kSNP 2.0 (Gardner & Hall 2013; Gardner & Slezak 2010). We kept all non- homoplastic SNPs that were shared among all taxa in a given dataset (core SNP subset), which were used to build matrices for downstream analyses.

203 Burkholderia pseudomallei spp. Brucella pestis Yersinia

204 Figure 1: Geographic distribution of isolates used in this study

We created full genome datasets by aligning complete genome sequences in

Mauve 2.3.1 (Darling et al. 2010) and then using the resulting multiple sequence alignment directly and/or reduced for phylogenetic inference. The reduced full genome dataset consisted of all Locally Collinear Blocks (LCBs) detected by Mauve that were greater than 10 Kb and randomly subsampled up to a total of 300 Kb present across all taxa in a given dataset.

Diversity and Phylogenetic Analyses

We measured genetic diversity as the substitution rate-scaled effective population size Q for all molecular survey approaches (MLST, SNP, WGS), as implemented in the ‘pegas’ package in R (Paradis 2010). We inferred phylogenies, both with and without assuming a molecular clock. Clock phylogenies were inferred using Bayesian Inference (BI) and

Markov Chain Monte Carlo (MCMC) simulations as implemented in Beast 1.7.5

(restricting the analysis to those sequences with recorded dates) and using the Beagle library to speed up analysis (Ayres et al. 2012; Drummond et al. 2012). We assumed a

General Time Reversible (GTR) substitution model for all three data approaches with a discrete gamma distribution (4 categories) to model rate heterogeneity (MLST datasets were partitioned by gene with a model fit per gene; rate heterogeneity was not modeled for SNP datasets). We unsuccessfully tried to partition the genome dataset by gene, but phylogenetic inference did not reached convergence. Briefly, MCMC simulations were run until a single chain reached convergence, as diagnosed by its trace and ESS values

(>400; ranging from 2E8 to 2E9 steps; 10% burnin) in Tracer 1.5

205 (http://tree.bio.ed.ac.uk/software/tracer/) and tree distributions were summarized in

TreeAnnotator 1.7.5 (10-20% of trees were discarded as burnin). The molecular clock

(strict clock model) was calibrated using isolate collection dates and a uniform distribution (from 0 to 1) as clock prior. We also used BI for non-clock phylogenies as implemented in MrBayes 3.2 (Ronquist et al. 2012) where we ran 8 chains (6 heated),

2E7 generations each. As in the clock phylogenies, we used visual inspection of the traces as well as the average standard deviation of split frequencies to assess convergence. All trees were rooted by using outgroups (Yersinia pseudotuberculosis IP31758,

Ochrobactrum anthropic, and Burkholderia thailandensis E264).

In order to compare tree topologies, we applied two topology metrics, Robinson-

Foulds (RF, Robinson & Foulds 1981) and Matching Splits Clusters (MC, Bogdanowicz

& Giaro 2012) to compare topologies across different molecular survey approaches and among chromosomes as implemented in TreeCmp (Bogdanowicz et al. 2012). We also assessed to what extent phylogenies and traits (host range, sample collection site, and sampling date) were correlated by Bayesian Tip-Significance testing by estimating the

Association Index (AI, Wang et al. 2001) and Parsimony Score (PS, Slatkin & Maddison

1989) as implemented in BaTs (Parker et al. 2008). Figures were plotted using ggplot2

(Wickham 2009) and APE (Paradis et al. 2004) packages, and high posterior density

(HPD) intervals were estimated using TeachingDemos package (Snow & Snow 2013).

Results and Discussion

Sequencing technologies and statistical phylogenetic methods are arming researchers with powerful tools to track infectious agents over space and time with unprecedented

206 resolution (Holt et al. 2012; Lewis et al. 2010). However, with multiple molecular survey approaches and a battery of analytical methods, it is not clear how these interact.

Using 115 genomic sequences (60 this study + 55 GenBank), we compiled datasets to test whether using different data approaches yielded dissimilar inferences regarding genetic diversity, substitution rates and node ages, tree topologies, structure and phylogenies. We use the term “molecular survey approaches” to refer to either

MLST, SNP, or WGS approaches, and the term “species datasets” or simply ‘dataset’ to refer either to B. pseudomallei, Brucella spp. or Y. pestis sequence data belonging to any of these species/genera. Given the difficulty of current algorithm implementations in reading and analyzing whole bacterial genomes, we decided to randomly sub-sample core homologous regions to compile genomic data that we termed “genome” (see Methods for details).

Diversity and datasets

Datasets sizes varied in length by data approach, species, and genomic partition

(chromosome I/II). Notably, we intended to include as many genes as possible for the

MLST schemes, which resulted in partitioned datasets ranging from 7 to 18 genes. In the case of Y. pestis, the MLST dataset constituted a larger dataset than the SNP dataset due to the low variability in this species. The interspecific dataset, i.e., Brucella spp., rendered the smallest dataset for all data approaches (least number of sites) as opposed to intraspecific datasets (Y. pestis; B. pseudomallei) that ended up being one or two orders of magnitude longer (Table 2; square brackets).

In order to characterize present genetic diversity of our datasets, we estimated effective population size (Q; segregating sites) and nucleotide diversity (p). Nucleotide

207 diversity ranked higher for SNPs compared to other approaches for the same species, as these data contain binary variable sites only (Table 2; rounded brackets). Nucleotide

diversity was higher for B. pseudomallei than for Brucella spp. and Y. pestis, when SNP data were analyzed. However, this was not observed for either MLST or genome data, where nucleotide diversity ranked higher for Brucella spp. compared to B. pseudomallei. Y. pestis nucleotide diversity was consistently lower compared to other datasets across molecular approaches. Q estimates were higher for B. pseudomallei than others for SNP and genome data, but not MLST data where Brucella spp. yielded the larger Q (Table 2; no brackets).

Table 2: Genetic diversity and dataset length for different species and data approaches.

First row = theta parameter (Θ, segregating sites); second row = nucleotide diversity (π); third row = dataset length in [base pairs]

MLST SNP Genome

chromosome I chromosome II chromosome I chromosome II

36.04 7800 4300 4660 2180 B. pseudomallei (4.44E-3) (0.958E-3) (0.969E-3) (5.67E-3) (7.14E-3) [108654] [3518] [31189] [17313] [289172]

106.38 863.48 221.107 601.67 1931.65 (0.130E-3) Brucella spp. (8.45E-3) (0.669E-3) (0.639E-3) (6.55E-3) [36223] [4409] [3628] [929] [24110]

35.49 3211.87 517.86 Y. pestis (4.08E-3) (0.475E-3) (3.74E-3) [20498] [14116] [281149]

Rates and Ages

208 We tested whether different data approaches resulted in different inferences regarding substitution rates and node ages, maintaining other parameters constant, i.e., clock calibrations, substitution model, tip dates, coalescent tree priors, and taxa (different partition scheme; see Methods for details). Substitution rate estimates were always higher for SNP data compared to genome data, irrespective of species datasets used (Fig. 2).

Rates estimated from MLST data were largely overlapping with estimates from genome data for Y. pestis and Brucella spp. including median values (highlighted in Fig. 2).

However, this was not the case for the B. pseudomallei dataset where, although the distributions overlapped, median values for the substitution rate estimate from MLST data were higher by at least an order of magnitude compared to estimates from other approaches (MLST rate median = 6.30E-7; genome chr I = 6.17E-8; genome chr II =

2.48E-7; SNP chr I =1.06E-6; SNP chr II = 9.94E-7 (rates in substitutions per site per year)).

209 210 Figure 2: Substitution rates for all datasets as estimated from different data approaches.

Genome/SNP chr I/II refers to estimates from different chromosomes. Note different scale for species rates

Remarkably, when collecting node ages and comparing them across data approaches, we found that highest posterior density intervals (95% HPD) overlapped substantially in the case of Y. pestis and Brucella spp. datasets (Fig. 3B-C). We observed the same trend with SNP and genome approaches when analyzing B. pseudomallei datasets, but not with MLST data (Fig. 3A). Interestingly, in Y. pestis node age estimates, we observe that 95% HPD intervals are narrower in SNP and genome data than in MLST data. This suggests that, though different molecular survey approaches result in markedly different substitution rate estimates, node ages 95% HPD are largely overlapping and thus not significantly different.

Substitution rate estimates differ substantially (up to 2 orders of magnitude), though their posterior distributions overlap to various degrees. Generally, substitution rate estimates drawn from SNP data were higher than those from MLST and genome data. However, node ages are largely consistent across molecular survey approaches, especially for Brucella spp. data (interspecific and intermediate diversity dataset). This supports the practice of using SNP data coupled to Bayesian inference coalescent methods to infer divergence times, even though traditional reversible substitution models were not specifically designed for this molecular approach. Substitution models based on models for discrete morphological character changes have been suggested, but are not widely popular (Lewis 2001).

211 Phylogenies and Topology Comparisons

We wanted to determine whether different data approaches produce different phylogenies and to quantify the extent of any observed differences in topology (dataset sizes in Table

2). We inferred phylogenies for every species under all three molecular survey approaches, partitioned by chromosome when appropriate, without assuming a molecular clock and outgroup rooted (see Methods). We used two topology metrics, Matching

Clusters (MC), rooted version of Matching splits (Bogdanowicz & Giaro 2012), and R-F

Clusters (RC), rooted version of Robinson-Foulds metric (Bogdanowicz et al. 2012;

Robinson & Foulds 1981). MC distances reflect the minimal number of cluster (or clade) movements needed so that the two phylogenies are topologically equivalent. RC distances measure the average number of cluster differences between two phylogenies. In general, we found that the phylogenies inferred using MLST data are less resolved and poorly supported (posterior probabilities) than those inferred by either SNP or genome data, for all species datasets (Fig. 4 and Supplemental Figure 2). This is also reflected in

MC distances, where topologies inferred by MLST data are as distant, or more so, to

SNP/genome based topologies than between SNP and genome topologies, with some exceptions (Table 3).

212 A Burkholderia pseudomallei

mlst snp genome

1e+06

1e+04

1e+02

0 10 20 30 0 10 20 30 0 10 20 30

B Brucella spp.

mlst snp genome 1e+06

1e+04 Node ages

1e+02

0 10 20 30 0 10 20 30 0 10 20 30

C Yersinia pestis

mlst snp genome

10000

100

0 10 20 30 0 10 20 30 0 10 20 30

Nodes

213 Figure 3: Median node ages in years. Burkholderia pseudomallei (A), Brucella spp. (B), and Yersinia pestis (C) median estimates and their 95% highest posterior density (HPD) interval according to data approach (only chromosome I showed; see Supplemental

Figure 1). Nodes are numbered from youngest to oldest

For Brucella spp. and Y. pestis, MC distances are clearly higher between

SNP/genome and MLST; however, RC distances do not follow this trend. Since MC metric concentrates more on differences corresponding to branches deep in the topologies as opposed to RC, these results suggest that SNP and genome topologies have more similar backbones when compared to each other than MLST topologies. Likewise, MLST topologies are more similar at the tips rather than deep in the topologies (Fig. 4 and Table

3). Of course, to determine which approach is more accurate would require a dataset of known evolutionary history, but SNP and genome approaches appear to be more consistent with one another, especially for the deeper nodes.

214 Burkholderia pseudomallei 5 1949 pseudomallei Burkholderia 668 1995 Burkholderia pseudomallei D51 1992 D51 pseudomallei Burkholderia MSHR346 1996 MSHR346 Genome Burkholderia pseudomallei D150 2006 D150 pseudomallei Burkholderia Burkholderia pseudomallei D35 2003 D35 pseudomallei Burkholderia Burkholderia pseudomallei 25 1977 25 pseudomallei Burkholderia 1.00 Burkholderia pseudomallei 35 1963 35 pseudomallei Burkholderia Burkholderia pseudomallei D77 1994 D77 pseudomallei Burkholderia Bp22 1989 Bp22 Burkholderia pseudomallei 208 1990 208 pseudomallei Burkholderia Burkholderia pseudomallei D60 1992 D60 pseudomallei Burkholderia Burkholderia pseudomallei 31 1992 31 pseudomallei Burkholderia K96243 1996 K96243 Burkholderia pseudomallei 104 1990 104 pseudomallei Burkholderia Burkholderia pseudomallei 33 1976 33 pseudomallei Burkholderia 1106b 1996 1106a 1993 Burkholderia pseudomallei 24 1976 24 pseudomallei Burkholderia Burkholderia pseudomallei 4075 1999 4075 pseudomallei Burkholderia Burkholderia pseudomallei 18 1990 18 pseudomallei Burkholderia 1710a 1996 1710b 1999 1.00 1.00 Burkholderia pseudomallei 6 1960 pseudomallei Burkholderia BPC006 2008 BPC006 Burkholderia pseudomallei 68 1992 68 pseudomallei Burkholderia 1.00 1.00 Burkholderia pseudomallei 9 1988 pseudomallei Burkholderia Burkholderia pseudomallei 91 1984 91 pseudomallei Burkholderia 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.98 Burkholderia pseudomallei 8 8 117 2008 8 117 pseudomallei Burkholderia 1.00 1026b 1993 1.00 1.00 1.00 1.00 1.00 0.83 1.00 1.00 1.00 1.00 1.00 1.00 Burkholderia pseudomallei 5 1949 pseudomallei Burkholderia Burkholderia pseudomallei D51 1992 D51 pseudomallei Burkholderia 668 1995 SNP Burkholderia pseudomallei D150 2006 D150 pseudomallei Burkholderia Burkholderia pseudomallei D35 2003 D35 pseudomallei Burkholderia MSHR346 1996 MSHR346 Burkholderia pseudomallei 31 1992 31 pseudomallei Burkholderia 1.00 Burkholderia pseudomallei 25 1977 25 pseudomallei Burkholderia Burkholderia pseudomallei 208 1990 208 pseudomallei Burkholderia Burkholderia pseudomallei D77 1994 D77 pseudomallei Burkholderia Burkholderia pseudomallei D60 1992 D60 pseudomallei Burkholderia Burkholderia pseudomallei 35 1963 35 pseudomallei Burkholderia 1.00 Burkholderia pseudomallei 24 1976 24 pseudomallei Burkholderia Burkholderia pseudomallei 33 1976 33 pseudomallei Burkholderia Burkholderia pseudomallei 6 1960 pseudomallei Burkholderia Burkholderia pseudomallei 18 1990 18 pseudomallei Burkholderia Burkholderia pseudomallei 4075 1999 4075 pseudomallei Burkholderia 1106b 1996 1.00 1106a 1993 1.00 1.00 BPC006 2008 BPC006 Burkholderia pseudomallei 9 1988 pseudomallei Burkholderia Burkholderia pseudomallei 8 8 117 2008 8 117 pseudomallei Burkholderia Bp22 1989 Bp22 1710a 1996 Burkholderia pseudomallei 68 1992 68 pseudomallei Burkholderia 1710b 1999 1.00 1.00 1.00 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1026b 1993 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 K96243 1996 K96243 Burkholderia pseudomallei 91 1984 91 pseudomallei Burkholderia 1.00 Burkholderia pseudomallei 104 1990 104 pseudomallei Burkholderia 1.00 1106a 1993 BPC006 2008 BPC006 1106b 1996 Bp22 1989 Bp22 0.95 1026b 1993 1.00 Burkholderia pseudomallei D150 2006 D150 pseudomallei Burkholderia Burkholderia pseudomallei 208 1990 208 pseudomallei Burkholderia Burkholderia pseudomallei 6 1960 pseudomallei Burkholderia Burkholderia pseudomallei 18 1990 18 pseudomallei Burkholderia MLST Burkholderia pseudomallei 4075 1999 4075 pseudomallei Burkholderia 1.00 668 1995 1.00 Burkholderia pseudomallei 68 1992 68 pseudomallei Burkholderia Burkholderia pseudomallei D60 1992 D60 pseudomallei Burkholderia Burkholderia pseudomallei D51 1992 D51 pseudomallei Burkholderia 1710a 1996 1710b 1999 0.55 Burkholderia pseudomallei 5 1949 pseudomallei Burkholderia Burkholderia pseudomallei 8 8 117 2008 8 117 pseudomallei Burkholderia MSHR346 1996 MSHR346 Burkholderia pseudomallei D35 2003 D35 pseudomallei Burkholderia 1.00 Burkholderia pseudomallei 9 1988 pseudomallei Burkholderia 0.98 0.62 0.74 Burkholderia pseudomallei 33 1976 33 pseudomallei Burkholderia 0.93 Burkholderia pseudomallei D77 1994 D77 pseudomallei Burkholderia 0.62 0.52 Burkholderia pseudomallei 25 1977 25 pseudomallei Burkholderia Burkholderia pseudomallei 24 1976 24 pseudomallei Burkholderia Burkholderia pseudomallei 31 1992 31 pseudomallei Burkholderia 0.92 Burkholderia pseudomallei 91 1984 91 pseudomallei Burkholderia 0.95 Burkholderia pseudomallei 35 1963 35 pseudomallei Burkholderia 0.78 0.75 Burkholderia pseudomallei 104 1990 104 pseudomallei Burkholderia K96243 1996 K96243

215 Figure 4: Burkholderia pseudomallei phylogenies by survey approach. MLST phylogeny

(left) is less resolved and poorly supported compared to SNP (center) and genome (right) phylogenies (only chromosome I showed)

Slowly evolving pathogens can be difficult to track as their populations accrue fewer substitutions, and/or genomic changes may, or may not reflect ecological processes, such as host switches or geographic spread (see below for association testing).

For instance, phylogenies inferred using MLST data were less resolved and poorly supported compared to their SNP and genome counterparts, even though in some cases

(e.g., Brucella spp. / Y. pestis) the MLST dataset size was larger than the SNP size dataset. This argues for the need to acquire genome data, as those data constitute the ultimate source of genealogical information, especially when analyzing monomorphic or clonal species, i.e., Y. pestis (Achtman 2008; Achtman et al. 1999). We also found that strongly supported phylogenies, e.g., those based on SNP and genome data, can support conflicting hypotheses s and thus will be misleading. For instance, B. pseudomallei clades, including isolates 1106a, 1106b, Bp22, and BPC006, all show posterior probabilities = 1, yet their relationships differ, hence a caveat when analyzing

SNP/genome data and drawing conclusions about relationships amongst isolates.

216

Table 3: Topology distances among phylogenies inferred using different data approaches.

Genome/SNP-I/II = chromosome I or II; R-F Cluster = Robinson-Foulds for rooted trees metric. Bolded rows show tree comparisons between different chromosomes under the same molecular survey approach

Tree Matching R-F Species Comparisons Cluster Cluster B. pseudomallei mlst snp-I 181 16 B. pseudomallei mlst snp-II 162 17 B. pseudomallei mlst genome-I 149 18 B. pseudomallei mlst genome-II 116 18 B. pseudomallei snp-I snp-II 33 7 B. pseudomallei snp-I genome-I 56 17 B. pseudomallei snp-I genome-II 91 17 B. pseudomallei snp-II genome-I 47 19 B. pseudomallei snp-II genome-II 72 16 B. pseudomallei genome-I genome-II 63 14 Brucella spp. mlst snp-I 34 5 Brucella spp. mlst snp-II 10 1.5 Brucella spp. mlst genome-I 73 4.5 Brucella spp. mlst genome-II 56 5 Brucella spp. snp-I snp-II 24 5.5 Brucella spp. snp-I genome-I 61 5.5 Brucella spp. snp-I genome-II 40 5 Brucella spp. snp-II genome-I 63 5 Brucella spp. snp-II genome-II 50 4.5 Brucella spp. genome-I genome-II 67 7.5 Y. pestis mlst snp 223 13.5 Y. pestis mlst genome 103 8 Y. pestis snp genome 124 8.5

217 Tips association: geography, time, and hosts

Phylogenetic inference often is performed to infer ecological processes that leave a genomic imprint. Phylogeny-trait associations are essential to elucidate these processes.

Accordingly, we estimated the Association Index (AI), and Parsimony Score (PS) on three traits (sampling location, sampling time, and host), and tested whether different answers were obtained by molecular survey. Results for B. pseudomallei showed significant association with sampling location and sampling time, but not with host for most of the datasets (AI and PS; Table 4). Likewise, Y. pestis datasets were significantly associated with sampling location and, to some extent, with sampling time and host.

Interestingly, Brucella spp. showed significant genetic structure to be associated with both sampling location and host, but not sampling time (Table 4).

Irrespective of the molecular survey approach used, phylogenies derived from B. pseudomallei showed a significant association with sampling location, but not with host, suggesting similar evolutionary forces acting on B. pseudomallei in different hosts, or that B. pseudomallei isolates are highly endemic to the sites from which they were isolated. Similarly, Brucella spp. phylogenies were associated with both sampling location and host, irrespective of the data approach used, most likely reflecting metabolic and geographic constraints on gene flow. Interestingly, for Y. pestis, no significant association of host and MLST data was observed, most likely reflecting lack of signal, given the absence of resolution of phylogenies in its posterior distribution.

218 Table 4: Trait-phylogeny association statistics. Significant associations (p value < 0.05) were found between traits (sampling location/host/time) and phylogenies inferred by using different data approaches. Association index (AI); Parsimony Score (PS); genome/SNP-I/II = chromosome I or II

Statistic Trait Sampling Location B. pseudomallei Brucella spp. Y. pestis

AI MLST, genome-I, MLST, genome-I, MLST, genome, genome-II, SNP-I, genome-II, SNP-I, SNP SNP-II SNP-II

PS MLST, genome-I, MLST, SNP-II MLST, genome, genome-II, SNP-II SNP

Host B. pseudomallei Brucella spp. Y. pestis AI none MLST, genome-I, genome, SNP genome-II, SNP-I, SNP-II

PS none MLST, genome-I, none genome-II, SNP-I, SNP-II

Time AI genome-I, genome-II none genome

PS none none genome

219

Molecular survey approaches do have different sets of assumptions and properties that must be considered before an analysis is done. So too, statistical models that are employed may be suited for certain data approaches and not others. Here, we used popular phylogenetic methods for all molecular approaches to test whether congruent inferences could be obtained, even though some might violate particular model assumptions. The MLST method targets housekeeping genes that are likely to be maintained across taxonomic levels, hence amenable for evolutionary inferences. Yet, they are likely to be subjected to selective pressures, violating the neutrality assumption of most phylogenetic methods (Roje 2014). Other trade-offs of MLST have been discussed elsewhere, mainly with respect to utility and how they can be refashioned in the post-genomic era (Maiden et al. 2013; Pérez-Losada et al. 2013). On the other hand, sampling bias can influence phylogenetic analysis (Lachance & Tishkoff 2013). Here, we obtained SNP data without using reference data and included globally sampled genomes to diminish ascertainment bias. However, standard nucleotide substitution models, such as GTR, are not designed to account for binary sites-only datasets nor Bayesian Inference methods, which typically factor in invariable sites, influencing branch length estimation and impacting parameter estimates, such as substitution rate and divergence time.

Nonetheless, they have been used to date the spread of bacteria and other pathogens

(Comas et al. 2013; Holt et al. 2010; Holt et al. 2012; Okoro et al. 2012; Pepperell et al.

2013). We speculate, based on these results, that analysis of SNP data to survey genomic variation is robust and can produce inferences that are not substantially different from

WGS data.

220 Conclusions

The field of bacterial population genomics is advancing rapidly with larger datasets (more taxa, more sites) increasingly available, including whole-genomes, making greater resolution possible and more powerful exploration of complex issues

(Chewapreecha et al. 2014; Nasser et al. 2014). The results of analyses reported here show that the molecular survey that is used can have a critical impact on substitution rate and phylogenetic inference. However, node dates and trait associations are relatively consistent irrespective of the survey tool used. We found substitution rates vary widely depending on the approach taken, and SNP and genomic datasets yield different, but strongly supported phylogenies. Overall, inferences were more sensitive to molecular survey in the low diversity Y. pestis dataset, compared to the B. pseudomallei and

Brucella spp. datasets.

Substitution rate estimates are important because, coupled to sampling dates, they allow tracking infections in space and time, and thus provide an essential epidemiological tool for monitoring and control of infectious diseases. The results presented strongly suggest that future studies need to consider discordances between inferences derived from different molecular survey methods, especially with respect to substitution rate estimates.

Importantly, for whole genome analysis, a subset of data is selected to run existing software to estimate population genetic parameters. Clearly, there is a need to expand the range of methods to include whole genome data analysis. However, as bacterial genomics matures, current methods will need to be modified and extended to handle the stream of data now being generated.

221 Supplemental Information

(http://dx.doi.org/10.6084/m9.figshare.1091392)

Supplemental File 1: SRA Accession numbers for all 60 genomes contributed in this study

Supplemental Figure 1: Median node ages for Burkholderia pseudomallei (A), Brucella spp. (B)

Supplemental Figure 2: Phylogenies by molecular survey approach. Brucella spp. phylogenies (A) and Y. pestis phylogenies (B)

Supplemental File 2: Datasets, Clock and non-clock phylogenies for all molecular survey approaches and species

References

Achtman M. 2008. Evolution, Population Structure, and Phylogeography of Genetically

Monomorphic Bacterial Pathogens. Annual Review of Microbiology 62:53-70.

Achtman M, Zurth K, Morelli G, Torrea G, Guiyoule A, and Carniel E. 1999. Yersinia

pestis, the cause of plague, is a recently emerged clone of Yersinia

pseudotuberculosis. Proceedings of the National Academy of Sciences 96:14043-

14048.

Altschul SF, Gish W, Miller W, Myers EW, and Lipman DJ. 1990. Basic local alignment

search tool. Journal of molecular biology 215:403-410.

Ayres DL, Darling A, Zwickl DJ, Beerli P, Holder MT, Lewis PO, Huelsenbeck JP,

Ronquist F, Swofford DL, and Cummings MP. 2012. BEAGLE: an application

programming interface and high-performance computing library for statistical

phylogenetics. Systematic biology 61:170-173.

222 Baker S, Hanage WP, and Holt KE. 2010. Navigating the future of bacterial molecular

epidemiology. Current opinion in microbiology 13:640-645.

Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, and Sayers EW. 2010. GenBank.

Nucleic Acids Research 38:D46-D51.

Bertelli C, and Greub G. 2013. Rapid bacterial genome sequencing: methods and

applications in clinical microbiology. Clinical Microbiology and Infection

19:803-813.

Bogdanowicz D, and Giaro K. 2012. Matching split distance for unrooted binary

phylogenetic trees. Computational Biology and Bioinformatics, IEEE/ACM

Transactions on 9:150-160.

Bogdanowicz D, Giaro K, and Wróbel B. 2012. Treecmp: comparison of Trees in

polynomial Time. Evolutionary Bioinformatics Online 8:475.

Bos KI, Schuenemann VJ, Golding GB, Burbano HA, Waglechner N, Coombes BK,

McPhee JB, DeWitte SN, Meyer M, Schmedes S, Wood J, Earn DJD, Herring

DA, Bauer P, Poinar HN, and Krause J. 2011. A draft genome of Yersinia pestis

from victims of the Black Death. Nature 478:506-510.

Brinkac LM, Davidsen T, Beck E, Ganapathy A, Caler E, Dodson RJ, Durkin AS,

Harkins DM, Lorenzi H, and Madupu R. 2010. Pathema: a clade-specific

bioinformatics resource center for pathogen research. Nucleic Acids Research

38:D408-D414.

Butler MI, Stockwell PA, Black MA, Day RC, Lamont IL, and Poulter RTM. 2013.

Pseudomonas syringae pv.actinidiae from Recent Outbreaks of Kiwi fruit

223 Bacterial Canker Belong to Different Clones That Originated in China. PLoS

ONE 8:e57464.

Castillo-Ramirez S, Corander J, Marttinen P, Aldeljawi M, Hanage W, Westh H, Boye K,

Gulay Z, Bentley S, Parkhill J, Holden M, and Feil E. 2012. Phylogeographic

variation in recombination rates within a global clone of methicillin-resistant

Staphylococcus aureus. Genome Biology 13:R126.

Chen C, Zhang W, Zheng H, Lan R, Wang H, Du P, Bai X, Ji S, Meng Q, Jin D, Liu K,

Jing H, Ye C, Gao GF, Wang L, Gottschalk M, and Xu J. 2013. Minimum core

genome sequence typing of bacterial pathogens: a unified approach for clinical

and public health microbiology. Journal of Clinical Microbiology.

Chewapreecha C, Harris SR, Croucher NJ, Turner C, Marttinen P, Cheng L, Pessia A,

Aanensen DM, Mather AE, and Page AJ. 2014. Dense genomic sampling

identifies highways of pneumococcal recombination. Nature genetics 46:305-309.

Comas I, Coscolla M, Luo T, Borrell S, Holt KE, Kato-Maeda M, Parkhill J, Malla B,

Berg S, and Thwaites G. 2013. Out-of-Africa migration and Neolithic

coexpansion of Mycobacterium tuberculosis with modern humans. Nature

genetics.

Cornejo OE, Lefébure T, Pavinski Bitar PD, Lang P, Richards VP, Eilertson K, Do T,

Beighton D, Zeng L, Ahn S-J, Burne RA, Siepel A, Bustamante CD, and

Stanhope MJ. 2013. Evolutionary and Population Genomics of the Cavity

Causing Bacteria Streptococcus mutans. Molecular Biology and Evolution

30:881-893.

224 Croucher NJ, Finkelstein JA, Pelton SI, Mitchell PK, Lee GM, Parkhill J, Bentley SD,

Hanage WP, and Lipsitch M. 2013. Population genomics of post-vaccine changes

in pneumococcal epidemiology. Nat Genet 45:656-663.

Darling AE, Mau B, and Perna NT. 2010. progressiveMauve: multiple genome alignment

with gene gain, loss and rearrangement. PLoS ONE 5:e11147.

Drummond AJ, Suchard MA, Xie D, and Rambaut A. 2012. Bayesian phylogenetics with

BEAUti and the BEAST 1.7. Molecular Biology and Evolution 29:1969-1973.

Gardner SN, and Hall BG. 2013. When Whole-Genome Alignments Just Won't Work:

kSNP v2 Software for Alignment-Free SNP Discovery and Phylogenetics of

Hundreds of Microbial Genomes. PLoS ONE 8:e81760.

Gardner SN, and Slezak T. 2010. Scalable SNP analyses of 100+ bacterial or viral

genomes. J Forensic Res 1:107.

Gillespie JJ, Wattam AR, Cammer SA, Gabbard JL, Shukla MP, Dalay O, Driscoll T,

Hix D, Mane SP, and Mao C. 2011. PATRIC: the comprehensive bacterial

bioinformatics resource with a focus on human pathogenic species. Infection and

immunity 79:4286-4298.

Goss EM, Tabima JF, Cooke DE, Restrepo S, Fry WE, Forbes GA, Fieland VJ, Cardenas

M, and Grünwald NJ. 2014. The Irish potato famine pathogen Phytophthora

infestans originated in central Mexico rather than the Andes. Proceedings of the

National Academy of Sciences:201401884.

Grad YH, and Waldor MK. 2013. Deciphering the Origins and Tracking the Evolution of

Cholera Epidemics with Whole-Genome-Based Molecular Epidemiology. mBio 4.

225 Holt KE, Baker S, Dongol S, Basnyat B, Adhikari N, Thorson S, Pulickal AS, Song Y,

Parkhill J, and Farrar JJ. 2010. High-throughput bacterial SNP typing identifies

distinct clusters of Salmonella Typhi causing typhoid in Nepalese children. BMC

infectious diseases 10:144.

Holt KE, Baker S, Weill F-X, Holmes EC, Kitchen A, Yu J, Sangal V, Brown DJ, Coia

JE, Kim DW, Choi SY, Kim SH, da Silveira WD, Pickard DJ, Farrar JJ, Parkhill

J, Dougan G, and Thomson NR. 2012. Shigella sonnei genome sequencing and

phylogenetic analysis indicate recent global dissemination from Europe. Nat

Genet 44:1056-1059.

Kearse M, Moir R, Wilson A, Stones-Havas S, Cheung M, Sturrock S, Buxton S, Cooper

A, Markowitz S, and Duran C. 2012. Geneious Basic: an integrated and

extendable desktop software platform for the organization and analysis of

sequence data. Bioinformatics 28:1647-1649.

Kos VN, Desjardins CA, Griggs A, Cerqueira G, Van Tonder A, Holden MTG, Godfrey

P, Palmer KL, Bodi K, Mongodin EF, Wortman J, Feldgarden M, Lawley T, Gill

SR, Haas BJ, Birren B, and Gilmore MS. 2012. Comparative Genomics of

Vancomycin-Resistant Staphylococcus aureus Strains and Their Positions within

the Clade Most Commonly Associated with Methicillin-Resistant S. aureus

Hospital-Acquired Infection in the United States. mBio 3.

Köser CU, Holden MTG, Ellington MJ, Cartwright EJP, Brown NM, Ogilvy-Stuart AL,

Hsu LY, Chewapreecha C, Croucher NJ, Harris SR, Sanders M, Enright MC,

Dougan G, Bentley SD, Parkhill J, Fraser LJ, Betley JR, Schulz-Trieglaff OB,

Smith GP, and Peacock SJ. 2012. Rapid Whole-Genome Sequencing for

226 Investigation of a Neonatal MRSA Outbreak. New England Journal of Medicine

366:2267-2275.

Lachance J, and Tishkoff SA. 2013. SNP ascertainment bias in population genetic

analyses: Why it is important, and how to correct it. BioEssays.

Lemey P, Rambaut A, Bedford T, Faria N, Bielejec F, Baele G, Russell CA, Smith DJ,

Pybus OG, and Brockmann D. 2014. Unifying viral genetics and human

transportation data to predict the global transmission dynamics of human

influenza H3N2. PLoS pathogens 10:e1003932.

Lewis PO. 2001. A likelihood approach to estimating phylogeny from discrete

morphological character data. Systematic biology 50:913-925.

Lewis T, Loman NJ, Bingle L, Jumaa P, Weinstock GM, Mortiboy D, and Pallen MJ.

2010. High-throughput whole-genome sequencing to dissect the epidemiology of

Acinetobacter baumannii isolates from a hospital outbreak. Journal of Hospital

Infection 75:37-41.

Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, and

Durbin R. 2009. The sequence alignment/map format and SAMtools.

Bioinformatics 25:2078-2079.

Li R, Li Y, Kristiansen K, and Wang J. 2008. SOAP: short oligonucleotide alignment

program. Bioinformatics 24:713-714.

Liolios K, Mavromatis K, Tavernarakis N, and Kyrpides NC. 2008. The Genomes On

Line Database (GOLD) in 2007: status of genomic and metagenomic projects and

their associated metadata. Nucleic Acids Research 36:D475-D479.

227 Maiden MCJ, van Rensburg MJJ, Bray JE, Earle SG, Ford SA, Jolley KA, and McCarthy

ND. 2013. MLST revisited: the gene-by-gene approach to bacterial genomics. Nat

Rev Micro 11:728-736.

Markowitz VM, Chen I-MA, Palaniappan K, Chu K, Szeto E, Grechkin Y, Ratner A,

Jacob B, Huang J, and Williams P. 2012. IMG: the integrated microbial genomes

database and comparative analysis system. Nucleic Acids Research 40:D115-

D122.

Marttinen P, Hanage WP, Croucher NJ, Connor TR, Harris SR, Bentley SD, and

Corander J. 2012. Detection of recombination events in bacterial genomes from

large population samples. Nucleic Acids Research 40:e6.

Morelli G, Song Y, Mazzoni CJ, Eppinger M, Roumagnac P, Wagner DM, Feldkamp M,

Kusecek B, Vogler AJ, Li Y, Cui Y, Thomson NR, Jombart T, Leblois R,

Lichtner P, Rahalison L, Petersen JM, Balloux F, Keim P, Wirth T, Ravel J, Yang

R, Carniel E, and Achtman M. 2010. Yersinia pestis genome sequencing

identifies patterns of global phylogenetic diversity. Nat Genet 42:1140-1143.

Nasser W, Beres SB, Olsen RJ, Dean MA, Rice KA, Long SW, Kristinsson KG,

Gottfredsson M, Vuopio J, and Raisanen K. 2014. Evolutionary pathway to

increased virulence and epidemic group A Streptococcus disease derived from

3,615 genome sequences. Proceedings of the National Academy of

Sciences:201403138.

Nielsen CB, Cantor M, Dubchak I, Gordon D, and Wang T. 2010. Visualizing genomes:

techniques and challenges. Nat Methods 7:S5-S15.

228 Okoro CK, Kingsley RA, Connor TR, Harris SR, Parry CM, Al-Mashhadani MN,

Kariuki S, Msefula CL, Gordon MA, de Pinna E, Wain J, Heyderman RS, Obaro

S, Alonso PL, Mandomando I, MacLennan CA, Tapia MD, Levine MM, Tennant

SM, Parkhill J, and Dougan G. 2012. Intracontinental spread of human invasive

Salmonella Typhimurium pathovariants in sub-Saharan Africa. Nat Genet

44:1215-1221.

Paradis E. 2010. pegas: an R package for population genetics with an integrated–modular

approach. Bioinformatics 26:419-420.

Paradis E, Claude J, and Strimmer K. 2004. APE: analyses of phylogenetics and

evolution in R language. Bioinformatics 20:289-290.

Parker J, Rambaut A, and Pybus OG. 2008. Correlating viral phenotypes with phylogeny:

accounting for phylogenetic uncertainty. Infection, Genetics and Evolution 8:239-

246.

Pepperell CS, Casto AM, Kitchen A, Granka JM, Cornejo OE, Holmes EC, Birren B,

Galagan J, and Feldman MW. 2013. The Role of Selection in Shaping Diversity

of Natural M. tuberculosis Populations. PLoS pathogens 9:e1003543.

Pérez-Lago L, Comas I, Navarro Y, González-Candelas F, Herranz M, Bouza E, and

García-de-Viedma D. 2013. Whole Genome Sequencing analysis of intrapatient

microevolution in Mycobacterium tuberculosis: Potential impact on the inference

of tuberculosis transmission. Journal of Infectious Diseases 209:98-108.

Pérez-Losada M, Cabezas P, Castro-Nallar E, and Crandall KA. 2013. Pathogen typing in

the genomics era: MLST and the future of molecular epidemiology. Infection,

Genetics and Evolution 16:38-53.

229 Pybus OG, Fraser C, and Rambaut A. 2013. Evolutionary epidemiology: preparing for an

age of genomic plenty. Philosophical Transactions of the Royal Society B:

Biological Sciences 368.

Read TD, Salzberg SL, Pop M, Shumway M, Umayam L, Jiang L, Holtzapple E, Busch

JD, Smith KL, and Schupp JM. 2002. Comparative genome sequencing for

discovery of novel polymorphisms in Bacillus anthracis. Science 296:2028-2033.

Reimer AR, Van Domselaar G, Stroika S, Walker M, Kent H, Tarr C, Talkington D,

Rowe L, Olsen-Rasmussen M, Frace M, Sammons S, Dahourou GA, Boncy J,

Smith AM, Mabon P, Petkau A, Graham M, Gilmour MW, and Gerner-Smidt P.

2011. Comparative genomics of Vibrio cholerae from Haiti, Asia, and Africa.

Emerg Infect Dis 17:2113-2121.

Revazishvili T, Rajanna C, Bakanidze L, Tsertsvadze N, Imnadze P, O’Connell K,

Kreger A, Stine O, Morris J, and Sulakvelidze A. 2008. Characterisation of

Yersinia pestis isolates from natural foci of plague in the Republic of Georgia,

and their relationship to Y. pestis isolates from other countries. Clinical

Microbiology and Infection 14:429-436.

Robinson D, and Foulds LR. 1981. Comparison of phylogenetic trees. Mathematical

Biosciences 53:131-147.

Roje DM. 2014. Evaluating the Effects of Non-Neutral Molecular Markers on Phylogeny

Inference. PLoS ONE 9:e87428.

Ronquist F, Teslenko M, van der Mark P, Ayres DL, Darling A, Höhna S, Larget B, Liu

L, Suchard MA, and Huelsenbeck JP. 2012. MrBayes 3.2: efficient Bayesian

230 phylogenetic inference and model choice across a large model space. Systematic

biology 61:539-542.

Sheppard SK, Didelot X, Meric G, Torralbo A, Jolley KA, Kelly DJ, Bentley SD, Maiden

MCJ, Parkhill J, and Falush D. 2013. Genome-wide association study identifies

vitamin B5 biosynthesis as a host specificity factor in Campylobacter.

Proceedings of the National Academy of Sciences 110:11923-11927.

Slatkin M, and Maddison WP. 1989. A cladistic measure of gene flow inferred from the

phylogenies of alleles. Genetics 123:603-613.

Snow G, and Snow MG. 2013. Package ‘TeachingDemos’.

Spoor LE, McAdam PR, Weinert LA, Rambaut A, Hasman H, Aarestrup FM, Kearns

AM, Larsen AR, Skov RL, and Fitzgerald JR. 2013. Livestock Origin for a

Human Pandemic Clone of Community-Associated Methicillin-Resistant

Staphylococcus aureus. mBio 4.

Wang T, Donaldson Y, Brettle R, Bell J, and Simmonds P. 2001. Identification of shared

populations of human immunodeficiency virus type 1 infecting microglia and

tissue macrophages outside the central nervous system. Journal of virology

75:11686-11699.

Warinner C, Rodrigues JFM, Vyas R, Trachsel C, Shved N, Grossmann J, Radini A,

Hancock Y, Tito RY, and Fiddyment S. 2014. Pathogens and host immunity in

the ancient human oral cavity. Nature genetics 46:336-344.

Whatmore A, Perrett L, and MacMillan A. 2007. Characterisation of the genetic diversity

of Brucella by multilocus sequencing. BMC Microbiology 7:34.

231 Wickham H. 2009. ggplot2: elegant graphics for data analysis: Springer Publishing

Company, Incorporated.

Wielgoss S, Barrick JE, Tenaillon O, Wiser MJ, Dittmar WJ, Cruveiller S, Chane-Woon-

Ming B, Médigue C, Lenski RE, and Schneider D. 2013. Mutation rate dynamics

in a bacterial population reflect tension between adaptation and genetic load.

Proceedings of the National Academy of Sciences 110:222-227.

Wilson MR, Allard MW, and Brown EW. 2013. The forensic analysis of foodborne

bacterial pathogens in the age of whole-genome sequencing. Cladistics 29:449-

461.

Zhang Z, Chen D, Chen Y, Liu W, Wang L, Zhao F, and Yao B. 2010. Spatio-Temporal

Data Comparisons for Global Highly Pathogenic Avian Influenza (HPAI) H5N1

Outbreaks. PLoS ONE 5:e15314.

232

Section III

Case Studies Related To Food

Production

233

Chapter 5:

Molecular Phylodynamics And

Protein Modeling Of Infectious

Salmon Anemia Virus (ISAV)

234 Abstract

Background: ISAV is a member of the Orthomyxoviridae family that affects salmonids with disastrous results. It was first detected in 1984 in Norway and from then on it has been reported in Canada, United States, Scotland and the Faroe Islands. Recently, an outbreak was recorded in Chile with negative consequences for the local fishing industry.

However, few studies have examined available data to test hypotheses associated with the phylogeographic partitioning of the infecting viral population, the population dynamics, or the evolutionary rates and demographic history of ISAV. To explore these issues, we collected relevant sequences of genes coding for both surface proteins from Chile,

Canada, and Norway. We addressed questions regarding their phylogenetic relationships, evolutionary rates, and demographic history using modern phylogenetic methods.

Results: A recombination breakpoint was consistently detected in the Hemagglutinin-

Esterase (he) gene at either side of the Highly Polymorphic Region (HPR), whereas no recombination breakpoints were detected in Fusion protein (f) gene. Evolutionary relationships of ISAV revealed the 2007 Chilean outbreak group as a monophyletic clade for f that has a sister relationship to the Norwegian isolates. Their tMRCA is consistent with epidemiological data and demographic history was successfully recovered showing a profound bottleneck with further population expansion. Finally, selection analyses detected ongoing diversifying selection in f and he codons associated with protease processing and the HPR region, respectively.

Conclusions: Our results are consistent with the Norwegian origin hypothesis for the

Chilean outbreak clade. In particular, ISAV HPR0 genotype is not the ancestor of all

ISAV strains, although SK779/06 (HPR0) shares a common ancestor with the Chilean

235 outbreak clade. Our analyses suggest that ISAV shows hallmarks typical of RNA viruses that can be exploited in epidemiological and surveillance settings. In addition, we hypothesized that genetic diversity of the HPR region is governed by recombination, probably due to template switching and that novel fusion gene proteolytic sites confer a selective advantage for the isolates that carry them. Additionally, protein modeling allowed us to relate the results of phylogenetic studies with the predicted structures. This study demonstrates that phylogenetic methods are important tools to predict future outbreaks of ISAV and other salmon pathogens.

Background

Infectious salmon anemia virus (ISAV) is a pathogen that has been associated with high fish mortality in the aquaculture industry [1]. The cumulative mortality associated with each outbreak of ISAV in Norway and other countries is very high, reaching up to 100% of fish stock in some cases [2-6]. ISAV has been classified as the only member of Isavirus genus belonging to the Orthomyxoviridae family, which includes Influenza viruses [7]. Virions consist mainly of a membranous envelope with two surface glycoproteins and a matrix protein surrounding a ribonucleoprotein complex.

Association of nucleoprotein, three polymerase subunits and the genomic RNA, in turn, forms the ribonucleoprotein complex. Virion morphology varies from spherical to pleiomorphic. The enveloped virus particles are of 45-140 nm in diameter, however, highly pleiomorphic particles up to 700 nm in the longest dimension are occasionally observed. The genome organization is consistent with other members of the

Orthomyxoviridae; comprising of 8 segments of single stranded negative sense RNA ((-) ssRNA), in which each segment encodes one protein except segments 7 and 8, which

236 encode two (European strains) / three (Canadian strains) and two proteins, respectively

[8]. Four major structural proteins have been identified, including a 68 kDa nucleoprotein, a 22 kDa matrix protein, a 42 kDa surface glycoprotein named haemagglutinin-esterase protein with receptor-binding and receptor-destroying activity, and a 50 kDa surface glycoprotein with fusion activity, coded by genome segments 3, 8,

6, and 5, respectively. Segments 1, 2, and 4 encode the putative viral polymerase subunits

PB1, PB2, and PA. The ORF1 of segment 7 encodes a nonstructural protein with interferon antagonistic properties, while ORF2 encodes a nuclear export protein. The smaller ORF1 of segment 8 encodes the matrix protein, while the larger ORF2 encodes an RNA-binding structural protein also with interferon antagonistic properties [4, 9-11].

Regarding the genes coding for surface proteins, he genes exhibit a length polymorphism close to the 3’ end called highly polymorphic region (HPR) that has been related to virulence, although this is not entirely clear [12-14]. In turn, f genes seem to be more conserved in length, with the exception of the region next to the putative protease- processing site (PPPS) [15, 16]. The fusion protein in ISAV is synthesized as a precursor protein designated as F0. In order to exert its biological function, F0 must be cleaved by cellular proteases to generate F1 and F2 [17]. Several insertions of 10-12 amino acids

(IN1-4), identical to sequences from other genomic segments, have been suggested to confer novel protease cleavage sites [15], and thus they could be subjected to positive selection.

Unlike other members of the Orthomyxoviridae family that commonly infect mammals or birds, ISAV naturally infects fishes of the Salmonidae family. Although natural outbreaks of ISA have only been recorded in farmed Atlantic salmon (Salmo

237 salar), the virus has been isolated also from rainbow trout (Oncorhynchus mykiss) [3] and

Coho salmon (Oncorhynchus kisutch) in Chile [18]. The disease was first reported in

Norway and the virus was associated with the disease in 1984. At present, ISAV has been reported in Canada (New Brunswick in 1996 and Nova Scotia in 2000), USA (Maine in

2001), Scotland (1998), The Faroe Islands (2000; Denmark), and recently in Chile [6, 19,

20]. Because of the worldwide distribution and the potential for serious economical losses, ISAV has been listed as a non-exotic disease for the European Union (EU) and is therefore monitored closely by the European Community Reference Laboratory for Fish

Diseases (http://www.crl-fish.eu/).

In general, phylogenetic studies of infectious diseases have been centered on human and zoonotic diseases with little attention to enzootic pathogens, although with few exceptions, for example, in plant viruses’ crop-related diseases [21-24]. Given the human health impact, phylogenetic studies of Orthomyxoviruses have been largely devoted to influenza viruses in particular to influenza A [25, 26]. To date, questions related to global distributions, molecular evolution and adaptations, sources of genetic diversity and emergence of new isolates have been extensively addressed in the influenza literature. However, little has been done with its relatives, e.g., ISAV, even when the same genetic structure, and presumably, similar evolutionary forces might be acting on

ISAV populations.

In 1999 an ISAV belonging to the North American (NA) genotype was isolated from Coho salmon in Chile and the first outbreak of ISAV in marine-farmed Atlantic salmon in the Southern hemisphere occurred in the same country starting in June 2007.

Given that there are no known natural hosts for ISAV in Chile, only a few studies have

238 investigated whether the outbreak source came from North America or Europe by using basic phylogenetic methods [6, 15, 19, 20].

In this study, we apply more sophisticated phylogenetic and population genetic methods to currently available sequence data to address questions regarding the 2007

Chilean ISAV outbreak. In particular, we asked (i) whether recombination is a relevant force acting on f and he genes, (ii) whether the age estimates of the Chilean clade agree with epidemiological data, and (iii) whether population processes leave marks in ISAV genomes that allow us to reconstruct its demographic history. Finally, we developed new improved structural homology models with which positively selected sites were mapped to relate them with functional traits.

Results and discussion

We included 70 isolates from Canada, Chile and Norway for which both genes, f and he, were available along with their sampling dates. The length of the final alignments was 1365 and 1269 bp, respectively. We selected TVM as the best-fit nucleotide substitution model for both genes based on Akaike Information Criterion (AIC) [27].

Rate heterogeneity was estimated as the gamma distributed shape parameter α = 0.5740 and 0.6570 for f and he, respectively. We also tested for whether the data were best explained by a strict or relaxed molecular clock. Bayes Factor hypothesis testing favored the relaxed model in both cases (10.44 and 21.82 log BF for f and he, respectively). This was further supported by the Coefficient of Variation and ucld.stdv parameters.

Genetic diversity and recombination

In order to explore the data and to see whether recombination plays a recognizable role in ISAV evolution, we estimated the population genetic parameter

239 Theta (θ) and recombination rate (r) for each population. When calculated under the infinite-sites assumption, the Canadian and Chilean populations were more diverse than the older Norwegian population (0.08579, 0.06247 and 0.01809, respectively). In turn, when using a coalescent sampler estimator, the Norwegian population turned out to be, as expected, more than three times more diverse 0.04774 (95% CI 0.0198-0.08448) than the young Chilean population 0.0142 (95% CI 0.008368-0.020745), and than the Canadian population 0.03523 (95% CI 0.021083-0.051501) as well. Although at first this seems contradictory, coalescent-based estimators of genetic diversity are far more reliable than those based on segregating sites [28]. By taking into account evolutionary history, coalescent-based estimates are more robust to model assumptions and allow us to determine factors influencing the observed patterns (see below). This overall result has practical implications as less diverse populations, i.e., genetically homogeneous, would be more susceptible to vaccines and antiviral treatments whereas more structured populations, i.e., individuals genetically similar within populations but different among populations, make effective vaccine and antiviral deliveries more difficult. So, population genetic inferences can be instrumental in fish health surveillance programs as demonstrated in human health settings [29-32].

240 241

Figure 1. Schematic representation mapping of positively selected sites and tertiary structure models for both ISAV surface proteins.

(A) Structural model of the HE protein from ISAV by homology with the HEF protein from influenza C virus (PDB ID 1FLC) with V134 and S340 residues in Van der

Waals representation. (B) Alignment of secondary structural prediction (using PSIPRED) for sequence and model of ISAV HE against HEF protein. The alpha helix structural prediction is shown as red cylinders and Beta strands in yellow arrow representation. (C)

Structural model of the Fusion protein from ISAV 901 (without insertion) by homology with HEF protein from influenza C virus (PDB ID 1FLC) and HE from Influenza A virus

(PDB ID 3EYJ). (D) Superposition of models of the ISAV fusion protein with and without different insertions. (E) Zoomed image of alpha helix, which contains putative fusion peptides and different insertions with A279 residue in Van der Waals representation. (F) Alignment of secondary structural prediction (using PSIPRED) for different sequences and models of ISAV Fusions against HEF protein. The alpha helix structure prediction is shown as red cylinders and Beta strands as yellow arrows.

Positively selected sites were deemed as such when the dN/dS rate ratio was greater than

1 and with a p-value less than 0.1. Amino acid numbering according to ISAV752 09

(ADF36500 for Hemagglutinin and ADF36499 for Fusion). Black arrows denote positively selected sites. Blue labels indicate landmarks in the genes, i.e., GS, glycosylation site; IN, insertion; RB, recombination breakpoint; PPPS, putative protease- processing site; HPR, Highly Polymorphic Region; TM, Transmembrane domain.

242 Antisense (-) ssRNA viruses are less prone to recombination than their (+) sense counterparts [33, 34]. In principle this would arise from the impossibility of (-) ssRNA viruses to serve as templates for negative sense strand synthesis once the genome is uncoated, in turn rendering copy choice recombination less likely. In addition, (-) ssRNA genomes are generally protected by viral nucleoproteins that prevent genome exposure to the cytoplasm (and cell machinery) and to come in close contact with other genome copies, thus rendering copy-choice recombination also unlikely. Also, the segmented genome structure of some (-) ssRNA relaxes even more the occurrence of recombination by yielding the same outcome when shuffling segments (reassortment). In agreement with this hypothesis, our results showed a low overall recombination force (r) of

(1.173×10−3 (1.90×10−4 - 3.50×10−3)) expressed as the ratio of recombination rate per site over substitution rate per site. However, recombination does happen in human [35-

37] and avian [38] strains of influenza viruses as well as in some segments in ISAV, namely segments 6 and 5 [12-14]. In this sense, we tested our dataset to see whether or not it is possible to detect recombination breakpoints. Using a suite of methods [39], we consistently detected one possible recombination breakpoint in the vicinity of the he HPR region. This result was confirmed by substitution and phylogenetic algorithms and was further corroborated by phylogenetic topological incongruence using a genetic algorithm

(GARD; Figure 1B). Since full-length HPR (HPR0) has been detected only in natural populations (not virulent) and different HPR alleles co-occur in the same fish-farming location, these results support the hypothesis that the HPR region is the product of template switching during replication (unequal recombination), and opens the possibility of testing it in controlled experimental settings. Early indications of recombination in the

243 he gene were reported in 2001, when researchers looked for topological incongruence and possible recombination patterns ’by eye’ within he Norwegian sequences [40].

Interestingly, they found that recombination was associated with the geographic location of the isolates and that it occurred at either side of the HPR region. Our results provide statistical support for this initial observation as we detected a topological incongruence at nucleotide 900 (KH-adjusted p-value = 2×10−4; Figure 1B). We also detected recombination breakpoint signals at either side of the HPR; however, it was not possible to detect parental sequences with confidence (Rdp, maxchi, chimera and Genconv; Figure

1B). This evidence contradicts the hypothesis of Cunningham et al. [41] that the most plausible mechanism of recombination is that each HPR comes from partial deletions of precursor HPR0.

Regarding the f gene, we do not detect any recombination breakpoints. At first this was surprising since small non-homologous fragments have been observed at around

266-276, which might confer novel protease cleavage sites (IN1-4) and therefore have a fitness impact. We think there are at least two potential reasons for our inability to detect recombination in this gene region. First, small fragments are hard to detect with current methods because of the low information they contain. Second, methods are not designed to explicitly look for non-homologous recombination. Also, it would be hard to detect if the INs are evolutionarily linked to the f gene and/or have similar substitution rates.

Phylogenetic inference and past population dynamics

In order to estimate phylogenetic relationships, we used Bayesian phylogenetic methods to test the Chilean outbreak isolates for monophyly, to estimate rates of molecular evolution across the genomes, and estimate divergence times for clades of

244 particular interest (e.g., geographically isolated outbreaks). Recombination has detrimental effects on phylogeny estimation and phylogeny-based analyses [34, 42, 43] and thus the recombinant portion of the he gene was not included in these analyses.

Segmented viruses have the potential to re-assort their segments in novel combinations, which in phylogenetic terms has the same effect as recombination. Therefore, we estimated phylogenies and molecular evolution parameters separately for each gene since concatenating both datasets would render inaccurate inferences as ISAV might exhibit reassortment between segments.

In the case of the f gene, the Chilean outbreak clade was inferred as monophyletic suggesting that ISAV was introduced once in Chilean fish farming (Figure 2). However, the he gene tree (additional file 2) shows that outbreak isolates spread in two clades, i.e., paraphyletic; one composed of most of the outbreak isolates and the other composed of three isolates that are also closely related to isolates from Norway (VT11282007-38 / -39 and 2006B13364 in additional file 2). For both genes, posterior probability support was close or equal to 1, either for f monophyletic outbreak clade or he paraphyletic outbreak clades. We speculate that this might be the consequence of reassortment as he genes from

Chile exhibit two evolutionary pathways. Interestingly, the HPR0 allele (SK779/06HPR0 in Figure 2) is suggested to be the precursor of all HPR alleles by differential deletion of the polymorphic region [12]. However, in both gene trees, SK779/06HPR0 (not virulent) does not share a most recent common ancestor with all the alleles, but it does for the

Chilean outbreak alleles. This result suggests that the ancestor of the HPR0 allele is not necessarily the precursor of all modern HPRs. Nevertheless, this result agrees with the hypothesis that the ISAV that finally led to the Chilean outbreak was not virulent [20].

245 Figure 2 - Bayesian Phylogenetic Inference for ISAV f gene

Maximum clade credibility (MCC) phylogeny depicting the evolutionary relationships of isolates from Europe, Canada and Chile. Colors indicate ancestral locations as inferred from Ancestral State Reconstruction (ASR). Red = Chile, Blue =

Canada and Yellow = Europe. Support is expressed as posterior probabilities for major nodes. Inset = boxplots of evolutionary rates for the f and he genes and density plots for

246 node age estimates (black dashed line = Chilean he minor clade; black line = Chilean he major clade; grey line = Chilean f clade). Isolate NBISA01 belongs to the Canadian clade and it was reported in Chile in 2001.

Overall, these results are in rough agreement with previously published phylogenies using other methods. However, our analysis fully resolved relationships in crown nodes and allowed us to estimate molecular evolutionary parameters taking into account phylogenetic and parameter uncertainty. We focused in two events in the evolutionary history of the virus, the time in which Norwegian and Chilean isolates diverged, and the time in which the Chilean epidemic clade expanded. Whether it was in

Norway or in Chile, the most recent common ancestor (MRCA) between Norwegian and

Chilean isolates was estimated to have occurred in 1995 [1992 -1996 95% Highest

Probability Density (HPD)] and the Chilean clade expansion in 2003 (2001 -2005 95%

HPD) for the f gene. For the he gene, in turn, the MRCA was 1989 (1982 -1994 95%

HPD) and 2001 (1996 -2004 95% HPD). This is in agreement with reported distance- based analyses for the f gene but not for the he gene [20]. However, those methodologies do not provide confidence estimates in a probabilistic framework, which, in turn, could allow for meaningful comparisons. For instance, our estimates show that both segments have different evolutionary histories since they coalesce at significantly different points back in time (0.83 and 0.94 posterior probabilities for he and f Chilean clades, respectively). Together with the paraphyly of the he outbreak clade, the results suggest that the ancestor between Norwegian and Chilean isolates underwent reassortment of at least both surface protein genes. Ancestral State Reconstruction (ASR) clearly shows that

247 Chilean isolates came from Norway as previously hypothesized and this result is consistent across both genes (color-coded in Figure 2 and additional file 2).

As other (-) ssRNA viruses, ISAV exhibited extremely high substitution rates.

Similar to influenza viruses, ISAV replication machinery lacks proofreading activity, which has been proposed as the increased source of variability in RNA viruses [33, 44].

Accordingly, substitution rates were estimated at 7.83×10−4 substitutions per site per year

(s/s/y) (4.31×10−4 - 1.18×10−3 95% HPD) for the f gene and 1.039×10−3 (s/s/y)

(3.87×10−4 - 1.86×10−3 95% HPD) for the he gene (Figure 2).

An interesting aspect of phylogenetic analyses is to infer processes from patterns in the time coordinate. We hypothesized that if ISAV is such a fast-evolving entity, its genome must contain information that could be used to infer ecological and/or population processes. For ISAV, we inferred a population level process of a bottleneck, i.e., a population going through one or more generations of small size followed by subsequent population growth, perhaps associated with the outbreak it recently experienced around

2007 when it was first officially reported in Chile (www.sernapesca.cl). We reconstructed substitution rate-scaled effective population size (Nef) through time to see whether or not it is possible to recover that bottleneck using the Bayesian Skyline non-parametric model

(BSP) [45] (Figure 3). Interestingly, the analysis shows a strong reduction in population size around 2007 followed by a nearly exponential growth. In particular, the greatest reduction in population size was inferred from the f gene and occurred in 2006.42 and

2006.02 for the he gene (Figure 3). One assumption of analyses based on the coalescent and thus the BSP is that the dataset comprises one non-structured population. Although not explored in depth in the literature, population structure, in principle, could lead to

248 biased estimates of substitution rates and associated parameters. This, in turn, can produce patterns in the BSP that reflect changes in the degree of structure instead of changes in the overall population size, especially when using a single marker [46, 47]. To address this potential problem, we calculated Parsimony Score (PS) and Association

Index (AI) for both genes under study. These statistics aim to assert the degree of association of a discrete character, geographic location in this case, with a posterior distribution of trees. Both PS and AI statistics were significant (p-value<0.05), suggesting that it is possible to reject the null hypothesis of one non-structured population. Considering that two independent genes showed similar patterns, and that the

BSP correlates well with surveillance data, plus the agreement of substitution rate estimates with others [20, 48], we think is safe to assume that these results reflect a population bottleneck. Certainly, simulation studies will aid to assert how robust BSP estimates are to structured populations.

Although controversial, Vike et al. [19] suggested that the source of ISAV in the

Chilean outbreak were infected eggs brought from Norway, which could be explained by the fact that until 2008 this country was the main provider of embryonated eggs

(www.sernapesca.cl). It is clear that the Chilean outbreak clade has a sister relationship with Norwegian isolates (closely related to ISAV8 (97/09/615); see Figure 2), that the basal clades and ancestral nodes in the phylogeny are from Norway, and that the tMRCA is around 1989-1996, which matches the date when egg importation grew from less than

20 million to 50 million imported units (www.sernapesca.cl). In addition, the BSP analysis shows that the ISAV population increased exponentially since 2006, suggesting that the virus population remained constant since 2001 (tMRCA of Chilean outbreak

249 clade) and that the outbreak was publicly reported one year after this predicted population expansion. These results constitute valuable information for fish health officials to make informed future decisions based on what contingency measures were implemented at the time of transmission from Norway to Chile. For instance, it is very interesting to note that in 2011 the main ISAV genotype detected in Chile corresponded to HPR0

(www.sernapesca.cl). Conversely, HPR7b was nearly 80% prevalent between epidemic years 2007 and 2008 [20]. This could be attributed to the change of source material from

Norway to Iceland in 2009. For a future study it would be interesting to address the origin of newly detected avirulent HPR0 strains that would come from a country other than

Norway.

250 Figure 3 - Bayesian Skyline Plot reconstruction for the f and he genes.

Black and red lines represent median estimates of effective population size, scaled by generation time in years (τ), for the f and he genes, respectively. Grey and yellow lines represent 95% Highest Probability Density (HPD) for both genes with grey for f and yellow for he, respectively.

Selection analysis and structure models

In order to find potentially relevant sites in the primary sequence of both surface genes and to determine possible structure-function relations, we used a two-rate Fixed-

Effects likelihood (two-rate FEL) approach to find positively selected sites (dN/dS rate ratio > 1). In a gene-based approach, we found that both genes are evolving under purifying selection (dN/dS rate ratio < 1) as expected for fast-evolving entities with relatively small genomes (0.2038 and 0.1382 for he and f genes, respectively).

Using a site-by-site approach, we found that the he gene has two sites under positive selection within codons in positions 134 and 340 (Figure 1B; numbering according to ISAV752 isolate). In the he gene, amino acid Val/Leu134 is located specifically on the globular distal domain (Figure 1A), which is related to HE protein main function, i.e., cell receptor recognition [49]. Amino acid Ser/Arg/Gly/Ala340 lies upstream the HPR region, located in the stalk of the protein (Figure 1A), which prompted us to hypothesize that it might be involved in recombination. Interestingly, the expression of recombinant ISAV HEΔ339-351 results in decreased expression levels, impaired salmon serum recognition, and hema-adsorption activity. In contrast, HEΔ349-351 shows normal

251 parameters [49], suggesting that positively selected amino acid in position 340 is due to its great importance for HE activity.

We also found three sites under positive selection in the f gene within codons that code for amino acid residues 130, 170, and 279 (Figure 1F). It was not possible to reliably model the f gene protein product structure. However, we were able to map amino acid Ala/Gly/Thr279, which lies immediately downstream of insertions (INs) [6]. This constitutes an interesting finding given that insertions in this region have been reported to confer novel PPPS [15], as outlined in Figure 1F. In addition, Ala/Gly/Thr279 lies immediately downstream to Arg278 (Arg267 in ISAV fusion without insertion), a putative cleavage site [17] that has been suggested as a virulence factor [50]. Conversely, a recent study suggested that substitution of Gln/His266 to Leu266 (two positions upstream to our predicted site) in ISAV fusion protein without insertion is in association with an increase of virulence behavior, as seen in Chilean ISAV 901 strain [13, 15]. This has epidemiological implications as isolates carrying these elements would have an advantage compared to those that do not. This finding predicts that the frequency of these alleles will increase in the populations where they are present, e.g., IN4, which is only found in Chile with almost 80% prevalence. It has been suggested that INs close to the cleavage site exposes this region to the solvent probably helping in recognition and processing. Our improved model shows that insertions lie inside the alpha helix structure that is enlarged with subsequent addition of residues, which leads to the consequent exposure of PPPSs (Figure 1D and 1E), in comparison with ISAV901 fusion protein without insertion (Figure 1C). These findings also support the notion that IN’s serve as virulence determinants. For instance, the unique insertion, IN4, provides a cleavage site

252 for Asp-N endopeptidase metalloprotease, which has been detected in the skin mucosa of salmonids [15]. Since the HPR in he as well as the INs in f are associated with virulence and are also under natural selection pressure, we think that experimental confirmation of these results will lead to a better understanding of molecular determinants of pathogenicity for both surface proteins.

Conclusions

In summary, this study investigated the evolutionary history of ISAV, an important economic infection for salmon culture worldwide. Although less exploited in non-human diseases, phylogenetic and phylodynamic analyses characterize pathogens in space and time, which is perfectly suited for surveillance programs. For instance, identification of the origin of an outbreak could aid officials in formulating management policies. Though not pursued in this study, phylogeographic analyses could help identify virus sources and venues from which pathogens are spreading into different populations or fisheries. Reconstructing past population dynamics could help monitor how an infectious agent is changing or reacting to sanitary interventions in terms of its effective population size. Also, examining molecular sequences for positive selection, all in a structural context, can identify determinants of virulence or other molecular adaptations.

In this study we found that ISAV, as its relatives, (i) exhibit a low recombination rate, though recombination is present in the he surface gene. We hypothesized that HPR region diversity is governed by recombination probably due to template switching. (ii)

We also provided more evidence in favor of the Norwegian origin hypothesis in a coherent phylogenetic and statistical framework. Contrary to the widely accepted view

[41], (ii) ISAV HPR0 genotype is not the precursor of all ISAV he genotypes, but

253 appears to share a common ancestor with the Chilean outbreak clade. Finally, (iv) we identified potential determinants of fitness as identified by being under positive selection in a structural context; the protein models allowed us to relate the results of phylogenetic studies with the predicted structures, which suggest an important structural function for the predicted positively selected sites. Our analysis shows that the ISAV outbreak remained undetected in Chile since 2001 and that the population started to increase prior to the identification of the outbreak. Thus, we demonstrate the utility of evolutionary analyses in characterizing temporal, spatial, and mechanistic aspects of ISAV outbreaks, as well as their use in predicting future outbreaks of ISAV and other pathogens.

Undoubtedly, collaboration between basic experimental research and phylogenetic fields will help understand the dynamics of ISAV spread in space and time as it has proven fruitful in human health and emerging infectious diseases studies.

Methods

Sequences and alignment

For this study, we focused on the two most abundant genes available, which also have been linked to virulence and used extensively for genotyping, namely the Fusion (f) and Hemagglutinin-esterase (he) genes. Sequences ranging from 1989-2009 were obtained from GenBank according to several criteria. First, in order to implement a molecular clock, we selected only full-length sequences with reported isolation dates.

Second, we only used isolates for which both target genes were sequenced. After these criteria were applied, seventy full-length sequences from each gene were retrieved from

GenBank (accession numbers in additional file 3) from strains isolated in Canada,

Norway, and Chile. To avoid alignment artifacts and to come up with an optimal

254 statement of homology, nucleotide sequences were visualized in Seaview 4.2.7 [51] and then converted to amino acid sequences for alignment with MAFFT [52] under L-INS-i algorithm, thereby preserving the open reading frame for these protein coding sequences.

Amino acid sequences were then back translated into their original nucleotide sequences for further analyses.

Genetic diversity and recombination

Genetic diversity (θ) within populations was estimated both by the traditional segregating sites approach of Watterson [53] (implemented in DnaSP5 [54]) and by a coalescent Bayesian Markov Chain Monte Carlo method (BMCMC [55]; implemented in

Lamarc 2.1.3 [56]). Recombination breakpoint analysis was performed under a suit of methods looking for different types of recombination evidence (Rdp, Genconv, MaxChi,

Chimaera as primary methods and Bootscanning and Siscan as secondary (summarized in

[57]) as implemented in Rdp3 [39]. Windows and step sizes were varied to refine primary evidence of recombination as suggested in the user manual. In addition, recombination detection was performed on both genes by using a genetic algorithm approach looking at topological incongruence (GARD) [58] as implemented in the datamonkey.org [59] server.

Phylogenetic inference, molecular clock, substitution rates and demographic history

The best-fit model of evolution was selected under Akaike information Criterion

[27] (AIC) and parameter values using model averaging as implemented in jModelTest

1.0.1 [60]. The best-fit model was used for subsequent analyses; nevertheless, base frequencies and rates were co-estimated with the phylogeny. A phylogenetic tree was inferred using Bayesian Inference (BI) as optimality criterion as implemented in the

255 Beast 1.6.1 package [61]. Posterior probabilities (PP) were approximated by running four independent analyses of 1×108 steps sampled every 1×105 each. A burn-in period was defined using Tracer [57] by testing for convergence on likelihood scores. Burn-in was defined as the period before convergence and those runs were discarded accordingly.

Also, we assessed convergence and mixing by checking the effective sample size statistic

(ESS > 500) and the trace itself. Strict molecular clock was rejected on the basis of Bayes

Factor hypothesis testing between uncorrelated lognormal clock (H1) and strict clock

(H0) as implemented in Tracer [62]. A lognormal prior with a mean centered on the reported Influenza A HA gene substitution rate [26] and tip dates were used to calibrate the clock. The same analysis was run ’on empty’ to assess the influence of the selected priors. Estimating Association Index (AI), Parsimony Score (PS), and Maximum Clade number (MC) statistics tested the extent of geographic structure in BaTS [63]. Past population dynamics of both genes were inferred using the Bayesian Skyline Plot (BSP) model as a tree prior [45]. Ancestral State Reconstruction (ASR) was estimated used the method of Lemey et al. [64] implemented in BEAST as described online

(http://beast.bio.ed.ac.uk/Tutorials#Phylogeography_tutorials)

Selection analyses and structure models

Both genes, he and f, were investigated for evidence of positive selection. First, a global analysis was performed using a counting method (SLAC). Then, a site-by-site analysis was performed using Fixed-Effects likelihood method as implemented in HyPhy

2.0 [65, 66]. Differences in selective pressures between the Chilean clade and the rest of the tree were estimated with SelectionLRT.bf script also within the HyPhy 2.0 package.

256 To assess whether selected sites lay on structural and functionally relevant regions of proteins, f and he 3D structure models were build using the following protocol. The sequences of the he and the f genes were obtained from GenBank (ADF36510.1;

ABG65799.1; AAX46277.1; AAX46259.1; ADF36509.1 and ADF36499.1). Alignments for HE protein were obtained by modeler 9v9 [67], and for F the server CLUSTALW was used [68]. To verify that the alignments were suitable, we used the secondary structure prediction server PSIPRED (http://bioinf.cs.ucl.ac.uk/psipred/) [69]. In order to generate the F protein model, we used two templates: hemagglutinin-esterase-fusion glycoprotein structure of influenza C virus (PDB ID: 1FLC) (3.20Å) and hemagglutinin-fusion glycoprotein structure of influenza A virus (PDB ID: 3EYJ) (2.60 Å), with identities of

21% and 20%, respectively. For the HE protein model, we used hemagglutinin-esterase fusion glycoprotein structure of influenza C virus (PDB ID: 1FLC) with 19% identity as template. Finally, both models were minimized (50,000 steps) and a trajectory of 1 ns was obtained by molecular dynamics using the CHARMM force field [70] in NAMD program [71]. The quality of the models was tested by the Anolea server [72].

References

1. Asche F, Hansen H, Tveteras R, Tveteras S: The Salmon Disease Crisis in

Chile. Mar Resour Econ 2009, 24(4):405-411.

2. Bouchard D, Keleher W, Opitz HM, Blake S, Edwards KC, Nicholson BL:

Isolation of infectious salmon anemia virus (ISAV) from Atlantic salmon in

New Brunswick, Canada. Dis Aquat Organ 1999, 35(2):131-137.

257 3. Geoghegan F: First isolation and identification of ISAV in Ireland. In: 6th

Annual Meeting of EU National Reference Laboratories for Fish Diseases.

Brussels, Belgium; 2002.

4. Kibenge FSB, Munir K, Kibenge MJT, Joseph T, Moneke E: Infectious salmon

anemia virus: causative agent, pathogenesis and immunity. Anim Health Res

Rev 2004, 5(1):65-78.

5. Lovely J, Dannevig B, Falk K, Hutchin L, MacKinnon A, Melville K, Rimstad E,

Griffiths S: First identification of infectious salmon anaemia virus in North

America with haemorrhagic kidney syndrome. Dis Aquat Organ 1999,

35(2):145-148.

6. Godoy M, Aedo A, Kibenge M, Groman D, Yason C, Grothusen H, Lisperguer A,

Calbucura M, Avendano F, Imilan M et al: First detection, isolation and

molecular characterization of infectious salmon anaemia virus associated

with clinical disease in farmed Atlantic salmon (Salmo salar) in Chile. BMC

Vet Res 2008, 4(4):28.

7. Kawaoka Y, Cox N, Haller O, Hongo S, Kaverin N, Klenk H, Lamb R, McCauley

J, Palese P, Rimstad E et al: Infectious Salmon Anaemia Virus - Eight Report

of the International Committee on Taxonomy Viruses. New York: Elsevier

Academic Press; 2005.

8. Cottet L, Rivas-Aravena A, Cortez-San Martin M, Sandino AM, Spencer E:

Infectious salmon anemia virus--genetics and pathogenesis. Virus Res 2010,

155(1):10-19.

258 9. García-Rosado E, Markussen T, Kileng O, Baekkevold ES, Robertsen B,

Mjaaland S, Rimstad E: Molecular and functional characterization of two

infectious salmon anaemia virus (ISAV) proteins with type I interferon

antagonizing activity. Virus Res 2008, 133(2):228-238.

10. Rimstad E, Mjaaland S: Infectious salmon anaemia virus. An orthomyxovirus

causing en emerging infection in Atlantic salmon. APMIS 2002, 110(4):273--

282.

11. Kibenge FS, Xu H, Kibenge M, Qian B, Joseph T: Characterization of gene

expression on genomic segment 7 of infectious salmon anaemia virus.

Virology J 2007, 4:34-48.

12. Mjaaland S, Hungnes O, Teig A, Dannevig BH, Thorud K, Rimstad E:

Polymorphism in the infectious salmon anemia virus hemagglutinin gene:

importance and possible implications for evolution and ecology of infectious

salmon anemia disease. Virology 2002, 304(2):379-391.

13. Markussen T, Jonassen CM, Numanovic S, Braaen S, Hjortaas M, Nilsen H,

Mjaaland S: Evolutionary mechanisms involved in the virulence of infectious

salmon anaemia virus (ISAV), a piscine orthomyxovirus. Virology 2008,

374(2):515-527.

14. Kibenge FSB, Kibenge MJT, Wang Y, Qian B, Hariharan S, Sandi: Mapping of

putative virulence motifs on infectious salmon anemia virus surface

glycoprotein genes. J Gen Virol 2007, 88(11):3100-3111.

15. Cottet L, Cortez-San Martín M, Tello M, Olivares E, Rivas-Aravena A, Vallejos

E, Sandino AM, Spencer E: Bioinformatic Analysis of the Genome of

259 Infectious Salmon Anemia Viruses associated with outbreaks of high

mortality in Chile. J Virol 2010, 84(22):11916-11928.

16. Nylund A, Devold M, Karlsen M: Sequence analysis of the fusion protein gene

from infectious salmon anemia virus isolates: evidence of recombination and

reassortment. J Gen Virol 2006, 87((Pt 7)):2031-2040.

17. Aspehaug V, Mikalsen AB, Snow M, Biering E, Villoing S: Characterization of

the infectious salmon anemia virus fusion protein. J Virol 2005, 79(19):12544-

12553.

18. Kibenge FSB, Grate ON, Johnson G, Arriagada R, Kibenge MJT, Wadowska D:

Isolation and identification of infectious salmon anaemia virus (ISAV) from

Coho salmon in Chile. Dis Aquat Organ 2001, 45(1):9-18.

19. Vike S, Nylund S, Nylund A: ISA virus in Chile: evidence of vertical

transmission. Arch Virol 2009, 154(1):1-8.

20. Kibenge F, Godoy M, Wang Y, Kibenge M, Gherardelli V, Mansilla S, Lisperger

A, Jarpa M, Larroquete G, Avendano F et al: Infectious salmon anaemia virus

(ISAV) isolated from the ISA disease outbreaks in Chile diverged from ISAV

isolates from Norway around 1996 and was disseminated around 2005, based

on surface glycoprotein gene sequences. Virology J 2009, 6:88.

21. Elena SF, Bedhomme S, Carrasco P, Cuevas JM, de la Iglesia F, Lafforgue G,

Lalić J, Pròsper A, Tromas N, Zwart MP: The evolutionary genetics of

emerging plant RNA viruses. Mol Plant Microbe Interact 2011, 24(3):287-293.

260 22. Castao A, Ruiz L, Elena SF, Hernández C: Population differentiation and

selective constraints in Pelargonium line pattern virus. Virus Research 2011,

155(1):274-282.

23. Wu B, Alexandra, Liu Y, Zhou G, Wang X, Elena SF: Dynamics of molecular

evolution and phylogeography of Barley yellow dwarf virus-PAV. PloS One

2011, 6(2):e16896-e16896.

24. Lalić J, Agudelo-Romero P, Carrasco P, Elena SF: Adaptation of tobacco etch

potyvirus to a susceptible ecotype of Arabidopsis thaliana capacitates it for

systemic infection of resistant ecotypes. Philos Trans R Soc Lond B Biol Sci

2010, 365(1548):1997-2007.

25. Bhatt S, Holmes EC, Pybus OG: The genomic rate of molecular adaptation of

the human influenza A virus. Mol Biol Evol 2011, 28(9):2443-2451.

26. Rambaut A, Pybus OG, Nelson MI, Viboud C, Taubenberger JK, Holmes EC:

The genomic and epidemiological dynamics of human influenza A virus.

Nature 2008, 453(7195):615-619.

27. Akaike H: A new look at the statistical model identification. IEEE Trans

Automat Contr 1974, 19(6):716-723.

28. Felsenstein J: Estimating effective population size from samples of sequences:

inefficiency of pairwise and segregating sites as compared to phylogenetic

estimates. Genet Res 1992, 59(2):139-147.

29. Hughes GJ, Fearnhill E, Dunn D, Lycett SJ, Rambaut A, Leigh Brown AJ, UK

HIV Drug Resistance Collaboration: Molecular Phylodynamics of the

261 Heterosexual HIV Epidemic in the United Kingdom. Plos Pathog 2009,

5(9):e1000590.

30. Bennett SN, Drummond AJ, Kapan DD, Suchard MA, Munoz-Jordan JL, Pybus

OG, Holmes EC, Gubler DJ: Epidemic Dynamics Revealed in Dengue

Evolution. Mol Biol Evol 2010, 27(4):811-818.

31. Tazi L, Pérez-Losada M, Gu W, Yang Y, Xue L, Crandall K, Viscidi R:

Population dynamics of Neisseria gonorrhoeae in Shanghai, China: a

comparative study. BMC Infectious Diseases 2010, 10(1):13.

32. Zehender G, De Maddalena C, Giambelli C, Milazzo L, Schiavini M, Bruno R,

Tanzi E, Galli M: Different evolutionary rates and epidemic growth of

hepatitis B virus genotypes A and D. Virology 2008, 380(1):84-90.

33. Holmes EC: The Evolutionary Genetics of Emerging Viruses. Annu Rev Ecol

Evol S 2009, 40:353-372.

34. Posada D, Crandall KA, Holmes EC: Recombination in evolutionary genomics.

Annu Rev Genet 2002, 36:75-97.

35. Boni MF, de Jong MD, van Doorn HR, Holmes EC: Guidelines for Identifying

Homologous Recombination Events in Influenza A Virus. PLoS One 2010,

5(5):e10434.

36. Boni MF, Zhou Y, Taubenberger JK, Holmes EC: Homologous Recombination

Is Very Rare or Absent in Human Influenza A Virus. J Virol 2008,

82(10):4807-4811.

262 37. He CQ, Han GZ, Wang D, Liu W, Li GR, Liu XP, Ding NZ: Homologous

recombination evidence in human and swine influenza A viruses. Virology

2008, 380(1):12--20.

38. Kosakovsky Pond SL, Posada D, Stawiski E, Chappey C, Poon AFY, Hughes G,

Fearnhill E, Gravenor MB, Leigh Brown AJ, Frost SDW: An Evolutionary

Model-Based Algorithm for Accurate Phylogenetic Breakpoint Mapping and

Subtype Prediction in HIV-1. PLoS Comput Biol 2009, 5(11):e1000581.

39. Martin DP, Lemey P, Lott M, Moulton V, Posada D, Lefeuvre P: RDP3: a

flexible and fast computer program for analyzing recombination.

Bioinformatics 2010, 26(19):2462-2463.

40. Devold M, Falk K, Dale B, Krossøy B, Biering E, Aspehaug V, F N, A N: Strain

variation, based on the hemagglutinin gene, in Norwegian ISA virus isolates

collected from 1987 to 2001: indications of recombination. Dis Aquat Organ

2001, 47(2):119-128.

41. Cunningham CO, Gregory A, Black J, Simpson I, Raynard RS: A novel variant

of the infectious salmon anaemia virus (ISAV) haemagglutinin gene suggests

mechanisms for virus diversity. Bull Eur Assoc Fish Pathol 2002, 22(6):366-

374.

42. Posada D: Unveiling the molecular clock in the presence of recombination.

Mol Biol Evol 2001, 18(10):1976-1978.

43. Posada D, Crandall KA: The effect of recombination on the accuracy of

phylogeny estimation. J Mol Evol 2002, 54(3):396-402.

263 44. Domingo-Calap P, Sanjuán R: Experimental Evolution of RNA versus DNA

Viruses. Evolution 2011, 65(10):2987-2994.

45. Drummond AJ, Rambaut A, Shapiro B, Pybus O: Bayesian coalescent inference

of past population dynamics from molecular sequences. Mol Biol Evol 2005,

22(5):1185-1192.

46. Ho S, Shapiro B: Skyline-plot methods for estimating demographic history

from nucleotide sequences. Mol Ecol Resour 2011, 11(3):423-434.

47. Pannell JR: Coalescence in a Metapopulation with Recurrent Local

Extinction and Recolonization. Evolution 2003, 57(5):949-961.

48. Duffy S, Shackelton LA, Holmes EC: Rates of evolutionary change in viruses:

patterns and determinants. Nat Rev Genet 2008, 9(4):267-276.

49. Mikalsen AB, Sindre H, Mjaaland S, Rimstad E: Expression, antigenicity and

studies on cell receptor binding of the hemagglutinin of infectious salmon

anemia virus. Arch Virol 2005, 150(8):1621-1637.

50. Kibenge FS, Kibenge MJ, Wang Y, Qian B, Hariharan S, McGeachy S: Mapping

of putative virulence motifs on infectious salmon anemia virus surface

glycoprotein genes. J Gen Virol 2007, 88(Pt 11):3100-3111.

51. Gouy M, Guindon S, Gascuel O: SeaView Version 4: A Multiplatform

Graphical User Interface for Sequence Alignment and Phylogenetic Tree

Building. Mol Biol Evol 2010, 27(2):221-224.

52. Katoh K, Kuma K, Toh H, Miyata T: MAFFT version 5: improvement in

accuracy of multiple sequence alignment. Nucleic Acids Res 2005, 33(2):511-

518.

264 53. Watterson GA: On the number of segregating sites in genetical models

without recombination. Theor Popul Biol 1975, 7(2):256-276.

54. Librado P, Rozas J: DnaSP v5: a software for comprehensive analysis of DNA

polymorphism data. Bioinformatics 2009, 25(11):1451-1452.

55. Kuhner MK, Yamato J, Felsenstein J: Estimating effective population size and

mutation rate from sequence data using Metropolis-Hastings sampling.

Genetics 1995, 140(4):1421-1430.

56. Kuhner MK: LAMARC 2.0: maximum likelihood and Bayesian estimation of

population parameters. Bioinformatics 2006, 22(6):768-770.

57. Posada D, Crandall KA: Evaluation of methods for detecting recombination

from DNA sequences: Computer simulations. Proc Natl Acad Sci U S A 2001,

98(24):13757-13762.

58. Kosakovsky Pond SL, Posada D, Gravenor MB, Woelk CH, Frost SD: GARD: a

genetic algorithm for recombination detection. Bioinformatics 2006,

22(24):3096-3098.

59. Delport W, Poon AFY, Frost SDW, Kosakovsky Pond SL: Datamonkey 2010: a

suite of phylogenetic analysis tools for evolutionary biology. Bioinformatics

2010, 26(19):2455-2457.

60. Posada D: jModelTest: Phylogenetic Model Averaging. Mol Biol Evol 2008,

25(7):1253-1256.

61. Drummond AJ, Rambaut A: BEAST: Bayesian evolutionary analysis by

sampling trees. BMC Evol Biol 2007, 7(1):214.

265 62. Suchard MA, Weiss RE, Sinsheimer JS: Bayesian Selection of Continuous-

Time Markov Chain Evolutionary Models. J Mol Evol 2001, 18(6):1001-1013.

63. Parker J, Rambaut A, Pybus OG: Correlating viral phenotypes with phylogeny:

Accounting for phylogenetic uncertainty. Infect Genet Evol 2008, 8(3):239-

246.

64. Lemey P, Rambaut A, Drummond AJ, Suchard MA: Bayesian Phylogeography

Finds Its Roots. PLoS Comput Biol 2009, 5(9).

65. Kosakovsky Pond SL, Frost SDW: Not So Different After All: A Comparison

of Methods for Detecting Amino Acid Sites Under Selection. Mol Biol Evol

2005, 22(5):1208-1222.

66. Kosakovsky Pond SL, Frost SDW, Muse SV: HyPhy: hypothesis testing using

phylogenies. Bioinformatics 2004, 21(5):676-679.

67. Sali A, Blundell TL: Comparative Protein Modelling by Satisfaction of Spatial

Restraints. J Mol Biol 1993, 234(3):779-815.

68. Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the

sensitivity of progressive multiple sequence alignment through sequence

weighting, position-specific gap penalties and weight matrix choice. Nucleic

Acids Res 1994, 22(22):4673-4680.

69. Jones DT: Protein secondary structure prediction based on position-specific

scoring matrices. J Mol Biol 1999, 292(2):195-202.

70. Brooks BR, Bruccoleri RE, Olafson BD, States DJ, Swaminathan S, Karplus M:

CHARMM: A program for macromolecular energy, minimization, and

dynamics calculations. J Comput Chem 1983, 4(2):187-217.

266 71. Phillips JC, Braun R, Wang W, Gumbart J, Tajkhorshid E, Villa E, Chipot C,

Skeel RD, Kalé L, Schulten K: Scalable molecular dynamics with NAMD. J

Comput Chem 2005, 26(16):1781-1802.

72. Guex N, Peitsch MC: SWISS-MODEL and the Swiss-Pdb Viewer: An

environment for comparative protein modeling. Electrophoresis 1997,

18(15):2714-2723.

267

Chapter 6:

Population Genomics And

Phylogeography Of An Australian

Dairy Factory Derived Lytic

Bacteriophage

268 Abstract In this study, we present the full genomic sequences and evolutionary analyses of a serially sampled population of 28 Lactococcus lactis-infecting phage belonging to the

936-like group in Australia. Genome sizes were consistent with previously available genomes ranging in length from 30.9 Kbp to 32.1 Kbp and consisted of 55 to 65 ORFs.

We analyzed their genetic diversity and found that regions of high diversity are correlated with high recombination rate regions (p-value = 0.01). Phylogenetic inference showed two major clades that correlate well with known host range. Using the extended Bayesian

Skyline model, we found that population size has remained mostly constant through time.

Moreover, the dispersion pattern of these genomes is in agreement with human driven dispersal as suggested by phylogeographic analysis. In addition, selection analysis found evidence of positive selection on codon positions of the Receptor Binding Protein (RBP).

Likewise, positively selected sites in the RBP were located within the neck and head region in the crystal structure, both known determinants of host range. Our study demonstrates the utility of phylogenetic methods applied to whole genome data collected from populations of phage for providing insights into applied microbiology.

269 Introduction

Bacteriophage (phage) are some of the fastest evolving and abundant entities in nature (Hendrix, 2003). Phage are ubiquitous, co-occurring with bacteria in environments as varied as soils, oceans, and even in human intestines. They also are present in any industrial process that capitalizes on bacterial metabolism, e.g., the biotechnological production of chemicals and food products (Brussow et al., 2004; Hendrix, 2003). In order to acidify milk during the production of fermented foods such as cheese, buttermilk, and sour cream, different strains of Lactococcus lactis, a Gram-positive bacteria, are used as starter organisms primarily to ferment lactose to lactic acid. In particular, the cheese industry has been troubled by phage infections that can delay or halt fermentation. Given that phage are found in raw milk and can survive pasteurization

(Madera et al., 2004), strict infection control measures and careful strain selection are required to mitigate potential phage induced dairy fermentation failures. The L. lactis phage are members of one of the largest phage orders, Caudovirales, and are highly diverse both genetically and morphologically. This order contains three families,

Myoviridae (with long, contractile tails), Siphoviridae (with long, noncontractile tails), and Podoviridae (with short tails). Lactococcal phage are mainly members of the

Siphoviridae family, with a few members from the Podoviridae family. The three most prevalent groups of L. lactis phage isolated from diary environments are c2, 936, and

P335; where the first two are virulent lytic phage and the last have been reported as virulent temperate phage.

Probable sources of phage infecting dairy fermentations include raw milk, growth supplements, starter strains possessing temperate phage integrated into their genomes,

270 factory equipment, and workers. Lytic phage infections can cause bacterial cell lysis, with subsequent consequences on the rate of acid production in the fermentation process.

In cheese factories, delays to fermentation can cause significant difficulties in a process that is based on a perishable starting material (milk) that cannot be stored in the event of delay. Phage infections can also lead to negative repercussions in flavor and texture of the final product, which can result in significant economic losses.

Previous research has deciphered some portions of the replicative cycle of 936- like phage with special attention to particle adsorption and naturally occurring phage resistance mechanisms (Boucher et al., 2000; De Haard et al., 2005; Ledeboer et al.,

2002; Tremblay et al., 2006). The first interaction of a phage particle and a bacterium is mediated through the specific recognition between the phage receptor-binding protein

(RBP), located at the tip of the tail, and the host cell receptor distributed over the cell surface. It is known that L. lactis phage adsorb initially to the cell surface and likely bind to various carbohydrates containing rhamnose, glucose, or galactose (Tremblay et al.,

2006). This adsorption step is reversible for c2 phage, which need a secondary interaction with the bacterial cell wall through a predicted membrane attached protein (PIP).

However, 936 and P335 phage do not use a secondary receptor. A number of naturally occurring plasmid and chromosomally-encoded phage resistance mechanisms have been described in L. lactis strains. Among these, more than twenty Abortive infection (Abi) systems have been described (Boucher et al., 2000; Chopin et al., 2005). These phage resistance mechanisms act after phage adsorption, DNA penetration, and early gene expression and generally result in death of the infected cell and a diminished number of phage progeny. In addition, a novel anti-phage strategy has been developed by raising

271 antibodies against 936 RBP in Llama (Lama glama) with substantial inhibitory results

(De Haard et al., 2005). However, as in other pathogen-host interactions, the arms race is usually led by the pathogen and in this system, lactococcal phage have evolved mechanisms to avoid cell defenses and escape host cell resistance mutations.

Caudovirales evolution is thought to be driven by the horizontal exchange of genes between distantly related phage and hosts. This mosaicism is inferred from the observation of abrupt changes in sequence similarity (Casjens, 2005). However, particular to 936-like phage, this hypothesis awaits experimental and informatic evidence.

The 936-like phage are the species of lactoccocal phage most frequently isolated from dairy fermentations (Deveau et al., 2006), have been readily isolated from whey samples during the cheese productive process around the world (Crutz-Le Coq et al.,

2002; de Fabrizio et al., 1991; Deveau et al., 2006; Fortier et al., 2006; Hejnowicz et al.,

2009; Madera et al., 2004; Suárez et al., 2008), and probably occur wherever L. lactis occurs. The evolution of the 936-like group is not well studied and while numerous complete genomes have been sequenced, most previous studies have utilized random samples within a factory or from a variety of factories. These analyses have been essentially based on sequence comparison but not in the context of a rigorous phylogenetic framework (Crutz-Le Coq et al., 2002; Fortier et al., 2006; Rousseau and

Moineau, 2009). In addition, nothing is known about their evolution from a population genetics approach, their phylodynamics or dispersion over a determined geographic region (Pybus and Rambaut, 2009). Moreover, open reading frames (ORFs) involved in key functions have never been analyzed for diversifying selection in a phylogenetic context.

272 In the present study, we inferred from full genome sequence, the phylogenetic relationships within an Australian population of 936-like phage sampled serially from dairy factories over an 8-year period (1994-2001). In addition, we tested whether the population remained constant through time, and how the isolates dispersed over the geographic region from which they were sampled. Finally, we tested for evidence of diversifying selection in a set of relevant genes. Thus, the aim of the work was to provide insights into the historical relationships of this group of phage; in particular to know how they have dispersed through Australian factories, whether the population has changed in size through time, and whether there are alleles that may provide an increased fitness to the phage that carry them.

Materials and Methods

Sequences and Alignment

Phage samples were collected by the Cultures Division of Dairy Innovation

Australia Ltd. as part of routine screening of factory whey samples for phage and spanned a time window from 1994 to 2001. Phage were categorized by factory of origin

(factories were identified with random letters [A to I]) date of collection, and known host range. Twenty-eight genomes were selected for complete genome sequencing by pyrosequencing (Roche Genome Sequencer FLX by Department of Primary Industry -

Victoria) supplemented, on an as needed basis, by Sanger sequencing in selected regions to resolve ambiguities and low quality positions. Phage sequences were computer annotated with the Genome Annotation Transfer Utility (Tcherepanov et al., 2006) and manual curation.

273 From 28 complete phage genomes, each ORF was taken and aligned individually with its orthologs. DNA sequences were visualized in Seaview 4.2.7 (Gouy et al., 2010), converted into protein and then aligned with MAFFT using L-INS-i algorithm (Katoh et al., 2005). Protein sequences were back translated into their original nucleotide sequences for further analyses. Additionally, the resulting alignment was inspected by eye to correct obvious misaligned positions. Since not all phage isolates have the same gene content or gene order, extended regions of gaps were placed when genes were absent / present in some isolates during the alignment. It is known that insertions and deletions are common in the evolution of phage; so these were interpreted as novel characters (sensu (Egan and

Crandall, 2008)). Consequently, in further analyses gaps were considered as a fifth character unless otherwise stated.

Genetic diversity and Recombination

Genetic diversity (Θ) was estimated for the Australian isolates using both a

Watterson estimate (Watterson, 1975) assuming infinite sites as implemented in DNAsp ver. 5 (Librado and Rozas, 2009) and a modified Watterson estimate that relaxes the infinite sites model assumption as implemented in Pairwise, LDhat 2.5 (Hudson, 2001;

McVean et al., 2002). The recombination parameter, rho (ρ= 2Ner), was estimated using a composite-likelihood method as implemented in Interval, also part of the LDhat 2.5 package. The correlation between Θ and ρ was tested using a non-parametric test of correlation as implemented in PASW18 (Kendall’s t test (Kendall, 1938)).

Phylogenetic Inference

A phylogeny for phage isolates was estimated using Bayesian inference (BI) as implemented in BEAST 1.5.3 (Drummond and Rambaut, 2007). This method was chosen

274 because of its ability to account for uncertainty in phylogenetic estimates, the variety of clock models and tree priors available, and the relative speed of the computation. The best-fit model of evolution was selected under Akaike information Criterion (AIC)

(Akaike, 1974) and parameter values using model averaging as implemented in jModelTest 1.0.1 (Posada, 2008). The best-fit model was used for subsequent analyses; nevertheless, base frequencies and rates were estimated in BEAST. The Molecular Clock hypothesis was tested on the data against an uncorrelated Lognormal Relaxed Molecular

Clock by Bayes Factors as implemented in BEAST 1.5.3 (Drummond et al., 2006). The molecular clock was calibrated with sampling dates and a uniform prior distribution for the mean rate of substitution ranging from 10-3 to 10-8 substitutions per site per year

(s/s/y) (Holmes, 2009). Bayesian Posterior Probabilities were determined by running 4 chains (independent runs) of 100 million steps each. Parameters were sampled every

10,000 steps. A burn-in of 10% was used to discard parameters initially sampled.

Autocorrelation was assessed by checking the Effective Sample Size statistic (ESS,

>200). Convergence and mixing of the Markov chains were assessed using Tracer 1.5

(http://beast.bio.ed.ac.uk/Main_Page). Different runs tend to converge at the same parameter values in the stationary phase. Mixing was evaluated qualitatively by looking at oscillations in the parameter space explored by the trace of each chain. Trees were summarized in TreeAnnotator (http://beast.bio.ed.ac.uk/Main_Page) by targeting the

Maximum Clade Credibility (MCC) tree and support values represent posterior probability estimates.

Past Population Dynamics

275 Past population dynamics of the Australian phage, all the putative protein-coding genes, were inferred in BEAST 1.5.4 using the Extended Bayesian Skyline Plot (eBSP) model as tree prior and an uncorrelated relaxed molecular clock (Drummond et al., 2005).

The number of grouped intervals was co-estimated with the relative genetic diversity through time. Bayes factors were used to test different tree priors representing different demographic scenarios: expansion, exponential and constant growth, and the eBSP.

Phylogeographic Analysis

Phylogeographic analyses were performed by ancestral reconstruction of discrete states in a Bayesian Statistical Framework (Lemey et al., 2009). The most parsimonious description of the diffusion process was identified by a discrete analysis (see http://beast.bio.ed.ac.uk/Tutorials). The inferences were summarized and geographical ancestral states were color-coded onto the tree topology.

Selection Analysis

Adaptive selection was assessed by estimating the ratio of nonsynonymous to synonymous nucleotide substitution rates (dN/dS) in a site-by-site basis as implemented in DataMonkey.org (Delport et al., 2010) using Fixed-Effects Likelihood (FEL),

Random-Effects Likelihood (REL), and Single Likelihood Ancestor Counting (SLAC).

Positively selected sites were mapped onto the crystal structure of the Receptor Binding

Protein (RBP, PDB accession 2BSD (Spinelli et al., 2005)) using Cn3D 4.1 software available at ftp://ftp.ncbi.nih.gov/cn3d/Cn3D-4.1.msi.

276 277 Figure 1: Schematic genome alignment of Australian cheese factory derived 936-like phage used for this study. Phage genomes are represented as colored arrows indicative of the direction of transcription of protein and tRNA encoding genes. Those orthologs conserved in all phage are colored dark blue while those present in a subset of phage are in a variety of colors. The gray ortholog present in all genomes is an intact gene in some genomes and an apparent pseudogene in others. Black lines indicate the boundaries between those regions encoded during the early, middle, and late phases of transcriptions.

The tRNA genes encoded by some phage are also indicated by olive green arrows.

Results

Genome organization, Genetic Diversity and Recombination

The genome length of Lactococcal phage belonging to the 936-like species has been reported to range from 28 to 32 Kbp and to consist of 50-64 ORF (Deveau et al.,

2006; Rousseau and Moineau, 2009). Similarly, the new phage genomes analyzed for this study range in length from 30.9 Kbp to 32.1 Kbp (31.5 on average) and consisted of 55 to

65 ORFs (61 on average) that, to a great extent, were conserved in gene order and sequence homology (Fig. 1). Homologues of 41 ORFs were present in all 28 phage studied from a total pool of 74 ORFs. The remaining 33 ORFs not present in all phage but detected in one or more phage were primarily located in the early-transcribed region.

For phylogenetic analysis, all the ORFs from all the phage (not just orthologs common to all phage) were assembled into a coding genome that, when aligned, had a length of 34,086 bases (no missing data) (Fig. 2). The genetic diversity across the coding genome was measured using a sliding window approach that looked at genetic diversity

(Θ/site) and recombination rate (ρ/kb) (Fig. 2). Three main regions of relatively high

278 genetic diversity (Θ per site>0.10) were identified. The leftmost peak is located in the late-expressed region and mainly spans the neck portal protein. The second peak is located close to the late/early gene expression junction and includes some genes whose presence is variable across the phage. This area of increased diversity includes part of the tape measure protein, the holin, the lysin, and some early genes of unknown function with the largest peak over the receptor binding protein. The rightmost peak spans the start of the early region and all of the middle expressed genes. This region encodes the

DNA polymerase subunit, the Holliday junction endonuclease, and several genes of unknown function. Θ was estimated at 0.03009 using a likelihood method that relaxes the infinite sites assumption (LDhat) and at 0.04497 using the infinite sites Watterson estimate (DNAsp5). The recombination rate averaged across all segregating sites was estimated as ρ =0.0015, while the recombination rate for each site varied considerably and correlated significantly with regions of high genetic diversity (R=0.665 and two- tailed p-value = 0.01, Kendall’s t Test of correlation).

279 Figure 2: Plot of recombination rate and genetic diversity for aligned phage genomes.

Genetic diversity plotted as Θ per site (black line) and recombination rate ρ per kb (gray line) as a function of time as determined in dnaSP 5.0 and LDhat 2.5, respectively. A schematic representation of all the orthologs present in the concatenated alignment used for phylogenetic analysis annotated with selected genes to provide a point of reference is below the graph. Orthologs colored red are early transcripts while those in blue represent late and middle transcripts (left and right, respectively).

280 Figure 3: Maximum clade credibility (MCC) phylogeny of the twenty-eight 936-like phage. Branch values represent posterior probability support. Branches are colored according to the ancestral state of geographic location. Taxa names indicate factory key_phage_year of isolation. The tree is mid-point rooted. On the grid: first column indicate host tropism, black: 222/385 and grey: 92/818/962 L. lactis host strains. Second, third, and fourth columns are codons under positive selection 155, 165, and 167, respectively. Blue: methionine; white: gap; light blue: leucine; dark blue: phenylalanine; yellow: tyrosine; orange: serine; red: threonine; green: alanine. Arrows below time line relate dispersion analysis described in Fig. 5.

281 Figure 4: Plot of genetic diversity over time. Extended Bayesian Skyline Plot describing population changes as a function of time. The Y-axis represents relative genetic diversity and X-axis represents years backward in time from the most contemporaneous sample, i.e., 2001. Mean and median are shown with credibility intervals.

Phylogenomic Analysis

The best-fit model of evolution for the phage genome alignment, chosen out of 88 implemented models, was TVM+I+G (transversion model; Variable base frequencies, variable transversions, transitions equal). Because one of the purposes of this study was to infer historical relationships with respect to divergence times, a molecular clock was tested using Bayes factors. The relaxed uncorrelated lognormal model was then used to infer a phylogeny (Fig. 3). The tree prior used was chosen based on Bayes Factors (Table

1). Four demographic models were tested and the one with highest support, eBSP, was chosen for subsequent analysis. The phylogenetic inference showed strong support for all the clades and a spatio-temporal relationship between the isolates. However, clades and

282 subclades did not map cleanly with geographic location. For example, most factory E isolates were closely related to each other (subclade within lower large clade in Fig. 3), while the others were scattered throughout the tree, suggesting potential transmission routes (see below). In this particular case, factory E isolates did not form a monophyletic group since some members are related to phage isolated from other locations. The same applies for isolates from other disparate locations. The structure of the tree matched perfectly with the known host specificity, i.e., the upper clade contains all isolates

283 Figure 5: Dispersal pattern of Australian 936-like phages. Phylogeographic analysis for timepoints I to IV (root to tips; linked to positions designated in Fig. 3) showing the dispersion routes of the ancestral lineages that explain the current sampling geographic distribution (letters indicate phage from different geographic locations). I to IV are 4

"time slices" arbitrarily chosen to describe the dispersal pattern.

capable of infecting L. lactis ASCC host strains 222 and 385. Likewise, the lower clade contains all isolates capable of infecting L. lactis ASCC host strains 92, 818 and 962.

Branch colors represent the localities (factories) where the phage was present and where the ancestral state was inferred. Additionally, we obtained age estimates for the upper and lower clades. Likely values for the upper clade age range from 14 to 128 years old (95% credible interval, CI) with a mean of 60 years. In turn, the mean age for the lower clade was estimated at 110 years with a 95% CI from 27 to 236 years; much older than the upper clade, but overlapping in their respective credible intervals.

Table 1: Bayes Factors hypothesis testing on demographic tree priors for Australian 936

phages

Model Likelihood Model Constant Expanding Exponential Extended population population growth Bayesian Skyline Plot Constant population - 15.07 16.794 -31.877 -92650.146

Expanding population -15.07 - 1.725 -46.947 -92684.846

Exponential growth -16.794 -1.725 - -48.671 -92688.817

Extended Bayesian 31.877 46.947 48.671 - -92576.747 Skyline Plot

284 Phylodynamics

The phage dataset was tested against four demographic models including the extended Bayesian Skyline Plot (eBSP). The eBSP model had the highest support and was chosen for subsequent analysis. To estimate the historical diversity of this group of phage, an extended Bayesian Skyline Plot analysis was performed (Fig. 4). Mean and median values for relative genetic diversity (Y axis) together with credibility intervals were plotted through time (X axis, time 0 represents the earliest sampling, i.e., 2001). A rather constant genetic diversity and population size, with slight variation in the credible interals. The genetic diversity and population sizes are in part dependent on the number of hosts available for infection. Since the production processes in the factories are standardized and the time window is small (8 years for a DNA virus), it is possible that the analysis represents the limited/constant capacity of the factories during such years or that the temporal signal was not present in the sequences.

Phylogeography

The phylogeographic pattern of dispersial shows multiple sources for several isolates (Fig. 5). The diagram depicts a matrix of factories reflecting the relative distances between factories, but not linked to any geographic features. The pattern suggests a human driven dispersion throughout the region rather than natural means because distant locations are connected in different directions instead of, for example, by means of natural elements like wind directions, bird migrations, etc. For instance, ancestral lineages in factories C, D, and E were evolved from an ancestor present in factory F.

Likewise, the factory F ancestral lineage gave rise to lineages present in factories G and I

(Fig. 3).

285

Figure 6: Molecular mapping of positive selected sites detected in receptor binding protein (RBP). A, monomer structure with 3 sites mapped on it; B, Homotrimer (active structure) with sites mapped in each chain. Molecular structure retrieved from NCBI structure database, PDB code 2BSD.

Selection Analysis

Four protein-coding genes were chosen for selection analysis: Receptor binding protein gene (RBP, host specificity), Tape measure protein (tail length determinant),

Major capsid protein (head structure) and lysin (bacterial cell wall degradation). Sites in which the dN/dS rates ratio was statistically greater than one were considered to be under positive selection (Sharp, 1997). The REL method found more sites than the SLAC and

FEL methods (Table 2). Since the crystal structure for the RBP protein has been resolved

(Spinelli et al., 2005), the selected sites were mapped onto the structure to visualize the positioning of these in a 3D context (Fig. 6). The functional poly-protein is a homo-trimer

286 (Fig. 6B) in which each monomer is intertwined to each other. All positively selected sites were located at the head and neck structures where host range specificity is known to reside. Accordingly, these sites were mapped to the tips of the phylogeny along with host specificity (Fig. 3 grid).

Table 2: Positive selected sites (dN/dS>1) on Australian 936 phages genes Table 2: Positive selected sites (dN/dS>1) on Australian 936 phages genes

Gene (overall Selection detection method dN/dS) SLAC* FEL* REL** Consensus (at least in 2 methods) Receptor binding 165, 167 155, 167, 45, 57, 137, 139, 155, 167 protein (0.44962) 155, 167, 173, 223, 229, 233, 259, 261 Major Capsid Protein Not found Not found Not found ------(0.194507) Endolysin (0.207828) Not found Not found Not found ------

Tape Measure Not found Not found 42, 158, 241, 319, ------Protein ( 0.21852) 320, 390, 527, 530, 584, 678, 729, 738, 762, 770, 802, 818, 847, 849, 854 Neck Portal protein Not found Not found Not found ------(0.472364) * p-value<0.1; **Bayes Factor>50. All analyses performed on line at datamonkey.org.

Discussion

Genome organization, Genetic Diversity and Recombination

It has been commonly assumed that phage evolve in a modular fashion, by shuffling conserved cassettes of genome regions due to the observed similarity among these regions when comparing distant phage (Casjens, 2005). Here, we looked for recombination rates and genetic diversity over all segregating sites (Fig. 2). Low recombination rates were inferred over the first ~8kb, which represents late expressed genes. This observation is somewhat expected since this region contains packaging and morphogenesis genes; therefore, changes here are likely to be purified due to structural and functional constraints, i.e., a change here is more likely to yield a protein product

287 unable to form viable viral particles. In addition, over the central region of the genome alignment a spike centered on the receptor binding protein ~17.5 kb and another region between 27.7 and 33.4 kb exhibit high recombination rates. The former spike is also part of the late expressed genes; whereas, the latter region lies within early/middle expressed genes. However, it seems that genetic diversity along the genome is more the result of nucleotide substitutions rather than by recombination as evidenced by the ratio q/r =

20.06. Needless to say, further experimental analyses will shed more light on the relative role of recombination in the evolution of Lactoccocal phage. For now, the correlation of higher genetic diversity with higher recombination rate (Fig. 2) is suggestive of an important role for recombination in generating novel combinations of alleles in the phage populations and the discrete regions of high recombination rate suggest that modularity could be a common feature in this group.

Phylogenomic Analysis

In order to infer historical relationships between isolates, a relaxed uncorrelated molecular clock was used and calibrated with sampling dates (Drummond and Rambaut,

2007; Drummond et al., 2002; Drummond et al., 2005). The phylogeny inferred here shows two major clades that share a common ancestor. Interestingly, each clade contains all the isolates able to infect one of the two main groups of L. lactis host strains used in this study (Fig. 3). Ancestral states were color-mapped in the topology showing a geographical structure that can be further used to infer phylogeographic patterns (see below). For the lower clade, it is clear that ancestral lineages were present somewhere near factory E and then were dispersed and independently evolved to give rise to all the

288 diversity sampled that is capable of infecting L. lactis ASCC host strains 92, 818 and 962 in different sampled localities.

Although credible intervals of timing estimates for both clades were partially overlapping, the mean values are 50 years apart. This suggests that this lineage has had more time to diversify and disperse than the upper clade. This is evidenced by the dispersion pattern inferred (Fig. 5) in which isolates with 92, 818 and 962 host strain tropism exhibited a wider range. In contrast, the upper (younger) clade has a narrower geographic range (Fig. 5). It is worth mentioning that although the tips of the tree are associated with this host range, changes in host strains could have occurred in the past in response to phage detection, for which ancestors of the sampled phage might have been associated with other L. lactis strains.

Phylodynamics

Demographic analysis revealed a constant population size /genetic diversity through time. If we assume that effective population size in viruses, looking back in time, reflects the number of infections that occurred, we could say that this result (Fig. 4) might represent the stable number of outbreaks through time (for the time period analyzed).

Further information could shed light on this issue as it has been done in other studies where census sizes or recorded demographic/outbreak information has been correlated with past population dynamics inferences (Bennett et al., 2010; Pérez-Losada et al., 2005;

Tazi et al., 2010). Interestingly, Bayes Factor analysis did not show the strongest support for the constant demographic model (Table 1). Instead, the eBSP model showed the strongest support. This could be explained because this model, in a piecewise fashion,

289 tries to accommodate population dynamics giving an overall best fit of the model to the data (Heled and Drummond, 2008).

Phylogeography

The dispersal pattern (Fig. 5) suggests artificial dispersal because of the multiple routes and long distances the lineages have taken to reach the present distribution. If the dispersal of lineages over the geographic region was explained by some sort of natural carrier or whatever natural means, we would expect geographically close factories to have genetically closely related phage lineages, e.g., factories H, D and B, as two neighbor factories would share phage having a most recent common ancestor. However, this seems not to be the case. Given the complex nature of the dairy manufacturing industry, the large number of sales persons, technicians, and myriad of consumable suppliers combined with the lack of historical records of these movements, further analysis would be more appropriate to a well designed prospective study to provide additional evidence for the artificial movement hypothesis.

Natural Selection

The final step of this study was to look for diversifying selection in some biologically important genes (Table 2 and Fig. 6). We used three methods to achieve some degree of confidence in the potentially selected sites. It has been shown that with large datasets (>50 taxa) all methods tend to converge to the same answer (Pond and

Frost, 2005). However, when using small datasets (this case, 28 taxa) a consensus-based inference is advisable. One of the methods, the REL method, assumes that each rate is represented by a simple distribution and is commonly recognized as a liberal method. On the other hand, SLAC method provides a “quick and dirty” alternative. In turn, the FEL

290 method was used to infer the more conservative estimate of positive selected sites. Not surprisingly, REL found many more sites under selection than FEL and SLAC when examining RBP and tape measure genes. Regarding major capsid protein and endolysin genes, it was not possible to find any sites under positive selection with any of the three methods, more likely because of the little variation of those genes, the presence of identical sequences that reduced the dataset even more, and the inherent conservativeness of the methods. Similarly, no positively selected sites were found in tape measure protein gene sequence when inspected under SLAC and FEL methods. However, positively selected sites were found in RBP, the protein that determines host specificity.

Fortunately, the crystal structure of an ortholog of this protein has been resolved (p2 phage, 936-like, (Spinelli et al., 2005)), so it was possible to map the selected sites onto the structure (Fig. 6). The active form of the protein is a homo-trimer whose main domains are: the shoulders, a b-sandwich attached to the phage tail; the neck, an interlaced b-prism; and the head, the receptor recognition domain composed of seven- stranded b-barrel (from bottom up in figure 6B). We mapped positively selected sites found by all three methods and these were located in the neck and one toward the head.

The neck is a rigid structure, homologous to viruses infecting host of different kingdoms

(adenovirus, reovirus, other phage as well). It is thought that this structure along with the head have co-evolved with their host and thus are responsible of host range (Spinelli et al., 2005). In fact, across 936-like phage, the shoulder structure is highly conserved

(~90% amino acid identity), whereas neck and head show greater variability (down to

15%, figure 2E in (Spinelli et al., 2005)). Interestingly, character states in codon 155 were related to monophyletic groups and to a large degree related to host specificity, with

291 the upper clade having a methionine and the lower clade with either a gap or a leucine.

Likewise, codons 165 and 167 also followed this pattern. It is known that RBP is the sole determinant of host specificity (Dupont et al., 2004). Thus, the fact that these sites were found to be under positive selection and located in relevant domains could indicate that are somehow involved in the phage-cell interaction directly or indirectly by serving as a scaffold for a proper conformation or making contacts with the cell ligand. The use of this observation as a predictor of host specificity awaits experimental confirmation.

Summary

In this study, we showed that recombination rate is concentrated within a few regions over the genome with these increased recombination rates correlated with increased regions of genetic diversity providing evidence for the modular nature of phage genome structure. In addition, our whole genome phylogenetic analysis of a population of

936-like phage shows isolates cluster together according to their host tropism. Our phylogeographic analysis suggests the mechanism of dispersion is most likely human associated movement and agrees with timing estimates. Furthermore, positively selected sites in the receptor-binding protein, the sole determinant of host range (Dupont et al.,

2004), lies on the domains that interact with the host receptor and correlate well with host specificity. A comprehensive assessment of 936-like phage evolutionary features will necessitate extensive sequencing efforts along with the recording of isolate characteristics. Nevertheless, our results clearly show the dominant role of host specificity in the evolutionary dynamics of these phage and demonstrate the utility of evolutionary approaches to key questions in applied microbiology.

292 References

Akaike, Hirotugu 1974. A New Look at the Statistical Model Identification. IEEE

Transactions on Automatic Control 19: 716-723.

Bennett, S. N., A. J. Drummond, D. D. Kapan, M. A. Suchard, J. L. Munoz-Jordan, O. G.

Pybus, E. C. Holmes and D. J. Gubler 2010. Epidemic Dynamics Revealed in Dengue

Evolution. Molecular Biology and Evolution 27: 811-818.

Boucher, Isabelle, Eric Emond, Eric Dion, Diane Montpetit and Sylvain Moineau 2000.

Microbiological and Molecular Impacts of Abik on the Lytic Cycle of Lactococcus Lactis

Phages of the 936 and P335 Species. Microbiology 146: 445-453.

Brussow, Harald, Carlos Canchaya and Wolf-Dietrich Hardt 2004. Phages and the

Evolution of Bacterial Pathogens: From Genomic Rearrangements to Lysogenic

Conversion. Microbiol. Mol. Biol. Rev. 68: 560-602.

Casjens, Sherwood R. 2005. Comparative Genomics and Evolution of the Tailed-

Bacteriophages. Current Opinion in Microbiology 8: 451-458.

Chopin, Marie-Christine, Alain Chopin and Elena Bidnenko 2005. Phage Abortive

Infection in Lactococci: Variations on a Theme. Current Opinion in Microbiology 8: 473-

479.

Crutz-Le Coq, Anne-Marie, Benedicte Cesselin, Jacqueline Commissaire and Jamila

Anba 2002. Sequence Analysis of the Lactococcal Bacteriophage Bil170: Insights into

Structural Proteins and Hnh Endonucleases in Dairy Phages. Microbiology 148: 985-

1001.

293 de Fabrizio, S. V., R. A. Ledford, Y. S. C. Shieh, J. Brown and J. L. Parada 1991.

Comparison of Lactococcal Bacteriophage Isolated in the United States and Argentina.

International Journal of Food Microbiology 13: 285-293.

De Haard, Hans J. W., Sandra Bezemer, Aat M. Ledeboer, Wally H. Muller, Piet J.

Boender, Sylvain Moineau, Marie-Cecile Coppelmans, Arie J. Verkleij, Leon G. J.

Frenken and C. Theo Verrips 2005. Llama Antibodies against a Lactococcal Protein

Located at the Tip of the Phage Tail Prevent Phage Infection. J. Bacteriol. 187: 4531-

4541.

Delport, Wayne, Art F. Y. Poon, Simon D. W. Frost and Sergei L. Kosakovsky Pond

2010. Datamonkey 2010: A Suite of Phylogenetic Analysis Tools for Evolutionary

Biology. Bioinformatics 26: 2455-2457.

Deveau, Helene, Simon J. Labrie, Marie-Christine Chopin and Sylvain Moineau 2006.

Biodiversity and Classification of Lactococcal Phages. Appl. Environ. Microbiol. 72:

4338-4346.

Drummond, A. J. and A. Rambaut 2007. Beast: Bayesian Evolutionary Analysis by

Sampling Trees. Bmc Evolutionary Biology 7: -.

Drummond, AJ, S Ho, M Phillips and A Rambaut 2006. Relaxed Phylogenetics and

Dating with Confidence. Plos Biology 4: e88.

Drummond, AJ, G Nicholls, A Rodrigo and W Solomon 2002. Estimating Mutation

Parameters, Population History and Genealogy Simultaneously from Temporally Spaced

Sequence Data. Genetics 161: 1307 - 1320.

294 Drummond, AJ, A Rambaut, B Shapiro and O Pybus 2005. Bayesian Coalescent

Inference of Past Population Dynamics from Molecular Sequences. Molecular Biology and Evolution 22: 1185 - 1192.

Dupont, Kitt, Finn Kvist Vogensen, Horst Neve, Jose Bresciani and Jytte Josephsen

2004. Identification of the Receptor-Binding Protein in 936-Species Lactococcal

Bacteriophages. Appl. Environ. Microbiol. 70: 5818-5824.

Egan, Ashley N. and Keith A. Crandall 2008. Incorporating Gaps as Phylogenetic

Characters across Eight DNA Regions: Ramifications for North American Psoraleeae

(Leguminosae). Molecular Phylogenetics and Evolution 46: 532-546.

Fortier, Louis-Charles, Ali Bransi and Sylvain Moineau 2006. Genome Sequence and

Global Gene Expression of Q54, a New Phage Species Linking the 936 and C2 Phage

Species of Lactococcus Lactis. J. Bacteriol. 188: 6101-6114.

Gouy, Manolo, Stéphane Guindon and Olivier Gascuel 2010. Seaview Version 4: A

Multiplatform Graphical User Interface for Sequence Alignment and Phylogenetic Tree

Building. Molecular Biology and Evolution 27: 221-224.

Hejnowicz, Monika S., Marcin Golebiewski and Jacek Bardowski 2009. Analysis of the

Complete Genome Sequence of the Lactococcal Bacteriophage Bibb29. International

Journal of Food Microbiology 131: 52-61.

Heled, J. and A. J. Drummond 2008. Bayesian Inference of Population Size History from

Multiple Loci. Bmc Evolutionary Biology 8: -.

Hendrix, Roger W. 2003. Bacteriophage Genomics. Current Opinion in Microbiology 6:

506-511.

295 Holmes, E. C. 2009. The Evolutionary Genetics of Emerging Viruses. Annual Review of

Ecology Evolution and Systematics 40: 353-372.

Hudson, Richard R. 2001. Two-Locus Sampling Distributions and Their Application.

Genetics 159: 1805-1817.

Katoh, Kazutaka, Kei-ichi Kuma, Hiroyuki Toh and Takashi Miyata 2005. Mafft Version

5: Improvement in Accuracy of Multiple Sequence Alignment. Nucleic Acids Research

33: 511-518.

Kendall, M. G. 1938. A New Measure of Rank Correlation. Biometrika 30: 81-93.

Ledeboer, A. M., S. Bezemer, J. J. W. de Haard, I. M. Schaffers, C. T. Verrips, C. van

Vliet, E.-M. Dusterhoft, P. Zoon, S. Moineau and L. G. J. Frenken 2002. Preventing

Phage Lysis of Lactococcus Lactis in Cheese Production Using a Neutralizing Heavy-

Chain Antibody Fragment from Llama. J. Dairy Sci. 85: 1376-1382.

Lemey, P., A. Rambaut, A. J. Drummond and M. A. Suchard 2009. Bayesian

Phylogeography Finds Its Roots. PLoS Computational Biology 5: -.

Librado, P. and J. Rozas 2009. Dnasp V5: A Software for Comprehensive Analysis of

DNA Polymorphism Data. Bioinformatics 25: 1451-1452.

Madera, Carmen, Cristina Monjardin and Juan E. Suarez 2004. Milk Contamination and

Resistance to Processing Conditions Determine the Fate of Lactococcus Lactis

Bacteriophages in Dairies. Appl. Environ. Microbiol. 70: 7365-7371.

McVean, Gil, Philip Awadalla and Paul Fearnhead 2002. A Coalescent-Based Method for

Detecting and Estimating Recombination from Gene Sequences. Genetics 160: 1231-

1241.

296 Pérez-Losada, Marcos, Raphael P. Viscidi, James C. Demma, Jonathan Zenilman and

Keith A. Crandall 2005. Population Genetics of Neisseria Gonorrhoeae in a High-

Prevalence Community Using a Hypervariable Outer Membrane Porb and 13 Slowly

Evolving Housekeeping Genes. Mol Biol Evol 22: 1887-1902.

Pond, Sergei L. Kosakovsky and Simon D. W. Frost 2005. A Genetic Algorithm

Approach to Detecting Lineage-Specific Variation in Selection Pressure. Molecular

Biology and Evolution 22: 478-485.

Posada, D. 2008. Jmodeltest: Phylogenetic Model Averaging. Molecular Biology and

Evolution 25: 1253-1256.

Pybus, O. G. and A. Rambaut 2009. Evolutionary Analysis of the Dynamics of Viral

Infectious Disease. Nature Reviews Genetics 10: 540-550.

Rousseau, Genevieve M. and Sylvain Moineau 2009. Evolution of Lactococcus Lactis

Phages within a Cheese Factory. Appl. Environ. Microbiol. 75: 5336-5344.

Sharp, Paul M. 1997. In Search of Molecular Darwinism. Nature 385: 111-112.

Spinelli, Silvia, Aline Desmyter, C Theo Verrips, Hans J W de Haard, Sylvain Moineau and Christian Cambillau 2005. Lactococcal Bacteriophage P2 Receptor-Binding Protein

Structure Suggests a Common Ancestor Gene with Bacterial and Mammalian Viruses.

Nature Structural & Molecular Biology 13: 85 - 89.

Suárez, V., S. Moineau, J. Reinheimer and A. Quiberoni 2008. Argentinean Lactococcus

Lactis Bacteriophages: Genetic Characterization and Adsorption Studies. Journal of

Applied Microbiology 104: 371-379.

297 Tazi, Loubna, Marcos Pérez-Losada, Weiming Gu, Yang Yang, Lin Xue, Keith Crandall and Raphael Viscidi 2010. Population Dynamics of Neisseria Gonorrhoeae in Shanghai,

China: A Comparative Study. Bmc Infectious Diseases 10: 13.

Tcherepanov, V., A. Ehlers and C. Upton 2006. Genome Annotation Transfer Utility

(Gatu): Rapid Annotation of Viral Genomes Using a Closely Related Reference Genome.

Bmc Genomics 7: -.

Tremblay, Denise M., Mariella Tegoni, Silvia Spinelli, Valerie Campanacci, Stephanie

Blangy, Celine Huyghe, Aline Desmyter, Steve Labrie, Sylvain Moineau and Christian

Cambillau 2006. Receptor-Binding Protein of Lactococcus Lactis Phages: Identification and Characterization of the Saccharide Receptor-Binding Site. J. Bacteriol. 188: 2400-

2410.

Watterson, G. A. 1975. On the Number of Segregating Sites in Genetical Models without

Recombination. Theoretical Population Biology 7: 256-276.

298 Summary

The scope of the conclusions of this work covers three main aspects, 1) it reviews the literature pertinent to phylogenetic approaches to population genetics and molecular evolution applied to microorganisms, 2) it provides a methodological overview regarding molecular survey approaches and evaluates their relative advantages, and 3) it demonstrates the utility of such approaches to pathogens related to food production and select agents.

In particular, in this work I addressed three outstanding issues. In section I, I review the current literature pertinent to the application of phylogenetic methods to the model organism Human Immunodeficiency Virus. This section covers at length the history of discoveries aided by phylogenetic analysis from high-level processes such as global migration down to our ability to follow transmission chains in a few individuals.

In section II chapter 2, I found that that substitution rates vary widely among survey approaches (MLSTs, SNPs, and genomes), and that SNP and genomic datasets might yield different but highly supported phylogenies. MLSTs needs to be revamped as they typically fall short in terms of resolution to track infections. Multi-gene approaches similar to those applied in phylogenomic systematics have been suggested but are not yet widely adopted. If we are to capitalize on the sheer volume of data in the post-genomic era, methodological advances or at least new implementations are needed that can make use of this abundance.

In Section III parts 2 and 3, I demonstrate the utility of phylogenetic methods to outbreak investigations in salmon aquaculture (origin, demography, and natural selection of Infectious Salmon Anemia Virus), and in cheese production (gene flow and

299 determinants of host specificity in 936-like Lactococcus phage). Both studies demonstrate that human-driven contamination (gene flow) has been behind impairments in food production, and thus they highlight the open niche for and benefit of applied phylogenetic studies. Phylogenetic methods applied to food production pathogens have a direct impact in policy and practices, which in turn, have direct impact on the society.

In an age of data driven science, it also makes publicly available 60 bacterial genomes, 28 bacteriophage genomes, and 2 viral genomes. However, the main contribution of this work lies in the connection of phylogenetic approaches, which as of now have been mainly applied to human pathogens, and non-model organisms such as pathogens of economic importance. This paves the way for combining current molecular biology with population approaches and thus contributing to food production safety and manufacture.

300