DISS. ETH NO. 26372

RESISTANCEISFUTILE:TAMINGTHEBEASTSOFANTIBIOTICRESISTANCEIN MYCOBACTERIUMTUBERCULOSIS .

A thesis submitted to attain the degree of DOCTOR OF SCIENCES of ETH ZURICH (Dr. sc. ETH Zurich)

presented by

J ULIJAPE¯ CERSKAˇ M.Sc. ETH Zurich, Zurich, Switzerland

born on 24.01.1990

citizen of Latvia

accepted on the recommendation of Prof. Dr. Tanja Stadler (examiner), Prof. Dr. Sebastien Gagneux (co-examiner), Prof. Dr. François Balloux (co-examiner)

2019

ACKNOWLEDGMENTS

These past 5 years have been a journey I never anticipated, but I am extremely happy that I did end up taking it on. It has not been always smooth sailing, but if I had the choice to do it all over again knowing what I know now, I would do it without question. This work has helped me feel like I am in the right place and I belong to something bigger – the vast and glorious scientific community. I am extremely grateful to my supervisor, Tanja Stadler, for guiding me through all these years, I could not have wished for better guidance and support. My time as a student with Tanja has allowed me to work on exciting scientific projects that are relevant to global wellbeing, as well as participate in activities that improve the life of students locally within the department. Thanks to Tanja I could rediscover and nourish my passion for teaching, which will most likely guide my future career choices. I am very grateful to Sebastien Gagneux for kind guidance in collaborative projects and well as for being on my committee. I am also very grateful to François Balloux for agreeing to join the committee for my PhD examination. Many thanks to my collaborators in different projects, and in particular to Denise Kühnert, Conor Meehan, Sebastian Gygli, Andrej Trauner and Mark Tanaka. Your insight and support has been invaluable and the conversations we had were always enlightening and fun. A huge thanks goes out to the wonderful cEvo group members, current and former: Tim, Marc, Venelin, Rachel, Jana, Sarah, Nicola, Jérémie, Sasha, Chi and David. And a special thanks to Veronika, Joëlle and Carsten for sometimes being my antagonists (especially when it comes to teaching), but for being amazing friends to me despite our differences. And yet another thanks to cEvo for rising up to the occasion and working together on the Taming the Beast workshop series – each of these workshops have been unique and exciting in its own way and I am endlessly grateful I could teach in so many of the workshops. A loving thank you to all the friends who have supported me, worried about me and incessantly made fun of me throughout this long journey, and in particular to Alëna, Alina, Jelizaveta, Maria, Irina and Dmitry. I appreciate every one of you and I could not do without your support. Last but not least, a huge thank you goes to my family: my parents and grandparents who brought me up to be who I am, my dearest sister Anna Esther who is my closest friend, and to my partner Jaroslavs who has supported me in my crazy endeavours (and put up with my scientific journey for all these years). I love you!

iii

CONTENTS

1 introduction1 1.1 2 1.2 Bayesian phylodynamic inference for epidemiology 4 1.3 Outline 5 2 existing approaches to tb and mdr-tb modelling7 3 transmission fitness costs in mdr-tb 47 4 pyrazinamide resistance fitness costs in mdr-tb in georgia 79 5 transmission time and clustering methods in tb epidemiology 91 6 quantifying the fitness cost of hiv drug resistance 105 7 transmission of hepatitis b and d in an african community 143 8 bayesian analyses made understandable 183 9 discussion and conclusions 189

bibliography 193

v

ABSTRACT

Tuberculosis (TB) has been declared a global public health emergency more than a quarter of a century ago and while significant progress has been made to reduce the mortality and infection rates, there is still a long way to go to the ultimate goal of eradicating TB. While we have efficient treatment regimens, even the shortest regimen for fully drug-sensitive TB is already 6 months long. This regimen, while effective at treating drug-sensitive cases, provides plenty of room for resistance development due to sheer treatment length. Resistance development is further sped up by the characteristics of the bacterium and of the disease that it causes in humans. In this thesis, I estimated the relative transmission fitness of drug-resistant Mycobacterium tuberculosis strains in relation to drug-sensitive strains using phylodynamics. Phylodynamic analysis has been used in the past 20 years to quantify population dynamic processes studying a plethora of different population types. The methods were used to study population dynamics on drastically different scales: from species dynamics on macroscopic, to virus and other dynamics on microscopic scales. A lot has been done on estimating epidemiological dynamics of , however most work has been on fast evolving viruses and not extensively tested on much slower bacterial pathogens, of which Mycobacterium tuberculosis is a prominent example. In Chapter 1 I give a general introduction to the topic at hand and justify the importance of whole genome sequencing of ongoing TB outbreaks. I briefly describe the current knowledge of TB dynamics as well as the gaps in knowledge that we will still need to extensively research to cover. I then introduce Bayesian phylodynamic methods for epidemiological analyses, highlighting in particular the most important characteristics of said analyses for TB epidemiology. In Chapter 2 I cover current approaches to TB modelling, describing a range of models from evolutionary to epidemiological to a combination of both. The chapter shows the diversity of modelling approaches, as well as areas of development for further modelling efforts. In Chapter 3 I describe a proof-of-concept simulation study and an empirical analysis of a multi-drug resistant TB (MDR-TB) dataset. In the simulation study I use an efficient implementation of the multi-type birth-death (MTBD) model in BEAST2 to analyse simulated TB epidemics, showing that even though the simulated dynamics are much more complex than MTBD assumptions, the analyses recover the parameters accurately and precisely. I then analyse an MDR-TB lineage 4 dataset from Kinshasa using the MTBD model, estimating the relative transmission fitness of pyrazinamide-resistant MDR-TB strains. In Chapter 4 I analysed a different dataset, this time of two TB lineages from Georgia, quantifying the transmission fitness costs of pyrazinamide resistance. The two datasets, Kin- shasa and Georgia, show very different estimated transmission fitness costs for pyrazinamide resistance, the former showing the expected transmission fitness cost, while the latter shows signal for no reduction in fitness associated with the studied resistance. This is a worrying result, as the current consensus is that pyrazinamide resistance fitness costs are high and pyrazinamide is one of the few first line drugs used in both drug-sensitive and MDR-TB treatment regimens. Low fitness costs, however, would mean that resistance is very

vii viii contents

likely to happen on a population wide level. Chapter 5 covers the most commonly used modern TB genotyping techniques and the correlation between different cluster definitions using the genotyping data and the times spanned by the clusters. Chapter 6 describes using the computational approach used in Chapter 3 and Chapter 4 on an HIV dataset to estimate the relative fitness of different drug resistance mutants. Chapter 7 contains a study on the epidemiological dynamics of active and occult cases of Hepatitis B spreading in an African rural community with no access to treatment. Chapter 8 covers the online resource hub set up after running the first “Taming the Beast” workshop in the series. The demand for phylodynamic expertise can not be covered even with the multiple workshops running in different parts of the world, thus we have created a portal containing all the existing tutorials to allow scientists to learn the necessary skills online. Finally, I discuss the work presented in this thesis as well as outlook on further research in the area in Chapter 9. ZUSAMMENFASSUNG

Tuberkulose (TB) wurde vor mehr als einem Vierteljahrhundert als international relevante gesundheitliche Notlage deklariert. Obwohl signifikante Fortschritte erzielt wurden, um die Sterbe- und Infektionsraten zu reduzieren, ist es noch ein weiter Weg, das ultimative Ziel, TB auszurotten, zu erreichen. Es gibt einige wirksame Therapien, doch schon die kürzeste Therapie dauert 6 Monate. Diese Therapie ist effektiv, wenn Fälle von arzneimittelsensiblen Tuberkuloseinfektionen behandelt werden, allerdings begünstigt die Länge der Therapie die Entwicklung von resistenten Stämmen während der Behandlung. Die Resistenzentwicklung wird durch die Eigenschaften des Bakteriums und der Krankheit, die es beim Menschen verursacht, weiter beschleunigt. In dieser Doktorarbeit schätze ich die relative Fitness der Übertragung von Antibiotika resis- tenten Mycobacterium tuberculosis Stämmen in Relation zu der Fitness von arzneimittelsensiblen Stämmen mit Hilfe von Phylodynamischen Methoden. Phylodynamische Analysen wurden in den letzten 20 Jahren benutzt, um populationsdynamische Prozesse auf makroskopischer Ebene (wie Speziesevolution) bis hin zur mikroskopischen Welt (wie Viren) zu quantifizieren. Die meisten Untersuchungen betrafen jedoch sich schnell entwickelnde Viren, doch langsamer entwickelnde Bakterien wie das Mycobacterium tuberculosis wurden vernachlässigt. In Kapitel 1 führe ich das Thema ein und beschreibe, warum es immer wichtiger wird, ganze Genome in anhaltenden TB-Ausbrüchen zu sequenzieren. Ich gebe einen Überblick über den momentanen Kenntnisstand der TB-Populationsdynamik und hebe unsere Wissens- lücken in diesem Thema hervor. Dann führe ich Bayesische phylodynamische Methoden für epidemiologische Analysen ein, dies unter besonderer Berücksichtigung für die Analyse von TB-Epidemiologie. In Kapitel 2 beschreibe ich die verschiedenen Ansätze TB zu modellieren: evolutionäre, epidemiologische und eine Kombination dieser beiden Ansätze. Dieses Kapitel zeigt sowohl die Verschiedenheit dieser Modelle als auch die Bereiche, in denen bestehende Modelle erweitert werden müssen. In Kapitel 3 erläutere ich, wie eine Simulationsstudie die Richtigkeit unseres Ansatzes zeigt und wie ich einen empirischen multi-resistenten TB (MDR-TB) Datensatz analysiere. In der Simulationsstudie verwende ich eine effiziente Implementierung des Multi-type Birth-death (MTBD) Modells in BEAST2 und zeige, dass, obwohl die simulierte Dynamik komplexer ist, als das MTBD-Modell voraussetzt, die Modellparameter akkurat und präzis geschätzt werden können. Im Anschluss benutzte ich das MTBD-Modell, um den MDR-TB Stamm 4 Datensatz aus Kinshasa zu analysieren und die relative Transmissionsfitness des Pyrazinamid-resistenten MDR-TB Stamm zu schätzen. In Kapitel 4 beschreibe ich meine Analyse eines anderen Datensatzes, dieses Mal handelt es sich um die TB-Stämme 2 und 4 aus Georgien, in der ich die Transmissionsfitnesskosten für Resistenzentwicklung quantifiziere. Die so aus den zwei Datensätzen aus Kinshasa und Georgien bestimmten Kosten der erwarteten Transmissionsfitness sind sehr unterschiedlich: der Kinshasa-Datensatz zeigt – wie erwartet – eine Reduktion der Fitnesskosten, aber der Georgien-Datensatz enthält ein Signal dafür, dass die studierte Resistenz nicht zu einer Reduktion der Fitness führt. Dies ist ein beunruhigendes Resultat, da man momentan davon ausgeht, dass die Fitnesskosten für die Entwicklung der Pyrazinamid Restistenz sehr hoch ix x contents

sind und Pyrazinamid eine der wenigen Ersttherapien sowohl für arzneimittelsensible als auch MDR-TB ist. Jedoch bedeuten geringe Fitnesskosten zur Entwicklung resistenter Stämme, dass diese sich mit höher Wahrscheinlichkeit schnell entwickeln können. Kapitel 5 fasst die momentan am häufigsten benutzten modernen Techniken zur TB- Genotypisierung zusammen und erläutert die Korrelation zwischen verschiedenen Defini- tionen von Clustern, die diese Genotypisierungsdaten verwenden, und die Zeit, die diese Cluster umspannen. Kapitel 6 beschreibt, wie die relative Fitness verschiedener Medikament resistenter Mu- tationen in einem HIV-Datensatz mit Hilfe der in den Kapiteln Kapitel 3 und Kapitel 4 dargestellten Methoden bestimmt wird. Kapitel 7 erläutert, wie ich die epidemiologische Dynamik von aktiven und okkulten Fällen von Hepatitis-B Ausbreitung in einer ländlichen afrikanischen Gegend ohne Zugang zu Therapien, bestimmt habe. Kapitel 8 beschreibt die online Ressource, die wir nach dem ersten “Taming the Beast” Workshop erstellt haben. Die Nachfrage an phylodynamischer Expertise ist momentan so hoch, dass sie nicht nur durch die verschiedene Workshops, die wir rund um der Welt organisieren, gedeckt werden kann, doch das online Portal, in dem alle Tutorien gespeichert sind, erlaubt allen interessierten Wissenschaftlerinnen und Wissenschaftlern die notwendigen Fähigkeiten online selber zu erlernen. Im Kapitel 9 diskutiere ich die vorliegenden Ergebnisse meiner Doktorarbeit und erlaube mir einen Ausblick auf zukünftige, relevante Fragestellungen in diesem Gebiet. INTRODUCTION 1

Contrary to popular belief, tuberculosis (TB) is far from eradicated and poses a severe threat to public health worldwide. In 2017, an estimated 10 million people developed TB and 1.6 people died from TB or co-morbidities. Moreover, about half a million of the new cases were rifampicin resistant (RR) or multidrug resistant (MDR), i.e. resistant to at least two of the main first line drugs – the aforementioned rifampicin and, additionally, isoniazid. Three countries accounted for almost half of the world’s cases of RR and MDR-TB in 2017: India at 24%, China at 13% and Russia at 10%. While developed countries certainly have a much lower prevalence (e.g. European countries only accounted for about 3% of the global cases), they are not safe from neither TB nor MDR-TB (WHO, 2018). Moreover, while definite progress has been made in percentage of cases tested for drug susceptibility, in 2017 only 30% of the new and previously treated TB cases were tested for rifampicin resistance, which leaves room for improved coverage (WHO, 2018). A small and well-controlled recent outbreak in Switzerland was closely studied first by Genewein et al., 1993 using an older genotyping technique and and then by Kühnert et al., 2018, using phylodynamics on whole genome sequencing (WGS) data. While the sampling dates ranged from 1987 to 2011, the phylodynamic analyses confirmed a previous hypothesis that the outbreak most likely peaked around 1990. Kühnert et al. inferred that most of the transmission events likely occurred between 1990 and 1991, whereas the majority of cases were only reported in 1993. So, not only does this show that even in highly monitored conditions with adequate healthcare TB outbreaks may happen, but it also illustrates how far we still have to go in order to develop rapid and efficient diagnostic and analysis tools. WHO declared TB a global public health emergency in 1993, however we still lack the means to effectively track and prevent TB outbreaks. This thesis describes an attempt to quantify relative transmission fitness of drug resistance, and that of MDR-TB in particular. The relative transmission fitness is defined as the ratio of transmission rates of the drug resistant to the drug sensitive strain, which therefore represents the relative decrease or increase in speed of spread for the drug-resistant strain. In this thesis I present an approach to Mycobacterium tuberculosis phylodynamic analysis, verified by an extensive simulation study, which allows us to quantify the relative transmission fitness of drug resistance in TB using Bayesian inference. I was able to use this method to quantify the relative fitness of pyrazinamide-resistant MDR-TB strains compared to pyrazinamide-sensitive MDR-TB, showing how this specific resistance seems to have highly variable effects on the transmission fitness of the bacterium. In order to quantify relative transmission fitness of multi-drug resistance compared to sensitive strains we need highly diverse datasets with multiple clusters of both sensitive and MDR strains. Such datasets are not easy to acquire for various reasons explained later in the thesis and due to the lack of such data we could not quantify drug resistance transmission fitness. The hope is, nevertheless, that using the approaches described in this thesis others will be able to answer this question when the data will become available.

1 2 introduction

Phylodynamics, a term coined by Grenfell et al., 2004, describes the relatively young field that unifies epidemiological and evolutionary approaches to analyse pathogen dynamics. These kinds of analyses can be used to uncover the features of many different epidemics while using genetic sequencing data. Moreover, these analyses can be performed using different statistical frameworks, e.g. using the maximum likelihood (ML) or Bayesian framework. In this thesis I will be focusing on the one-step Bayesian approach which allows to simultaneously infer both the epidemiological and the evolutionary parameters as well as the phylogenetic tree. This approach has been used in the analyses of many different pathogens such as, for example, HIV and HBV, examples of which are also presented in this thesis. Nevertheless, the majority of the work presented here is focused on Mycobacterium tuberculosis. Thus, the next sections provide a brief introduction to the basic epidemiological dynamics of Mycobacterium tuberculosis and to the methodology of Bayesian phylodynamic analysis. I conclude the introduction with a short overview of the chapters in this thesis.

1.1 tuberculosis epidemiology

TB is a disease caused by Mycobacterium tuberculosis which most commonly manifests as a pulmonary infection in humans. It is spread when a person with pulmonary infectious TB coughs, sneezes or talks (or even sings (Sepkowitz, 1996)), which disperses the bacteria into the air where they can be inhaled by a susceptible individual. However, only about 5% of the people that come in contact with the bacteria will develop active disease within a short period of time (e.g. a year), while another 5% may develop the disease within their lifetime. Around 90% of the human population will likely never develop the disease even after coming in contact with the bacteria, for reasons still unclear. This rough estimate was originally proposed by Blower et al., 1995, and is currently widely used, often as a rule of thumb for defining the infection risk. Mycobacterium tuberculosis typically resides deep in the alveoli of the lungs, while some bacteria stay dormant within the fibrous regions and the granulomas. The variety of different reservoirs generally increases the risk of relapse. It also increases the risk of even multi-drug therapy turning into functional monotherapy when some drugs fail to reach the bacteria or are not effective in the specific environment, especially so when coupled with inadequate therapy (Colijn et al., 2011). Additionally, it seems that the inherent asymmetry in the division process of Mycobacterium tuberculosis contributes to overall population heterogeneity and to variable degrees of antibiotic resistance in the same population (Aldridge et al., 2012). This dictates the standard drug regimens that are used to treat TB, e.g. the WHO recommended short-term treatment, which comprises isoniazid, rifampicin, ethambutol and pyrazinamide on a daily basis for two months, with continued therapy of isoniazid and rifampicin for a further four months. This is the shortest treatment course currently available and it already lasts six months, which gives plenty of opportunity for poor adherence which in turn may cause relapse and drug resistance. However, relapse and resistance may happen even in the case of good adherence to treatment, and resistance to most commonly used drugs is never reversed (Andersson and Hughes, 2010). Different works have attempted to quantify the resistance acquisition rates, such as e.g. (Ford et al., 2013) and (Steenwinkel et al., 2012), but so far there are no conclusive estimates. It has also been shown that resistance-conferring substi- tutions can reduce the fitness cost of existing resistances through epistatic interactions (Borrell et al., 2013). This further indicates that compensation is much more likely than resistance 1.1 tuberculosis epidemiology3 reversion through backward mutation (Allen et al., 2017). Moreover, the only vaccine currently available is the BCG vaccine, which only protects children from extra-pulmonary TB, but has little effect on the prevalence of pulmonary TB in adults in an endemic setting (Gomes et al., 2004). Pulmonary TB is characterised by a relatively long latency – which can vary one year to perhaps life-long, – which greatly limits the possibilities for contact tracing. For other, faster developing diseases, or sexually-transmitted pathogens, one can often trace contacts to build up information on linked cases and thus track the spread of the disease. While close contacts of TB patients can be monitored for disease symptoms and/or given pre-emptive treatment, we almost certainly cannot establish the source of the infection especially if the latent period was long and no previous contacts are known. This makes it almost impossible to discover the actual length of the latency within a patient, which is further complicated by the lack of methods that would allow to detect latent TB infection (Colangeli et al., 2014). Similarly, yet another aspect of TB dynamics that is complicated by lack of methods for detecting latent TB is the possibility of relapse for seemingly recovered patients. Unless a relapsed patient had samples collected in their first episode of TB it is almost impossible to tell whether they have been newly infected or are suffering from a relapse from a previous infection that was stowed away in a reservoir. From an evolutionary perspective Mycobacterium tuberculosis seems to be a pretty straight- forward bacterium to analyse. Mycobacterium tuberculosis genome shows no sign of recombi- nation (Godfroid, Dagan, and Kupczok, 2018) or horizontal gene transfer (Cole et al., 1998) and, moreover, there is no evidence of the presence of plasmids (Zainuddin and Dale, 1990), which altogether simplifies the inference of phylogenetic relationships as the relationships can be displayed by a tree rather than a network. However, Mycobacterium tuberculosis is also characterised by extremely slow growth (one generation in 24 hours) (Gordon and Parish, 2018), which in turn influences the rate of evolution, which is relatively slow compared to other bacteria (Duchene et al., 2016). This apparent “slowness” has many diverse impacts on the way Mycobacterium tuberculosis is observed and studied, and specifically it influences data gath- ering procedures. Older methods of Mycobacterium tuberculosis genotyping (e.g. spoligotyping, spacer oligonucleotide typing, a polymerase chain reaction-based method (Driscoll, 2009) and MIRU-VNTR, typing of loci containing a variable number of tandem repeats (Vergnaud and Pourcel, 2009) are based on limited regions of the genome, and thus show very little observable variance. However, WGS data shows noticeable variance between patients within an outbreak, and thus provides a higher information content. Moreover, mounting evidence suggests that between strain variation in Mycobacterium tuberculosis is higher than previously thought (Borrell and Gagneux, 2011; Gagneux and Small, 2007; Roetzer et al., 2013), which can give steady ground for analysis, especially when working with WGS data. Using WGS to study TB dynamics shows promising results, highlighting the differences in rates of recent transmission by TB lineage (Guerra-Assuncao et al., 2015), inferring the difference in rates of evolution between active and latent disease stages (Colangeli et al., 2014), and investigating the possible reasons for the high prevalence of drug resistance even when globally recommended treatment schemes are used (Cohen et al., 2015). Nevertheless, despite the range of models currently available for WGS data analysis, better resolution in data gathering is necessary to reach full analysis potential that would let us e.g. distinguish between reinfections and relapses (Hatherell et al., 2016). Moreover, better understanding of Mycobacterium tuberculosis 4 introduction

within-host evolution and mixed infection will further improve our interpretation of WGS data (Hatherell et al., 2016).

1.2 bayesian phylodynamic inference for epidemiology

Phylodynamics is a relatively new approach developed in the last 20 years to quantify popula- tion dynamics simultaneously with evolutionary dynamics. When looking at epidemiology, the population dynamics are, in fact, epidemiological dynamics. Unlike traditional epidemiol- ogy, which mainly uses incidence and prevalence data for analyses, phylodynamic approaches require some kind of discrete characters that would allow the methods to reconstruct evolu- tionary relationships. Of course, genetic data, e.g. in the form of pathogen genetic sequences, provides a great source of information that can be used for phylodynamic analyses. Mycobac- terium tuberculosis, as discussed above, has no evidence of horizontal gene transfer in any way, which allows us to reconstruct the ancestral relationships of the sequences in the form of a phylogenetic tree rather than having to reconstruct more complex relationships in the form of networks. In addition, in order for the genetic sequences to be informative enough for trees to be built, the pathogen in question has to belong to the so-called measurably evolving populations (Drummond et al., 2003), which Mycobacterium tuberculosis seemingly does. The shape of the reconstructed phylogenetic tree is defined by both the the evolutionary and epidemiological processes (i.e. the transmission process will be different depending on the speed of spread of the pathogen), which consequently allows us to infer parameters of both processes from the genetic data. Phylodynamic analyses comprise an exceptionally hard computational problem, in fact, they are NP-hard1. The point of the inference is to find the posterior probability distribution of the model parameters given the data, which is generally impossible to calculate directly. Thus we use the Bayes theorem to evaluate the distribution based on other probabilities: the probability of the data given the model parameters, the priors on the model parameters, and the marginal likelihood of the data:

P(D τ, θ, η)P(τ η)P(η)P(θ) P(D τ, θ)P(τ η)P(η)P(θ) P(τ, θ, η D) = | | = | | | P(D) P(D)

where:

D is the data (e.g. the sequence alignment),

τ is the tree topology with branch lengths,

θ are the parameters of the evolutionary model,

η are the parameters of the epidemiological (or population) model,

P(D τ, θ, η) is the phylodynamic likelihood or the probability of the data given the model parameters, | P(τ η) is the probability of the tree given the parameters of the population model, | P(η) is the prior probability of the population model parameters, and

1 non-deterministic polynomial-time hard (NP-hard) problems are a class of problems with equivalent solutions (a solution for an NP-hard problem can be transformed into a solution to a different NP-hard problem in polynomial time) for which there is no currently known polynomial-time algorithm. 1.3 outline5

P(θ) is the prior probability of the evolutionary model parameters.

The marginal likelihood P(D) is expressed in the form of the following integral:

P(D) = P(D τ, θ)P(τ η)P(η)P(θ)dτdθdη Zτ,θ,η | | Importantly, Bayesian phylodynamic analyses estimate the posterior distributions for all estimated parameters as well as the posterior distribution over the tree space. This means that Bayesian analyses always provide the degree of uncertainty that we have in any of the estimates. This is particularly important for cases when data access is scarce, as it can be tempting to make conclusions on the basis of a single estimate (e.g. with maximum likelihood inference or even based off the median estimate from Bayesian inference), but the uncertainty can be so great that no true conclusion can be made. In the context of this thesis it is important to discuss the influence of the likelihood of the data (the probability of the data given the parameters) and the priors on the model parameters 2. The phylodynamic likelihood should have the most impact on the posterior, as it numerically describes the fit of a model to the data, being the probability of the data given the model parameters. For that impact to be significant, however, the data needs to contain enough information for inference, which is why older types of TB genotyping data do not always serve the purpose for such analyses, often providing the same set of characters for multiple sequences. Bayesian phylodynamic approaches also force us to include any prior knowledge that we may have about the analysed system3. Moreover, if the data has little information on the parameters of the processes, the influence of the priors may overpower the likelihood. Well-set priors therefore are vitally important for analyses where data contains little information. For example, Mycobacterium tuberculosis shows little evolution on the timescales of clustered outbreaks that I studied in this thesis. Setting a strict and precise prior on the evolutionary parameters allows us to eliminate part of the uncertainty in our final estimates.

1.3 outline

This thesis mainly consists of the work I have done on estimating relative transmission fitness for drug resistance in Mycobacterium tuberculosis. It also includes some work on other infectious diseases (HIV and HBV) as well as the work on teaching the members of the scientific community how to use BEAST2 for Bayesian phylogenetic and phylodynamic analyses. Chapter 2 presents existing approaches to TB modelling. It includes both epidemiological and evolutionary models, as well as a combination of the two, providing a systematic overview of past and present approaches to modelling TB epidemics. It also outlines current challenges

2 While marginal likelihood, of course, has a direct effect on the posterior, most of the time it is impossible to calculate directly, especially for models with many parameters. This computational problem can be avoided by operating on ratios of posterior values, thus cancelling out the marginal likelihood. BEAST2, among other software implementations of Bayesian phylodynamic inference tools, uses the Bayesian MCMC algorithm to walk the space of posterior probability. 3 While non-informative priors seem like a way to avoid informing specific parameter estimates, these priors still in fact specify very strong assumptions. For example, when specifying a uniform prior from 0 to infinity on a specific parameter, one may think they are being as non-informative as possible, while in fact the prior puts an extremely high weight on any values e.g. larger than one. 6 introduction

in modelling and analysis and areas of development that need to be addressed in the future. This chapter has been published as part of the book titled “Strain Variation in the Mycobacterium tuberculosis Complex: Its Role in Biology, Epidemiology and Control”, with me as the first author. Chapter 3, Chapter 4 and Chapter 5 are on TB clustering and phylodynamic analysis. Chapter 3 first describes the simulation study that I ran in order to verify a method for quantifying relative transmission fitness for MDR-TB strains on simulated TB data. After a successful simulation study I used the method to analyse the relative fitness of MDR-TB with additional resistance to pyrazinamide, which showed a strong signal for a high cost of such resistance in the dataset. This chapter, for which I am the first author, is currently under review in journal Epidemics. Chapter 4 describes results obtained by the same the method on a similar dataset from Georgia and then a comparison of the results to the results from Chapter 3, which shows a strong influence of location and strain type on the final relative fitness estimates. The manuscript, for which I am the first author, is currently in preparation for submission. Chapter 5 describes the different genotyping methods currently used for TB and their influence on the timeframes spanned by the clusters defined using different existing approaches. It shows that while all genotyping methods are useful in certain situations, in order to evaluate the parameters of an epidemic in as close to real time as possible whole genome sequencing is required. This chapter, where I assisted in setting up the phylodynamic analyses, was published in EBioMedicine with me as a middle author. Chapter 6, describes the successful usage of the computational approach presented in Chap- ter 3 to infer the relative fitness of drug-resistant HIV strains. This work, where I implemented the reparametrisation of the model to estimate relative fitness, was published in PLOS Pathogens, with me as a middle author. Chapter 7 is about the spread of Hepatitis B (HBV) and D (HBV) viruses in an untreated African rural community. My work there was on analysing the clustering of the two different phenotypes of HBV – active and occult – in the population in question. This work, where I performed multi-type phylodynamic analyses on HBV data, was published in mSystems, with me as a middle author. Chapter 8 is about the aftermath of the BEAST2 workshop “Taming the Beast”, which was initiated and first organised by our research group in 2016. This chapter describes the online resource hub which allows users from all around the globe to learn the basics of using Bayesian phylodynamic methods for the analyses of genetic sequencing data and to learn about the influence of the assumptions that these methods make. This work was published in Systematic Biology, where I am a shared first author. Finally, Chapter 9 holds the discussion of the work presented in this thesis and outlook on the work that still lies ahead. EXISTINGAPPROACHESTOTBANDMDR-TBMODELLING 2

This chapter was published as part of a book on strain variation in Mycobacterium tubercu- losis and describes a wide range of approaches to TB mathematical modelling. The models described form the basis of our understanding of how TB evolves, spreads, and interacts with the human population as a whole as well as with individual patients. This work looks at evolutionary modelling of TB and one of the major obstacles to efficient within- and between-patient strain tracking, which is the relatively slow evolution of Mycobacterium tu- berculosis (Duchene et al., 2016). This poses a particular issue, as a wide range of existing models were originally developed for faster, mainly viral, pathogens, or for pathogens with much simpler dynamics (e.g. with no latency or chance of relapse) (Biek et al., 2015). This chapter also covers different approaches to epidemiological modelling of TB, from simple, only approximately modelling the spread of a single strain, disregarding any patient- or strain-specific information, to complex, including such epidemiologically characteristic traits as latency and relapse, as well as patient- or strain-specific parameters. Additionally, this work describes methods that integrate the epidemiological and the evolutionary models in an attempt to gather a more complete understanding of the dynamics of TB. This chapter overall reinforces one of the main messages of this thesis, which is that whole genome sequencing will likely become the norm superseding other genotyping methods, and this calls for improved analysis methodology. Broader sampling and more varied datasets will allow us to develop better methods and consequently to improve our predictive capabilities. And, of course, while there is plenty of work to be done in theoretical model development, even more work needs to be done in implementation of numerically stable and efficient methods of parameter estimation. This work was published in November 2017 as a chapter titled “Mathematical models for the epidemiology and evolution of Mycobacterium tuberculosis” in the book “Strain Variation in the Mycobacterium tuberculosis Complex: Its Role in Biology, Epidemiology and Control”, DOI: 10.1007/978-3-319-64371-7, where I am the first author. Following is an author post-print of the chapter.

7 15. Mathematical models for the epidemiology and evolution of Mycobacterium tuberculosis

Julija¯ Pecerska,ˇ James Wood, Mark M. Tanaka∗ and Tanja Stadler

Abstract This chapter reviews the use of mathematical and computational models to facilitate understanding of the epidemiology and evolution of Mycobacterium tu- berculosis. First, we introduce general epidemiological models, and describe their use with respect to epidemiological dynamics of a single strain and of multiple strains of M. tuberculosis. In particular, we discuss multi-strain models that include drug sensitivity and drug resistance. Second, we describe models for the evolution of M. tuberculosis within and between hosts, and how the resulting diversity of strains can be assessed by considering the evolutionary relationships among differ- ent strains.Third, we discuss developments in integrating evolutionary and epidemi- ological models to analyse M. tuberculosis genetic sequencing data. We conclude the chapter with a discussion of the practical implications of modelling – partic- ularly modelling strain diversity – for controlling the spread of tuberculosis, and future directions for research in this area.

Julija¯ Pecerskaˇ Department of Biosystems Science and Engineering, ETH Zurich,¨ Basel, Switzerland, telephone: +41 61 387 34 48, e-mail: [email protected], James Wood School of Public Health and Community Medicine, UNSW Sydney, telephone: +61 2 9385 8769, e-mail: [email protected]

Mark Tanaka (∗Corresponding author) School of Biotechnology & Biomolecular Sciences, and Evolution & Research Centre, UNSW Sydney, telephone: +61 2 9385 2038, e-mail: [email protected] Tanja Stadler Department of Biosystems Science and Engineering, ETH Zurich,¨ Basel, Switzerland, telephone: +41 61 387 34 10, e-mail: [email protected]

1 2 J. Pecerska,ˇ J. Wood, M. M. Tanaka and T. Stadler 1 Introduction

The causative agent of tuberculosis, Mycobacterium tuberculosis, emerged as a hu- man pathogen around 70,000 years ago [34, 62], although conflicting estimates point to much later dates of around 5,000 years ago [16]. Forms of tuberculosis such as M. bovis that infect non-human animals evolved from human tuberculosis, indicating that the disease first appeared in humans before adapting to other animals. Mounting genetic evidence indicates that strain-to-strain variation in M. tuberculosis is more extensive than previously thought [15, 63]. Seven major lineages of modern-day tu- berculosis have been identified [65] and specific strains are highly associated with geographic location [63, 75]. Molecular methods have helped identify finer scale variation within lineages, which we discuss in more detail in Section 3. The increased availability of data both at the epidemiological and molecular level allows us to start raising complex questions about data interpretation and analysis. For instance, how do we understand and predict tuberculosis epidemics on the pop- ulation level? How do we best use molecular data to shed light on the transmission dynamics of different M. tuberculosis lineages? These questions typically require collated data analysis under specific assumptions on the properties of M. tuberculo- sis, such as, for instance, the mechanisms of mutational change. Mathematical, or computational, modelling is a methodology that enables the precise description of assumptions in order to investigate model behaviour, qualitatively or quantitatively. Defined models can be combined with data and thus provide answers to scientific questions concerning the given dataset. This approach has been instrumental in the understanding of the physical sciences, and it has become more widely used in biol- ogy as biological data have become increasingly refined and quantitative in nature. Mathematical models that are applied in biology range from being extremely simple and generic to being complex and specific. Simple models often enable an understanding of complex phenomena, while complex models have the advantage of being more realistic and detailed and thus may offer detailed quantitative insight. In the words of statistician George Box, however, “all models are wrong but some are useful”. The aim of modelling is to shed light on a phenomenon rather than to create a maximally realistic description of it. In the study of infectious diseases models can extend our understanding of an epidemic by allowing us to predict population dynamics from basic knowledge of the natural history of a disease. Models can help evaluate the effects of any poten- tial or actual interventions at the population level. By providing precise quantitative predictions mathematical models also play a role in drawing inferences from obser- vational data, for example, by producing estimates of parameters relating to disease transmission. In this chapter we consider how mathematical and computational models can be used to understand the variation in M. tuberculosis that has been revealed using molecular techniques. Two different modelling traditions are pertinent to this topic. First, epidemiological models address the dynamics of infectious diseases at the population level and enable researchers to consider possible outcomes including the effects of intervention strategies. Second, models of molecular evolution and pop- Modelling MTB 3 ulation genetics concern the processes by which genomes undergo change. These models are generic in that they have not been developed for any particular species, and can be applied to M. tuberculosis to understand its variation and to reconstruct its evolutionary history. We will describe both of these approaches and their ap- plications to M. tuberculosis. We will further discuss progress made in combining epidemiological and evolutionary elements within the same framework to analyse the diversity of M. tuberculosis.

2 Epidemiological modelling and analysis

In this section we will focus on epidemiological models and their application to M. tuberculosis. We will primarily consider models that assume that the host population is homogeneous, ignoring possible effects of heterogeneity in host behaviour on the dynamics of the epidemic. We will begin with generic models that describe epidemiological dynamics of a single disease variant and then describe models of TB epidemiology with heterogeneity in the pathogen population, e.g. due to the occurrence of drug-resistant strains.

2.1 Epidemiological modelling of M. tuberculosis

Epidemiological models traditionally separate host populations into distinct com- partments according to their infection status. In the simplest scenario, an individual is either susceptible to infection, infectious, or recovered and therefore immune to reinfection. The numbers of individuals in each compartment are tracked by S(t), I(t) and R(t) respectively, where t stands for time. In what follows we will drop the “(t)” from the notation except where it is clearer to retain it. Typically, a susceptible host is assumed to transition to the infected state at a rate proportional to the number of infected individuals I (say βI), and an infected individual transitions to the recov- ered state at a constant rate (say γ). Without host birth or host death events, the total number of individuals in the three compartments is constant N = S +I +R where N is the total population size. The structure of this model is illustrated in the left panel of Figure 1. This model, which is known as the Susceptible-Infectious-Recovered (SIR) model, was initially studied in depth by Kermack and McKendrick [83] and has since been elaborated upon in many ways [42, 81]. The SIR dynamics can be modelled deterministically or stochastically. In the deterministic version, the change in the compartment sizes follows ordinary differ- ential equations d S(t) = βI(t)S(t) (1) dt − d I(t) = βI(t)S(t) γI(t) (2) dt − 4 J. Pecerska,ˇ J. Wood, M. M. Tanaka and T. Stadler

(1) (2) (3)

Susceptible Susceptible Susceptible pI S S pI S

I pI pI

Infectious Exposed Exposed I E E

v  v   Recovered Infectious Infectious R I I

   

Recovered Treated R T

Fig. 1 Examples of compartmental models of infectious disease dynamics. (1) Classic SIR model with transmission rate β and recovery rate γ. (2) A more complex model with an exposed class E. In the case of tuberculosis, a proportion (1 p) of new cases enter a state of latent infection modelled with E while the remainder (p) progress− to active disease I. Latent infections reactivate at rate v, active cases recover at rate γ and recovered individuals regress to the disease state at rate ω. (3) When modelling antibiotic use and drug resistance it is useful to modify the model to include a state for infected treated individuals T. In this model, active cases are detected and treated at rate τ, stop treatment at rate φ and treated individuals return to the uninfected class S at rate σ. Not shown are death from each compartment or birth into S. We note that published models of TB dynamics are varied and while they are in general similar to the structures shown in 2) and 3) there are differences that reflect differences in the questions being addressed. and R(t) = N S(t) I(t). If an epidemic starts with a single introduction of the − − infection into the population initial conditions are set as S(0) = N 1 and I(0) = 1. − In the stochastic formulation of the SIR model, S,I, and R are integer-valued rather than real, and when an infection occurs I increases by one and S decreases by one. Given a very small time interval ∆t, the probability for infection to happen is assumed to be βSI∆t + o(∆t). The term βSI∆t is the probability for precisely one infection event to happen in time interval ∆t. The term o(∆t) summarises the probability for more than one infection event to happen, with the term o(∆t)/∆t approaching zero as ∆t approaches zero. This means that the waiting time until an infection event where I increases by one and S decreases by one is exponentially distributed with parameter βSI. Similarly, upon recovery I decreases by one and R increases by one, and this event occurs with probability γI∆t + o(∆t). The dynamics of the SIR model are well understood and are well described in multiple sources such as the text by Keeling and Rohani [81]. The assumptions of the SIR model are clearly too narrow to be directly applicable to M. tuberculosis. In particular, M. tuberculosis infection is characterised by a long and highly variable Modelling MTB 5 incubation period known as latent infection. Furthermore, hosts generally do not have strong immune protection against further infection and asymptomatic hosts of- ten relapse to disease years after an acute infection. However, the basic methodology of SIR modelling can be modified to reflect the natural history of M. tuberculosis. Extension of the SIR model to M. tuberculosis began with the work of H. Waaler et al. [138], with the inclusion of long-term latency in the form of non-symptomatic cases as a key feature. The model divides individuals into noninfected, noncase, actual disease case, and recovered compartments. The noncase individuals are the latently infected individuals that do not show symptoms immediately upon infection but can potentially progress (with some rate ν). The actual disease case individu- als show symptoms and are thus infectious. Individuals move into the recovered compartment after an active M. tuberculosis infection. Recovered individuals may not have cleared the pathogen and thus may relapse with some rate ω. Figure 1 (middle) shows a simple version of a model structure that captures key features, in particular latency and relapse, of M. tuberculosis nat- ural history. Note that the R compartment here represents “recovery” from active tuberculosis, but individuals in this compartment have not necessarily cleared the M. tuberculosis pathogen which is why they may relapse. The approach to incorporating latency has varied due to the lack of detailed quan- titative information about the long-term dynamics of infection and the immune re- sponse within humans. Blower et al. [11] popularised the use of a dichotomous short-term/long-term characterisation of latency based around the rule of thumb that around 5% of infections progress quickly to active disease and about another 5% progress slowly over the remainder of a person’s life-span. This would be captured by p = 0.05 in Figure 1 (middle). In these models, a fraction of infected individuals progress immediately to active disease while the remaining fraction enter a latent state and progress to disease at a low rate. More recently, a modified form of this dichotomous transition has been introduced that more accurately captures the tim- ing of active disease in relation to infection through stratification of latency into 2-stages (see for instance Dowdy and Cohen [44]). Models of tuberculosis epidemiology have been used to characterise the decline of M. tuberculosis epidemics in the US and UK [11] and to determine the contribu- tion of endogenous reactivation and exogenous reinfection to the overall risk of dis- ease [137]. Blower et al. [11] used a model that allowed for infection, reactivation, and relapse, and showed that the apparent decline might be explained as a temporary effect following a large epidemic. In the work by Vynnycky and Fine [137], a more data-driven approach to M. tuberculosis infection risk estimation was taken, without modelling the underlying transmission process directly. The model was designed to evaluate the impact of new infections compared to reinfection and reactivation of the disease. The results suggest that in the UK reinfection made a strong epidemi- ological impact during the first half of the 20th century, but had negligible effects by 1980, by which time the incidence reached its lowest point. This approach is less relevant to more recent epidemiological history in countries such as the UK, where cases have been increasingly driven by migration from high-incidence settings. 6 J. Pecerska,ˇ J. Wood, M. M. Tanaka and T. Stadler

The observed variation in the effect of the BCG vaccine has also been investi- gated, for instance through the work by Gomes et al. [70], drawing on the earlier model-based analysis by Fine and Vynnycky [59]. The latter work was aimed at explaining the differential success of the BCG vaccine in different settings, ranging from high efficacy observed in the UK trials as opposed to no efficacy in the large Chingleput trial in India. This paper noted how exposure to M. tuberculosis and/or related mycobacteria could have a confounding effect on the estimates of vaccine efficacy. Animal studies have shown that the BCG vaccine did not provide a further benefit over the immunity derived from tuberculosis infection. The study by Fine and Vynnycky showed how variation in this background level of immunity affected estimates of vaccine efficacy, with these estimates becoming negligibly small when natural infection rates were very high and provided similar levels of protection as vaccination. Gomes et al. [70] explore these issues in a population dynamic model and link the analysis with the concept of reinfection threshold, which is defined as the reproductive number (see Section 2.2) in a population composed of previously exposed individuals becoming greater than 1. Further developments in tuberculosis modelling have focused on potential effects of variation in treatment strategies. Treatment strongly affects incidence, prevalence and mortality from TB and reduces the average duration of infectiousness and thus reduces the possibility of transmission. Treatment is typically incorporated in mod- els through a modification of the rate of cure (as in Figure 1, right), although in models seeking to capture treatment programs more precisely, this may be described through multiple model states and transitions. Such models have been used to ex- amine the traditional World Health Organisation DOTS (directly observed treat- ment, short-course) approach [52] and various means of improvement, for example, through active case finding, changes in diagnosis and treatment regimens and wider application of preventive treatment of latent tuberculosis infection [51]. Of partic- ular interest within the context of this book are those models that have looked at interactions between treatment programs and the development of resistance, which we cover in more detail in Section 5. Modelling studies have also aimed to understand the origins of tuberculosis [25] and the longer term evolution of M. tuberculosis and its characteristics, such as latency [24, 152] and virulence [7].

2.2 Transmission parameters and their estimation

Epidemiological modelling has identified a critical quantity in infectious disease dynamics known as the basic reproduction number or R0 [4, 42]. This quantity is defined as the average number of new infectious cases arising from a typical case in a completely susceptible population. One of the principal reasons for interest in R0 is that it constitutes a threshold quantity for a large class of infectious disease models; namely, R0 > 1 implies that the disease can persist whereas R0 < 1 leads Modelling MTB 7 to disease elimination. Despite this simplicity the mathematical definition of R0 is a function of model structure and reflects details of disease epidemiology. Under the SIR model, R0 = βS(0)/γ. However, even quite simple models of TB such as defined in [11] lead to more complex expressions, with R0 defined as the sum FAST SLOW RELAPSE R0 = R0 + R0 + R0 . Each component here is a function of multiple parameters. The SIR model formula- FAST tion of R0 resembles only the R0 component whereas the transmission potential SLOW for tuberculosis is complicated by the processes of reactivation (R0 ) and relapse RELAPSE (R0 ). Other studies have additionally included the process of reinfection [70]. For a general derivation and discussion of the basic reproduction number we refer readers to the text by Diekmann & Heesterbeek [41]. In practice, populations are not usually fully susceptible and so the production of new cases is slower than indicated by R0. New cases are better understood through the related quantity known as the effective reproduction number, usually labelled Re. This quantity is defined as the average number of new infectious cases produced by a typical case regardless of the susceptible proportion. At the start of a local epidemic Re = R0, and over time it decreases to a value < 1 if the epidemic ends and is unable to persist, or remains close to 1 if the disease persists endemically. Knowledge of reproduction numbers is important for informing strategies for con- trolling infectious diseases but estimating these quantity poses some challenges. In particular, direct observation of the average number of secondary infections pro- duced by infectious individuals is not feasible with conventional epidemiological methods. Instead reproduction numbers are typically estimated implicitly through comparisons of model outcomes with infection history data or incidence and preva- lence time-series. Earlier approaches are summarised in [43], while examples of approaches for stochastic models and structured communities are provided in the text by Becker [8]. Empirical data can be used to estimate parameters which in turn allow investi- gation of the long term characteristics of epidemics using epidemiological models. The dynamics can also be studied through computer simulations of stochastic for- mulations of models, which can in turn be compared with data that directly reflect transmission such as contact tracing studies or analysis of M. tuberculosis infection within households. Brooks-Pollocks et al. [18], for example, used incidence data across multiple-person households to estimate the relative contributions of commu- nity and households to M. tuberculosis transmission. For tuberculosis, the differing components of the reproduction number as de- scribed above make this more challenging. Estimation of the basic reproduction number is also complicated by the absence of good diagnostic markers for immu- nity. As such, in contrast to infections such as influenza and measles, practical epi- demiological models of tuberculosis have focused more on projections of disease incidence than analysis of the reproduction number and the potential for elimination [51]. However, the growing availability of molecular data provides opportunities to overcome some of the issues outlined above. In particular, molecular data provide 8 J. Pecerska,ˇ J. Wood, M. M. Tanaka and T. Stadler information that can potential allow models to separate between recent infection, reactivation and relapse [119]. Stochastic formulations of the epidemiological mod- els in combination with molecular data, rather than deterministic formulations in combination with the traditional epidemiological data discussed here, will be focus of Section 4.

2.3 Modelling heterogeneous epidemics

Thus far we have discussed models in which infections are homogenous with all prevalent infectious cases represented by the single I compartment. However, pathogen populations may be variable, for example, in levels of drug resistance. Such variation could cause hosts infected with different strains to transmit with dif- ferent rates. One can model a simple case by subdividing the I class into IS and IR classes: the hosts infected with a drug sensitive strain, and the hosts infected with a drug resistant strain respectively. Correspondingly, one needs to divide the E class into ES,ER. Hosts may move from IS to IR through resistance evolution, and poten- tially move from IR to IS through loss of resistance. Similarly, there may be variation in the host population both in terms of transmis- sion risks (e.g. geographic variation in incidence) and in risks of developing active TB for example through HIV co-infection. Again, we can divide the compartments according to this variation, for example into I1 and I2. As done above, we subdivide the S class as well as the E class as the difference in behaviour does not depend on the infection status. Figure 2 illustrates an example of how variation in pathogens (left) and variation in hosts (right) could be modelled, extending the homogeneous epidemic model displayed on the right of Figure 1. Such models have been used to model different types of variation in viral epi- demics, such as geographical variation, drug resistance levels and super spreader behaviour. For M. tuberculosis, outside of settings with high HIV prevalence, the main source of variation that has been considered to date is in relation to drug- resistance and its interaction with treatment programs, which we now discuss in more detail. As with other infectious diseases, the rise of antimicrobial drug resistance is a problem for the control of M. tuberculosis. While multi-drug resistant (MDR) or extremely-drug resistant (XDR) tuberculosis is likely to be the result of treatment failure, they also occur with lower frequency in new cases. These issues are dis- cussed in more detail in Chapters 11 and 14 but here we briefly describe key features of epidemiological models of resistance. In the context of population dynamics, drug resistant strains have two critical properties. First, they can persist longer in patients because standard treatment is less effective. Second, they may come with a fitness cost that lowers their rate of transmission. The cost may be due to a trade-off between the original function of the gene and the resistance phenotype [Chapter 14]. The fitness cost has been mea- sured in vitro [64] and it has been observed to be variable across M. tuberculosis Modelling MTB 9

Fig. 2 Examples of compart- (1) (2) mental models with pathogen Host variation and host variation. variation

(1) When pathogen varia- S S tion is taken into account 1 S the infected classes are par- S titioned into multiple states corresponding to the multiple E E pathogen types. The varia- 1 1 tion can be with respect to E E 2 2 E E resistance or other genetic 3 3 characters. Here we do not I I show transitions between 1 1 types, which depend on the I I 2 2 details of the model. (2) When I I 3 3 host variation is taken into ac- T T count the hosts are partitioned 1 1 into susceptibility classes. For T T 2 2 simplicity the figure does not T T 3 3 show births and deaths for Pathogen either model. variation strains. Such in vitro measurements assess the replicative capacity of a strain which differs from the ability of the strain to infect new hosts, i.e. the transmission fitness, but the two fitness concepts are assumed to be linked. In vitro strains have shown a replication fitness disadvantage when compared to their rifampicin-susceptible an- cestors, some of the clinically-derived strains show no fitness costs. One reason for this variability is the possibility that further mutations occur that lower or compen- sate for the cost of resistance. Thus, it is unlikely that drug resistant M. tuberculosis will easily revert to sensitivity [26]. The dynamics of drug resistance in M. tuberculosis have been studied using com- partmental models as introduced above and shown in Figure 2 (left). The first such models were introduced by Blower and colleagues [10, 12] modelling two types of infection: drug sensitive and drug resistant. Models have since been extended and refined, maintaining a structure similar to that shown in Figure 3 (see also [104]). This figure is a special case of models with pathogen heterogeneity as shown at left in Figure 2, with the addition of transition between sensitive and resistant treatment classes. Such models on the evolution and spread of drug resistance have been studied to address a number of epidemiological problems, in particular to clarify the impor- tance and variability of the reproductive and transmission fitness of drug resistant strains. Such insight allows us to quantify the future burden of drug resistant strains on an epidemiological scale. The replicative fitness of resistant strains is variable [50, 64], but there is evidence for lower rates of transmission of resistant strains [50, 93]. On the epidemiological scale this cost is balanced by the advantage re- sistant strains have in patients on treatment [93]. Furthermore, since the cost of resistance can be lowered by compensatory mutations, the resulting variation in M. tuberculosis fitness means that without adequate control strategies, resistant strains 10 J. Pecerska,ˇ J. Wood, M. M. Tanaka and T. Stadler

Susceptible Fig. 3 An example of a com- S partmental model of tuber- culosis with drug resistance evolution and transmission. Latent Latent E E This model is an extension of DS DR the treated TB model shown in Figure 1 (right) in which drug sensitive M. tuberculosis Infectious Infectious I I can evolve resistance de novo DS DR and subsequently transmit to new hosts. The subscripts in- dicate drug sensitive (DS) and Treated Treated T drug resistant (DR) infection DS TDR states. For simplicity, rates are omitted from this diagram. will likely dominate in the long run [10, 26, 93, 118, 133]. Models can be used to explore control scenarios – e.g. rates of detection and cure success – that could lead to the control of drug-resistant tuberculosis [48, 49]. Modelling also allows quan- tification of the relative importance of the two factors contributing to the spread of drug resistant strains, namely de-novo evolution of resistance versus transmitted re- sistance. Modelling studies strongly suggest that in most settings a large majority of drug resistant cases are due to the transmission of resistant strains rather than the de novo acquisition of resistance [26, 82, 93, 133].

3 Molecular evolution of M. tuberculosis

The fields of population genetics and molecular evolution are dedicated to analysing and understanding genetic variation within and among species. Many models and methods have been developed for these purposes, most of which are general and not specific to pathogens. In this section we discuss applications of this theory to M. tuberculosis. We start with standard models and methods in molecular evolution and phylogenetics, and then discuss some of the specific issues that arise in applying these methods to genetic data obtained from M. tuberculosis isolates. All organisms may accumulate mutations through replication, including M. tu- berculosis. Many of these changes are never observed as the resulting mutants may suffer a large fitness disadvantage so that they are eliminated from the population. Other mutations, however, rise in frequency and may become fixed in a popula- tion. A mutation which is fixed in a population is called a substitution. To define this term more precisely, it is necessary to clarify what population is being consid- ered. The evolution of M. tuberculosis occurs on at least two different population scales. First, the global population of M. tuberculosis may undergo evolutionary substitution. This process of substitution is important to consider when comparing Modelling MTB 11

M. tuberculosis to other species. Second, evolutionary substitution occurs at a local level within hosts. In this case, mutants arise within each host and may reach fixa- tion in the host. In this section we are primarily concerned with substitution at the local host scale rather than the global population scale. Genetic changes accumulate within hosts so that bacterial populations in different hosts generally diverge. This genetic divergence provides information about the evolutionary history of the bac- terium, which we can represent with a phylogenetic tree. A similar but distinct tree concept used in population genetics studies and implemented in some software is the coalescent, which describes genealogical history viewing time going backwards (for further details see [139]). Coalescents are also known as gene trees. Phylogenies are appropriate when there is no horizontal gene transfer or recombi- nation and no convergent evolution. In contrast to many other bacteria, M. tubercu- losis exhibits remarkably little recombination [91] and there is currently no evidence it carries plasmids [151]. This makes the analysis of genomes relatively straightfor- ward as only substitutions on a phylogenetic tree need to be modelled. In particular, phylogenies of M. tuberculosis based on different genetic loci have tree-like struc- tures rather than being reticulate or networks. On the other hand, some genes in the M. tuberculosis genome may be under strong natural selection and can therefore ex- hibit convergent evolution. In particular, sites conferring drug resistance are known to undergo convergence and are therefore typically excluded from any phylogenetic analysis. In the next subsections we will first discuss evolutionary models for markers of M. tuberculosis and then evolutionary models for all nucleotides within the M. tu- berculosis genome. We will then illustrate how these models are used in population genetics to assess the within- and between-host variation of M. tuberculosis strains.

3.1 Evolutionary models for markers of M. tuberculosis

In an effort to characterise and understand the transmission of tuberculosis, iso- lates of M. tuberculosis have been genotyped for many years using a variety of molecular methods [Chapter1]. Early methods included typing based on the mobile gene IS6110 [3, 21, 119], spacer-oligonucleotide typing or spoligotyping [80] and MIRU-VNTR typing [126]. In recent years, the declining cost of DNA sequencing has enabled the use of whole-genome sequencing (WGS) as a strategy for studying M. tuberculosis transmission [67, 140]. WGS allows a fine-scale genetic character- isation of M. tuberculosis strains within a population, and has the advantage over previous technologies of minimising the impact of parallel evolution [Chapter3]. The genetic variation observed at marker loci ultimately arises from mutation processes. To exploit this variation for studying M. tuberculosis epidemiology it is useful to know how and at what rate the underlying genetic loci mutate and undergo within-host substitution [128]. The simplest model to apply is a Poisson process in which mutations appear randomly at a constant rate per unit time and reach fixation instantly. This model and elaborations of it have been used to estimate mutation 12 J. Pecerska,ˇ J. Wood, M. M. Tanaka and T. Stadler rates. An alternative approach is to compare the extent of variation using different markers to estimate rates against known mutation rates [110]. Genomic variation in M. tuberculosis is partly due to the movement of the mo- bile gene IS6110. The rate of movement of IS6110 has been estimated to be around 0.1 to 0.3 changes per year using serial isolates of M. tuberculosis [36, 113, 141]. Spoligotype variation is due to deletion of repeats at a CRISPR locus [6]. The evo- lutionary process at this locus, estimated to be around 0.02 to 0.09 per year [110], is slower than changes due to IS6110 movement. VNTR loci mutate by expanding or contracting in the number of repeats. The stepwise mutation models developed for microsatellites in humans [38] are likely to apply well to bacteria [134]. Under the simplest version of such models, repeats increase or decrease by a single step and with equal probability. Estimates of the mutation rate of VNTR loci have varied 5 3.9 considerably with low rates around 10− per locus per year [71] and 10− [146]) 3 2 and high rates of 10− to 10− [1, 108, 110]. The variation in these estimates may reflect the diversity of models, methods and data used to obtain them. Single nucleotide polymorphisms (SNPs) occur through point mutation which can occur throughout the entire genome. Whole genome sequences analysed with phylogenetic and similar methods have yielded conflicting estimates of mutation rates varying from 3 10 9 [34] to 10 7 [16, 140]. The higher rates are supported × − − by in vitro studies of mutation rates [61] but it should be noted that the “long-term” rate of evolution is likely to be lower because estimates based on recent variation includes polymorphisms that ultimately will not be fixed in the population (e.g. deleterious mutations) [76]. Mutation rates also vary substantially depending on genetic background [61]. Whether mutation rates during latent infection are equal to [60, 88] or lower than [29] rates during active infection is not settled. We suggest that the uncertainty in mutation rates can be further addressed in the future through modelling and careful consideration of assumptions underlying models and data. One of the challenges in using genetic markers for phylogenetic analysis is the occurrence of parallel evolution (or homoplasy), by which identical states are reached by mutation in independent infections. Such events can potentially under- mine the phylogenetic analysis of M. tuberculosis [33], although a simulation study has shown that the impact of homoplasy is not necessarily large on the epidemi- ological scale [112]. The arrival of low-cost genome sequencing has removed the obstacle of homoplasy since it does not strongly affect genome-wide SNPs (after removing SNPs implicated in drug resistance).

3.2 Evolutionary models for whole genomes of M. tuberculosis

A wide array of molecular evolution models of substitution have been developed for nucleotide changes along a genome, covering a range of complexities. These models all assume that each nucleotide evolves independently of the other nu- cleotides. The simplest is the Jukes-Cantor model (JC69) which assumes that all substitutions among nucleotide bases occur at an equal rate and that base frequen- Modelling MTB 13 cies are all equal [79]. Kimura’s two parameter model (K80) allows different tran- sition and transversion rates, while keeping equal base frequencies [84], whereas Felsenstein’s model (F81) keeps equal rates while allowing varying nucleotide base frequencies [56]. More complex models include the HKY85 model which does not assume equal base frequencies and accounts for the difference between tran- sitions and transversions [73], the TN93 model which distinguishes not only tran- sitions/transversions, but also differentiates between purine and pyrimidine transi- tions [127], and the generalised time-reversible (GTR) model, which is the least restrictive time-reversible model possible [131]. Generally, the rate of substitution has been inferred to vary across sites. In order to account for such variation, it is common to assume either a continuous gamma distribution or a discrete approximation with a suitable number of rate classes [147, 149] for variation across sites in the substitution rate. These substitution models can then be used in combination with genetic data to compute likelihoods or genetic distances for phylogenetic reconstruction, as we next discuss.

3.3 Phylogenetic reconstruction

A number of approaches have been developed for reconstruction of phylogenetic trees from genetic data. Distance-based methods use a measure of genetic distance – an estimate of the degree of similarity two sequences share – to reconstruct evo- lutionary history. Other methods optimise a criterion measuring how well a tree explains the genetic sequence alignment over the space of possible phylogenetic trees. In distance-based methods, similar sets of taxa are grouped together whereas more distant ones are placed further apart on a tree according to entries in a pair- wise distance matrix. The branch lengths of the inferred phylogeny are a close, but in general imperfect, representation of the inter-sequence distance matrix. An example of a distance-based method is the neighbour-joining method which is a bottom-up clustering algorithm that joins nodes of the tree according to the shortest distance between two existing nodes [114]. It is statistically consistent1, but does not max- imise a criterion for measuring how well a tree explains the data. Alternative methods of tree reconstruction from genetic data include maximum parsimony, maximum likelihood and Bayesian tree reconstruction, which all search tree space while optimising a criterion. The maximum parsimony method minimises the number of substitutions required to explain the inferred phylogeny. This method is quick and easy but it has been shown to be statistically inconsistent [55]. The maximum likelihood method searches the tree space to maximise the proba- bility of the data given a particular tree structure (the likelihood of the tree given the data). Bayesian methods also search the tree space, yielding a posterior distribution

1 Statistical consistency of a phylogenetic method means that when given infinitely long genetic sequences, the method –employing the model under which the sequences evolved– will recover the true underlying phylogeny. 14 J. Pecerska,ˇ J. Wood, M. M. Tanaka and T. Stadler of trees. As such, Bayesian methods produce multiple trees; they also accommo- date prior distributions on trees and parameters. In the maximum likelihood and Bayesian frameworks, the substitution model is used to evaluate the likelihood of trees and model parameters when considering the data. Thus, the methods allow multiple substitutions, parallel substitution, convergent evolution and back substitu- tion along branches, and the complexity of the substitution model can be adjusted to improve the fit to the data. Both frameworks are statistically consistent [125, 148], and are currently the most widely used phylogenetic reconstruction methods. Phylogenetic reconstruction methods produce trees with branch lengths mea- sured in numbers of substitutions. Exceptions are the Bayesian methods assuming a clock model [46, 153] which leads to branch lengths in units of calendar time. Branch lengths in calendar time are important for quantifying the timing of epi- demic spread. Recently, a method to infer branch lengths in calendar time based on a tree with branch lengths on number of substitutions was introduced (LSD (least- squares dating) software [132]). For more details on phylogenetic methods, we refer readers to the texts by Yang [150] or Felsenstein [57].

3.4 Classification of TB using genetic data

The application of molecular technologies and phylogenetic methods also enables the classification of pathogen isolates into broad classes. In relation to M. tuber- culosis extensive genetic research has identified the structure of the complex of closely related M. tuberculosis species and the seven major extant lineages [65]. The classification of M. tuberculosis can be further refined by considering relationships within lineages using fast-evolving molecular markers. Using phylogenetic meth- ods, attempts have been made to date the first introduction of M. tuberculosis into the human population and to describe the patterns of world-wide M. tuberculosis distribution [62]. The suitability of alternative genetic approaches can be evaluated by comparing phylogenies reconstructed from different types of markers to a “gold standard” phylogeny in order to identify flaws in commonly used methods and pro- vide means of quick typing of isolates [58]. When the isolates in a data set are closely related to each other – such as iso- lates from a single outbreak – an alternative approach to phylogenies is to show the direct relationships among the genotypes in graphs such as minimal spanning trees. The underlying assumption is that all substitutions that occurred are ob- served, so that complex substitution models to account for hidden ancestral substi- tutions are not needed. This approach to visualisation and classification is aided by specifically modelling the processes of substitution underlying the genetic markers. Models of spoligotype evolution have been used to show relationships among iso- lates [19, 103, 111, 116, 117]. Relationships among isolates based on MIRU-VNTR can also be visualised using graphs. [143]. Large international databases based on multiple genotyping methods including MIRU-VNTR and spoligotypes aid in the classification of isolates [37, 143]. Modelling MTB 15

SNPs obtained from whole genome sequencing can also be visualised through graphs within outbreaks [140] or through phylogenies when analysing highly diver- gent isolates with ancient origins [16, 34, 146]. It is anticipated that future epidemi- ological studies will increasingly make use of whole genome sequencing.

3.5 Population genetics of TB

So far we have considered models of substitution (i.e. fixation of a mutation in a population), and how the variation between isolates can be represented in a tree. Population genetics “zooms into” the process of substitution by modelling the ori- gins and dynamics of genetic variation including the process of fixation. A natural null model in population genetics is the neutral model in which all genetic vari- ants are selectively equivalent. In this model the process of mutation generates new variants (alleles) while randomness – genetic drift – leads to loss of variation or to fixation (i.e. substitution). The dynamic balance between these two processes – mutation and drift – has been characterised along with properties of samples from a population in such balance [53]. The theory of this balance between mutation and drift is generally applicable to bacteria including M. tuberculosis. Because most viable mutations are expected to have zero or negligible effect on bacterial phenotype, variation at marker loci can be considered selectively neutral, as a first approximation. An important exception is antibiotic resistance genes which are removed for the purposes of phylogenetic analysis because they are known to be under selection [34]. Exceptions to strict neutrality can also occur at marker loci: for instance, movement of IS6110 can lead to deleterious or even advantageous effects. Moreover, in the absence of recombination – as is the case for M. tuberculosis – any mutation under positive selection will be linked to neutral variation throughout the genome which may hitchhike to fixation in a selective sweep [120]. Nevertheless, a first step to analysing most molecular variation is to treat it as selectively neutral to understand broad patterns in data using theory from population genetics [54]. Neutral models have been found to often adequately describe the distribution of cluster sizes under IS6110-RFLP typing and spoligotyping, and the fit can be improved compared to the standard Wright-Fisher population genetic model by al- lowing the infected population to expand according to a birth-death process [92]. These simple population models focus on genetic variation without considering de- tails of the population dynamics or the natural history of disease. For example, at the epidemiological level the population of interest is the set of infected hosts, and the replication process is the transmission process. However, standard population genetic models do not explicitly account for the process of transmission nor other processes such as host recovery, death or latent infection. Thus there is a need to integrate population genetic models with epidemiological dynamics such as those described in Section 2. Section 4 will describe progress towards this goal. 16 J. Pecerska,ˇ J. Wood, M. M. Tanaka and T. Stadler 3.6 Models of within-host variation and mixed infections

At epidemiological scales, it is convenient to assume that each infected case cor- responds to a single strain and that mutation leads to the instantaneous replace- ment of the ancestor, but in reality more than one strain can exist within an infec- tion [115, 142]. Such infections are called mixed or complex infections. This bac- terial variation may be due to mutation within the host or reinfection of the host by another strain. In order to understand the source and consequences of variation, models of bacterial dynamics at the within-host level are needed. Such modelling has led, for example, to methods to detect mixed infection [106] and to classify whether the variation is due to reinfection or mutation [23] using genetic data. In cases where within-host variation is due to mutation, serial isolates of M. tubercu- losis from an infection can be used to estimate mutation rates [108, 113, 129]. A benefit of these estimates is that unlike “snapshot” data they make use of temporal information. Within-host data can also be used to study population genetic statistics to quantify the action of natural selection [101]. We note that the dynamics of M. tuberculosis within patients are highly complex and involve a large number of interactions between the pathogen and the host. The roles of the immune system, inter-cellular signals and spatial effects have been mod- elled [66, 145]. In those models, the variation in disease dynamics is due to the com- plex interactions between M. tuberculosis and the host response. To be able to draw conclusions for population level dynamics, models must suppress some of the com- plexities of the intra-host dynamics while focusing on competition between strains. For example, different M. tuberculosis strains can be modelled by imposing struc- ture on the pathogen population (Figure 2, left), and different immune responses can be modelled by imposing structure on host population (Figure 2, right). Just as drug resistance is an important source of variation on the epidemiological scale, it is also important to consider on the within-host scale. Models that combine the population dynamics of M. tuberculosis with the pharmacokinetics of drugs pro- vide a quantitative description of the emergence of resistance within a patient, which can then be used to optimise treatment regimens to minimise drug resistance [90]. The effects of nonadherence and drug synergies can be considered under such mod- els [90]. Within-host modelling of multidrug resistant M. tuberculosis has led to predictions that such strains can arise at an unexpectedly high rate from apparently pansensitive within-host populations because of standing genetic variation in those populations [31, 90].

4 Integrating epidemiological and evolutionary models

In Section 2.1 we described models of the epidemiology of disease and of M. tu- berculosis in particular. In Section 3 we described how models from the fields of molecular evolution and population genetics can be applied to genetic data from M. tuberculosis isolates. We suggest that to fully understand genetic data, it is necessary Modelling MTB 17 to combine both kinds of approaches. We will describe developments in this area, including the integration of epidemiological and evolutionary models that involve the analysis of phylogenies – a young field dubbed phylodynamics [72].

4.1 Between-host variation, clustering and transmission

A central goal of molecular epidemiology is to draw inferences about disease trans- mission using genetic information [69, 95, 98, 128]. Above we introduced phyloge- netic trees to investigate the diversity of M. tuberculosis strains, namely to infer the past history (or state) of the epidemic. In this section, we go a step further and in fact assess the transmission dynamics in addition to the state. The degree to which isolates cluster into identical genotypes carries information about the extent of recent transmission. The underlying assumption is that genotypes evolve on the same timescale as the process of disease transmission so that each cluster of isolates represents a set of cases that arose recently through transmission, but isolates not being connected via recent transmission are different through accu- mulated mutations. Simple clustering statistics have therefore been used to quantify recent transmission of tuberculosis [3, 119]. Widely used are the “n” and “n-1” statistics ([69], also known as RTIn and RTIn 1 respectively [128]). These are de- fined as: −

RTIn = nc/n (3)

RTIn 1 = (nc c)/n (4) − − where nc is the the number of isolates in clusters (of 2 or more isolates), c is the number of clusters and n is the total number of isolates. These can be alternatively written as

RTI = 1 u/n (5) n − RTIn 1 = 1 g/n (6) − − where g is the number of distinct genotypes in the sample and u is the number of unique genotypes in the sample (also called non-clustered isolates or singletons). Mathematical modelling has contributed to the goal of improving inferences from these data. First, modelling and simulation studies have shown that incomplete sam- pling leads to an underestimation of the extent of recent transmission which also interferes with assessing risk factors for recent transmission [14, 68, 69, 99, 100]. Second, the degree of clustering cannot easily be compared across different types of markers because different markers mutate at different rates, a fact not accounted for in these simple clustering statistics [128]. Thus, in order to interpret patterns of clustering in terms of disease transmission it is important to know the mutation rates of different markers. Furthermore, separate clusters may not be completely differ- ent strains – a single mutation event could split a cluster into two clusters, which 18 J. Pecerska,ˇ J. Wood, M. M. Tanaka and T. Stadler should be treated as a single epidemiological cluster of cases. Ideally then, methods of inference from genetic data would both incorporate the speed of evolution and account for sampling. An approach to analysing genetic data is to use mathematical models that account for disease transmission, marker mutation and sampling. The extent of transmission can be quantified by estimating the effective reproductive number of the pathogen. Because models can be complex and difficult to work with directly, computational methods that approximate the likelihood have been applied to analyse data using models [2, 93, 130]. These models do not consider the phylogenetic history of the genetic data. An alternative approach is to augment the data by using trees which permits exact calculation of likelihoods conditional on trees [121]. This approach requires a clear definition of trees that represent the evolutionary and epidemiologi- cal history of a sample of bacterial isolates. The next section introduces the concept of the transmission tree as a step towards integrating epidemiology and phylogeny.

4.2 Transmission trees

In order to characterise the connection between epidemiological and evolutionary relationships it is necessary to introduce the concept of transmission trees to be studied in light of phylogenetic trees. If all of the infected individuals and the times and sources of every new infection were known, we could represent the spread of an infection as a bifurcating tree. In such a tree, which we will refer to as the trans- mission tree, each branch represents an infectious case and each bifurcation event represents a secondary case and the root branch represents the initial introduction of the infection to the population. One of the very few cases of the perfect reconstruc- tion of a transmission tree via complete contact tracing was done in a 1980 study on a quarantined US naval vessel described by Houk [77]. Unfortunately, in most epidemics it is difficult to achieve complete sampling and the precise timing of events; nor can we perform contact tracing, as the infection event might have taken place years prior to the onset of symptoms. Instead, genetic sequencing data are used to estimate the transmission tree relying on the following observation. Upon each infection a subset of the genotypes occurring in the donor is transferred to the recipient, producing a new case. From the time of that transmission event, the pathogen populations in the two distinct infections evolve independently of each other within the two hosts. Thus patients close in the transmission chain have M. tuberculosis genomes closer to each other compared to patients distant in the transmission chain. The phylogenetic tree of the pathogen sequences puts similar sequences close together and serves as a proxy for the transmission chain. Note that upon a bifurcation, we do not know which patient is the donor or the recipient in this reconstructed tree. Genetic sequencing data represent only a subset of active tuberculosis infection because of incomplete sampling and financial or other constraints. Reconstructed transmission chains are therefore also incomplete. Figure 4 shows an example of Modelling MTB 19 a transmission tree in an epidemic with SIR dynamics, with incomplete sampling. By observing the reconstructed phylogenetic trees and interpreting them as a proxy for the transmission chain, one can make conclusions about multiple characteristics of the epidemic, such as possible hotspots of infection (e.g. households where mul- tiple family members were infected) and identify whether a specific patient could have been a source of infection in a cluster. Methods to infer the exact transmission tree of M. tuberculosis from a phylogenetic tree, including the direction of infec- tion have been recently proposed in [39, 40]. Essentially, these methods assign a corresponding sampled or unsampled infected host to lineages.

10 10

5 5 Number infected 0 0 Number removed

Fig. 4 An example of a transmission tree depicting SIR dynamics. The tree shows sampled branches with solid lines and unsampled ones with dotted lines, with samples shown as orange circles. The curves underneath the tree show the changes in the number of infected individuals and the recovered individuals.

4.3 Phylodynamic methods

As sequencing of genomes becomes more cost-effective, fast and reliable, increas- ing amounts of sequence data are sampled from ongoing epidemics, and phyloge- netic trees as well as transmission trees are thus being reconstructed. This increase in sequence data has stimulated the development of phylodynamic methods that com- bine evolutionary and epidemiological analyses to quantify the parameters of the epidemiological models. In fact, the structure and branch length of the reconstructed phylogenetic tree contains information on the different compartments and rates of movement (i.e. dynamics) between the compartments in the underlying epidemi- ological model. For example, in the case of an SIR model with all patients being sampled upon recovery, the waiting times for a new branching event will be expo- nentially distributed with mean rate λ = βI and the branches will go extinct (the in- dividual will stop being infectious) with mean rate γ. Incomplete sampling requires the development of sophisticated statistical tools integrating over all non-sampled patients [124, 136]. The dependency will be more complex for a model including a latent or other non-infectious class, or allowing for heterogeneous pathogen or host population as in Figure 2 [87, 122, 135]. In such a case the reconstructed phyloge- 20 J. Pecerska,ˇ J. Wood, M. M. Tanaka and T. Stadler netic tree is extended by labelling each tip in the tree with the information on which compartment the corresponding host belongs to: if tips from the same compartment cluster in the tree we conclude that there is transmission within the compartment, while if tips from the same compartment are spread over the tree we conclude in- dependent migration into that compartment. The phylodynamic methods allow us to quantify the rates of transmission and migration in the epidemiological models based on the phylogenetic trees with the tip labels. There are two main approaches to infer epidemiological parameters from pathogen sequencing data, which we call the two-step and the one-step approaches. In the two-step approach, one first produces a phylogenetic tree with branch lengths in calendar units, using a tree reconstruction method as discussed in Sec- tion 3.3. The reconstructed tree is used as fixed input in the second analysis step to infer epidemiological parameters (see e.g. [86, 87, 122, 124, 135, 136]). Most of these methods are available within the software package BEAST2 [17] and stand- alone implementations are mentioned in the individual papers. This parameter infer- ence based on a fixed tree can be done using maximum likelihood (ML) or Bayesian inference. ML inference focuses on finding the combination of parameters that was the most likely to have produced the phylogenetic tree that is being studied under the given epidemiological model. The ML framework does not make use of prior knowl- edge of the parameters of the underlying models. In contrast, Bayesian methods can incorporate prior distributions of parameters and yield posterior distributions over the parameter space. Posterior distributions are a natural way to interpret the uncer- tainty in the resulting estimates. The incorporation of prior distributions allows for the better use of all the information available but requires great care in prior specifi- cation, as inappropriate priors can significantly and incorrectly influence the results of the analysis. One-step approaches simultaneously estimate trees and parameters from the ge- netic sequences, typically in a Bayesian framework. In the one-step Bayesian ap- proach the uncertainty in the phylogenetic trees is naturally integrated out. In other words, the posterior distributions of our epidemiological parameters take into ac- count tree uncertainty. This will become particularly useful for M. tuberculosis analyses as the low diversity in whole genome M. tuberculosis strains leads to high uncertainty in trees. For this one-step approach we jointly assume an epidemiolog- ical model such as the SIR model, which gives rise to a probability distribution over the tree space, and an evolutionary model such as GTR, which defines the probability distribution of the alignment of sequences. Output is a posterior distri- bution of trees, epidemiological, and evolutionary parameters. Software packages BEAST2 [17] and BEAST1 [47] can simultaneously infer the evolutionary history and the epidemiological parameters under some simple epidemiological models.2 This one-step approach had been used for viral datasets such as HCV [107], HIV [123], or influenza [87]. Phylodynamics on viruses can be done based on single genes as the substitution rates are high enough to see differences in single genes of the virus in donor compared to recipient. The slower-evolving M. tuberculosis

2 BEAST2 started out as a re-design of BEAST1, however over the course of time the two platforms continued to evolve independently with new features being implemented in both. Modelling MTB 21 requires whole genomes to reconstruct the phylogenetic trees and the epidemiologi- cal parameters. With next-generation sequencing technologies, such whole genomes become increasingly available, opening the door for phylodynamic analysis. Such analyses have been done for assessing the rate of transmitted drug resis- tance. In [20] the tips in the reconstructed phylogenetic tree were labelled according to the drug resistance status. Short inter-sequence distance was used to infer trans- mission links and to assess the transmission fitness costs in drug-resistant strains.

5 Practical implications

Mathematical modelling helps us explain and predict the dynamics of tuberculo- sis, including the origins and future of strain diversity. Models aid in estimating the rates of transmission and reactivation, which in turn can influence the design of pop- ulation interventions, and therefore models that incorporate at least the conclusions from strain diversity studies are of importance in targeting interventions to achieve WHO goals of eliminating M. tuberculosis.

5.1 Classification and outbreaks

In a practical sense, model-based analysis of genetic diversity data for M. tubercu- losis can be useful both for reactive purposes such as outbreak and contact inves- tigations and longer-term policy definitions for addressing problems such as drug- resistance. In relation to M. tuberculosis cases and contact interventions, modelling has most potential to be useful in high-resource environments where relevant data collection on cases and contacts provide nearly complete strain information including geno- typing and sequencing of samples. A recent demonstration study suggested that use of WGS-based testing and identification of resistance was more of a cost-effective solution for resistance testing and case follow-up than existing methods [105]. Application of modelling tools such as BEAST to analyse genotype and whole genome sequencing data [67, 102] for M. tuberculosis isolates can inform outbreak investigations by more precisely identifying links, which in turn has implications for epidemiological analyses of risk-factors for transmission and disease. In addition, modelling can be used to better understand measurement processes and biases in collection of samples, for instance the simulations conducted by Piazotta et. al. [106] show how the prevalence of mixed-infection can be corrected for based on modelled properties of the infection and sample collection process. A related question that can be informed by diversity data is the estimation of the proportions of recent transmission, relapse and reactivation. Determining this bal- ance is important as it can help decide which aspects of treatment and prevention programs require attention. For instance a high rate of recent transmission would 22 J. Pecerska,ˇ J. Wood, M. M. Tanaka and T. Stadler suggest prioritising case finding and treatment success rates, while high rates of re- activation might suggest the need for increased use of preventive therapy. Studies using WGS to investigate transmission were recently reviewed by Hatherell et. al. [74] who suggest that while these approaches are very helpful, improvements are still needed not only to data fidelity but also in the refinement of models of trans- mission trees and the development of model-based thresholds for genetic distance to distinguish linked and unlinked cases.

5.2 Correlation between pathogen genetics and host outcomes

A separate question that can be addressed with genetic data is whether differences in infection and disease outcomes are due to pathogen diversity or due to factors unrelated to the pathogen characteristics. For example, differences in progression to disease from infections with M. africanum as opposed to M. tuberculosis have been observed in a large epidemiological cohort study in the Gambia [78]. While such differences in disease natural history between related species might be expected, it does raise the question of whether epidemiological differences in infection, progression or disease outcomes differ between strain groupings and whether this might need attention in terms of disease control. Evidence for this variation, discussed in Chapter 5 and in the comprehensive review by Coscolla and Gagneux [35], demonstrates a range of differences between M. tuberculosis lineages at molecular, in vivo and in vitro levels. However, epidemiological evidence for spe- cial properties of, for example, the Beijing strain, is less conclusive and has not yet been a major focus in modelling studies. We note that Comas and Gagneux [32] have argued for a “systems epidemiology” approach using computational models to address such questions and expect the increased use of the mathematical techniques such as those summarised here.

5.3 Dynamics of resistance

A major practical focus for modelling strain diversity is the phenomena of multi- drug and extremely drug resistant tuberculosis. While poor individual outcomes and the high cost of such treatment have been influential in altering WHO policy for detecting and treating MDR-TB, models have played a key part in showing the potential for expansion of resistant strains [10, 26, 48, 49] and identifying the need for drug-sensitivity testing as part of approaches to mitigate these effects. Implemen- tation of treatment programs for MDR-TB in lower income settings has been facili- tated by the development of molecular genetic tests such as geneXpert [13] and more detailed modelling studies that directly assess and compare the cost-effectiveness of relevant population treatment strategies. Modelling MTB 23

These issues have been explored primarily through deterministic epidemiolog- ical models, with strain heterogeneity simplified to sensitive and resistant classes, including the potential for latent infection with both resistant and sensitive TB [30]. These approaches are valuable for assessing the impacts of relative fitness under strong treatment related selection, either for active TB [26] or preventive ther- apy [27] in latently infected populations. Extensions of such models have also been applied to describe trends in MDR+ resistance through an expanded classification of resistance properties from mono- resistance through to MDR and XDR-TB [96]. Models with this additional detail on resistance allow prediction of the effects of molecular testing for resistance and appropriate on epidemic trajectories and the prevalence of MDR+ resistance. Mod- elling studies have in general found that tests such as geneXpert offer acceptable value for money even in lower resource settings and these studies have helped fa- cilitate a rapid scale-up in testing and treatment for MDR-TB since 2013 [144]. There has been a recent proliferation of studies assessing the effectiveness and cost- effectiveness of such strategies, as summarised by Zwerling et. al. [154] who note that molecular tests have generally had positive cost-effectiveness findings but that future models need to feature increased use of setting-specific parameters in relation to treatment and diagnostic programs. In 2015 WHO established its end-TB strategy which sets ambitious goals for re- ducing new cases by 90% by 2035. This target does not seem achievable without re- ducing the burden of reactivation TB through treatment of latent tuberculosis [109] and suggests that the prevalence of resistance in latent infections will have substan- tial impact on the success of preventative treatment strategies. It will be important to detect and know the extent of mixed infection involving sensitive and resistant bacteria [28, 97], through both data collection and use of models to correct biases in observations. Models have also considered that increases in the prevalence of resistance re- late to the variance of reproductive fitness among resistant strains. In accordance with expectations from evolutionary theory, variance in fitness enhances the success of resistance even if the mean fitness is relatively low [85]. In particular these re- sults indicate that while resistance that emerges under treatment will in general be poorly transmissible, transmission will nevertheless become dominated by resistant strains with the highest fitness [85] in the absence of rapid identification and appro- priate treatment for MDR-TB. Such models have the advantage of illustrating how high-fitness resistant strains can gain an increase in prevalence in poorly controlled epidemics. Dynamic models that consider strain diversity more directly have been less com- mon and have been concerned with scientific questions such as estimation of fitness costs of resistance. For instance stochastic epidemiological models with genotype evolution have been used to estimate the relative fitness of drug resistant strains, and to estimate the relative importance of transmission of resistant strains versus acquired resistance [93]. The practical implication of such studies is to emphasise the value in reducing the average period of infectiousness for individuals with re- sistant M. tuberculosis infections. Models can then be used as decision tools to help 24 J. Pecerska,ˇ J. Wood, M. M. Tanaka and T. Stadler guide us to the most effective means to achieve this goal, for example through in- creasing overall case-finding, reducing the time between identification of a case and tests for resistance or improving treatment compliance and outcomes. These find- ings can have both policy and research implications, for instance in terms of sug- gesting the characteristics of potential diagnostics [45] or treatment regimens that would be most beneficial [154].

5.4 Future directions

Tuberculosis modelling is a rapidly growing field, with a number of key directions in which modelling research is progressing. In relation to models of M. tuberculosis variation, it is particularly important to refine models for whole genome sequencing (WGS) data analysis because much future data will be of this kind. We discussed some of the recent developments in this broad area in Section 4 but further work can be done. For example, more realistic models could be developed to link dynamics at within- and between-host levels. Models of the within-host M. tuberculosis infection should ideally feature a more fine-grained characterisation of the natural history of disease, including interactions between the immune-system and the pathogen population and understanding fea- tures of TB infection such as granulomas. Research in this area has moved from more theoretical explorations [22] to studies of potential biomarkers [94] and en- hancements to therapy [89]. One other open topic of research is the quantification of substitution rates for M. tuberculosis in the latent and in the acute stage of the disease. Another aspect of tuberculosis epidemiology that has been relatively underex- plored is the variation in host susceptibility to infection and disease. We briefly described approaches to describing host variation through models in Section 2.3. Future work may extend such models to consider coevolution between the host im- mune system and M. tuberculosis. We note that variation in hosts can be due to genetic or non-genetic factors and that although genetic susceptibility to tubercu- losis infection has been studied [9] the important sources of host variation are ar- guably non-genetic factors such as age, HIV status and other immunosuppression, diabetes, BCG vaccination and living conditions including nutrition, crowding and smoking behaviour. While host factors such as these are commonly included in risk- prediction models for non-communicable diseases (e.g. coronary artery disease), their adoption in transmission models for M. tuberculosis has been slow. However, there is increasing recognition that social determinants of transmission and disease are important in determining the characteristics of M. tuberculosis epidemics and in setting priorities for control [5]. As host factors are often closely linked to key char- acteristics of treatment programs (e.g. compliance with treatment), there can be flow on effects to pathogen variation. As such, it is likely that there is value in integrated approaches that take both host and pathogen variation into account. We see these Modelling MTB 25 approaches as valuable not only in terms of explaining existing epidemiology but in more local prediction of epidemic outcomes under changes to control measures.

References

[1] Aandahl RZ, Reyes JF, Sisson SA, Tanaka MM (2012) A model- based Bayesian estimation of the rate of evolution of VNTR loci in Mycobacterium tuberculosis. PLoS Comput Biol 8(6):e1002,573, DOI 10.1371/journal.pcbi.1002573, URL http://dx.doi.org/10.1371/ journal.pcbi.1002573 [2] Aandahl RZ, Stadler T, Sisson SA, Tanaka MM (2014) Exact vs. approx- imate computation: reconciling different estimates of Mycobacterium tu- berculosis epidemiological parameters. Genetics 196(4):1227–1230, DOI 10.1534/genetics.113.158808 [3] Alland D, Kalkut G, Moss A, McAdam R, Hahn J, Bosworth W, Drucker E, Bloom B (1994) Transmission of tuberculosis in New York City. An analysis by DNA fingerprinting and conventional epidemiologic methods. N Engl J Med 330(24):1710–6 [4] Anderson RM, May RM (1979) Population biology of infectious diseases: Part I. Nature 280:361–367 [5] Andrews JR, Basu S, Dowdy DW, Murray MB (2015) The epidemiological advantage of preferential targeting of tuberculosis control at the poor. Int J Tuberc Lung Dis 19(4):375–380 [6] Aranaz A, Romero B, Montero N, Alvarez J, Bezos J, de Juan L, Mateos A, Dom´ınguez L (2004) Spoligotyping profile change caused by deletion of a direct variable repeat in a Mycobacterium tuberculosis isogenic lab- oratory strain. J Clin Microbiol 42(11):5388–5391, DOI 10.1128/JCM.42. 11.5388-5391.2004, URL http://dx.doi.org/10.1128/JCM.42. 11.5388-5391.2004 [7] Basu S, Galvani AP (2009) The evolution of tuberculosis virulence. Bull Math Biol 71(5):1073–1088, DOI 10.1007/s11538-009-9394-x, URL http://dx.doi.org/10.1007/s11538-009-9394-x [8] Becker N (2015) Modeling to Inform Infectious Disease Control. Chapman & Hall/CRC Biostatistics Series, Taylor & Francis, URL https://books. google.com.au/books?id=F\_MQrgEACAAJ [9] Bellamy R, Beyers N, McAdam KP, Ruwende C, Gie R, Samaai P, Bester D, Meyer M, Corrah T, Collin M, Camidge DR, Wilkinson D, Hoal-Van Helden E, Whittle HC, Amos W, van Helden P, Hill AV (2000) Genetic susceptibility to tuberculosis in Africans: a genome-wide scan. Proc Natl Acad Sci U S A 97(14):8005–8009, DOI 10.1073/pnas.140201897, URL http://dx.doi.org/10.1073/pnas.140201897 26 J. Pecerska,ˇ J. Wood, M. M. Tanaka and T. Stadler

[10] Blower SM, Gerberding JL (1998) Understanding, predicting and controlling the emergence of drug-resistant tuberculosis: a theoretical framework. J Mol Med (Berl) 76(9):624–636 [11] Blower SM, McLean AR, Porco TC, Small PM, Hopewell PC, Sanchez MA, Moss AR (1995) The intrinsic transmission dynamics of tuberculosis epi- demics. Nature Med 1:815–821 [12] Blower SM, Small PM, Hopewell PC (1996) Control strategies for tubercu- losis epidemics: new models for old problems. Science 273:497–500 [13] Boehme CC, Nabeta P, Hillemann D, Nicol MP, Shenai S, Krapp F, Allen J, Tahirli R, Blakemore R, Rustomjee R, Milovic A, Jones M, O’Brien SM, Persing DH, Ruesch-Gerdes S, Gotuzzo E, Rodrigues C, Alland D, Perkins MD (2010) Rapid molecular detection of tuberculosis and rifampin resis- tance. N Engl J Med 363(11):1005–1015 [14] Borgdorff MW, van den Hof S, Kalisvaart N, Kremer K, van Soolingen D (2011) Influence of sampling on clustering and associations with risk factors in the molecular epidemiology of tuberculosis. Am J Epidemiol 174(2):243– 251, DOI 10.1093/aje/kwr061, URL http://dx.doi.org/10.1093/ aje/kwr061 [15] Borrell S, Gagneux S (2011) Strain diversity, epistasis and the evolu- tion of drug resistance in Mycobacterium tuberculosis. Clin Microbiol In- fect 17(6):815–20, DOI 10.1111/j.1469-0691.2011.03556.x, URL http: //www.ncbi.nlm.nih.gov/pubmed/21682802 [16] Bos KI, Harkins KM, Herbig A, Coscolla M, Weber N, Comas I, Forrest SA, Bryant JM, Harris SR, Schuenemann VJ, et al (2014) Pre-Columbian mycobacterial genomes reveal seals as a source of New World human tuber- culosis. Nature 514(7523):494–497 [17] Bouckaert R, Heled J, Kuhnert D, Vaughan T, Wu CH, Xie D, Suchard MA, Rambaut A, Drummond AJ (2014) BEAST 2: a soft- ware platform for Bayesian evolutionary analysis. PLoS Comput Biol 10(4):e1003,537, DOI 10.1371/journal.pcbi.1003537, URL http://www. ncbi.nlm.nih.gov/pubmed/24722319 [18] Brooks-Pollock E, Becerra MC, Goldstein E, Cohen T, Murray MB (2011) Epidemiologic inference from the distribution of tuberculosis cases in house- holds in Lima, Peru. J Infect Dis 203(11):1582–9, DOI 10.1093/infdis/jir162, URL http://www.ncbi.nlm.nih.gov/pubmed/21592987 [19] Brudey K, Driscoll JR, Rigouts L, Prodinger WM, Gori A, Al-Hajoj SA, Allix C, Aristimuno˜ L, Arora J, Baumanis V, et al (2006) Mycobacterium tu- berculosis complex genetic diversity: mining the fourth international spolig- otyping database (SpolDB4) for classification, population genetics and epi- demiology. BMC microbiology 6(1):23 [20] Casali N, Nikolayevskyy V, Balabanova Y, Harris SR, Ignatyeva O, Kontse- vaya I, Corander J, Bryant J, Parkhill J, Nejentsev S, Horstmann RD, Brown T, Drobniewski F (2014) Evolution and transmission of drug-resistant tuber- culosis in a Russian population. Nat Genet 46(3):279–86, DOI 10.1038/ng. 2878, URL http://www.ncbi.nlm.nih.gov/pubmed/24464101 Modelling MTB 27

[21] Cave MD, Eisenach KD, McDermott PF, Bates JH, Crawford JT (1991) IS6110: conservation of sequence in the Mycobacterium tuberculosis com- plex and its utilization in DNA fingerprinting. Mol Cell Probes 5(1):73–80 [22] Chang ST, Linderman JJ, Kirschner DE (2005) Multiple mechanisms allow Mycobacterium tuberculosis to continuously inhibit MHC class II-mediated antigen presentation by macrophages. Proceedings of the National Academy of Sciences of the United States of America 102(12):4530–4535 [23] Chindelevitch L, Colijn C, Moodley P, Wilson D, Cohen T (2016) ClassTR: Classifying within-host heterogeneity based on tandem repeats with ap- plication to Mycobacterium tuberculosis Infections. PLoS Comput Biol 12(2):e1004,475, DOI 10.1371/journal.pcbi.1004475, URL http://dx. doi.org/10.1371/journal.pcbi.1004475 [24] Chisholm RH, Tanaka MM (2016) The emergence of latent infection in the early evolution of Mycobacterium tuberculosis. Proc Biol Sci 283(1831), DOI 10.1098/rspb.2016.0499, URL http://dx.doi.org/10.1098/ rspb.2016.0499 [25] Chisholm RH, Trauer JM, Curnoe D, Tanaka MM (2016) Controlled fire use in early humans might have triggered the evolutionary emergence of tuber- culosis. Proc Natl Acad Sci U S A 113(32):9051–9056, DOI 10.1073/pnas. 1603224113 [26] Cohen T, Murray M (2004) Modeling epidemics of multidrug-resistant M. tuberculosis of heterogeneous fitness. Nat Med 10(10):1117–1121, DOI 10. 1038/nm1110, URL http://dx.doi.org/10.1038/nm1110 [27] Cohen T, Lipsitch M, Walensky RP, Murray M (2006) Beneficial and per- verse effects of isoniazid preventive therapy for latent tuberculosis infection in HIV-tuberculosis coinfected populations. Proceedings of the National Academy of Sciences of the United States of America 103(18):7042– 7047, DOI 10.1073/pnas.0600349103, URL https://www.scopus. com/inward/record.uri?eid=2-s2.0-33646493800& partnerID=40&md5=affd24e21a2915b31375d88cb57caf88 [28] Cohen T, van Helden PD, Wilson D, Colijn C, McLaughlin MM, Abubakar I, Warren RM (2012) Mixed-strain Mycobacterium tu- berculosis infections and the implications for tuberculosis treatment and control. Clinical Microbiology Reviews 25(4):708–719, DOI 10.1128/CMR.00021-12, URL https://www.scopus.com/inward/ record.uri?eid=2-s2.0-84867198547&partnerID=40& md5=bb6952b52ec058f2821621f8f8b3c52d [29] Colangeli R, Arcus VL, Cursons RT, Ruthe A, Karalus N, Coley K, Man- ning SD, Kim S, Marchiano E, Alland D (2014) Whole genome sequenc- ing of Mycobacterium tuberculosis reveals slow growth and low mutation rates during latent infections in humans. PLoS One 9(3):e91,024, DOI 10.1371/journal.pone.0091024, URL http://dx.doi.org/10.1371/ journal.pone.0091024 [30] Colijn C, Cohen T, Murray M (2009) Latent coinfection and the maintenance of strain diversity. Bull Math Biol 71(1):247–263, DOI 28 J. Pecerska,ˇ J. Wood, M. M. Tanaka and T. Stadler

10.1007/s11538-008-9361-y, URL http://dx.doi.org/10.1007/ s11538-008-9361-y [31] Colijn C, Cohen T, Ganesh A, Murray M (2011) Spontaneous emergence of multiple drug resistance in tuberculosis before and during therapy. PLoS One 6(3):e18,327, DOI 10.1371/journal.pone.0018327, URL http://dx. doi.org/10.1371/journal.pone.0018327 [32] Comas I, Gagneux S (2011) A role for systems epidemiology in tuberculosis research. Trends in microbiology 19(10):492–500 [33] Comas I, Homolka S, Niemann S, Gagneux S (2009) Genotyping of ge- netically monomorphic bacteria: DNA sequencing in it Mycobacterium tuberculosis highlights the limitations of current methodologies. PLoS One 4(11):e7815, DOI 10.1371/journal.pone.0007815, URL http://dx. doi.org/10.1371/journal.pone.0007815 [34] Comas I, Coscolla M, Luo T, Borrell S, Holt KE, Kato-Maeda M, Parkhill J, Malla B, Berg S, Thwaites G, et al (2013) Out-of-Africa migration and Neolithic coexpansion of Mycobacterium tuberculosis with modern humans. Nat Genet 45(10):1176–1182 [35] Coscolla M, Gagneux S (2014) Consequences of genomic diversity in Mycobacterium tuberculosis. Semin Immunol 26(6):431–44, DOI 10. 1016/j.smim.2014.09.012, URL https://www.ncbi.nlm.nih.gov/ pubmed/25453224 [36] de Boer AS, Borgdorff MW, de Haas PE, Nagelkerke NJ, van Embden JD, van Soolingen D (1999) Analysis of rate of change of IS6110 RFLP patterns of Mycobacterium tuberculosis based on serial patient isolates. J Infect Dis 180(4):1238–1244, DOI 10.1086/314979, URL http://dx.doi.org/ 10.1086/314979 [37] Demay C, Liens B, Burguiere` T, Hill V, Couvin D, Millet J, Mokrousov I, Sola C, Zozio T, Rastogi N (2012) SITVITWEB–A publicly available in- ternational multimarker database for studying Mycobacterium tuberculosis genetic diversity and molecular epidemiology. Infection, Genetics and Evo- lution 12(4):755–766 [38] Di Rienzo A, Peterson AC, Garza JC, Valdes AM, Slatkin M, Freimer NB (1994) Mutational processes of simple-sequence repeat loci in human popu- lations. Proc Natl Acad Sci U S A 91(8):3166–3170 [39] Didelot X, Gardy J, Colijn C (2014) Bayesian inference of infectious disease transmission from whole-genome sequence data. Mol Biol Evol 31(7):1869– 1879, DOI 10.1093/molbev/msu121, URL http://dx.doi.org/10. 1093/molbev/msu121 [40] Didelot X, Fraser C, Gardy J, Colijn C (2017) Genomic infectious disease epidemiology in partially sampled and ongoing outbreaks. Mol Biol Evol [41] Diekmann O, Heesterbeek JAP (2000) Mathematical epidemiology of infec- tious diseases: model building, analysis and interpretation. John Wiley & Sons [42] Diekmann O, Heesterbeek JAP, Metz JAJ (1990) On the definition and the computation of the basic reproduction ratio R0 in models for infectious dis- Modelling MTB 29

eases in heterogeneous populations. Journal of Mathematical Biology 28(4), DOI 10.1007/bf00178324 [43] Dietz K (1993) The estimation of the basic reproduction number for infec- tious diseases. Statistical methods in medical research 2(1):23–41 [44] Dowdy DW, Dye C, Cohen T (2013) Data needs for evidence-based deci- sions: a tuberculosis modeler’s ’wish list’. Int J Tuberc Lung Dis 17(7):866– 77, DOI 10.5588/ijtld.12.0573, URL http://www.ncbi.nlm.nih. gov/pubmed/23743307 [45] Dowdy DW, Houben R, Cohen T, Pai M, Cobelens F, Vassall A, Menzies NA, Gomez GB, Langley I, Squire SB, White R (2014) Impact and cost- effectiveness of current and future tuberculosis diagnostics: the contribution of modelling. Int J Tuberc Lung Dis 18(9):1012–1018 [46] Drummond AJ, Ho SY, Phillips MJ, Rambaut A (2006) Relaxed phylogenet- ics and dating with confidence. PLoS Biol 4(5):e88, DOI 10.1371/journal. pbio.0040088, URL https://www.ncbi.nlm.nih.gov/pubmed/ 16683862 [47] Drummond AJ, Suchard MA, Xie D, Rambaut A (2012) Bayesian Phyloge- netics with BEAUti and the BEAST 1.7. Molecular Biology and Evolution 29(8):1969–1973, DOI 10.1093/molbev/mss075 [48] Dye C, Espinal MA (2001) Will tuberculosis become resistant to all antibi- otics? Proc R Soc Lond B Biol Sci 268(1462):45–52 [49] Dye C, Williams BG (2000) Criteria for the control of drug-resistant tuber- culosis. Proc Natl Acad Sci U S A 97(14):8180–8185 [50] Dye C, Williams BG (2009) Slow elimination of multidrug-resistant tuber- culosis. Sci Transl Med 1(3):3ra8, DOI 10.1126/scitranslmed.3000346, URL http://dx.doi.org/10.1126/scitranslmed.3000346 [51] Dye C, Williams BG (2010) The population dynamics and control of tuber- culosis. Science 328(5980):856–861 [52] Dye C, Garnett GP, Sleeman K, Williams BG (1998) Prospects for worldwide tuberculosis control under the WHO DOTS strategy. The Lancet 352(9144):1886 – 1891, DOI http://dx.doi.org/10.1016/ S0140-6736(98)03199-7, URL http://www.sciencedirect.com/ science/article/pii/S0140673698031997 [53] Ewens WJ (1972) The sampling theory of selectively neutral alleles. Theor Popul Biol 3(1):87–112 [54] Ewens WJ (2004) Mathematical Population Genetics 1: Theoretical Intro- duction, vol 27, 2nd edn. Springer [55] Felsenstein J (1978) Cases in Which Parsimony or Compatibility Methods Will Be Positively Misleading. Systematic Zoology 27(4):401–410, DOI Doi10.2307/2412923, URL ://WOS:A1978GH36300002 [56] Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17(6):368–76, DOI 10.1007/bf01734359, URL http://www.ncbi.nlm.nih.gov/pubmed/7288891 [57] Felsenstein J, Felenstein J (2004) Inferring phylogenies, vol 2. Sinauer Asso- ciates Sunderland 30 J. Pecerska,ˇ J. Wood, M. M. Tanaka and T. Stadler

[58] Filliol I, Motiwala AS, Cavatore M, Qi W, Hazbon MH, Bobadilla del Valle M, Fyfe J, Garcia-Garcia L, Rastogi N, Sola C, Zozio T, Guerrero MI, Leon CI, Crabtree J, Angiuoli S, Eisenach KD, Durmaz R, Joloba ML, Rendon A, Sifuentes-Osornio J, Ponce de Leon A, Cave MD, Fleis- chmann R, Whittam TS, Alland D (2006) Global phylogeny of Mycobac- terium tuberculosis based on single nucleotide polymorphism (SNP) analy- sis: insights into tuberculosis evolution, phylogenetic accuracy of other DNA fingerprinting systems, and recommendations for a minimal standard SNP set. J Bacteriol 188(2):759–72, DOI 10.1128/JB.188.2.759-772.2006, URL http://www.ncbi.nlm.nih.gov/pubmed/16385065 [59] Fine PE, Vynnycky E (1998) The effect of heterologous immunity upon the apparent efficacy of (e.g. bcg) vaccines. Vaccine 16(20):1923 – 1928, DOI http://dx.doi.org/10.1016/S0264-410X(98)00124-8, URL http://www.sciencedirect.com/science/article/pii/ S0264410X98001248 [60] Ford CB, Lin PL, Chase MR, Shah RR, Iartchouk O, Galagan J, Mohaideen N, Ioerger TR, Sacchettini JC, Lipsitch M, Flynn JL, Fortune SM (2011) Use of whole genome sequencing to estimate the mutation rate of Mycobacterium tuberculosis during latent infection. Nat Genet 43(5):482–486, DOI 10.1038/ ng.811, URL http://dx.doi.org/10.1038/ng.811 [61] Ford CB, Shah RR, Maeda MK, Gagneux S, Murray MB, Cohen T, Johnston JC, Gardy J, Lipsitch M, Fortune SM (2013) Mycobacterium tuberculosis mutation rate estimates from different lineages predict substantial differences in the emergence of drug-resistant tuberculosis. Nat Genet 45(7):784–790 [62] Gagneux S (2012) Host-pathogen coevolution in human tuberculosis. Philos T Roy Soc B 367(1590):850–859 [63] Gagneux S, Small PM (2007) Global phylogeography of Mycobacterium tu- berculosis and implications for tuberculosis product development. Lancet In- fect Dis 7(5):328–337, DOI 10.1016/S1473-3099(07)70108-1, URL http: //dx.doi.org/10.1016/S1473-3099(07)70108-1 [64] Gagneux S, Long CD, Small PM, Van T, Schoolnik GK, Bohannan BJM (2006) The competitive cost of antibiotic resistance in Mycobacterium tuber- culosis. Science 312(5782):1944–1946, DOI 10.1126/science.1124410, URL http://dx.doi.org/10.1126/science.1124410 [65] Galagan JE (2014) Genomic insights into tuberculosis. Nat Rev Genet 15(5):307–320, DOI 10.1038/nrg3664, URL http://dx.doi.org/10. 1038/nrg3664 [66] Gammack D, Doering CR, Kirschner DE (2004) Macrophage response to Mycobacterium tuberculosis infection. J Math Biol 48(2):218–242 [67] Gardy JL, Johnston JC, Ho Sui SJ, Cook VJ, Shah L, Brodkin E, Rempel S, Moore R, Zhao Y, Holt R, Varhol R, Birol I, Lem M, Sharma MK, Elwood K, Jones SJM, Brinkman FSL, Brunham RC, Tang P (2011) Whole-genome sequencing and social-network analysis of a tuberculosis outbreak. N Engl J Med 364(8):730–739, DOI 10.1056/NEJMoa1003176, URL http://dx. doi.org/10.1056/NEJMoa1003176 Modelling MTB 31

[68] Glynn JR, Bauer J, de Boer AS, Borgdorff MW, Fine PE, Godfrey-Faussett P, Vynnycky E (1999) Interpreting DNA fingerprint clusters of Mycobacterium tuberculosis. European Concerted Action on Molecular Epidemiology and Control of Tuberculosis. Int J Tuberc Lung Dis 3(12):1055–1060 [69] Glynn JR, Vynnycky E, Fine PE (1999) Influence of sampling on estimates of clustering and recent transmission of Mycobacterium tuberculosis derived from DNA fingerprinting techniques. Am J Epidemiol 149(4):366–371 [70] Gomes MGM, Franco AO, Gomes MC, Medley GF (2004) The reinfection threshold promotes variability in tuberculosis epidemiology and vaccine effi- cacy. Proc Biol Sci 271(1539):617–623, DOI 10.1098/rspb.2003.2606, URL http://dx.doi.org/10.1098/rspb.2003.2606 [71] Grant A, Arnold C, Thorne N, Gharbia S, Underwood A (2008) Mathe- matical modelling of Mycobacterium tuberculosis VNTR loci estimates a very slow mutation rate for the repeats. J Mol Evol 66(6):565–574, DOI 10.1007/s00239-008-9104-6, URL http://dx.doi.org/10.1007/ s00239-008-9104-6 [72] Grenfell BT, Pybus OG, Gog JR, Wood JLN, Daly JM, Mumford JA, Holmes EC (2004) Unifying the epidemiological and evolutionary dynamics of pathogens. Science 303(5656):327–332, DOI 10.1126/science.1090727, URL http://dx.doi.org/10.1126/science.1090727 [73] Hasegawa M, Kishino H, Yano Ta (1985) Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. Journal of molecular evolution 22(2):160–174 [74] Hatherell HA, Colijn C, Stagg HR, Jackson C, Winter JR, Abubakar I (2016) Interpreting whole genome sequencing for investigating tuber- culosis transmission: a systematic review. BMC Med 14:21, DOI 10. 1186/s12916-016-0566-x, URL https://www.ncbi.nlm.nih.gov/ pubmed/27005433 [75] Hershberg R, Lipatov M, Small PM, Sheffer H, Niemann S, Homolka S, Roach JC, Kremer K, Petrov DA, Feldman MW, Gagneux S (2008) High functional diversity in Mycobacterium tuberculosis driven by genetic drift and human demography. PLoS Biol 6(12):e311, DOI 10.1371/journal.pbio. 0060311, URL http://dx.doi.org/10.1371/journal.pbio. 0060311 [76] Ho SYW, Phillips MJ, Cooper A, Drummond AJ (2005) Time dependency of molecular rate estimates and systematic overestimation of recent divergence times. Mol Biol Evol 22(7):1561–1568, DOI 10.1093/molbev/msi145, URL http://dx.doi.org/10.1093/molbev/msi145 [77] Houk VN (1980) Spread of tuberculosis via recirculated air in a naval vessel: the Byrd study. Ann N Y Acad Sci 353:10–24, URL http://www.ncbi. nlm.nih.gov/pubmed/6939378 [78] de Jong BC, Hill PC, Aiken A, Awine T, Antonio M, Adetifa IM, Jackson- Sillah DJ, Fox A, Deriemer K, Gagneux S, Borgdorff MW, McAdam KP, Corrah T, Small PM, Adegbola RA (2008) Progression to active tuberculosis, but not transmission, varies by Mycobacterium tuberculosis lineage in The 32 J. Pecerska,ˇ J. Wood, M. M. Tanaka and T. Stadler

Gambia. J Infect Dis 198(7):1037–43, DOI 10.1086/591504, URL https: //www.ncbi.nlm.nih.gov/pubmed/18702608 [79] Jukes TH, Cantor CR (1969) Evolution of protein molecules. Mammalian protein metabolism 3(21):132 [80] Kamerbeek J, Schouls L, Kolk A, van Agterveld M, van Soolingen D, Kui- jper S, Bunschoten A, Molhuizen H, Shaw R, Goyal M, van Embden J (1997) Simultaneous detection and strain differentiation of Mycobacterium tubercu- losis for diagnosis and epidemiology. J Clin Microbiol 35(4):907–914 [81] Keeling MJ, Rohani P (2008) Modeling infectious diseases in humans and animals. Princeton University Press [82] Kendall EA, Fofana MO, Dowdy DW (2015) Burden of transmitted mul- tidrug resistance in epidemics of tuberculosis: a transmission modelling analysis. The Lancet Respiratory Medicine 3(12):963–972, DOI 10.1016/ s2213-2600(15)00458-0 [83] Kermack WO, McKendrick AG (1927) Contributions to the mathematical theory of epidemics. Proc Roy Soc Ser 115:700–721 [84] Kimura M (1980) A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of molecular evolution 16(2):111–120 [85] Knight GM, Colijn C, Shrestha S, Fofana M, Cobelens F, White RG, Dowdy DW, Cohen T (2015) The Distribution of Fitness Costs of Resistance- Conferring Mutations Is a Key Determinant for the Future Burden of Drug-Resistant Tuberculosis: A Model-Based Analysis. Clin Infect Dis 61Suppl 3:S147–S154, DOI 10.1093/cid/civ579, URL http://dx.doi. org/10.1093/cid/civ579 [86]K uhnert¨ D, Stadler T, Vaughan TG, Drummond AJ (2014) Simultane- ous reconstruction of evolutionary history and epidemiological dynamics from viral sequences with the birth-death SIR model. J R Soc Interface 11(94):20131,106, DOI 10.1098/rsif.2013.1106, URL http://dx.doi. org/10.1098/rsif.2013.1106 [87]K uhnert¨ D, Stadler T, Vaughan TG, Drummond AJ (2016) Phylodynamics with migration: A computational framework to quantify population struc- ture from genomic data. Mol Biol Evol DOI 10.1093/molbev/msw064, URL http://dx.doi.org/10.1093/molbev/msw064 [88] Lillebaek T, Norman A, Rasmussen EM, Marvig RL, Folkvardsen DB, An- dersen AB,˚ Jelsbak L (2016) Substantial molecular evolution and muta- tion rates in prolonged latent Mycobacterium tuberculosis infection in hu- mans. Int J Med Microbiol DOI 10.1016/j.ijmm.2016.05.017, URL http: //dx.doi.org/10.1016/j.ijmm.2016.05.017 [89] Linderman JJ, Kirschner DE (2015) In silico models of m. tubercu- losis infection provide a route to new therapies. Drug Discovery To- day: Disease Models 15:37 – 41, DOI http://dx.doi.org/10.1016/j.ddmod. 2014.02.006, URL http://www.sciencedirect.com/science/ article/pii/S1740675714000115, computational models of lung diseases Modelling MTB 33

[90] Lipsitch M, Levin BR (1997) The within-host population dynamics of an- tibacterial chemotherapy: conditions for the evolution of resistance. Ciba Found Symp 207:112–27; discussion 127–30 [91] Liu X, Gutacker MM, Musser JM, Fu YX (2006) Evidence for re- combination in Mycobacterium tuberculosis. J Bacteriol 188(23):8169–77, DOI 10.1128/JB.01062-06, URL http://www.ncbi.nlm.nih.gov/ pubmed/16997954 [92] Luciani F, Francis AR, Tanaka MM (2008) Interpreting genotype cluster sizes of Mycobacterium tuberculosis isolates typed with IS6110 and spoligotyping. Infect Genet Evol 8(2):182–190, DOI 10.1016/j.meegid.2007.12.004, URL http://dx.doi.org/10.1016/j.meegid.2007.12.004 [93] Luciani F, Sisson SA, Jiang H, Francis AR, Tanaka MM (2009) The epidemiological fitness cost of drug resistance in Mycobacterium tuber- culosis. Proc Natl Acad Sci U S A 106(34):14,711–14,715, DOI 10. 1073/pnas.0902437106, URL http://dx.doi.org/10.1073/pnas. 0902437106 [94] Marino S, Gideon HP, Gong C, Mankad S, McCrone JT, Lin PL, Linderman JJ, Flynn JL, Kirschner DE (2016) Computational and empirical studies pre- dict Mycobacterium tuberculosis-specific T cells as a biomarker for infection outcome. PLoS Comput Biol 12(4):e1004,804 [95] Mathema B, Kurepina NE, Bifani PJ, Kreiswirth BN (2006) Molecular epi- demiology of tuberculosis: current insights. Clin Microbiol Rev 19(4):658– 685, DOI 10.1128/CMR.00061-05, URL http://dx.doi.org/10. 1128/CMR.00061-05 [96] Menzies NA, Cohen T, Lin HH, Murray M, Salomon JA (2012) Population Health Impact and Cost-Effectiveness of Tuberculosis Diagnosis with Xpert MTB/RIF: A Dynamic Simulation and Eco- nomic Evaluation. PLoS Medicine 9(11), DOI 10.1371/journal.pmed. 1001347, URL https://www.scopus.com/inward/record. uri?eid=2-s2.0-84870266092&partnerID=40&md5= 557a31f6d3f2341c03f44cb526513bdf [97] Mills HL, Cohen T, Colijn C (2013) Community-wide isoniazid preventive therapy drives drug-resistant tuberculosis: A model- based analysis. Science Translational Medicine 5(180), DOI 10.1126/scitranslmed.3005260, URL https://www.scopus. com/inward/record.uri?eid=2-s2.0-84877765955& partnerID=40&md5=feec00c0015ae17a30d3181030db7f0c [98] Murray M (2002) Determinants of cluster distribution in the molecular epi- demiology of tuberculosis. Proc Natl Acad Sci U S A 99(3):1538–1543 [99] Murray M (2002) Sampling bias in the molecular epidemiology of tubercu- losis. Emerg Infect Dis 8(4):363–9 [100] Murray M, Alland D (2002) Methodological problems in the molecular epi- demiology of tuberculosis. Am J Epidemiol 155(6):565–571 [101] O’Neill MB, Mortimer TD, Pepperell CS (2015) Diversity of Mycobacterium tuberculosis across Evolutionary Scales. PLoS Pathog 11(11):e1005,257, 34 J. Pecerska,ˇ J. Wood, M. M. Tanaka and T. Stadler

DOI 10.1371/journal.ppat.1005257, URL http://dx.doi.org/10. 1371/journal.ppat.1005257 [102] Outhred AC, Holmes N, Sadsad R, Martinez E, Jelfs P, Hill-Cawthorne GA, Gilbert GL, Marais BJ, Sintchenko V (2016) Identifying Likely Transmis- sion Pathways within a 10-Year Community Outbreak of Tuberculosis by High-Depth Whole Genome Sequencing. PLoS One 11(3):e0150,550, DOI 10.1371/journal.pone.0150550, URL https://www.ncbi.nlm.nih. gov/pubmed/26938641 [103] Ozcaglar C, Shabbeer A, Vandenberg S, Yener B, Bennett KP (2011) Sub- lineage structure analysis of Mycobacterium tuberculosis complex strains using multiple-biomarker tensors. BMC 12 Suppl 2:S1, DOI 10.1186/1471-2164-12-S2-S1, URL http://dx.doi.org/10.1186/ 1471-2164-12-S2-S1 [104] Ozcaglar C, Shabbeer A, Vandenberg SL, Yener B, Bennett KP (2012) Epi- demiological models of Mycobacterium tuberculosis complex infections. Math Biosci 236(2):77–96, DOI 10.1016/j.mbs.2012.02.003, URL http: //dx.doi.org/10.1016/j.mbs.2012.02.003 [105] Pankhurst LJ, Del Ojo Elias C, Votintseva AA, Walker TM, Cole K, Davies J, Fermont JM, Gascoyne-Binzi DM, Kohl TA, Kong C, Lemaitre N, Niemann S, Paul J, Rogers TR, Roycroft E, Smith EG, Supply P, Tang P, Wilcox MH, Wordsworth S, Wyllie D, Xu L, Crook DW, Group CTS (2016) Rapid, com- prehensive, and affordable mycobacterial diagnosis with whole-genome se- quencing: a prospective study. Lancet Respir Med 4(1):49–58, DOI 10.1016/ S2213-2600(15)00466-X, URL https://www.ncbi.nlm.nih.gov/ pubmed/26669893 [106] Plazzotta G, Cohen T, Colijn C (2015) Magnitude and sources of bias in the detection of mixed strain M. tuberculosis infection. Journal of theoretical biology 368:67–73 [107] Pybus OG, Drummond AJ, Nakano T, Robertson BH, Rambaut A (2003) The epidemiology and iatrogenic transmission of hepatitis C virus in Egypt: a Bayesian coalescent approach. Mol Biol Evol 20(3):381–7, DOI 10.1093/molbev/msg043, URL https://www.ncbi.nlm.nih.gov/ pubmed/12644558 [108] Ragheb MN, Ford CB, Chase MR, Lin PL, Flynn JL, Fortune SM (2013) The mutation rate of mycobacterial repetitive unit loci in strains of M. tuberculosis from cynomolgus macaque infection. BMC Genomics 14:145, DOI 10.1186/1471-2164-14-145, URL http://dx.doi.org/ 10.1186/1471-2164-14-145 [109] Rangaka MX, Cavalcante SC, Marais BJ, Thim S, Martinson NA, Swami- nathan S, Chaisson RE (2015) Controlling the seedbeds of tuberculosis: di- agnosis and treatment of tuberculosis infection. Lancet 386(10010):2344– 53, DOI 10.1016/S0140-6736(15)00323-2, URL https://www.ncbi. nlm.nih.gov/pubmed/26515679 [110] Reyes JF, Tanaka MM (2010) Mutation rates of spoligotypes and variable numbers of tandem repeat loci in Mycobacterium tuberculosis. Infect Genet Modelling MTB 35

Evol 10(7):1046–1051, DOI 10.1016/j.meegid.2010.06.016, URL http:// dx.doi.org/10.1016/j.meegid.2010.06.016 [111] Reyes JF, Francis AR, Tanaka MM (2008) Models of deletion for visualizing bacterial variation: an application to tuberculosis spoligotypes. BMC Bioin- formatics 9:496, DOI 10.1186/1471-2105-9-496, URL http://dx.doi. org/10.1186/1471-2105-9-496 [112] Reyes JF, Chan CHS, Tanaka MM (2012) Impact of homoplasy on variable numbers of tandem repeats and spoligotypes in Mycobacterium tuberculosis. Infect Genet Evol 12(4):811–818, DOI 10.1016/j.meegid.2011.05.018, URL http://dx.doi.org/10.1016/j.meegid.2011.05.018 [113] Rosenberg NA, Tsolaki AG, Tanaka MM (2003) Estimating change rates of genetic markers using serial samples: applications to the transposon IS6110 in Mycobacterium tuberculosis. Theor Popul Biol 63(4):347–363 [114] Saitou N, Nei M (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4(4):406–25, URL http: //www.ncbi.nlm.nih.gov/pubmed/3447015 [115] Sergeev R, Colijn C, Cohen T (2011) Models to understand the population- level impact of mixed strain M. tuberculosis infections. J Theor Biol 280(1):88–100, DOI 10.1016/j.jtbi.2011.04.011, URL http://dx.doi. org/10.1016/j.jtbi.2011.04.011 [116] Shabbeer A, Cowan LS, Ozcaglar C, Rastogi N, Vandenberg SL, Yener B, Bennett KP (2012) TB-Lineage: an online tool for classification and analysis of strains of Mycobacterium tuberculosis complex. Infect Genet Evol 12(4):789–797, DOI 10.1016/j.meegid.2012.02.010, URL http:// dx.doi.org/10.1016/j.meegid.2012.02.010 [117] Shabbeer A, Ozcaglar C, Yener B, Bennett KP (2012) Web tools for molec- ular epidemiology of tuberculosis. Infect Genet Evol 12(4):767–781, DOI 10.1016/j.meegid.2011.08.019, URL http://dx.doi.org/10.1016/ j.meegid.2011.08.019 [118] Shrestha S, Knight GM, Fofana M, Cohen T, White RG, Cobelens F, Dowdy DW (2014) Drivers and trajectories of resistance to new first-line drug reg- imens for tuberculosis. Open Forum Infect Dis 1(2):ofu073, DOI 10.1093/ ofid/ofu073, URL http://dx.doi.org/10.1093/ofid/ofu073 [119] Small PM, Hopewell PC, Singh SP, Paz A, Parsonnet J, Ruston DC, Schecter GF, Daley CL, Schoolnik GK (1994) The epidemiology of tuberculosis in San Francisco: A population-based study using conventional and molecular methods. N Engl J Med 330:1703–1709 [120] Smith NH, Hewinson RG, Kremer K, Brosch R, Gordon SV (2009) Myths and misconceptions: the origin and evolution of Mycobacterium tuberculosis. Nat Rev Microbiol 7(7):537–544, DOI 10.1038/nrmicro2165, URL http: //dx.doi.org/10.1038/nrmicro2165 [121] Stadler T (2011) Inferring epidemiological parameters on the basis of allele frequencies. Genetics 188(3):663–672, DOI 10.1534/genetics.111.126466, URL http://dx.doi.org/10.1534/genetics.111.126466 36 J. Pecerska,ˇ J. Wood, M. M. Tanaka and T. Stadler

[122] Stadler T, Bonhoeffer S (2013) Uncovering epidemiological dynamics in heterogeneous host populations using phylogenetic methods. Philosophical Transactions of the Royal Society B: Biological Sciences 368(1614) [123] Stadler T, Kouyos RD, von Wyl V, Yerly S, Boni¨ J, Burgisser¨ P, Klimkait T, Joos B, Rieder P, Xie D, Gunthard¨ HF, Drummond A, Bonhoeffer S, the Swiss HIV Cohort Study (2012) Estimating the basic reproductive number from viral sequence data. Mol Biol Evol 29:347–357 [124] Stadler T, Kuhnert D, Bonhoeffer S, Drummond AJ (2013) Birth-death sky- line plot reveals temporal changes of epidemic spread in HIV and hep- atitis C virus (HCV). Proc Natl Acad Sci U S A 110(1):228–33, DOI 10.1073/pnas.1207965110, URL http://www.ncbi.nlm.nih.gov/ pubmed/23248286 [125] Steel M (2013) Consistency of bayesian inference of resolved phylogenetic trees. Journal of theoretical biology 336:246–249 [126] Supply P, Allix C, Lesjean S, Cardoso-Oelemann M, Rusch-Gerdes¨ S, Willery E, Savine E, de Haas P, van Deutekom H, Roring S, et al (2006) Pro- posal for standardization of optimized mycobacterial interspersed repetitive unit-variable-number tandem repeat typing of Mycobacterium tuberculosis. Journal of clinical microbiology 44(12):4498–4510 [127] Tamura K, Nei M (1993) Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Molecular biology and evolution 10(3):512–526 [128] Tanaka MM, Francis AR (2005) Methods of quantifying and visualising outbreaks of tuberculosis using genotypic information. Infect Genet Evol 5(1):35–43, DOI 10.1016/j.meegid.2004.06.001, URL http://dx.doi. org/10.1016/j.meegid.2004.06.001 [129] Tanaka MM, Rosenberg NA, Small PM (2004) The control of copy number of IS6110 in Mycobacterium tuberculosis. Mol Biol Evol 21(12):2195–2201, DOI 10.1093/molbev/msh234, URL http://dx.doi.org/10.1093/ molbev/msh234 [130] Tanaka MM, Francis AR, Luciani F, Sisson SA (2006) Using ap- proximate Bayesian computation to estimate tuberculosis transmission parameters from genotype data. Genetics 173(3):1511–1520, DOI 10.1534/genetics.106.055574, URL http://dx.doi.org/10.1534/ genetics.106.055574 [131] Tavare´ S (1986) Some probabilistic and statistical problems in the analysis of DNA sequences. Lectures on mathematics in the life sciences 17:57–86 [132] To TH, Jung M, Lycett S, Gascuel O (2016) Fast Dating Using Least-Squares Criteria and Algorithms. Syst Biol 65(1):82–97, DOI 10.1093/sysbio/syv068, URL http://www.ncbi.nlm.nih.gov/pubmed/26424727 [133] Trauer JM, Denholm JT, McBryde ES (2014) Construction of a mathemat- ical model for tuberculosis transmission in highly endemic regions of the Asia-Pacific. J Theor Biol 358:74–84, DOI 10.1016/j.jtbi.2014.05.023, URL http://dx.doi.org/10.1016/j.jtbi.2014.05.023 Modelling MTB 37

[134] Vogler AJ, Keys C, Nemoto Y, Colman RE, Jay Z, Keim P (2006) Effect of repeat copy number on variable-number tandem repeat mutations in Es- cherichia coli O157:H7. J Bacteriol 188(12):4253–4263, DOI 10.1128/JB. 00001-06, URL http://dx.doi.org/10.1128/JB.00001-06 [135] Volz EM (2012) Complex population dynamics and the coalescent under neu- trality. Genetics 190(1):187–201, DOI 10.1534/genetics.111.134627, URL http://dx.doi.org/10.1534/genetics.111.134627 [136] Volz EM, Kosakovsky Pond SL, Ward MJ, Leigh Brown AJ, Frost SDW (2009) Phylodynamics of infectious disease epidemics. Genetics 183(4):1421–1430, DOI 10.1534/genetics.109.106021, URL http://dx. doi.org/10.1534/genetics.109.106021 [137] Vynnycky E, Fine PEM (1997) The natural history of tuberculosis: the impli- cations of age-dependent risks of disease and the role of reinfection. Epidemi- ology and Infection 119(2):183–201, DOI Doi10.1017/S0950268897007917, URL ://WOS:A1997YD95900010 [138] Waaler H, Geser A, Andersen S (1962) The use of mathematical models in the study of the epidemiology of tuberculosis. American Journal of Public Health and the Nations Health 52(6):1002–1013 [139] Wakeley J (2009) Coalescent theory: an introduction. Roberts and Co., Greenwood Village, CO [140] Walker TM, Ip CL, Harrell RH, Evans JT, Kapatai G, Dedicoat MJ, Eyre DW, Wilson DJ, Hawkey PM, Crook DW, et al (2013) Whole-genome se- quencing to delineate Mycobacterium tuberculosis outbreaks: a retrospective observational study. Lancet Infect Dis 13(2):137–146 [141] Warren RM, van der Spuy GD, Richardson M, Beyers N, Borgdorff MW, Behr MA, van Helden PD (2002) Calculation of the stability of the IS6110 banding pattern in patients with persistent Mycobacterium tuberculosis dis- ease. J Clin Microbiol 40(5):1705–1708 [142] Warren RM, Victor TC, Streicher EM, Richardson M, Beyers N, Gey van Pittius NC, van Helden PD (2004) Patients with active tuberculosis often have different strains in the same sputum specimen. Am J Respir Crit Care Med 169(5):610–614, DOI 10.1164/rccm.200305-714OC, URL http:// dx.doi.org/10.1164/rccm.200305-714OC [143] Weniger T, Krawczyk J, Supply P, Harmsen D, Niemann S (2012) Online tools for polyphasic analysis of Mycobacterium tuberculosis complex geno- typing data: Now and next. Infection, Genetics and Evolution 12(4):748–754 [144] WHO, et al (2016) Global tuberculosis report 2016. Tech. rep., World Health Organization [145] Wigginton JE, Kirschner D (2001) A model to predict cell-mediated immune regulatory mechanisms during human infection with Mycobacterium tuber- culosis. J Immunol 166(3):1951–1967 [146] Wirth T, Hildebrand F, Allix-Beguec´ C, Wolbeling¨ F, Kubica T, Kre- mer K, van Soolingen D, Rusch-Gerdes¨ S, Locht C, Brisse S, Meyer A, Supply P, Niemann S (2008) Origin, spread and demography of the Mycobacterium tuberculosis complex. PLoS Pathog 4(9):e1000,160, DOI 38 J. Pecerska,ˇ J. Wood, M. M. Tanaka and T. Stadler

10.1371/journal.ppat.1000160, URL http://dx.doi.org/10.1371/ journal.ppat.1000160 [147] Yang Z (1994) Maximum likelihood phylogenetic estimation from DNA se- quences with variable rates over sites: approximate methods. J Mol Evol 39(3):306–14, DOI 10.1007/BF00160154, URL http://www.ncbi. nlm.nih.gov/pubmed/7932792 [148] Yang Z (1994) Statistical properties of the maximum likelihood method of phylogenetic estimation and comparison with distance matrix methods. Sys- tematic biology 43(3):329–342 [149] Yang Z (1996) Among-site rate variation and its impact on phylogenetic analyses. Trends in Ecology & Evolution 11(9):367–372, DOI 10.1016/ 0169-5347(96)10041-0 [150] Yang Z (2014) Molecular evolution: a statistical approach. OUP Oxford [151] Zainuddin Z, Dale J (1990) Does Mycobacterium tuberculosis have plas- mids? Tubercle 71:43–49, DOI 10.1016/0041-3879(90)90060-L, URL http://www.sciencedirect.com/science/article/pii/ 004138799090060L [152] Zheng N, Whalen CC, Handel A (2014) Modeling the potential impact of host population survival on the evolution of M. tuberculosis latency. PLoS One 9(8):e105,721, DOI 10.1371/journal.pone.0105721, URL http: //dx.doi.org/10.1371/journal.pone.0105721 [153] Zuckerkandl E, Pauling L (1965) Evolutionary divergence and convergence in proteins. Evolving genes and proteins 97:97–166 [154] Zwerling A, Gomez GB, Pennington J, Cobelens F, Vassall A, Dowdy DW (2016) A simplified cost-effectiveness model to guide decision-making for shortened anti-tuberculosis treatment regimens. The International Journal of Tuberculosis and Lung Disease 20(2):257–260

TRANSMISSIONFITNESSCOSTSINMDR-TB 3

This paper consists of two major parts, a simulation study and an empirical data analysis. In the simulation study, I simulate a tracked short-term TB epidemic with varying basic reproductive ratios for the two different types of strains – drug-sensitive and drug-resistant. I then analyse the simulated phylogenetic trees with tips labelled with the respective resistance types with the multi-type birth-death (MTBD) model (Kühnert et al., 2016), which has an efficient implementation in BEAST2 (Bouckaert et al., 2014). While the model used for simulation is a high-level description of the dynamics, we argue that the assumptions we implement are reasonable in the context of the epidemics we wish to study – short term epidemics with available unified treatment strategies. The major difference between the simulation and analysis models lies in the fundamental dynamics of TB. Susceptible-Exposed- Infected-Treated (SEIT), the model we use for simulation, contains TB latency and treatment period with a possibility of relapse, among other TB-specific dynamics. On the other hand, MTBD is a relatively simple birth-death model without complex dynamics, which turns out to be nevertheless powerful enough to estimate the relative transmission fitness of the drug-resistant strains. The simulation study shows that despite the fact that the SEIT setup is vastly different from MTBD, we still recover the main parameters accurately and precisely, and thus can use the analysis setup to estimate the relative fitness of drug resistance in an empirical dataset. As we are looking at the short-term dynamics, we are limited in the analyses to a single lineage, and to a strict definition of clusters, which span a time period on the same order as the sampling dates (see Chapter 5). We use a similar analysis approach as to the simulations. In particular, we perform a cluster-based analysis of a dataset containing sequences from re-treatment cases of MDR-TB, among which some strains additionally exhibit pyrazinamide resistance. We use the resistance statuses to label the tips of the tree accordingly. Thus, this approach to analysis allows us to make use of all available data – full genome sequences, sampling dates, and resistance statuses. The analyses confirm the hypothesis that in the given dataset pyrazinamide resistance has a significant transmission fitness cost when compared to strains that are also MDR, but sensitive to pyrazinamide (Hertog, Sengstake, and Anthony, 2015). This work provides a proof-of-concept for a new type of analysis that can be done on genetic sequencing data where the data can be partitioned into two or more classes (e.g. by resistance status). It allows to evaluate the impact of an additional drug resistance on transmission fitness of TB in comparison to a baseline strain type. The BEAST2 configuration files used in the analyses provide a template for similar analyses to be done in other locations or on other pathogens. This work is currently in review for Epidemics as a manuscript titled “Quantifying transmis- sion fitness costs of multi-drug resistant tuberculosis.”, where I am the first author. Following is the current version of the manuscript followed by the supplementary materials.

47 Quantifying transmission fitness costs of multi-drug resistant tuberculosis.

a,g, b c d c e,f J¯ulijaPeˇcerska ∗, Denise K¨uhnert , Conor J. Meehan , Mireia Coscoll´a , Bouke C. de Jong , Sebastien Gagneux , a,g, Tanja Stadler ∗

aDepartment of Biosystems Science and Engineering, ETHZ, Basel, Switzerland bTransmission, Infection, Diversification & Evolution Group, Max Planck Institute for the Science of Human History, Jena, Germany cUnit of Mycobacteriology, Biomedical Sciences, Institute of Tropical Medicine, Antwerp, Belgium dInstitute for Integrative Systems Biology (I2SysBio), University of Valencia-CSIC, Val`encia,Spain eDepartment of Medical Parasitology and Infection Biology, Swiss Tropical and Public Health Institute, Basel, Switzerland fUniversity of Basel, Basel, Switzerland gSwiss Institute of Bioinformatics (SIB), Lausanne, Switzerland

Abstract As multi-drug resistant tuberculosis (MDR-TB) continues to spread, investigating the transmission potential of different drug-resistant strains becomes an ever more pressing topic in public health. While phylogenetic and transmission tree inferences provide valuable insight into possible transmission chains, phylodynamic inference combines evolutionary and epidemiological analyses to estimate the parameters of the underlying epidemiological processes, allowing us to describe the overall dynamics of disease spread in the population. In this study, we introduce an approach to Mycobacterium tuberculosis (M. tuberculosis) phylodynamic analysis employing an existing computationally efficient model to quantify the transmission fitness costs of drug resistance with respect to drug-sensitive strains. To determine the accuracy and precision of our approach, we first perform a simulation study, mimicking the simultaneous spread of drug-sensitive and drug-resistant tuberculosis (TB) strains. We analyse the simulated transmission trees using the phylodynamic multi-type birth-death model (MTBD, (K¨uhnertet al., 2016)) within the BEAST2 framework and show that this model can estimate the parameters of the epidemic well, despite the simplifying assumptions that MTBD makes compared to the complex TB transmission dynamics used for simulation. We then apply the MTBD model to an M. tuberculosis lineage 4 dataset that primarily consists of MDR sequences. Some of the MDR strains additionally exhibit resistance to pyrazinamide – an important first-line anti-tuberculosis drug. Our results support the previously proposed hypothesis that pyrazinamide resistance confers a transmission fitness cost to the bacterium, which we quantify for the given dataset. Importantly, our sensitivity analyses show that the estimates are robust to different prior distributions on the resistance acquisition rate, but are affected by the size of the dataset – i.e. we estimate a higher fitness cost when using fewer sequences for analysis. Overall, we propose that MTBD can be used to quantify the transmission fitness cost for a wide range of pathogens where the strains can be appropriately divided into two or more categories with distinct properties. Keywords: antibiotic resistance, multi-type birth-death model, phylodynamics, whole genome M. tuberculosis

1. Introduction only a few second-line treatment options, all of which re- quire regimens that last from 9 months up to 2 years and Tuberculosis (TB) continues to be a major problem for are expensive and toxic (WHO, 2016). While these treat- global public health. A connected and pressing issue is the ments are normally successful in curing MDR-TB patients, continued detection of drug-resistant TB, and especially of WHO reports that in 2017 an average of 6.2% of MDR- multidrug-resistant (MDR-TB) strains, which resist treat- TB cases resisted treatment by the most effective second- ment by at least two main first-line drugs, rifampicin and line anti-TB drugs, representing the so-called extensively isoniazid. Rifampicin-resistant and MDR-TB made up as drug-resistant (XDR-TB) cases. In at least 127 countries much as half a million of the 10.6 million new tubercu- worldwide, one or more cases of XDR-TB had been re- losis cases worldwide in 2016 (WHO, 2017). In the same ported by the end of 2017 (WHO, 2018). The continuous year, an estimated 19% of previously treated TB cases detection of such strains in transmission clusters and the were rifampicin- or multidrug-resistant. While MDR-TB lack of new anti-TB drugs highlights the need for prevent- is treatable and curable by second-line drugs, there are ing further transmission of drug-resistant strains (Kendall et al., 2015). Any treatment selects for drug-resistant strains and ∗Corresponding author Email addresses: [email protected] (J¯ulija any drug resistance is a burden for the individual patient. Peˇcerska), [email protected] (Tanja Stadler) A resistant strain with high transmission potential may

Preprint submitted to Elsevier December 20, 2019 cause a resistant epidemic and thus poses a serious risk process. It assumes that different strain types (such as for the general population. Thus, public health measures drug sensitive and drug resistant) circulate simultaneously should aim at preventing the emergence of resistant strains within an epidemic, and allows estimation of type-dependent with a high transmission potential. In this study we aim transmission rates based on sequencing data. To ensure to quantify the transmission fitness cost of drug-resistant computational feasibility, our configuration of MTBD ef- strains using M. tuberculosis genomic data. fectively ignores the complex TB dynamics such as latency New sequencing technologies allow us to obtain large and treatment. numbers of M. tuberculosis genome sequences (Meehan Throughout, we estimate the relative transmission fit- et al., 2018; Sengstake et al., 2017; Casali et al., 2014). Ex- ness rλ of drug-resistant strains as the ratio of the drug- tensive genetic sequencing allows us to detect some types resistant strain transmission rate (λR) to the drug-sensitive λ of drug resistance earlier compared to phenotypic meth- strain transmission rate (λS), i.e. rλ = R/λS. This defini- ods (Yakrus et al., 2014; Miotto et al., 2014; Pankhurst tion of relative transmission fitness quantifies the average et al., 2016; Colijn & Cohen, 2016), and to detect drug decrease or increase in the number of new cases per unit resistance in cases when standardised tests are not avail- of time caused by a patient infected with a resistant strain able (Horne et al., 2012). New technologies allow real-time compared to a patient with a drug-sensitive strain. For whole genome sequencing (WGS) of ongoing epidemics a given relative fitness rλ, the transmission fitness cost is (Walker et al., 2018). (1 rλ) 100%. This configuration of MTBD has been ap- Phylogenetic and transmission analyses of WGS data plied− to× estimate the relative transmission fitness of drug- attempt to reconstruct transmission between infected in- resistant mutations for the human immunodeficiency virus dividuals. Tools for phylogenetic and transmission tree re- (HIV) (K¨uhnertet al., 2018). However, we present here construction from TB WGS data are increasingly becom- the first study in which it is applied to a bacterial – and ing available, e.g. Didelot et al. (2014, 2017); Klinkenberg hence much more slowly evolving – pathogen. et al. (2017). Our simulation study shows that MTBD parameter Phylodynamic analysis, an approach introduced more estimates are highly robust when estimating the relative than a decade ago by Grenfell et al. (2004), aims at unify- transmission fitness rλ of drug resistant strains, despite ing the inference of epidemiological and evolutionary dy- long periods of treatment and latency used in the simula- namics of pathogens. This approach aims to estimate the tion scenarios. parameters of the tree-generating process, e.g. transmis- To illustrate the utility of MTBD for TB epidemio- sion and cure rates in the case of an epidemiological model, logical analysis, we apply it to an M. tuberculosis dataset jointly with the evolutionary relationships between sam- sampled over the course of five years in Kinshasa, the capi- pled sequences. There are still a number of challenges to tal of the Democratic Republic of the Congo. The dataset be tackled in phylodynamics (Frost et al., 2015), particu- contains sequences from re-treatment cases of TB which larly, since the first generation of phylodynamic tools have mainly exhibit MDR phenotypes. Many of these MDR been used and validated exclusively on viral sequences. sequences also carry substitutions that indicate pyrazi- Now that whole genome sequences are readily available namide resistance (Meehan et al., 2018). We set out to for other types of pathogens, rigorous testing needs to be test the previously posed hypothesis suggesting that pyraz- performed to further validate their use (Biek et al., 2015). inamide resistance reduces the ability of a strain to be Few phylodynamic methodological approaches have been transmitted from host to host (den Hertog et al., 2015). developed specifically for analysing M. tuberculosis datasets. In the terminology used here this translates to the relative One example is work by Merker et al. (2018), where popu- transmission fitness rλ of drug-resistant strains being be- lation size estimates were used to approximate the fitness low one. To test this hypothesis, we quantify the relative of strains with compensatory mutations. However, to our transmission fitness of additional pyrazinamide resistance knowledge, no previous study has directly estimated epi- when compared to pyrazinamide-sensitive MDR strains. demiological dynamics such as the relative transmission fitness of drug resistant strains for TB. 2. Materials and Methods In this study, we first investigate the appropriateness of a phylodynamic tool developed for viral pathogens to 2.1. Simulation study study the epidemiological dynamics of TB. We first simu- 2.1.1. Simulating epidemics late epidemics under an epidemiological model specific for TB, including latent (exposed) periods and treatment pe- In order to simulate realistic epidemics we designed an riods. We then apply this phylodynamic tool to the simu- epidemiological model that accounts for the most impor- lated data and evaluate the fitness costs for drug-resistant tant aspects of TB dynamics. This model builds upon TB strains compared to the drug-sensitive TB strains. The previous models described in the literature (e.g. (Gomes tool is called the multi-type birth-death (MTBD) model et al., 2007; Cohen et al., 2009; Pinho et al., 2015; Dowdy (K¨uhnertet al., 2016), which works within the BEAST2 et al., 2013)), and was adjusted to more closely represent software framework (Bouckaert et al., 2014). Inference un- the spread of drug-sensitive and drug-resistant M. tuber- der MTBD is based on a multi-type birth-death-sampling culosis strains within the scope of a single epidemic. In 2 our modelling framework, strain spread is described by the in general an SEIT model with m 1 resistant classes. m − Susceptible-Exposed-Infectious-Treated (SEIT2) model, The simulations were performed using the Bayesian shown in Figure 1a. The model is tailored towards M. tu- inference framework BEAST2 (Bouckaert et al., 2014). berculosis transmission as follows. Upon exposure to the The BEAST2 package MASTER (Moments and Stochas- bacterium only 10% of the susceptible population ( com- tic Trees from Event Reactions) (Vaughan & Drummond, partment) proceed to infection; others will developS neither 2013) was used to simulate stochastic realisations of epi- disease nor infectiousness and can therefore be ignored in demic histories. The simulation model was specified as the model (Vynnycky & Fine, 1997). As the sampling for chemical master equations (CMEs) describing the tran- the available dataset has been done within the span of five sitions between different SEITm model states happening years, we restrict our simulations to modelling short-term with predefined rates. The stochastic simulator produced epidemics. Previous infections have no clear effect on im- a random outcome of the epidemic in the form of a trans- munity to consecutive disease (Verver et al., 2005; Chiang mission tree. The simulations with initial population sizes & Riley, 2005; Yew & Leung, 2005), and existing meth- N = 200,000 or N = 1,000 stopped when 300 or 150 cases ods of vaccination (e.g. the BCG vaccine) seem to have a were sampled, respectively, which is close to the sample negligible effect on infectious disease dynamics in adults in number available in the empirical dataset from Kinshasa. endemic settings (Gomes et al., 2004). Therefore, recov- The large population size results in exponentially grow- ered individuals return to being susceptible after successful ing epidemics. The smaller population size of N = 1,000 treatment. resulted in epidemics where the infected population size Each simulated epidemic starts with a single patient in- saturated (see supplementary Figure 9). fected with a drug-sensitive strain (compartment IS in Fig- To match real life sampling and to ensure identifia- ure 1a) and N-1 susceptible individuals (compartment in bility, we only keep trees with a minimum sample of 30 Figure 1a), where N is the total population size. A patientS patients with a drug-resistant infection and 30 patients enters the latent (exposed) phase (ES compartment) upon with a drug-sensitive infection, and restart the simulation infection with a drug-sensitive strain, and moves (with rate otherwise. We also restart the simulation in cases when σ) to the active phase of TB infection (IS compartment). the epidemic died out before reaching the desired num- Patients in the active phase can transmit to susceptible in- ber of sampled cases. While this constraint is biasing the dividuals from the compartment with rate β , die with tree sample for surviving epidemics, we want to mimic real S S rate π, and start treatment (moving to the TS compart- world data, and we would not be able to obtain accurate ment) with rate τ. As a patient starts treatment, they estimates for empirical datasets that have fewer sequences. will be sampled and the M. tuberculosis genome sequenced To mimic real life situations we used a number of dif- with probability pS. We assume that successful treatment ferent configurations for our model parameters βx, σ, τ, always leads to recovery (i.e. the individual moves to the γ , κ , p , π, and µ, where x S, R . We specify the ba- S x x x ∈ { } compartment with rate γS), whereas dropped treatment or sic reproductive number R0,S of the sensitive strain type, otherwise unsuccessful treatment leads to disease relapse defined as the expected number of secondary infections (i.e. the individual moves to the IS compartment again, caused by a single drug-sensitive infected individual at the with rate κS). Furthermore, we assume that there are no start of the simulation prior to changing resistance sta- co-infections. As all diagnosed cases in the study area are tus. Thus, in case a drug-sensitive individual evolves drug currently getting treatment, there is little possibility for resistance, only the secondary infections caused prior to self-cure. Hence, infected individuals never recover with- drug resistance contribute to the R0,S. Analogously, the out treatment in our model. respective basic reproductive number of resistant strains To account for the drug-resistant strains (compartments is R0,R. ER,IR,TR in Figure 1a), we add a rate µ with which indi- In the simulation setup, we specify R0,x instead of βx, viduals in the treated compartment TS may develop drug x S, R . This re-parametrization of βx given the pa- resistance and thus move into a resistant infectious com- rameters∈ { R} , σ, τ, γ , κ , p , π, µ, x S, R , is de- 0,x x x x ∈ { } partment IR. We assume that drug resistance is never lost, scribed in the Supplement. We use combinations of val- i.e. a resistant individual cannot move back to the sensi- ues for R0,S and R0,R for which the drug-resistant strain tive class. As it is also possible for drug resistance to be causes either the same or a slightly lower number of sec- transmitted, we allow new infections which follow the same ondary infections than the drug-sensitive strain. We also dynamics as for drug-sensitive strains. Individuals enter included a case in which the basic reproductive number of the latent phase (ER compartment) upon infection at a the resistant strain R0,R is below the epidemic threshold rate βR and progress to active disease (IR compartment) 1. The R0,x parameter combinations for simulation were at a rate σ. Again, infectious individuals can be treated as follows: (R0,S, R0,R) = (1.3, 1.1), (1.2, 1.1), (1.1, 1.1), (e.g. using second-line treatment in the case of MDR-TB) and (1.2, 0.9). Note that all R0 values are around 1, since and thus move to compartment TR, where they can re- TB is endemic in many countries (Stadler, 2011; Ma et al., cover or relapse. A patient entering the TR class will be 2018). sampled and the M. tuberculosis genome sequenced with The time spent in the exposed and infectious compart- probability pR. We call this model an SEIT2 model, and ments together (prior to first treatment or death) is fixed 3 ?S k ? S IS pS ?

?R k ? R IR pR

(b) MTBD2 model with drug-sensitive and drug-resistant strains, (a) SEIT2 model with drug-sensitive and drug-resistant strains, subscripts S and R respectively. Rates are marked as follows (x k ∈ subscripts S and R respectively. Rates are marked as follows S, R ): λ - time-dependent birth rate per interval k, δx - recovery { } x (x S, R ): βx - infection rate, σ - disease progression rate, π rate, µ - resistance evolution rate. ∈ { } - death rate, τ - treatment rate, κx - relapse rate, γx - recovery rate, µ - resistance evolution rate.

? k k k k ? S ? R ? R ? R IS IR1 IR2 ... IR6

pS pR pR pR

?S ?R ?R ?R

(c) The MTBD7 model setup for Kinshasa analysis of clusters with at most 6 different types of pyrazinamide resistance substitutions per cluster. Each of the IRn compartments, where n is the compartment number, represents a distinctive resistance mutation within a cluster. k Rates are marked as follows (x S, R ): λx - time-dependent birth rate per interval k, δx - recovery rate, µ - resistance evolution rate. Resistance cannot be lost, only∈ acquired, { } and all resistant strains have the same fitness relative transmission fitness.

Figure 1: Different models used for simulation and analysis. In all figures, the compartments are marked as follows: Susceptible - , Exposed S -Ex, Infected - Ix, under Treatment - Tx, where x is either drug-sensitive or drug-resistant, S or R respectively. Sampling probability is marked by px, where x is S or R.

4 to one time unit, which we set to one year in our simu- e.g. susceptible depletion or a newly introduced effective lations. The proportion of time spent in the exposed and vaccination strategy. Transmission rates are modelled as infectious compartments, respectively, is varied. The effect piecewise constant rates, i.e. the transmission rate is con- k of a higher proportion of time spent in the exposed com- stant within a user-specified time interval k (λx), after k+1 partment on the tree shape is that bifurcations (new infec- which it may change to another constant rate (λx ), and tions) will start to appear on branches later after the start so on. The number of time intervals is user-specified. The of the branch, since individuals reach the infectious state model estimates a removal (become uninfectious) rate δ. later. average time spent in the exposed compartment in We use a MTBD setup in which all infected individuals the simulations ranges from tE = 0, 0.2, 0.4, 0.5, 0.6, 0.8 to are infectious, i.e. the latent and treatment phases are 0.9 year, which is specified by the rate σ = [ , 5, 2.5, 2, 5/3, 5/4,ignored.10/9] We further assume that δ is constant through 1 ∞ year− . Hence, the average times spent in the infectious time as the treatment strategies stay the same during the compartment are tI = 1 tE = [1, 0.8, 0.6, 0.5, 0.4, 0.2, 0.1] time spanned by our phylogenetic tree. Sensitive and re- − 1 year, specified by τ + π = [1, 5/4, 5/3, 2, 2.5, 5, 10] year− . sistant strain have the same δ as any difference in time Thus, 1/σ + 1/(τ+π) = 1, which keeps the total time of infec- until treatment is likely negligible. Similarly, MTBD esti- tion before first treatment or death at one year. This way mates the sampling proportions pS and pR and resistance we allow the time a person spends while exposed and not acquisition rate µ, which are directly comparable between yet infectious to be up to nine times longer than the infec- MTBD and SEIT2. We fix the sampling probability in the tious time. Individuals infected with either sensitive or re- simulation analysis to the true values (see supplementary sistant strains are removed from the infectious pool due to section Parameter definitions). We use the MTBD setup fatal outcomes with probability 0.1 and proceed to treat- which disallows so-called sampled ancestors. 1 ment with probability 0.9, (i.e. τ = 0.9 /tI). The follow- In the classic MTBD setup, one would estimate a sepa- × x ing rates are used for the recovery rate as consequence of rate λS per time interval k and per strain x S, R . How- 1 1 ∈ { }λk k treatment: γS = 1.0 year− , γR = 0.5 year− , which takes ever we argue that the relative transmission fitness R/λS in into account the fact that resistant strains need longer real epidemics is independent of the speed of spread (e.g. treatment given currently recommended treatment regi- due to the varying number of susceptible individuals) at 1 mens. Mean rates of relapse are set to κS = 0.1 year− , a particular point in time, thus the disadvantage a strain 1 κR = 0.075 year− , corresponding to a lower chance of has after developing drug resistance stays constant. This λk k relapse for the resistant strains with appropriate treat- means that we assume rλ = R/λS is constant for all time ment. Drug resistance mutations were acquired at a rate intervals k. We implemented this assumption in MTBD 1 λk of µ = 0.04 year− . We assume that M. tuberculosis drug by estimating rλ, δS, δR, µ and Re,S = S/(δS+µ). The lat- resistance reversal does not occur (Andersson & Hughes, ter is called the effective reproductive number which we 2010; Casali et al., 2012; Allen et al., 2017). Unfortunately, estimate for the sensitive strain. The effective reproduc- relapse and drug resistance acquisition rates have not yet tive number Re,S of the sensitive (resp. Re,R of the resis- been quantified conclusively, so these rates were set in an tant) strain type is defined as the expected number of sec- ad hoc way. We also simulated trees without exposure or ondary infections caused by a single drug-sensitive (resp. possible relapse by setting σ = and γx = (essentially drug-resistant) infected individual at time t, before leav- setting the time spent in the E∞and T compartments∞ to ing the class x, x S, R . Thus, analogous to the basic x x ∈ { } 0, x S, R ), referred to as the SIS2 model. reproductive number, in case a drug-sensitive individual Due∈ { to drug} resistant strains being of greater clinical evolves drug resistance, we only count the secondary in- interest, a higher proportion of them is sampled compared fections caused prior to evolving the drug resistance. Thus, k λk k λk to sensitive strains. Hence, the sampling probabilities were Re,S = S/(δS+µ) and Re,R = R/δR for each time interval set to pS = 0.1 and pR = 0.3 in the simulations. k. To account for the stochasticity of the epidemiological We evaluate the performance of MTBD for analysing process for each of the different parameter configurations SEIT2 model simulations by comparing the estimated rela- we ran a hundred separate instances. tive transmission rate rλ to the true ratio βR/βS in SEIT2. The prior distributions used for Re,S, rλ, δx, p, and µ are 2.1.2. Analysis of the simulated epidemics provided in Table 1. We perform all analyses assuming Employing a full SEITm model for phylodynamic in- both (i) a constant transmission rate and (ii) a piecewise ference would be very demanding computationally (see constant transmission rate over three time intervals. For supplementary section MTBD vs. SEITm). Instead, we (ii) we split the time covered by the simulated tree such analyse the simulated trees using the BEAST2 package that each interval has the same number of branching events MTDB (K¨uhnertet al., 2016). We configure it to fit a approximately. For all analyses of the simulation runs the simpler epidemiological model (Figure 1b), and estimate Markov chain Monte Carlo (MCMC) reached an ESS for the posterior distribution of the model parameters given all parameters of at least 200. For some of the simulation each tree simulated using SEIT2. MTBD allows us to es- configurations with tE = 0.9 and tI = 0.1 the simulations k timate a time-dependent transmission rate λx, which ac- failed to run due to the epidemics dying out almost in- counts for possible changes in transmission rates due to stantly, thus they were excluded from the results. 5 Analysis Re,S rλ δx µ px Simulations Lognormal(0, 1.25) Lognormal(0, 0.5) Lognormal(0, 0.5) Exp(1.0) Fixed Kinshasa Lognormal(0, 1.25) Lognormal(0, 0.5) Lognormal(0, 0.5) Exp(1.0) Beta(23, 977) Exp(0.2) Exp(50)

Table 1: Prior distributions for the MTBD parameters.

2.2. The Kinshasa dataset can evolve multiple times along the tree if the tree struc- The Kinshasa M. tuberculosis dataset consists of 324 ture estimated from the genetic sequences favours de novo sequences sampled from re-treatment patients, most of resistance rather than transmitted resistance (high µ, low which were identified as Lineage 4 (309). The sequences Re,R). We however disallow resistance reversal. were sampled over the course of 5 years and the sampling The model setup used for analysis is shown in Fig- calendar dates were recorded. The sequence alignment is ure 1c. We performed MTBD analyses assuming a con- 6,567 nucleotides long, not including any known resistance stant Re,S over the whole time period (we report in the mutations. Results section that, based on our simulation study, rλ Of the 309 sequences, 170 were clustered MDR-TB is estimated reliably without assuming time-variation for strains. The Lineage 4 sequences are not all part of one Re,S). We set the priors for the MTBD parameters as spec- single transmission cluster, but form a number of smaller ified in Table 1 and the prior for the substitution rate was 7 clusters identified previously (Meehan et al., 2018). These set to a Log-normal distribution with the mean of 1.5e− transmission clusters are based on a 12 single-nucleotide and standard deviation of 1.0 (Meehan et al., 2018). While polymorphism (SNP) cut-off. This cut-off excluded the the distribution is Log-normal, the mean is specified not mutations known to cause resistance to reduce false clus- in log, but in real space, such that it translates directly to tering due to similar drug resistance profile (any mutations estimated substitution rates. We set the sampling propor- defined by Feuerriegel et al. (2015)). For the purpose of tion to be equal for both strain types as all strains were this paper we also removed non-MDR sequences, i.e. se- sampled regardless of pyrazinamide resistance status. In quences that lack one or both isoniazid or rifampicin re- particular, we set a narrow prior on the sampling propor- sistance, and sequences which have different MDR profiles tion, centering the mean sampling proportion at 2.3%, as within a cluster. estimated by Meehan et al. (2018). The sampling pro- 102 of the clustered strains additionally exhibited pyraz- portion is a lot lower than in the simulations, however inamide resistance (Meehan et al., 2018). Pyrazinamide the difference in sampling is accounted for in the prior was used in Kinshasa as an anti-tuberculosis drug in the and should not bias the estimates of other parameters, form of a fixed combination tablet. Pyrazinamide is an an- as MTBD accounts for the sampling proportions in the timycobacterial pro-drug that is activated by the enzyme likelihood (see K¨uhnertet al. (2016)). Then, we estimate pyrazinamidase, which is encoded by the non-essential pncA phylogenetic trees for each of the 33 clusters; the MTBD gene in M. tuberculosis (Yadon et al., 2017). As pyrazi- and evolutionary parameters are shared across all clusters. namide action depends on the activity of the enzyme en- The empirical analysis was done based on several small coded by pncA (Njire et al., 2016), multiple different single clusters, while the simulation study was done on one large point mutations in pncA may cause resistance to pyrazi- cluster (and revealed reliable results for that scenario, see namide by disrupting the enzyme. Little convergent evolu- the Results section). To validate our empirical analysis tion has been detected on that gene (Miotto et al., 2014), approach involving several clusters, we have performed and it appears that resistance reversal mutations are ex- analyses only using one or a few of the Kinshasa clusters. tremely unlikely (Andersson & Hughes, 2010). The dataset Starting with the largest cluster in the Kinshasa data, we contains sequences that have 59 different pncA gene muta- analysed increasing numbers of clusters, sequentially in- tions and we assume that the relative transmission fitness cluding the smaller cluster sizes. We similarly removed the for each of the different mutations is the same. largest clusters, thereby reducing the size of the dataset. The resulting dataset consisted of 33 transmission clus- The relative fitness rλ and the de-novo resistance evo- ters, sized 2 to 30 strains. Each cluster contained at most 6 lution rate µ should be inversely correlated, as new in- different pncA substitutions, meaning that in each cluster fectious and de novo resistance acquisitions are the only we have up to 6 resistant compartments. In the result- routes to a drug-resistant infection. We investigated the robustness of the pyrazinamide resistance acquisition rate ing MTBD7 model, each resistant compartment informs the same transmission rate. Drug resistance substitutions µ and rλ estimates by changing the prior on the muta- are not used to infer phylogenies, but only to categorize tion rate from Exp(1.0), translating to a mean rate of 1, strains into different resistance compartments. The divi- to Exp(0.2), translating to a mean rate of 5. Such pri- sion of strains into different compartments does not enforce ors set the mean time until resistance acquisition to 1 and their clustering on the tree. The same pncA substitution 0.2 years respectively. We additionally ran the full clus-

6 ter analysis under a very restrictive prior on µ (Exp(50), For (ii), the Re,S and Re,R estimates in the first interval translating to a very low mean rate of 0.02) to see whether of the epidemic correspond to the true R0 and the second this will greatly influence the results. In all of the analyses and third interval show a drop in the estimated Re due to a the MCMC reached an ESS for all parameters of at least decreasing susceptible population size. The rλ is estimated 1751. well, as in 86-98% of simulations for each configuration the true value is within the estimated HPD. Here, in the cases 3. Results when the relative fitness cost is high (e.g. rλ 0.72), in at least 55% of the analyses, the corresponding≈ HPD interval 3.1. Simulation study excludes one. We simulate M. tuberculosis phylogenetic trees under Given a large population size (supplementary Figures 5 to 8), we estimate all parameters reliably, both when an SEIT2 model (shown in Figure 1a), performing 100 sim- ulations for each chosen parameter combination. We then using one and three intervals for Re estimation. Addi- estimate the epidemiological parameters based on the sim- tionally, given more data the estimates of rλ become more precise, as for r 0.72 in at least 81% of the analyses the ulated phylogenetic trees using the MTBD package (model λ ≈ shown in Figure 1b) within BEAST v2.0, resulting in a HPD interval excludes 1 in both single and three interval sample of the posterior distribution for each model param- estimates. eter. We summarise each posterior distribution by com- puting the median and the 95% highest posterior density 3.2. The Kinshasa dataset (HPD) interval. To evaluate the estimates for all 100 sim- We ran the analyses on the complete dataset using the ulated trees together, we report the median of the set of package MTBD configured as shown in Figure 1c under parameter medians. two different priors on the resistance acquisition rate µ, Here we report the relative transmission fitness rλ, which Exp(1) and Exp(0.2). In particular we assumed a constant is defined as the ratio of transmission rates of the two Re,S through time as the simulations revealed that this λ strains: R/λS. If the rλ is greater than 1, we conclude assumption still produces reliable rλ estimates. The two that the MDR M. tuberculosis strain is fitter than the analyses estimate a median relative transmission fitness drug-sensitive strain. Figure 2 and supplementary Figures of approximately 0.64: rλ = 0.6417 (median, 95% HPD: 1 to 8 show the resulting estimates from all simulations, [0.5477, 0.7378]) for Exp(1) and rλ = 0.6403 (median, 95% the simplest ignoring the E and T compartment and the HPD: [0.5454, 0.7394]) for Exp(0.2) (Figure 3). We trans- most complex including a long latent period (large tE) and late the estimated relative fitness to a pyrazinamide resis- treatment. tance transmission fitness cost of approximately 36%. To First, we discuss the Figures showing the simulated epi- test robustness of the results when the acquisition rate is demic on a fairly small population size, N = 1,000 (Figure 2 much slower, we additionally used a prior of Exp(50) on µ. and supplementary Figures 1 to 4). The transmission rates This analysis estimates a slightly higher relative transmis- and consequently the effective reproductive number Re de- sion fitness with largely overlapping confidence intervals: crease through time due to a depletion of susceptibles. As rλ = 0.7194 (median, 95% HPD: [0.6261, 0.8063]). The explained in the Materials and Methods section, we as- parameter estimates for µ for the Exp(0.2), Exp(1) and sume Re to be either (i) constant or (ii) allow three piece- Exp(50) priors overlap by a large margin: wise constant intervals for Re in the inference. First, we 0.0774 (median, 95% HPD: [0.0498, 0.1116]), 0.0752 (me- observe that violating the MTBD assumptions by adding dian, 95% HPD: [0.0466, 0.1071]) and 0.0414 (median, 95% the E and T compartments neither affects the estimate HPD: [0.0277, 0.0572]) (see Supplementary Figure 10). of the relative transmission fitness rλ nor the estimates of In order to test robustness of our results to dividing Re,S and Re,R. When assuming (i) that Re is constant sequences into clusters, we perform incremental analyses, through time, the rλ is estimated very well (in 91-99% of where we first analyse only the biggest cluster, then the the simulations for each configuration the true value falls two biggest clusters, etc. The results for the analyses un- within the estimated HPD). Not only the coverage is high, der an Exp(1) prior on µ are shown in supplementary Fig- accuracy is good as well. For example, when the relative ure 11a. Second, we perform a decremental cluster anal- fitness is relatively high (e.g. rλ 0.72), in at least 72% of ysis, where we start with the full dataset and then re- the analyses the HPD interval excludes≈ 1 when it should move the largest cluster from the analysed dataset, then be excluded due to the fitness cost of resistance. As ex- the two largest clusters, etc. The corresponding results are pected, the estimated Re is lower than the true R0 when shown in supplementary Figure 11b. The incremental clus- a constant Re is used as it averages over the whole time ter analyses show a consistent fitness cost and an effect of period which includes times with low susceptible counts. the dataset size on the fitness cost estimate. As more infor- mation is added, the between-host pyrazinamide resistance transmissibility increases (from r 0.5 to r 0.64), and 1With the exception of two runs on a small number of clusters, λ ≈ λ ≈ where the prior did not mix, however as the runs on the incomplete stabilises after a certain dataset size is reached (see sup- dataset were done to verify the approach and every other run mixed, plementary Figure 11a). Furthermore, the prior on µ does we disregard those. 7 1.8

1.6

1.4

1.2 45 44 29 36 41 39 36 1.0

0.8

0.6 98 93 94 92 95 98 91 0.4

0.2

0.0

SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8

(a) rλ estimates for three intervals. The green line shows the true rλ value from the simulations and the number in green indicates the percentage of runs whose 95% HPD includes the truth. The red line indicates the value 1 and the number in red indicates the percentage of runs that produce estimates whose 95% HPD include 1. If included in the HPD, the value 1 means that the method can not distinguish between the two strains with different transmission rates. 1.8 1.8

1.6 1.6

1.4 1.4

1.2 1.2

1.0 1.0

0.8 0.8

0.6 0.6 1 1 Re,S Re,R 0.4 2 0.4 2 Re,S Re,R 0.2 3 0.2 3 Re,S Re,R 0.0 0.0

SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8 tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8

(b) Re,S estimates for three intervals. Blue represents the estimate (c) Re,R estimates for three intervals. Blue represents the estimate in the time interval starting at the root of the tree, purple the middle in the time interval starting at the root of the tree, purple the middle time interval, and red the most recent time interval. Note that due time interval, and red the most recent time interval. Note that due to a decreasing number of susceptible individuals, we only expect to a decreasing number of susceptible individuals, we only expect the blue bars to appropriately estimate the true basic reproductive the blue bars to appropriately estimate the true basic reproductive number shown as a green horizontal line. number shown as a green horizontal line.

Figure 2: r ,R and R estimates plotted in relation to the different simulation models, for r 0.72, R = 1.2 and R = 0.9, λ e,S e,R λ ≈ 0,S 0,R 1,000 individuals in the population, 150 samples and 3 intervals for Re estimates. Each plot shows the median parameter estimates for 100 simulation runs for each configuration. The points on the vertical lines indicate the median of estimate medians per 100 runs. Similarly, the upper and lower bounds show the median values of the 95% HPD interval limits per 100 runs.

8 not affect these results (supplementary Figure 12). Simi- Once transmission fitness cost for a range of drug resistant larly, in the decremental cluster analyses, the cost rises as strains is quantified, one can assess the correlation between more sequences are removed from the analyses. in vitro and observational transmission fitness costs. At We further performed the incremental cluster analy- present, no estimates of pyrazinamide resistance transmis- ses without sequence data, while still specifying the clus- sion fitness costs in vitro are available, so a direct compar- ter sizes, sampling times, and drug resistance types, i.e. ison between different estimates is currently impossible. information on the composition of prevalence data. For Kendall et al. (2015) have used incidence data to es- this, all sequence data was replaced by a single unknown timate relative transmissibility of MDR-TB, showing that nucleotide character in the configuration files. The pos- a predominant number (median 95.9%; 95% uncertainty terior estimates are shown in supplementary Figure 13. range [68.0, 99.6]) of MDR-TB cases are due to transmis- Importantly, rλ for the full dataset is estimated around sion rather than de-novo resistance acquisition. On the 0.42. Thus, the same method using prevalence-related other hand, Burgos et al. (2003) have used M. tuberculosis data only predicts a relative transmission fitness of 0.42, drug resistance and genotype data to cluster isolates from while adding sequences predicts a relative fitness of around San Francisco and estimate drug resistance fitness costs 0.64. in comparison to drug sensitive strains in the form of a Additionally, we investigated whether the estimated proxy for the ratio of reproductive numbers. Their analy- relative transmission fitness in the analyses without ge- ses show a high overall fitness cost of drug resistance, show- nomic data reflects simple data properties such as the pro- ing that on average the estimated reproductive number portion of pyrazinamide-resistant strains to all strains, or ratio of any drug-resistant to drug-susceptible TB strains the diversity of the pyrazinamide-resistant strains. How- is 0.51 (95% confidence interval [0.37, 0.69], resistant to ever, we find no clear trend (supplementary Figure 13 vs. isoniazid, streptomycin, or both). Moreover, their dataset supplementary Figure 14). This suggests that the fitness shows no secondary MDR-TB cases, whereas our stud- cost is at least in part informed by the sampling times ied dataset shows signal for transmitted MDR-TB cases. and drug resistance statuses, even when disregarding evo- Luciani et al. (2009) estimate that the relative fitness of lutionary relationships between samples. drug-resistant strains varies from 0.3 in Venezuela to 1.0 in Cuba and Estonia, showing that depending on the coun- 4. Discussion try in question the fitness costs can vary drastically. Our estimate of around 0.64 for the relative fitness of pyrazi- We have shown that the BEAST2 MTBD package can namide resistant strains in Kinshasa falls inside the con- be used to estimate relative transmission fitness for drug- fidence intervals both in Burgos et al. (2003) and Luciani resistant TB strains compared to drug-sensitive strains. et al. (2009). Even though our MTBD configuration ignores latency and In summary, Kendall et al. (2015) have no genomic in- treatment during infection, it reliably estimates the trans- formation available, while Burgos et al. (2003) only make mission dynamics for simulations with long latent and treat- use of genotyping data to cluster isolates by similarity, ment phases. This insight is very important beyond the rather than by inferring evolutionary relationships. In our analysis of M. tuberculosis. Many infections can be treated analyses, we observed higher estimates of transmission fit- but may relapse. Many phylodynamic tools do not directly ness costs when ignoring the genomic data and hence the model exposure, treatment, or relapse for computational evolutionary relationship among samples (see Supplemen- reasons. Here, we show that a phylodynamic tool with tary Figure 12). Luciani et al. (2009) estimate fitness costs such a setup can still robustly estimate relevant epidemi- from genetic data using approximate Bayesian computa- ological quantities of epidemics with treatment, latency, tion, bundling multiple different resistant strain types to- relapse, and other dynamics. However, as in every sim- gether, which leads to averaging of possible costs in any ulation study, only a range of parameters can be inves- different resistance type. While in this study we bundle tigated. Here we picked parameter values that are real- substitutions on a single gene that confer resistance to a istic for MDR-TB epidemics, and researchers interested single drug, we make sure that the MDR substitutions are in different pathogens are highly encouraged to use our identical within a cluster. BEAST2 configuration files (in supplement) with modi- Phylogenetic and transmission analyses have previously fied parameter settings to explore the appropriateness of been performed on WGS M. tuberculosis data. These anal- MTBD for their study system. yses have mainly focused on inferring the timing of epi- Previous works considered fitness cost in vitro for drug demics and on inferring transmission networks with direc- resistant M. tuberculosis strains. For example, Gagneux tion of transmission. Works such as Didelot et al. (2014, et al. (2006) and others (Mariam et al., 2004; Davies et al., 2017) have used phylogenetic trees inferred by BEAST to 2000) have assessed fitness costs of drug resistance in M. infer transmission trees. On the other hand Klinkenberg tuberculosis using competition assays in different media. et al. (2017) have implemented simultaneous transmission They measure fitness costs via differences in cell growth, and phylogenetic tree estimation, which allows for more also referred to as in vitro fitness cost. However, in vitro precise estimates of transmission event times, as unob- fitness cost is not equivalent to transmission fitness cost. served events are unconstrained by the previously esti- 9 9 Exponential with rate 0.2 Exponential with rate 1 8

7

6

5

4 Density

3

2

1

0 0.4 0.5 0.6 0.7 0.8 0.9

Figure 3: The posterior probability density of rλ for the full dataset analysis of the Kinshasa sequences under 2 different priors on the resistance acquisition rate µ.

10 mated phylogenetic trees. However, this latter tool was It is important to also set the substitution rates depend- tested on densely sampled populations, and is thus not ing on the time scales on which the sequences are avail- applicable to datasets as the one used here. Moreover, able. If the data is available in a similar format to what none of these tools allow us to infer parameters defining is analysed here, e.g. in 12 SNP clusters, estimates of the the dynamics of epidemic spread, such as the transmission substitution rates for shorter time scales are more reason- rates and the relative fitness of distinct strains. able than estimates from the whole evolutionary history Based on phylogenetic and transmission analyses, one of M. tuberculosis. Additionally, it is important to further can attempt to make qualitative conclusions on specific study the rates of resistance acquisition. While the dif- strain fitness based on the clustering of the samples on phy- ferent priors on this parameter do not make a dramatic logenetic and transmission trees. One phylogenetic analy- difference, including more information on those rates will sis aiming at assessing transmission fitness cost based on be beneficial. WGS data is Casali et al. (2014), where the authors used In the future, we would like to estimate the relative whole genome sequences and their reconstructed phyloge- transmission fitness of isoniazid and rifampicin resistances, netic relationships to investigate the transmission fitness the two resistances defining MDR-TB, compared to drug costs of drug resistance. They used the clustering of iso- sensitive strains. Such analyses require datasets containing lates as an indicator of transmission fitness, where closely large numbers of both drug-sensitive and drug-resistant clustering isolates with an inferred common ancestor in- strains. Unfortunately, drug-sensitive M. tuberculosis strains dicate transmitted resistance, whereas single isolates indi- are often of lower clinical interest and are therefore rarely cate acquired resistance. The authors speculate that the sequenced. Indeed, mainly MDR strains were available most prevalent substitution p.Ile6Leu in the pyrazinamide for Kinshasa, thus such analyses were impossible. To our resistance gene pncA, which does not confer resistance in knowledge, no reasonably sized datasets of linked cases vitro, does confer clinical resistance with no reduced trans- containing both sensitive and resistant strain sequences are missibility. Unfortunately, this specific substitution is not available at the moment. Upon availability of such data present in the dataset from Kinshasa so it was impossible sets, our approach could be employed to compare between- to check this hypothesis. host fitness e.g. between strains with and without compen- Our approach allows us to look at specific resistances satory mutations, between HIV negative and positive pa- and estimate their transmission success in relation to other tients, and between prison and non-prison-associated TB strains circulating in the same epidemic. We quantify the cases. transmission fitness cost of pyrazinamide resistant MDR Overall, we show through simulation that we can use M. tuberculosis to be around 36% relative to pyrazinamide- the modified MTBD method to analyse pathogens where sensitive MDR M. tuberculosis strains. While it would the strains can be appropriately divided into two or more be most interesting to investigate the transmission fitness categories with distinct properties. Importantly, since we costs for each specific resistance mutation, rather than as- employ phylogenetic trees as a model for evolutionary his- suming that the cost is the same for all mutations caus- tories, these pathogens may not recombine drastically on ing pyrazinamide resistance, we need to have a significant an epidemiological scale. This is fulfilled for TB and many amount of sequences with identical pncA mutations to be viral pathogens where parts of the genome never recom- able to estimate their specific transmission costs. Our bine. Many other bacteria may not be appropriately anal- dataset, however, does not contain enough sequences ex- ysed with this method due to e.g. frequent plasmid ex- hibiting the same mutation in order to quantify its trans- change. In this work we show that we can estimate the mission fitness cost. Our model does not allow for co- transmission fitness effects of additional resistance in MDR- infection, which could have occurred in the Kinshasa pa- TB. We expect that our method will more generally be tients but is impossible to detect. The long culturing pe- useful to quantify the epidemic spread of drug resistances riod of the strains required before sequencing (at least 6 in a range of pathogens, and in that way may shed light on weeks) results in outcompeting of any mixed infections and optimal treatment strategies aiming to avoid the selection the sequencing of a single dominant clone. We lack infor- for highly transmissible drug resistances causing epidemic mation on possible confounding factors such as HIV co- outbreaks. infection status which could potentially influence our esti- mates. However, the possible association between pyraz- 5. Acknowledgments inamide resistance and HIV status has been previously explored and no significant association was found Budzik This work was supported by SystemsX.ch. et al. (2014) . The BDMM model explicitly accounts for the sampling in the computation of the phylodynamic likelihood. This References in turn means that difference in sampling strategies for the References different types of analysed strains will not bias the results as long as the sampling proportions are properly informed Allen, R. C., Engelstadter, J., Bonhoeffer, S., McDonald, B. A., & – i.e. fixed in the analyses or constrained by strong priors. Hall, A. R. (2017). Reversing resistance: different routes and 11 common themes across pathogens. Proc Biol Sci, 284 . URL: & Bohannan, B. J. (2006). The competitive cost of antibiotic https://www.ncbi.nlm.nih.gov/pubmed/28954914. doi:10.1098/ resistance in mycobacterium tuberculosis. Science, 312 , 1944–6. rspb.2017.1619. Gomes, M. G. M., Franco, A. O., Gomes, M. C., & Medley, G. F. Andersson, D. I., & Hughes, D. (2010). Antibiotic resistance and its (2004). The reinfection threshold promotes variability in tuber- cost: is it possible to reverse resistance? Nat Rev Microbiol, 8 , culosis epidemiology and vaccine efficacy. Proc Biol Sci, 271 , 260–71. 617–23. Biek, R., Pybus, O. G., Lloyd-Smith, J. O., & Didelot, X. (2015). Gomes, M. G. M., Rodrigues, P., Hilker, F. M., Mantilla-Beniers, Measurably evolving pathogens in the genomic era. Trends Ecol N. B., Muehlen, M., Cristina Paulo, A., & Medley, G. F. (2007). Evol, 30 , 306–13. URL: https://www.ncbi.nlm.nih.gov/pubmed/ Implications of partial immunity on the prospects for tuberculosis 25887947. doi:10.1016/j.tree.2015.03.009. control by post-exposure interventions. J Theor Biol, 248 , 608– Bouckaert, R., Heled, J., K¨uhnert,D., Vaughan, T., Wu, C. H., Xie, 17. D., Suchard, M. A., Rambaut, A., & Drummond, A. J. (2014). Grenfell, B. T., Pybus, O. G., Gog, J. R., Wood, J. L., Daly, J. M., Beast 2: a software platform for bayesian evolutionary analysis. Mumford, J. A., & Holmes, E. C. (2004). Unifying the epidemi- PLoS Comput Biol, 10 , e1003537. ological and evolutionary dynamics of pathogens. Science, 303 , Budzik, J. M., Jarlsberg, L. G., Higashi, J., Grinsdale, J., Hopewell, 327–32. URL: https://www.ncbi.nlm.nih.gov/pubmed/14726583. P. C., Kato-Maeda, M., & Nahid, P. (2014). Pyrazinamide re- doi:10.1126/science.1090727. sistance, mycobacterium tuberculosis lineage and treatment out- den Hertog, A. L., Sengstake, S., & Anthony, R. M. (2015). Pyraz- comes in san francisco, california. PLoS One, 9 , e95645. URL: inamide resistance in mycobacterium tuberculosis fails to bite? https://www.ncbi.nlm.nih.gov/pubmed/24759760. doi:10.1371/ Pathog Dis, 73 , ftv037. journal.pone.0095645. Horne, D. J., Pinto, L. M., Arentz, M., Lin, S. Y. G., Desmond, Burgos, M., DeRiemer, K., Small, P. M., Hopewell, P. C., & Da- E., Flores, L. L., Steingart, K. R., & Minion, J. (2012). Diagnos- ley, C. L. (2003). Effect of drug resistance on the generation of tic accuracy and reproducibility of who-endorsed phenotypic drug secondary cases of tuberculosis. J Infect Dis, 188 , 1878–84. susceptibility testing methods for first-line and second-line antitu- Casali, N., Nikolayevskyy, V., Balabanova, Y., Harris, S. R., Ig- berculosis drugs. Journal of Clinical Microbiology, 51 , 393–401. natyeva, O., Kontsevaya, I., Corander, J., Bryant, J., Parkhill, J., doi:10.1128/jcm.02724-12. Nejentsev, S., Horstmann, R. D., Brown, T., & Drobniewski, F. Kendall, E. A., Fofana, M. O., & Dowdy, D. W. (2015). Burden (2014). Evolution and transmission of drug-resistant tuberculosis of transmitted multidrug resistance in epidemics of tuberculo- in a russian population. Nat Genet, 46 , 279–86. sis: a transmission modelling analysis. The Lancet Respiratory Casali, N., Nikolayevskyy, V., Balabanova, Y., Ignatyeva, O., Kont- Medicine, 3 , 963–972. doi:10.1016/s2213-2600(15)00458-0. sevaya, I., Harris, S. R., Bentley, S. D., Parkhill, J., Nejentsev, Klinkenberg, D., Backer, J. A., Didelot, X., Colijn, C., & Wallinga, S., Hoffner, S. E., Horstmann, R. D., Brown, T., & Drobniewski, J. (2017). Simultaneous inference of phylogenetic and trans- F. (2012). Microevolution of extensively drug-resistant tuberculo- mission trees in infectious disease outbreaks. PLoS Com- sis in russia. Genome Res, 22 , 735–45. URL: https://www.ncbi. put Biol, 13 , e1005495. URL: https://www.ncbi.nlm.nih.gov/ nlm.nih.gov/pubmed/22294518. doi:10.1101/gr.128678.111. pubmed/28545083. doi:10.1371/journal.pcbi.1005495. Chiang, C.-Y., & Riley, L. W. (2005). Exogenous reinfection in K¨uhnert,D., Kouyos, R., Shirreff, G., Peˇcerska, J., Scherrer, A. U., tuberculosis. The Lancet Infectious Diseases, 5 , 629–636. B¨oni,J., Yerly, S., Klimkait, T., Aubert, V., G¨unthard, H. F., Cohen, T., Dye, C., Colijn, C., Williams, B., & Murray, M. (2009). Stadler, T., Bonhoeffer, S., & Study, S. H. C. (2018). Quanti- Mathematical models of the epidemiology and control of drug- fying the fitness cost of hiv-1 drug resistance mutations through resistant tb. Expert Rev Respir Med, 3 , 67–79. phylodynamics. PLOS Pathogens,. Colijn, C., & Cohen, T. (2016). Whole-genome sequencing K¨uhnert,D., Stadler, T., Vaughan, T. G., & Drummond, A. J. of mycobacterium tuberculosis for rapid diagnostics and be- (2016). Phylodynamics with migration: A computational frame- yond. The Lancet Respiratory Medicine, 4 , 6–8. doi:10.1016/ work to quantify population structure from genomic data. Mol s2213-2600(15)00510-x. Biol Evol, 33 , 2102–16. Davies, A. P., Billington, O. J., Bannister, B. A., Weir, W. R., Luciani, F., Sisson, S. A., Jiang, H., Francis, A. R., & Tanaka, McHugh, T. D., & Gillespie, S. H. (2000). Comparison of fitness M. M. (2009). The epidemiological fitness cost of drug resis- of two isolates of mycobacterium tuberculosis, one of which had tance in mycobacterium tuberculosis. Proc Natl Acad Sci U S developed multi-drug resistance during the course of treatment. J A, 106 , 14711–5. URL: https://www.ncbi.nlm.nih.gov/pubmed/ Infect, 41 , 184–7. URL: https://www.ncbi.nlm.nih.gov/pubmed/ 19706556. doi:10.1073/pnas.0902437106. 11023769. doi:10.1053/jinf.2000.0711. Ma, Y., Horsburgh, C., White, L. F., & Jenkins, H. E. (2018). Didelot, X., Fraser, C., Gardy, J., & Colijn, C. (2017). Genomic infec- Quantifying tb transmission: a systematic review of repro- tious disease epidemiology in partially sampled and ongoing out- duction number and serial interval estimates for tubercu- breaks. Mol Biol Evol, 34 , 997–1007. URL: https://www.ncbi. losis. Epidemiology and infection, 146 , 1478–1494. URL: nlm.nih.gov/pubmed/28100788. doi:10.1093/molbev/msw275. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6092233/. Didelot, X., Gardy, J., & Colijn, C. (2014). Bayesian inference of doi:10.1017/S0950268818001760. infectious disease transmission from whole-genome sequence data. Mariam, D. H., Mengistu, Y., Hoffner, S. E., & Andersson, D. I. Mol Biol Evol, 31 , 1869–79. URL: http://www.ncbi.nlm.nih. (2004). Effect of rpob mutations conferring rifampin resis- gov/pubmed/24714079. doi:10.1093/molbev/msu121. tance on fitness of mycobacterium tuberculosis. Antimicrobial Dowdy, D. W., Dye, C., & Cohen, T. (2013). Data needs for evidence- Agents and Chemotherapy, 48 , 1289–1294. doi:10.1128/aac.48. based decisions: a tuberculosis modeler’s ’wish list’. Int J Tuberc 4.1289-1294.2004. Lung Dis, 17 , 866–77. Meehan, C. J., Moris, P., Kohl, T. A., Peˇcerska, J., Akter, S., Merker, Feuerriegel, S., Schleusener, V., Beckert, P., Kohl, T. A., Miotto, M., Gehre, F., Lempens, P., Stadler, T., Kaswa, M. K., K¨uhnert, P., Cirillo, D. M., Cabibbe, A. M., Niemann, S., & Fellenberg, K. D., Niemann, S., & de Jong, B. C. (2018). The relationship (2015). Phyresse: a web tool delineating mycobacterium tubercu- between transmission time and clustering methods in mycobac- losis antibiotic resistance and lineage from whole-genome sequenc- terium tuberculosis epidemiology. in preparation,. ing data. J Clin Microbiol, 53 , 1908–14. URL: https://www.ncbi. Merker, M., Barbier, M., Cox, H., Rasigade, J. P., Feuerriegel, S., nlm.nih.gov/pubmed/25854485. doi:10.1128/JCM.00025-15. Kohl, T. A., Diel, R., Borrell, S., Gagneux, S., Nikolayevskyy, Frost, S. D., Pybus, O. G., Gog, J. R., Viboud, C., Bonhoeffer, S., V., Andres, S., Nubel, U., Supply, P., Wirth, T., & Niemann, & Bedford, T. (2015). Eight challenges in phylodynamic infer- S. (2018). Compensatory evolution drives multidrug-resistant tu- ence. Epidemics, 10 , 88–92. URL: https://www.ncbi.nlm.nih. berculosis in central asia. Elife, 7 . URL: https://www.ncbi.nlm. gov/pubmed/25843391. doi:10.1016/j.epidem.2014.09.001. nih.gov/pubmed/30373719. doi:10.7554/eLife.38200. Gagneux, S., Long, C. D., Small, P. M., Van, T., Schoolnik, G. K., Miotto, P., Cabibbe, A. M., Feuerriegel, S., Casali, N., Drobniewski,

12 F., Rodionova, Y., Bakonyte, D., Stakenas, P., Pimkina, E., Yakrus, M. A., Driscoll, J., Lentz, A. J., Sikes, D., Hartline, D., Augustynowicz-Kopec, E., Degano, M., Ambrosi, A., Hoffner, S., Metchock, B., & Starks, A. M. (2014). Concordance between Mansjo, M., Werngren, J., Rusch-Gerdes, S., Niemann, S., & Cir- molecular and phenotypic testing of mycobacterium tuberculosis illo, D. M. (2014). Mycobacterium tuberculosis pyrazinamide re- complex isolates for resistance to rifampin and isoniazid in the sistance determinants: a multicenter study. MBio, 5 , e01819–14. united states. J Clin Microbiol, 52 , 1932–7. URL: https://www. Njire, M., Tan, Y., Mugweru, J., Wang, C., Guo, J., Yew, W., ncbi.nlm.nih.gov/pubmed/24648563. doi:10.1128/JCM.00417-14. Tan, S., & Zhang, T. (2016). Pyrazinamide resistance in my- Yew, W. W., & Leung, C. C. (2005). Are some people not safer after cobacterium tuberculosis: Review and update. Adv Med Sci, 61 , successful treatment of tuberculosis? Am J Respir Crit Care 63–71. URL: https://www.ncbi.nlm.nih.gov/pubmed/26521205. Med, 171 , 1324–5. doi:10.1016/j.advms.2015.09.007. Pankhurst, L. J., del Ojo Elias, C., Votintseva, A. A., Walker, T. M., Cole, K., Davies, J., Fermont, J. M., Gascoyne-Binzi, D. M., Kohl, T. A., Kong, C., Lemaitre, N., Niemann, S., Paul, J., Rogers, T. R., Roycroft, E., Smith, E. G., Supply, P., Tang, P., Wilcox, M. H., Wordsworth, S., Wyllie, D., Xu, L., & Crook, D. W. (2016). Rapid, comprehensive, and afford- able mycobacterial diagnosis with whole-genome sequencing: a prospective study. The Lancet Respiratory Medicine, 4 , 49–58. doi:10.1016/s2213-2600(15)00466-x. Pinho, S. T., Rodrigues, P., Andrade, R. F., Serra, H., Lopes, J. S., & Gomes, M. G. (2015). Impact of tuberculosis treatment length and adherence under different transmission intensities. Theor Popul Biol,. Sengstake, S., Bergval, I. L., Schuitema, A. R., de Beer, J. L., Phelan, J., de Zwaan, R., Clark, T. G., van Soolingen, D., & Anthony, R. M. (2017). Pyrazinamide resistance-conferring mutations in pnca and the transmission of multidrug resistant tb in georgia. BMC Infect Dis, 17 , 491. Stadler, T. (2011). Inferring epidemiological parameters on the basis of allele frequencies. Genetics, 188 , 663–672. URL: https://www.genetics.org/content/188/3/663. doi:10. 1534/genetics.111.126466. Vaughan, T. G., & Drummond, A. J. (2013). A stochastic simulator of birth-death master equations with application to phylodynam- ics. Mol Biol Evol, 30 , 1480–93. Verver, S., Warren, R. M., Beyers, N., Richardson, M., van der Spuy, G. D., Borgdorff, M. W., Enarson, D. A., Behr, M. A., & van Helden, P. D. (2005). Rate of reinfection tuberculosis after suc- cessful treatment is higher than rate of new tuberculosis. Am J Respir Crit Care Med, 171 , 1430–5. Vynnycky, E., & Fine, P. E. (1997). The natural history of tuber- culosis: the implications of age-dependent risks of disease and the role of reinfection. Epidemiol Infect, 119 , 183–201. Walker, T. M., Merker, M., Knoblauch, A. M., Helbling, P., Schoch, O. D., van der Werf, M. J., Kranzer, K., Fiebig, L., Kr¨oger,S., Haas, W., Hoffmann, H., Indra, A., Egli, A., Cir- illo, D. M., Robert, J., Rogers, T. R., Groenheit, R., Mengshoel, A. T., Mathys, V., Haanper¨a,M., Soolingen, D. v., Niemann, S., B¨ottger,E. C., Keller, P. M., Avsar, K., Bauer, C., Bernasconi, E., Borroni, E., Brusin, S., Coscoll´aD´evis,M., Crook, D. W., Dedi- coat, M., Fitzgibbon, M., Gagneux, S., Geiger, F., Guthmann, J.- P., Hendrickx, D., Hoffmann-Thiel, S., van Ingen, J., Jackson, S., Jaton, K., Lange, C., Mazza Stalder, J., O’Donnell, J., Opota, O., Peto, T. E. A., Preiswerk, B., Roycroft, E., Sato, M., Schacher, R., Schulthess, B., Smith, E. G., Soini, H., Sougakoff, W., Tagliani, E., Utpatel, C., Veziris, N., Wagner-Wiening, C., & Witschi, M. (2018). A cluster of multidrug-resistant mycobacterium tubercu- losis among patients arriving in europe from the horn of africa: a molecular epidemiological study. The Lancet Infectious Diseases, 18 , 431–440. doi:10.1016/s1473-3099(18)30004-5. WHO (2016). WHO treatment guidelines for drug-resistant tubercu- losis, 2016 update. Report WHO. URL: http://www.who.int/tb/ areas-of-work/drug-resistant-tb/treatment/resources/en/. WHO (2017). Global tuberculosis report 2017 . Report WHO. WHO (2018). Global tuberculosis report 2018 . Report WHO. Yadon, A. N., Maharaj, K., Adamson, J. H., Lai, Y. P., Sac- chettini, J. C., Ioerger, T. R., Rubin, E. J., & Pym, A. S. (2017). A comprehensive characterization of pnca polymor- phisms that confer resistance to pyrazinamide. Nat Commun, 8 , 588. URL: https://www.ncbi.nlm.nih.gov/pubmed/28928454. doi:10.1038/s41467-017-00721-2.

13 APPENDIX

61 Quantifying transmission fitness costs of multi-drug resistant tuberculosis.

a,g, b c d c e,f J¯ulijaPeˇcerska ∗, Denise K¨uhnert , Conor J. Meehan , Mireia Coscoll´a , Bouke C. de Jong , Sebastien Gagneux , a,g, Tanja Stadler ∗

aDepartment of Biosystems Science and Engineering, ETHZ, Basel, Switzerland bTide research group, Max Planck Institute for the Science of Human History, Jena, Germany cUnit of Mycobacteriology, Biomedical Sciences, Institute of Tropical Medicine, Antwerp, Belgium dInstitute for Integrative Systems Biology (I2SysBio), University of Valencia-CSIC, Val`encia,Spain eDepartment of Medical Parasitology and Infection Biology, Swiss Tropical and Public Health fUniversity of Basel, Basel, Switzerland gSwiss Institute of Bioinformatics (SIB), Lausanne, Switzerland

π 1. Supplementary Material pπ π+τ probability of death when in class Ix 1.1. Parameter definitions τ pτ π+τ probability of treatment when in As MASTER simulates using rates as introduced in class Ix our model description and in figure 3, we need to calculate p γx probability of recovery when in γx γx+κx+µx the infection rates βS and βR from the different R0 values class Tx that we define. As stated in Table 1, β is the ratio of the p κx probability of relapse when in class S κx γx+κx+µx basic reproductive number R0,S and the product of (i) the Tx population size N and (ii) the expected total time that an p µS probability of acquiring resistance µS γS+κS+µS infected individual would be infectious, tinfectious,S. βR is when in class TS (0 for class TR) defined in the same way. 1 tEx σ expected time spent in class Ex 1 Additionally, we need to define the true rλ for the sim- tIx π+τ expected time spent in class Ix ulated data. This value is then compared to the rλ inferred 1 tTx γ +κ +µ expected time spent in class Tx R δ x x x by BDMM. The rλ is defined as the 0,R S/R0,SδR, where R0,x βx Nt infection rate the R0,S and R0,R are defined in the simulation setup, infectious,x and δS and the δR are defined as the inverse of the ex- Table 1: Summary of important quantities in the SEIT2 model for pected time of being infected with strain x (x S, R ), x S, R . 1 1 ∈ { } ∈ { } i.e δS = and δR = . tinfected,S tinfected,R To calculate the expected time of being infectious and to class S. The expected time of infectiousness of an indi- infected, tinfectious and tinfected for both the sensitive and resistant strains we first define the necessary values, see vidual is the sum of time spent in the I compartment prior to eventually moving back to the S compartment, dying, Table 1. We need to consider the SEIT2 model as shown in figure 3. The expected amount of time spent in each or evolving resistance. If we trace an individual’s progress compartment is the expectation of the exponential distri- through the compartments we can see that there is a pos- sibility for looping in the I T I T ... sequence bution with the rate of leaving the compartment, which is → → → → the sum of all rates that lead to leaving a compartment. of compartments, which we have to account for. Furthermore, the probability of each move between com- Thus, we can express the expected times as follows, partments is proportional to the rate of that move. Thus, where i+1 is the number of times an individual enters the we can express the expected time in each compartment IS or IR class (in simplifying the equations, we use the i 1 ∞ and the probabilities of moves between compartments and property of geometric series, i=0 x = (1 x) ): − use these quantities to calculate the expected time of infec- P tion and infectiousness. For each strain type, the expected time of a patient being infected tinfected,S is the time spent in either of the compartments E, I, or T, starting at the time of infection to the time when an individual returns

∗Corresponding author Email addresses: [email protected] (J¯ulija Peˇcerska), [email protected] (Tanja Stadler)

Preprint submitted to Elsevier December 20, 2019 2. No treatment phase. In BDMM, patients leave the system when they get treated. In SEIT, the ∞ i patient might (with some small probability) relapse tinfectious,S = (pκS pτ ) tIS i=0 and become infectious again after some time in treat- X γ + κ + µ ment. = S S (γS + κS + µ)π + (γS + µ)τ 3. No T S transfer. BDMM does not explicitly ac- → ∞ count for patients going back to the susceptible pool t = (p p )it infectious,R κR τ IR after recovery. However, since it allows for variable i=0 X birth rate λk, changes in the susceptible population γ + κ = R R size will be accounted for. κRπ + γR(π + τ)

∞ i 1.3. Supplementary figures tinfected,S = tES + (pκS pτ ) (tIS + pτ pγS tTS ) i=0 X 1 γ + κ + µ + τ = + S S σ (γS + κS + µ)π + (γS + µ)τ

∞ i tinfected,R = tER + (pκR pτ ) (tIR + pτ tTR ) i=0 X 1 γ + κ + τ = + R R σ κRπ + γR(π + τ) Finally, we need to adjust the simulated sampling prob- ability parameters for comparison with the parameters in- ferred by BDMM. The reason is that in SEITx, an individ- ual has the possibility to get sampled at multiple points in time, at every successive move from the infectious to treated compartment (e.g. after an infection relapse). In BDMM, we can only sample each individual once. Thus we compare preal to the inferred BDMM sampling proba- bility, px:

∞ p = (p p (1 p))ip p real τ κ − τ i=0 X (γ + κ)pτ = , γ(π + τ) + κ(π + pτ) where pτ , pκ are defined in Table 1.

1.2. BDMM vs SEITm

An analysis of our simulated trees with an SEITm- like model would require solving O(N3m) differential equa- tions for each likelihood evaluation Leventhal et al. (2014), where N is the number of susceptible individuals in the population. We obtain this runtime as we have to keep track of all possible numbers of individuals (out of N) in the 3m+1 different compartments, while the total number of individuals is constant (N). On the other hand, analysis with BDMM requires solving an order of m ODEs. BDMM runtime is much faster as it is a simplification of SEITm in the following ways: 1. No exposed phase. BDMM does not distinguish between the exposed (ESEIT) and infectious (ISEIT)

states in SEIT. BDMM estimates infectious time tIBDMM = 1 δ as the total time in ANY infected compartment in

SEIT (tESEIT + tISEIT + tTSEIT ). 2 References

Leventhal, G. E., G¨unthard, H. F., Bonhoeffer, S., & Stadler, T. (2014). Using an epidemiological model for phylogenetic inference reveals density dependence in hiv transmission. Mol Biol Evol, 31 , 6–17.

3 1.8 1.8

1.6 1.6

1.4 1.4

1.2 1.2 45 41 39 36 44 29 36 26 24 20 28 24 25 27 1.0 1.0

0.8 0.8

0.6 0.6 98 93 94 92 95 98 91 98 97 97 96 95 98 94 0.4 0.4

0.2 0.2

0.0 0.0

SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8 tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8

(a) rλ estimates for three intervals. (b) rλ estimates for one interval. 1.8 1.8

1.6 1.6

1.4 1.4

1.2 1.2

1.0 1.0 19 11 6 6 8 11 7 0.8 0.8

0.6 0.6 1 Re,S 0.4 2 0.4 Re,S 0.2 3 0.2 Re,S 0.0 0.0

SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8 tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8

(c) Re,S estimates for three intervals. (d) Re,S estimates for one interval. 1.8 1.8

1.6 1.6

1.4 1.4

1.2 1.2

1.0 1.0

0.8 0.8

0.6 0.6 1 81 82 86 84 Re,R 79 80 82 0.4 2 0.4 Re,R 0.2 3 0.2 Re,R 0.0 0.0

SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8 tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8

(e) Re,R estimates for three intervals. (f) Re,R estimates for one interval.

Figure 1: r ,R and R estimates plotted in relation to the different simulation models, for r 0.72, R = 1.2 and R = 0.9, N = λ e,S e,R λ ≈ 0,S 0,R 1,000, 150 samples, and 1 and 3 intervals for Re.

4 1.8 1.8

1.6 1.6

1.4 1.4

1.2 1.2

1.0 1.0

0.8 0.8 95 94 93 97 95 98 92 97 97 99 95 97 98 96 0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0

SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8 tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8

(a) rλ estimates for three intervals. (b) rλ estimates for one interval. 1.8 1.8

1.6 1.6

1.4 1.4

1.2 1.2

1.0 1.0 100 99 100 100 100 99 100 0.8 0.8

0.6 0.6 1 Re,S 0.4 2 0.4 Re,S 0.2 3 0.2 Re,S 0.0 0.0

SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8 tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8

(c) Re,S estimates for three intervals. (d) Re,S estimates for one interval. 1.8 1.8

1.6 1.6

1.4 1.4

1.2 1.2

1.0 1.0

0.8 0.8 90 83 76 75 83 74 83 0.6 0.6 1 Re,R 0.4 2 0.4 Re,R 0.2 3 0.2 Re,R 0.0 0.0

SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8 tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8

(e) Re,R estimates for three intervals. (f) Re,R estimates for one interval.

Figure 2: r ,R and R estimates plotted in relation to the different simulation models, for r 0.96, R = 1.1 and R = 1.1, N = λ e,S e,R λ ≈ 0,S 0,R 1,000, 150 samples, and 1 and 3 intervals for Re.

5 1.8 1.8

1.6 1.6

1.4 1.4 75 77 79 80 75 81 77 1.2 1.2 78 73 69 67 70 74 70 1.0 1.0

0.8 0.8 92 0.6 86 96 96 89 96 93 0.6 93 91 97 98 97 96 98

0.4 0.4

0.2 0.2

0.0 0.0

SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8 tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8

(a) rλ estimates for three intervals. (b) rλ estimates for one interval. 1.8 1.8

1.6 1.6

1.4 1.4

1.2 1.2

1.0 1.0 48 36 40 36 30 35 32 0.8 0.8

0.6 0.6 1 Re,S 0.4 2 0.4 Re,S 0.2 3 0.2 Re,S 0.0 0.0

SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8 tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8

(c) Re,S estimates for three intervals. (d) Re,S estimates for one interval. 1.8 1.8

1.6 1.6

1.4 1.4

1.2 1.2

1.0 1.0

0.8 0.8 66 58 58 55 55 57 51 0.6 0.6 1 Re,R 0.4 2 0.4 Re,R 0.2 3 0.2 Re,R 0.0 0.0

SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8 tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8

(e) Re,R estimates for three intervals. (f) Re,R estimates for one interval.

Figure 3: r ,R and R estimates plotted in relation to the different simulation models, for r 0.88, R = 1.2 and R = 1.1, N = λ e,S e,R λ ≈ 0,S 0,R 1,000, 150 samples, and 1 and 3 intervals for Re.

6 1.8 1.8

1.6 1.6

1.4 1.4

1.2 61 61 61 64 66 72 66 1.2 56 54 55 53 44 44 51 44 44 1.0 1.0

0.8 0.8

0.6 90 96 89 94 96 0.6 93 90 94 92 98 92 91 96 94 96 97 0.4 0.4

0.2 0.2

0.0 0.0

SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8 tE = 0.9 tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8 tE = 0.9

(a) rλ estimates for three intervals. (b) rλ estimates for one interval. 1.8 1.8

1.6 1.6

1.4 1.4

1.2 1.2

1.0 1.0 25 5 9 12 6 3 4 1 0.8 0.8

0.6 0.6 1 Re,S 0.4 2 0.4 Re,S 0.2 3 0.2 Re,S 0.0 0.0

SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8 tE = 0.9 tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8 tE = 0.9

(c) Re,S estimates for three intervals. (d) Re,S estimates for one interval. 1.8 1.8

1.6 1.6

1.4 1.4

1.2 1.2

1.0 1.0

0.8 0.8 69 53 52 0.6 0.6 46 45 49 42 43 1 Re,R 0.4 2 0.4 Re,R 0.2 3 0.2 Re,R 0.0 0.0

SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8 tE = 0.9 tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8 tE = 0.9

(e) Re,R estimates for three intervals. (f) Re,R estimates for one interval.

Figure 4: r ,R and R estimates plotted in relation to the different simulation models, for r 0.81, R = 1.3 and R = 1.1, N = λ e,S e,R λ ≈ 0,S 0,R 1,000, 150 samples, and 1 and 3 intervals for Re.

7 1.8 1.8

1.6 1.6

1.4 1.4

1.2 1.2 19 5 3 3 6 4 5 10 10 3 1 3 2 4 7 10 1.0 1.0

0.8 0.8

0.6 0.6 88 91 94 91 87 91 93 91 94 91 95 95 92 98 91 92 0.4 0.4

0.2 0.2

0.0 0.0

SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8 tE = 0.9 tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8 tE = 0.9

(a) rλ estimates for three intervals. (b) rλ estimates for one interval. 2.4 1.8 2.2 1.6 2.0 1.4 1.8 1.6 1.2 1.4 89 80 88 81 88 88 87 91 1.0 1.2 0.8 1.0 0.8 0.6 R1 0.6 e,S 2 0.4 0.4 Re,S 3 0.2 0.2 Re,S 0.0 0.0

SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8 tE = 0.9 tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8 tE = 0.9

(c) Re,S estimates for three intervals. (d) Re,S estimates for one interval. 2.4 1.8 2.2 1.6 2.0 1.4 1.8 1.6 1.2 1.4 1.0 1.2 0.8 1.0 92 89 95 92 94 91 94 0.8 0.6 91 R1 0.6 e,R 2 0.4 0.4 Re,R 3 0.2 0.2 Re,R 0.0 0.0

SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8 tE = 0.9 tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8 tE = 0.9

(e) Re,R estimates for three intervals. (f) Re,R estimates for one interval.

Figure 5: r ,R and R estimates plotted in relation to the different simulation models, for r 0.72, R = 1.2 and R = 0.9, N = λ e,S e,R λ ≈ 0,S 0,R 200,000, 300 samples, and 1 and 3 intervals for Re.

8 1.8 1.8

1.6 1.6

1.4 1.4

1.2 1.2

1.0 1.0 88 0.8 88 89 89 82 88 83 0.8 91 92 93 91 91 92 91 0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0

SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8 tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8

(a) rλ estimates for three intervals. (b) rλ estimates for one interval. 1.8 1.8

1.6 1.6

1.4 1.4

1.2 1.2

1.0 1.0 91 77 81 72 87 76 84 0.8 0.8

0.6 0.6 1 Re,S 0.4 2 0.4 Re,S 0.2 3 0.2 Re,S 0.0 0.0

SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8 tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8

(c) Re,S estimates for three intervals. (d) Re,S estimates for one interval. 1.8 1.8

1.6 1.6

1.4 1.4

1.2 1.2

1.0 1.0 84 78 80 81 85 82 86 0.8 0.8

0.6 0.6 1 Re,R 0.4 2 0.4 Re,R 0.2 3 0.2 Re,R 0.0 0.0

SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8 tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8

(e) Re,R estimates for three intervals. (f) Re,R estimates for one interval.

Figure 6: r ,R and R estimates plotted in relation to the different simulation models, for r 0.96, R = 1.1 and R = 1.1, N = λ e,S e,R λ ≈ 0,S 0,R 200,000, 300 samples, and 1 and 3 intervals for Re.

9 1.8 1.8

1.6 1.6

1.4 1.4

1.2 1.2 65 62 52 62 64 61 64 69 73 59 54 61 57 68 75 58 1.0 1.0

0.8 0.8 89 88 87 87 91 88 87 94 85 92 92 92 91 89 92 87 0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0

SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8 tE = 0.9 tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8 tE = 0.9

(a) rλ estimates for three intervals. (b) rλ estimates for one interval. 2.4 1.8 2.2 1.6 2.0 1.4 1.8 1.6 1.2 72 72 76 1.4 87 75 85 85 86 1.0 1.2 0.8 1.0 0.8 0.6 R1 0.6 e,S 2 0.4 0.4 Re,S 3 0.2 0.2 Re,S 0.0 0.0

SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8 tE = 0.9 tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8 tE = 0.9

(c) Re,S estimates for three intervals. (d) Re,S estimates for one interval. 2.4 1.8 2.2 1.6 2.0 1.4 1.8 1.6 1.2 1.4 1.0 83 1.2 83 80 88 88 86 93 89 0.8 1.0 0.8 0.6 R1 0.6 e,R 2 0.4 0.4 Re,R 3 0.2 0.2 Re,R 0.0 0.0

SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8 tE = 0.9 tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8 tE = 0.9

(e) Re,R estimates for three intervals. (f) Re,R estimates for one interval.

Figure 7: r ,R and R estimates plotted in relation to the different simulation models, for r 0.88, R = 1.2 and R = 1.1, N = λ e,S e,R λ ≈ 0,S 0,R 200,000, 300 samples, and 1 and 3 intervals for Re.

10 1.8 1.8

1.6 1.6

1.4 1.4

1.2 1.2 37 41 43 31 26 25 36 30 22 23 27 20 26 39 36 20 1.0 1.0

0.8 0.8 93 92 89 87 91 0.6 89 91 93 91 86 92 0.6 88 94 91 93 93 0.4 0.4

0.2 0.2

0.0 0.0

SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8 tE = 0.9 tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8 tE = 0.9

(a) rλ estimates for three intervals. (b) rλ estimates for one interval. 2.4 1.8 2.2 1.6 2.0 1.4 1.8 1.6 1.2 83 71 77 75 80 83 93 91 1.4 1.0 1.2 0.8 1.0 0.8 0.6 R1 0.6 e,S 2 0.4 0.4 Re,S 3 0.2 0.2 Re,S 0.0 0.0

SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8 tE = 0.9 tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8 tE = 0.9

(c) Re,S estimates for three intervals. (d) Re,S estimates for one interval. 2.4 1.8 2.2 1.6 2.0 1.4 1.8 1.6 1.2 1.4 1.0 1.2 83 87 91 0.8 86 93 90 90 91 1.0 0.8 0.6 R1 0.6 e,R 2 0.4 0.4 Re,R 3 0.2 0.2 Re,R 0.0 0.0

SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SIS2 SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, SEIT2, tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8 tE = 0.9 tE = 0.0 tE = 0.2 tE = 0.4 tE = 0.5 tE = 0.6 tE = 0.8 tE = 0.9

(e) Re,R estimates for three intervals. (f) Re,R estimates for one interval.

Figure 8: r ,R and R estimates plotted in relation to the different simulation models, for r 0.81, R = 1.3 and R = 1.1, N = λ e,S e,R λ ≈ 0,S 0,R 200,000, 300 samples, and 1 and 3 intervals for Re.

11 175 150 125 100 75 Number of infected individuals 50 Seed 1531234861 Seed 1531234862 Seed 1531234863 25 Seed 1531234864 Seed 1531234865 0 0 5 10 15 20 25 Time

Figure 9: Examples of 5 different simulation instances under the SEIT2 model using 5 different seeds for BEAST2 initialisation, where R0,S = 1.2, R0,R = 1.1 and tE = 0.8. After an initial exponential growth phase the saturation is reached and the overall level of infection in the population is stabilised.

9 Exponential with rate 1 Exponential with rate 0.2 8 Exponential with rate 50

7

6

5

Density 4

3

2

1

0 0.4 0.5 0.6 0.7 0.8 0.9

Figure 10: The posterior probability density of rλ for the full dataset analysis of the Kinshasa sequences under 3 different priors on the resistance acquisition rate µ.

12 1.0

0.8

0.6

0.4

0.2

0.0 30 seq. 15-30 seq. 14-30 seq. 10-30 seq. 7-30 seq. 6-30 seq. 4-30 seq. 3-30 seq. 2-30 seq.

(a) Incremental cluster analysis, sequentially including smaller clusters.

1.0

0.8

0.6

0.4

0.2

0.0 2-6 seq. 2-7 seq. 2-7 seq. 2-10 seq. 2-14 seq. 2-14 seq. 2-15 seq. 2-30 seq.

(b) Decremental cluster analysis, sequentially excluding largest clusters.

Figure 11: rλ estimates based on the Kinshasa data for incremental and decremental cluster analyses. The very right violin plot in both subfigures is the result of the analysis using all data. Both types of analyses were preformed using an Exp(1) prior on µ.

13 1.4 rλ µ rλ +µ

1.2

1.0

0.8

0.6

0.4

0.2

0.0 30 seq. 15-30 seq. 14-30 seq. 10-30 seq. 7-30 seq. 6-30 seq. 4-30 seq. 3-30 seq. 2-30 seq.

(a) Incremental cluster analyses with the prior Exp(1) on µ.

1.4 rλ µ rλ +µ

1.2

1.0

0.8

0.6

0.4

0.2

0.0 30 seq. 15-30 seq. 14-30 seq. 10-30 seq. 7-30 seq. 6-30 seq. 4-30 seq. 3-30 seq. 2-30 seq.

(b) Incremental cluster analyses with the prior Exp(0.2) on µ.

Figure 12: Incremental cluster analysis sequentially including smaller clusters with different priors on the rate of pnca-evolution, µ, based on the Kinshasa data. We show the rλ, µ, and rλ + µ estimates, where the latter is included to show possible correlation between the parameters.

14 1.0

0.8

0.6

0.4

0.2

0.0 30 seq. 15-30 seq. 14-30 seq. 10-30 seq. 7-30 seq. 6-30 seq. 4-30 seq. 3-30 seq. 2-30 seq.

Figure 13: Incremental cluster analysis for the Kinshasa data, sequentially including smaller clusters. Here we do not use sequence data, i.e. we only use prevalence-related data. The very right violin plot is the result of the analysis using all clusters.

15 4.0 NR/Nall 3.8 N /N 3.6 R S 3.4 3.2 3.0 2.8 2.6 2.4 2.2 2.0 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 30 seq. 15-30 seq. 14-30 seq. 10-30 seq. 7-30 seq. 6-30 seq. 4-30 seq. 3-30 seq. 2-30 seq.

N (a) Proportion of pyrazinamide-resistant strains to all strains in the given subset of data ( R/Nall), and the proportion of pyrazinamide-resistant strains to MDR strains (NR/NS).

1.0

0.8

0.6

0.4

0.2

0.0 30 seq. 15-30 seq. 14-30 seq. 10-30 seq. 7-30 seq. 6-30 seq. 4-30 seq. 3-30 seq. 2-30 seq.

(b) The Simpson’s index of diversity of the different pyrazinamide-resistant strains, 1 DR. The closer to 1 is the index, the higher the diversity of pyrazinamide-resistant strains. −

Figure 14: Properties of the datasets in the incremental cluster analysis.

16

PYRAZINAMIDERESISTANCEFITNESSCOSTSINMDR-TBIN 4 GEORGIA

This chapter includes analyses of lineage 2 and 4 sequences of MDR-TB from Georgia, as well as a comparison of the results between two very different locations – Georgia and Kinshasa. Previous hypotheses proposed a significant transmission fitness cost for pyrazinamide resis- tance in TB (Hertog, Sengstake, and Anthony, 2015), which was confirmed for the dataset in Kinshasa, but the analyses of the Georgian dataset show a contradictory picture. While the lineage 4 sequences from Kinshasa showed about a 30% transmission fitness cost of pyrazinamide resistance, sequences from Georgia do not show any significant loss in fitness. An analysis of the 72 available lineage 4 sequences does not allow us to exclude the possibility of no fitness difference, while the median estimate for transmission fitness is about 30% higher for the pyrazinamide-resistant strains. Similarly, the lineage 2 analysis does not allow us to exclude the possibility of no transmission fitness cost of resistance. The difference in results may be attributed to many different causes, but our main hypothesis is that they are due to the basic lineage differences for the different MDR strains as a consequence of the length of the preceding epidemic and treatment availability in the location in question. The precise causes of this disparity certainly require further investigation. A major conclusion of this work specifically is that we should not assume that the properties of drug resistant strains of Mycobacterium tuberculosis from the same lineage on the scale of an epidemic will be the same regardless of the space and time they were sampled in. Depending on factors that we are not yet able to pinpoint the differences in fitness may be dramatic, and thus each available dataset has to be investigated independently to estimate the influence of drug resistance on transmission fitness. This is a manuscript in preparation titled “Pyrazinamide resistance relative transmission fitness comparison of multi-drug resistant tuberculosis in different settings”, where I am the first author.

79 i i i i Pyrazinamide resistance relative transmission fitness comparison of multi-drug resistant tuberculosis in different settings. ,1,4 2,3,4 2,3,4 1,4 J¯ulijaPeˇcerska,∗ , Sebastian Gygli , Sebastien Gagneux , Tanja Stadler 1Department of Biosystems Science and Engineering, ETHZ, Basel, Switzerland 2Department of Medical Parasitology and Infection Biology, Swiss Tropical and Public Health Institute, Basel, Switzerland 3University of Basel, Basel, Switzerland 4Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland ∗Corresponding author: E-mail: [email protected], [email protected]. Associate Editor:

Abstract

Key words: antibiotic resistance, multi-type birth-death model, phylodynamics, whole genome M. tuberculosis

Introduction disrupted the work of many health services. The

Tuberculosis (TB) is a disease caused by emergence of a high number of drug resistant

Mycobacterium tuberculosis, which is currently strains of TB in the country may be partially in

one of the top ten causes of death and the leading consequence of this disruption, as any interruption

cause of death from a single infectious agent, in drug supply can increase the probability of drug

according to WHO. TB affects all nations, but the resistance. In 2017 in Georgia, a measured 11% Article severity of epidemics varies greatly from country of new TB cases and 30% of previously treated

to country. Per WHO, high-income countries have cases were either rifampicin-resistant (RR-TB) or

had under 10 new cases per 100.000 population, multi-drug resistant (MDR-TB), i.e. resistant to

whereas countries with the highest TB burden at least isoniazid and rifampicin. Moreover, in

had from 150 up to more than 500 new cases per 2015 of the 62 extensively drug-resistant cases

100.000 in 2017 (WHO, 2018). The developing (XDR-TB), resistant to isoniazid and rifampicin,

country of Georgia had approximately 85 new plus any fluoroquinolone and at least one of three

cases per 100.000 population in 2017, which makes injectable second-line drugs, only 56% were a

it one of the high incidence countries (Sengstake treatment success (WHO, 2018).

et al., 2017). TB in Georgia emerged as a public Pyrazinamide is an important first-line drug

health threat following the breakup of the Soviet that is used in the standard short course treatment

Union and the civil war of 1991, which has of drug-sensitive TB (Hopewell et al., 2009). It is also retained in the treatment of MDR-TB,

used optionally when drugs from the less toxic

i i

i i i i i Peˇcerska et al. i

and more efficient categories are not available as between strains from the same lineage but

or recommended (WHO, 2019). It is commonly sampled from a different location.

used in Georgia in both first-line and second- Materials and Methods line treatments, moreover, pyrazinamide is the Data

only first-line drug that will be maintained in Pyrazinamide is a pro-drug that is activated by

all regimens in the near future (Diacon et al., the enzyme pyrazinamidase and multiple different

2012). Even with such common use the prevalence substitutions in pncA gene can cause a disruption

and incidence of pyrazinamide resistance in the in the work of the enzyme, which is why we use

setting remains largely unknown (Sengstake et al., it as a proxy for pyrazinamide resistance. Little

2017). Resistance to pyrazinamide was previously to no convergent evolution was detected on the

thought to have a significant effect on the pncA gene (Miotto et al., 2014), and resistance

transmission fitness of the bacterium, making it reversion is extremely unlikely for other drugs

much less likely to cause secondary infections (den in Mycobacterium tuberculosis (Andersson and

Hertog et al., 2015). Hughes, 2010).How to justify this better?

In this paper we present an analysis of The original dataset was published in (?)

the MDR-TB dataset from Georgia, which (missing reference). The dataset consists of 1.492

additionally contains strains resistant to sequences from Georgia, of which 1.267 are

pyrazinamide. We estimate the relative fitness lineage 2 sequences and 225 lineage 4 sequences.

of pyrazinamide-resistant strains from lineages Of the 1.267 lineage 2 sequences 382 were

2 and 4, as well as compare our results with MDR sequences that we could use to infer

a previously analysed lineage 4 dataset with the transmission fitness costs of pyrazinamide

a similar structure from the capital of the resistance, grouped in clusters of sizes from

Democratic Republic of the Congo, Kinshasa. 3 to 93 sequences (30.15% of all sequences).

The analyses show a congruence in the behaviour The 382 clustered sequences contained 78 MDR

of the method between the two datasets when pyrazinamide-sensitive samples and 304 MDR

analysing transmission clusters rather than pyrazinamide-resistant samples. Similarly, of the

making the assumption that the sequences form 225 lineage 4 sequences 72 were MDR sequences

a single transmission chain, which is unlikely. that we could use to infer the transmission fitness

The analyses also show a significant difference costs of pyrazinamide resistance, grouped in

in relative transmission fitness of pyrazinamide- clusters of sizes from 3 to 10 sequences (32% of all

resistant strains between different lineages as well sequences). The 72 clustered sequences contained

40 MDR pyrazinamide-sensitive samples and

2 i i

i i i i i i

32 MDR pyrazinamide-resistant samples. The with a p.Leu116Arg substitution in pncA is

clusters were defined using the 12 SNP cutoff, inferred to infect a person whose isolate shows a

excluding known resistance substitutions. get p.Asp12Ala substitution in pncA. Each of these

details from Sebastian. substitutions informs the same relative fitness, but

The lineage 2 dataset contains clustered such an infection is would mean double resistance

sequences with 61 different pncA substitutions, evolution and resistance reversal. In other words,

at most 6 different ones within a cluster, while a wild strain with a substitution A is not allowed

the lineage 4 dataset has 6 different pncA to infect a person whose isolate is a wild strain

substitutions, with only a single substitution with a substitution B, since that would mean that

per cluster. 2 of the 6 substitutions in the the strain with substitution A went back to being

Georgian lineage 4 dataset are also present in wild and then evolved a different substitution in

the lineage 2. We assume that the transmission the same gene.

fitness cost for each of the different pncA gene If known pyrazinamide resistance substitutions

substitution is the same, however, since we assume were included in the alignments such situations

that each substitution is an independent de- in inference would not be likely to happen.

novo acquisition, we need to separate them into Thus, removing known resistance substitutions

different compartments if different substitutions and labelling the different substitution within

are present in the same cluster. one cluster with a different number allows us to Analyses avoid biasing parameter estimates (e.g. inflating

The analyses were set up in a fashion similar to the rate of evolving resistance). The graphical

(Peˇcerska et al., 2019). The clustered sequences representation of the model is shown on Figure 1a.

were used to inform a combined relative fitness We use incremental analyses to investigate the

for any of the pncA mutants. The data was correlation between dataset size, cluster size,

divided into clusters where each cluster was cluster number and the accuracy of the estimates.

used to infer its own phylogenetic tree, while all In the incremental analyses we analyse the clusters

the evolutionary and epidemiological parameters starting from the largest only, in each subsequent

were shared across different clusters. Sequences step including the clusters of the next largest size,

with the same pncA substitution were marked finally analysing the full dataset. This is done to

with the same class within a cluster, while compare to the previous analyses in Kinshasa, and

different pncA substitutions within a cluster were to ensure the stability of the method when data is

necessarily marked with different classes. This is scarce.

done to prevent situations when e.g. a sequence

3 i i

i i i i i Peˇcerska et al. i

when compared to the previous results from the

? Kinshasa dataset, as shown in Figure 2. The k k k k ? S ? R ? R ? R Kinshasa dataset shows a distinct loss in fitness IS IR1 IR2 ... IR6 associated with pyrazinamide resistance (median pS pR pR pR

?S ?R ?R ?R relative fitness 0.6417, 95% HPD [0.5472,0.7374]),

(a) The BDMM7 model setup for the analysis of clusters while both lineages 2 and 4 from Georgia show no with at most 6 different types of pyrazinamide resistance substitutions per cluster. The compartments are marked difference or even improved fitness. Lineage 2 from as follows: Susceptible - , Exposed - Ex, Infected - S Ix, under Treatment - Tx, where x is either drug- Georgia shows a negligible loss in fitness, median sensitive or drug-resistant, S or R respectively. Sampling probability is marked by px, where x is S or R. Each relative fitness 0.9349, 95% HPD [0.7904,1.0974], of the IRn compartments, where n is the compartment number, represents a distinctive resistance substitution notably including 1, which means that we cannot within a cluster. Rates are marked as follows (x ∈ k rule out the possibility of equal fitness between S,R ): λx - time-dependent birth rate per interval k, δx { } - recovery rate, µ - resistance evolution rate. Resistance the pyrazinamide-resistant and pyrazinamide- cannot be lost, only acquired, and all resistant strains have the same fitness cost. sensitive strains. On the other hand, lineage 4 We additionally perform the same incremental from Georgia shows signal for improved fitness analyses without sequence data, which is done of the pyrazinamide-resistant mutants, median by replacing the true sequences with a single 1.301, 95% HPD interval [0.8832,1.8284]. The 95% unknown nucleotide in the configuration of the HPD is notably wide, however, which is most analyses. This allows us to investigate the amount likely due to the low amount of data and therefore of information that can be extracted from the decreased precision of the method. update only the sampling dates, resistance statuses and GL4 clustering, completely ignoring any evolutionary Incremental analyses relationships that can be could from the sequence Incremental multi-cluster analyses show a very data. similar pattern to that visible in the Kinshasa Unfortunately, Lineage 4 sequences only formed dataset. In the incremental analyses we start 16 clusters of at most 10 sequences, which renders with analysing only the largest cluster, gradually incremental cluster analyses uninformative. including groups of clusters of smaller sizes. This Will we run Jo¨elle’spackage? inclusion procedure allows us to track the impact Results of cluster sizes and general sequence numbers on

Jo¨elle’spackage results the results. From that we conclude that only the

Lineage and country comparison dataset size matters, while actual cluster sizes

Overall, we see a much reduced fitness cost of matter very little, as long the same order of

pyrazinamide resistance in the Georgian dataset sequences of both types is present. An additional

4 i i

i i i i i i

9 Georgia L2 8 Georgia L4 Kinshasa L4 7

6

5

4

3

2

1

0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 r

(a) The posterior probability density of rλ for the full dataset analyses of the Georgia lineage 2 (blue) and lineage 4 (green), and Kinshasa lineage 4 (red) sequences. Median value for Georgia L2 is rλ =0.9349, 95% HPD [0.7904,1.0974], median value for Georgia L4 is rλ =1.301, 95% HPD [0.8831,1.8284], median value for Kinshasa L4 is rλ =0.6417, 95% HPD [0.5477,0.7378].

2.5 Georgia L2 Georgia L4 Kinshasa L4 2.0

1.5

1.0

0.5

0.0 0 1 2 3 4 5 6

RE

(b) The posterior probability density of RE for the full dataset analyses of the Georgia lineage 2 (blue) and lineage 4 (green), and Kinshasa lineage 4 (red) sequences. Median value for Georgia L2 is RE =1.9942, 95% HPD [1.5617,2.5177], median value for Georgia L4 is RE =1.7514, 95% HPD [1.1175,2.769], median value for Kinshasa L4 is RE =1.8487, 95% HPD [1.5276,2.2899].

FIG. 2. Comparison of rλ and RE estimates between the datasets from Kinshasa and Georgia.

5 i i

i i i i i Peˇcerska et al. i

requirement is that the types should be mixed robustness of the results to this prior. Prior

in at least some of the clusters, however it selection for this parameter had little to no effect

is hard to evaluate the extent to which the on the fitness cost estimates for the Kinshasa

mixing impacts the results. E.g. the results on dataset. Unfortunately the Georgian analyses

the Kinshasa dataset show no clear correlation proved to be less resilient and show a change in

between the amount of mixing (proportions of estimated fitness costs depending on the prior.

pyrazinamide resistant strains or measures of type On the other hand, the 95% highest posterior

diversity within clusters) and the estimates of density intervals overlap, so the influence is not

drug resistance fitness costs (see supplementary very strong.

Figure 12 in (Peˇcerska et al., 2019)). Discussion

Same as for the Kinshasa dataset, we have The relative fitness estimates for lineage 2 and

performed some analyses excluding the genetic lineage 4 sequences from the Georgian dataset

sequences, looking to evaluate the information thus contradict previous ideas on pyrazinamide

extracted from simply the occurrence data – resistance, which was hypothesised to have a high

cluster sizes and sample dates, shown on Figure 3. fitness cost that limited its spread (den Hertog

The analysis of the 93 sequences that form the et al., 2015). When looking at pyrazinamide largest cluster in fact shows no fitness cost (r λ ≈ resistance using pncA substitutions as proxy, we 1), which is the most likely value sampled from see varying results depending on both the lineage

the prior. As more clusters are added, the cost in question and the geographical location the

is estimated more precisely, finally stabilising dataset was sampled from. While the Lineage 4

at about 30% (median rλ =0.6824, 95% HPD dataset from Kinshasa shows an about 30% loss

[0.5477,0.8232]). This is in accordance with the in fitness for the pyrazinamide-resistant strains

results in (Peˇcerska et al., 2019), namely that the when comparing to pyrazinamide-sensitive MDR-

relative fitness is underestimated when only using TB, the dataset from Georgia shows a very

occurrence data. different result. While Lineage 2 strains from

Same analyses with all data (including the Georgia show a relative fitness that seems to be

sequences) show a similar pattern, first seemingly identical between the two strain types, Lineage

sampling the prior and then stabilising at the 4 pyrazinamide resistant strains from Georgia

value rλ =0.939, 95% HPD [0.7904,1.0974] (see actually show a fitness advantage when compared

Figure 4). to pyrazinamide-sensitive MDR-TB. However, the

Additionally, we used different priors on the cause of the difference between two datasets is

resistance acquisition rate µ to check the not clear. The difference may stem from various

6 i i

i i i i i i

2.0

1.8

1.6

1.4

1.2

1.0

0.8

0.6

0.4

0.2

0.0 95 seq. 55-95 seq. 20-95 seq. 19-95 seq. 16-95 seq. 11-95 seq. 10-95 seq. 9-95 seq. 8-95 seq. 6-95 seq. 5-95 seq. 4-95 seq. 3-95 seq. FIG. 3. Incremental cluster analysis for the Georgian data, sequentially including smaller clusters. Here we do not use sequence data, i.e. we only use occurence data. The rightmost violin plot is the result of the analysis using all clusters.

2.0

1.8

1.6

1.4

1.2

1.0

0.8

0.6

0.4

0.2

0.0 95 seq. 55-95 seq. 20-95 seq. 19-95 seq. 16-95 seq. 11-95 seq. 10-95 seq. 9-95 seq. 8-95 seq. 6-95 seq. 5-95 seq. 4-95 seq. 3-95 seq. FIG. 4. Incremental cluster analysis for the Georgian data, sequentially including smaller clusters. The rightmost violin plot is the result of the analysis using all clusters.

factors, perhaps from the intrinsic differences in It is possible that while additional pyrazinamide

lineage fitness, which, however, does not explain resistance indeed either does not greatly reduce

the disparity between the results for lineage 4 in transmission fitness, or even improves it, the

both locations. baseline fitness of the lineage 4 strains in Georgia

7 i i

i i i i i Peˇcerska et al. i

6 exp(1) exp(5) 5 exp(10)

4

3

2

1

0 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 r

FIG. 5. rλ value estimates for analyses of all clusters using different priors on the rate of resistance acquisition µ. Blue

density curve shows the rλ posterior estimate for exp(1) prior on µ, green for exp(5), red for exp(10).

is lower than in Kinshasa, judging by the size of that access to antibiotics was limited up until

the available dataset. In Kinshasa the sampling is the recent decade, and consequently there was

close to 2%, and in the dataset of 309 sequences not enough exposure to drugs for the circulating

55% cluster to form transmission chains using strains to adjust for the fitness loss. Ask Bouke ≈ the 12 SNP cutoff. On the other hand, in Georgia for reference? The low fitness cost may be due to

the sampling proportion for all TB patients is the longer treatment history in Georgia, and the

around 30%, and of the resulting 1.492 sequences longer time it had to adapt to the drug pressure.

30% cluster using the 12 SNP cutoff. Even Additionally, what should be taken into account ≈ though Georgia is a high incidence TB country, is that the lineage 4 Georgia dataset is in fact

lineage 4 sequences are not very common, which much smaller and the RE estimates should not could be due to low baseline fitness. The epidemic be taken literally. Additionally, the clustering

in Georgia has been under drug pressure for rate is low, which indicates a low RE in the longer, and the pressure has been intermittent, studied timeframe. At the same time however the

which may have allowed the circulating strains fitness costs seem to be nonexistent, which finding

to compensate for the fitness loss in the long still holds regardless of the precision of the RE term. The situation in the Democratic Republic estimates.

of the Congo is much less clear, and it is probable

8 i i

i i i i i i

This paper should warrant further research a high level of baseline strains – e.g. pan-

into pyrazinamide resistance. While the estimates sensitive strains that form the backbone of the

of relative fitness on the Kinshasa lineage 4 epidemic. For example, to study pyrazinamide

dataset are congruent with previous hypotheses mono-resistance, we would require a dataset that

of high costs of pyrazinamide resistance, the contains the same order of drug-sensitive strains

Georgian analyses show either no cost or improved and of pyrazinamide mono-resistant strains. In

fitness for pyrazinamide-resistant strains. This is the particular case of pyrazinamide access to

a worrying finding, as pyrazinamide is one of the such data is limited, as it seems that the

first line drugs that may still be used in treatment resistance mainly develops after rifampicin and

of MDR-TB in combination with other drugs, as it isoniazid resistances (Huy et al., 2017). The

was shown to eliminate dormant bacteria (WHO, best approximation is to study the possible

2019). This shows that we cannot extrapolate influence of pyrazinamide resistance on the

the results on relative fitness neither from other relative fitness of MDR-TB strains. Additionally,

locations nor from other lineage estimates and pyrazinamide resistance manifests in many

we have address the question on a per-dataset different substitutions, each of which may have

basis. Drug susceptibility testing for pyrazinamide a different effect on the transmission fitness of

should be rigorously performed to ensure that the bacterium. We lack the data to be able to

pyrazinamide-resistant strains do not get the drug distinguish the costs between strains with specific

exposure necessary to compensate for any possible substitutions, thus we resort to averaging out

fitness cost for the bacterium. any possible differences in fitness. Optimally, we

Additionally, the rates of drug resistance would need a large number of sequences with

acquisition should be studied more. The rates the same pncA substitution embedded within an

of drug resistance acquisition influence the final equal order of wild-type pan-sensitive strains to

estimates of relative fitness and while setting estimate the effect of a specific substitution on the

wide priors that contain little information has low fitness of the bacterium. However, even in such a

influence on the results, the influence is still visible case, we still need to study each location/situation

and can be reduced by improved estimates of the separately, as even the same lineage may exhibit

rates. very different fitness values in different locations.

We are interested in any estimates of relative

fitness that would allow to correlate treatment

regimens with more or less more potent strains Acknowledgments

arising as the result, it is only possible with This work was supported by SystemsX.ch.

9 i i

i i i i i Peˇcerska et al. i

References Pyrazinamide resistance-conferring mutations in pnca and the transmission of multidrug resistant tb in Andersson, D. I. and Hughes, D. 2010. Antibiotic resistance georgia. BMC Infect Dis, 17(1): 491. and its cost: is it possible to reverse resistance? Nat Rev WHO 2018. Global tuberculosis report 2018. Microbiol, 8(4): 260–71. WHO 2019. Who consolidated guidelines on drug-resistant den Hertog, A. L., Sengstake, S., and Anthony, R. M. 2015. tuberculosis treatment. Report. Pyrazinamide resistance in mycobacterium tuberculosis

fails to bite? Pathog Dis, 73(6): ftv037.

Diacon, A. H., Dawson, R., von Groote-Bidlingmaier, F., Symons, G., Venter, A., Donald, P. R., van Niekerk,

C., Everitt, D., Winter, H., Becker, P., Mendel, C. M.,

and Spigelman, M. K. 2012. 14-day bactericidal

activity of pa-824, bedaquiline, pyrazinamide, and moxifloxacin combinations: a randomised trial. The

Lancet, 380(9846): 986–993.

Hopewell, P. C., Fair, E. L., and Pai, M. 2009. International

standards for tuberculosis care, pages 649–659. Elsevier. Huy, N. Q., Lucie, C., Hoa, T. T. T., Hung, N. V., Lan,

N. T. N., Son, N. T., Nhung, N. V., Anh, D. D.,

Anne-Laure, B., and Van Anh, N. T. 2017. Molecular

analysis of pyrazinamide resistance in mycobacterium tuberculosis in vietnam highlights the high rate of

pyrazinamide resistance-associated mutations in clinical

isolates. Emerg Microbes Infect, 6(10): e86.

Miotto, P., Cabibbe, A. M., Feuerriegel, S., Casali, N., Drobniewski, F., Rodionova, Y., Bakonyte, D.,

Stakenas, P., Pimkina, E., Augustynowicz-Kopec, E.,

Degano, M., Ambrosi, A., Hoffner, S., Mansjo, M.,

Werngren, J., Rusch-Gerdes, S., Niemann, S., and Cirillo, D. M. 2014. Mycobacterium tuberculosis

pyrazinamide resistance determinants: a multicenter

study. MBio, 5(5): e01819–14.

Peˇcerska, J., K¨uhnert,D., Meehan, C. J., Coscoll´a,M., de Jong, B. C., Gagneux, S., and Stadler, T. 2019.

Quantifying transmission fitness costs of multi-drug

resistant tuberculosis. Epidemics,in review.

Sengstake, S., Bergval, I. L., Schuitema, A. R., de Beer, J. L., Phelan, J., de Zwaan, R., Clark, T. G.,

van Soolingen, D., and Anthony, R. M. 2017.

10 i i

i i

TRANSMISSIONTIMEANDCLUSTERINGMETHODSINTB 5 EPIDEMIOLOGY

This paper investigates different transmission cluster definitions, widely used currently, using phylogenetics to estimate the timespans that each clustering method covers. The clustering methods covered include spoligotyping, 24-loci MIRU-VNTR typing, and whole genome based methods such as SNP number cutoffs and core genome multi locus sequence typing (cgMLST). Spoligotyping, or spacer oligonucleotide typing, is a polymerase-chain reaction method of genotyping Mycobacterium tuberculosis strains. It is a fast method that has been standardised about 20 years ago and has allowed for creation of large international databases showcasing strain diversity of Mycobacterium tuberculosis. However, this method fails to differentiate well within large strain families as well as does not account for convergent evolution (Driscoll, 2009). Another method, often used in combination with spoligotyping, is the genotyping using mycobacterial interspersed repetitive units based on the variable numbers of tandem repeats (MIRU-VNTR), in particular one using 24 different loci. This method was proposed in 2006 as a high-resolution tool for phylogenetic analyses (Supply et al., 2006). However, it has previously been shown that using this method of genotyping unlinked cases may show identical patterns, which indicates that better resolution is required for short-term epidemic tracing as the method covers too broad a range of genetic diversity (Roetzer et al., 2013). Whole genome sequencing (WGS) with different SNP cutoffs has long been used in scenarios when increased resolution is necessary, however underlying costs and accessibility issues still limit its use on a regular basis. This work once again urges the scientific community to actively make choices for data gathering and clustering methods when dealing with a particular epidemiological setting. For example, spoligotyping-derived transmission clusters were associated with transmission events that could be hundreds of years old, while the transmission times estimated from MIRU-VNTR clusters often spanned over three decades. This kind of data, while undoubtedly interesting and useful when studying long term co-evolution of TB with the human population, holds little value when we set out to study recent transmission and short-term epidemics. In order to evaluate the impact of different factors in the short term, e.g. of interventions or new vaccine introductions, we need data that would span the appropriate, short-term, timeframes. Additionally, if, as in this thesis, we want to estimate drug resistance fitness costs, we not only need to sample drug resistant strains, we also need to have an equal order of drug-sensitive samples to provide the backbone of the tree and the baseline for relative transmission fitness estimates. This work was published in November 2018 in EBioMedicine as an article titled “The relationship between transmission time and clustering methods in Mycobacterium tuberculosis epidemiology”, DOI: 10.1016/j.ebiom.2018.10.013, where I am a middle author. Following is the publisher’s version of the article followed by the supplementary materials.

91 EBioMedicine 37 (2018) 410–416

Contents lists available at ScienceDirect

EBioMedicine

journal homepage: www.ebiomedicine.com

Research paper The relationship between transmission time and clustering methods in Mycobacterium tuberculosis epidemiology

Conor J. Meehan a,⁎,PieterMorisa,b,c,ThomasA.Kohld,e,Jūlija Pečerska f, Suriya Akter a,MatthiasMerkerd,e, Christian Utpatel d,e, Patrick Beckert d,e, Florian Gehre a,g,h, Pauline Lempens a, Tanja Stadler f,MichelK.Kaswaa,j, Denise Kühnert i,StefanNiemannd,e,1, Bouke C. de Jong a,1 a Unit of Mycobacteriology, Biomedical Sciences, Institute of Tropical Medicine, Antwerp 2000, Belgium b Adrem Data Lab (Adrem), Department of Mathematics and Computer Science, University of Antwerp, Antwerp 2020, Belgium c Biomedical Informatics Research Network Antwerp (biomina), University of Antwerp, Antwerp 2020, Belgium d German Center for Infection Research, Partner Site Hamburg-Lübeck-Borstel-Riems, D-23845 Borstel, Germany e Molecular and Experimental Mycobacteriology, Priority Area Infections, Research Center Borstel, D-23845 Borstel, Germany f Swiss Institute of Bioinformatics (SIB), 1015 Lausanne, Switzerland g Vaccines and Immunity Theme, Medical Research Council Unit The Gambia, Serekunda, Gambia h Department Infectious Diseases Epidemiology, Bernhard Nocht Institute for Tropical Medicine, Hamburg 20359, Germany i Max Planck Institute for the Science of Human History, 07745 JENA, Germany j National Tuberculosis Program, Kinshasa, DR Congo article info abstract

Article history: Background: Tracking recent transmission is a vital part of controlling widespread pathogens such as Mycobacte- Received 31 July 2018 rium tuberculosis. Multiple methods with specific performance characteristics exist for detecting recent transmis- Received in revised form 17 September 2018 sion chains, usually by clustering strains based on genotype similarities. With such a large variety of methods Accepted 3 October 2018 available, informed selection of an appropriate approach for determining transmissions within a given setting/ Available online 16 October 2018 time period is difficult. Methods: This study combines whole genome sequence (WGS) data derived from 324 isolates collected Keywords: 2005–2010 in Kinshasa, Democratic Republic of Congo (DRC), a high endemic setting, with phylodynamics to un- Mycobacterium tuberculosis MDR-TB molecular epidemiology veil the timing of transmission events posited by a variety of standard genotyping methods. Clustering data based Transmission on Spoligotyping, 24-loci MIRU-VNTR typing, WGS based SNP (Single Nucleotide Polymorphism) and core Spoligotyping genome multi locus sequence typing (cgMLST) typing were evaluated. MIRU-VNTR Findings: Our results suggest that clusters based on Spoligotyping could encompass transmission events that MLST occurred almost 200 years prior to sampling while 24-loci-MIRU-VNTR often represented three decades of trans- Whole genome sequencing mission. Instead, WGS based genotyping applying low SNP or cgMLST allele thresholds allows for determination Outbreak detection of recent transmission events, e.g. in timespans of up to 10 years for a 5 SNP/allele cut-off. Interpretation: With the rapid uptake of WGS methods in surveillance and outbreak tracking, the findings obtained in this study can guide the selection of appropriate clustering methods for uncovering relevant trans- mission chains within a given time-period. For high resolution cluster analyses, WGS-SNP and cgMLST based analyses have similar clustering/timing characteristics even for data obtained from a high incidence setting. © 2018 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

1. Introduction through ongoing transmission within large populations [2,3]. The track- ing and timing of recent transmission chains allows TB control programs Despite the large global efforts at curbing the spread of Mycobacte- to effectively pinpoint transmission hotspots and employ targeted rium tuberculosis complex (Mtbc) strains, 10.4 million new patients intervention measures. This is especially important for the transmission develop tuberculosis (TB) every year [1]. In addition, the prevalence of of drug resistant strains as it appears that drug resistance may be multidrug resistant (MDR) Mtbc strains is increasing [1], predominantly transmitted more frequently than acquired [2]. Thus, interrupting trans- mission is key for the control of MDR-TB [3,4]. For the development of the most effective control strategies, there is a strong need for ⁎ Corresponding author. fi E-mail address: [email protected] (C.J. Meehan). (i) appropriate identi cation of relevant transmission chains, risk fac- 1 Equal contribution. tors and hotspots and (ii) robust timing of when outbreaks first arose.

https://doi.org/10.1016/j.ebiom.2018.10.013 2352-3964/© 2018 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). C.J. Meehan et al. / EBioMedicine 37 (2018) 410–416 411

the primary non-molecular epidemiological method for investigating Research in context transmission networks of TB, mainly based on patient interviews [6]. Although this method is often seen as a gold standard of transmission Evidence before this study linking, it does not always match the true transmission patterns, even in low incidence settings [7] and misses many connections [8]. The For nearly 30 years, molecular genotyping tools have been used to implementation of molecular genotyping and epidemiological ap- define transmission chains/clusters of Mycobacterium tuberculo- proaches has overcome these limitations and is often used as the main sis strains. A variety of tools are used for such analysis e.g. the approach for transmission analyses. Classical genotyping has involved presence/absence of spacers sequences (Spoligotyping), the IS6110 DNA fingerprinting [9], Spoligotyping (CRISPR-based) [10], and length of tandem repeat patterns (24-loci-MIRU-VNTR) or, more variable-number tandem repeats of mycobacterial interspersed repeti- recently, nearly the complete genome by whole genome sequenc- tive units (MIRU-VNTR) [11] which is the most common method at ing (WGS). Each method has been proposed as the gold standard the moment [5]. The latter method is based on copy numbers of a genotyping technique for detecting transmission events in a cer- sequence in tandem repeat patterns derived from 24 distinct loci within tain timeframe and selection of the optimal method for a given the genome [12]. If two patients have the same classical genotyping pat- question is difficult as important parameters (e.g. the time span tern such as a 24-loci MIRU-VNTR pattern (or up to one locus difference a particular outbreak can encompass) are not well defined. [12]) they are considered to be within a local transmission chain. The Based on inferred mutation rates, there have been some time combination of Spoligotyping and MIRU-VNTR-typing, where patterns scales proposed for clusters based on WGS SNP-based methods, must match in both methods to be considered a transmission link, supported by contact tracing data to confirm epidemiological is often considered the molecular gold standard for transmission links. However, there is uncertainty around these timing estimates linking and genotyping [12]. However, examples of unlinked patients for SNP-based techniques, limited timing estimates available for with identical patterns have been observed, suggesting that this classical genotyping techniques and no such estimates for threshold covers too broad a genetic diversity and timespan between cgMLST approaches. This makes it very difficult for researchers, infections [7]. public health workers and clinicians to correctly interpret reported The application of (whole genome) sequence (WGS)-based clustering data. This is especially the case as WGS based methods approaches for similarity analysis of Mtbc isolates and cluster determi- are becoming rapidly ingrained in surveillance and clinical nation is known to have high discriminatory power when assessing workflows. transmission dynamics [7,13–16], either using core genome multi- locus sequence typing (cgMLST) [17,18] or SNP distances [7,14,15,19]. Added value of this study WGS-based approaches compare the genetic relatedness of the genomes of the clinical strains under consideration, albeit usually ex- N This study is the first to perform a comparative evaluation of clus- cluding large repetitive portions of the genome ( 10% for the PE/PPE ter data defined by both classical and WGS-based M. tuberculosis genes alone [20]), with the assumption that highly similar strains are genotyping approaches, especially with regard to transmission linked by a recent transmission event [7,14]. Although many SNP cut- timing. While many studies have put forward various methods offs for linking isolates have been proposed [21], the most commonly fi as the gold standard for M. tuberculosis transmission detection, employed is based on the nding that a 5 SNP cut-off will cluster the we have tested clustering data generated by the different methods genomes of strains from the majority of epidemiologically linked TB in a Bayesian statistical framework to elucidate the true fraction of patients, with an upper bound of 12 SNPs between any two linked iso- recent transmission each approach is detecting. When specifically lates [14]. The emerging widespread use of WGS has quickly pushed looking at recent transmission (e.g. b10 years previous), our re- these cut-offs to be considered the new molecular gold standard of sults indicate that classical genotyping methods vastly over esti- recent transmission linking, although SNP distances may vary for tech- fi mate recent transmission events. This solidifies the need for nical reasons (e.g. assembly pipelines or lter criteria [22]) and between WGS-based methods when searching for recent outbreaks of study populations e.g. high and low incidence settings [19]. M. tuberculosis. In addition to cluster detection, uncovering the timing of trans- mission events within a given cluster is highly useful information Implications of all the available evidence for TB control e.g. for assessing the impact of interventions on the spread of an outbreak or uncovering when MDR-TB transmission fi Our study allows researchers and public health officials to select rst emerged in a particular setting. Accordingly, knowledge of the the appropriate genotyping method for assessing transmission rate change associated with different genotyping methods is essen- tial for correct timing. The whole genome mutation rate of Mtbc with respect to the epidemiological setting and a given time- −7 period. We also suggest the incorporation of particular genotyping strains has been estimated by several studies as between 10 and −8 – methods in a cascade system with increasing resolution for vari- 10 substitutions per site per year or ~0·3 0·5 SNPs per genome – ous levels of surveillance e.g. from multi-country surveillance peryear[7,14,23 25], while the rate of change in the MIRU-VNTR fi −3 down to recent transmission and outbreak analyses. This is partic- loci speci cally is known to be quicker (~10 )[26,27]. Since these ularly important as each method comes with specific costs, infra- mutation rates have been shown to also vary by lineage [24,28]and structure and computational requirements, human resources, over short periods of time [23], such variation needs to be accounted and, last but not least interpretation complexities – all of which for when estimating transmission times, e.g. by using Bayesian phy- might not be feasible at all sites or scales. Accordingly, our logenetic dating techniques [3,23,26]. study can aid a cost/benefit analysis for selection of genotyping Considering the multiple genotyping methods currently avail- “ ” techniques, that might especially be used in high incidence, low able, many of them proposed as a gold standard , there is an urgent fi resource settings. need to precisely de ne the individual capacity of each method to accurately detect recent transmission events and perform timing of outbreaks. To provide this essential information, this study harnesses the power of WGS-based phylogenetic dating methods to assign Epidemiological TB studies often apply genotyping methods to Mtbc timespans onto Mtbc transmission chains encompassed by the strains to determine whether two or more patients are linked within a different genotypic clustering methods commonly used in TB trans- transmission chain (molecular epidemiology) [5]. Contact tracing is mission studies. 412 C.J. Meehan et al. / EBioMedicine 37 (2018) 410–416

2. Materials and methods pattern and the 24 loci MIRU-VNTR pattern matched. Spoligotyping and MIRU-VNTR patterns are available on figshare [32,33]. 2.1. Dataset, ethical approval and sequencing 2.5. SNP and cgMLST cut-off clustering A set of 324 isolates from Kinshasa, Democratic Republic of Congo were collected from consecutive retreatment TB patients between In this study, we employed the widely used 5 SNP (proposed by 2005 and 2010 at TB clinics, servicing an estimated 30% of the popula- Walker et al. [14] as the likely boundary for linked transmission) and tion of Kinshasa. This dataset represents approximately 2% of the cases 12 SNP cut-offs (proposed maximum boundary) for cluster definition. at the time. All isolates were taken from the start of the patient's Additionally, we employed a lower cut-off of 1 SNP to look for clusters retreatment phase and were phenotypically resistant to rifampicin of very highly related isolates. Pairwise SNP distances were calculated (RR-TB) and the majority are also isoniazid resistant (i.e. MDR-TB). between all isolates. A loose cluster definition was used, where every Use of the stored isolates without any linked personal information isolate in a cluster at most the SNP cut-off from at least 1 other isolate was approved by the health authorities of the DRC and the Institutional in the cluster. Review Board of the ITM in Antwerp (ref no 945/14). Libraries for whole An alternative approach to clustering using WGS data is the concept genome sequencing were prepared from extracted genomic DNA with of core genome MLST (cgMLST) patterns [17,18]. BAM files for all iso- the Illumina Nextera XT kit, and run on the Illumina NextSeq platform lates are input into Ridom SeqSphere+ software (Ridom GmbH, in a 2x151bp run according to manufacturer's instructions. Illumina Münster, Germany) to compile an allelic distance matrix based on the read sets are available on the ENA (https://www.ebi.ac.uk/ena) under cgMLST v2 scheme consisting of 2891 core Mtbc genes [18]. Loose clus- the accession number PRJEB27847. ters were then defined using allelic differences of 1, 5 and 12 as cut-offs. These methods are referred to as 1/5/12 cgMLST respectively. 2.2. Genome reconstruction 2.6. Estimation of transmission times The MTBseq pipeline [29] was used to detect the SNPs for each iso- late using the H37Rv reference genome (NCBI accession number To estimate the age and timespan of potential transmission clusters, NC000962.3) [30]. Unambiguous allele calls were based on the follow- SNP alignments were created for the four primary clustering types: ing parameters: four forward and four reverse reads indicating the al- Spoligotyping, MIRU-VNTR, 12 SNP clusters and 12 allele cgMLST lele, four reads indicating the allele with a phred score of 20 and a 75% clusters. allele frequency. All samples had over 95% coverage of H37Rv (median A Bayesian approach to transmission time estimation was then of 98%) with genome depth ranging from 54× to 290× (median of undertaken. Each cluster methods alignment was separately input to 160.5×). For creation of the SNP alignments, genes known to be in- BEAST-2 v2.4.7 [34] to create a time tree for those isolates. These volved in drug resistance (as outlined in the PhyResSE list of drug muta- phylogenies were built using the following priors: GTR + GAMMA sub- tions v27 [31]) were excluded from the alignment and additional stitution model, a log-normal relaxed molecular clock model to account filtering of sites with ambiguous calls in N5% of isolates and those for variation in mutation rates [35] and coalescent constant size demo- SNPs within a 12 bp window of each other was also applied. graphic model [36], which assumes a low sampling proportion, as ob- served here [37]. This combination of parameters has been tested previously within a Bayesian framework and been shown to be suitable 2.3. Transmission cluster estimation methods for lineage 4 isolates [19,25,38,39], including in Brazzaville, the city neighbouring Kinshasa in the Republic of the Congo [40 ]. The MCMC Six standard transmission clustering approaches were chosen for chain was run six times independently per alignment with a length of comparison and analysis: Spoligotyping, MIRU-VNTR, Spoligotyping + at least 400 million, sampled every 40,000th step (Spoligotyping: MIRU-VNTR, SNP-based clustering and cgMLST-based clustering. The 400 M; MIRU: 700 M; 12 SNP and cgMLST: 500 M). A log normal prior latter two approaches were undertaken at 3 different cut-offs (1, 5 (mean 1.5 × 10−7; variance 1·0) was used for the clock model to reflect and 12 SNPs/alleles). The total SNP distances were calculated, per the previously estimated mutation rate of M. tuberculosis lineage 4 method, to investigate the range of variability encompassed within [7,14,23–25], while allowing for variation as previously suggested each cluster. Maximum SNP distances were derived from pairwise com- [23]. A 1/X non-informative prior was selected for the population size parisons of isolates within the SNP alignment using custom python parameter of the demographic model. Isolation dates were used as in- scripts. A clustering rate was calculated for each method using the for- formative heterochronous tip dates and the SNP alignment was aug- mula (nc- c)/n, where nc is the total number of isolates clustered by a mented with a count of invariant sites for each of the four nucleotide given method, c is the number of clusters, and n is the total number of bases to avoid ascertainment bias [41]. Tracer v1.6 was used to deter- isolates in the dataset (n =324). mine adequate mixing and convergence of chains (effective sample sizes (ESS) N200 for all except Spoligotyping with ESS N100) after a 2.4. Spoligotyping and MIRU-VNTR 25% burn-in. The chains were combined via LogCombiner v2.4.8 [34] to obtain a single chain for each clustering type with high (N700) ESS. Spoligotype patterns were obtained from membranes following the The tree samples were combined in the same manner and resampled previously published protocol [10]. Isolates were said to be clustered if at a lower frequency to create thinned samples of (minimum) 20,000 all 43 spacers matched. Genotyping by MIRU-VNTR was undertaken as trees. Tip date randomisation was undertaken to check for temporal sig- previously described [12]. 2 μl of DNA was extracted from cultures and nal of the data. The R package ‘TipDatingBeast’ [42] was used to ran- amplified using the 24 loci MIRU-VNTR typing kit (Genoscreen, Lille, domly reassign tip dates across the 12 SNP-based alignment. Ten France). Analysis of patterns was undertaken using the ABI 3500 repetitions were undertaken and BEAST-2 run as above. Rate mean automatic sequencer (Applied Biosystems, California, USA) and and tree heights differed significantly between the random date and Genemapper software (Applied Biosystems). Isolates were said to be true dataset log files, suggesting a sufficient temporal signal was present clustered if all 24 loci matched. Mixed MIRU-VNTR patterns were in the data. observed in 18 isolates although this mixing was not observed in the The algorithm for estimating the timespan of transmission events WGS data, likely due to subculturing for sequencing. MIRU-VNTR pat- encompassed by each method is outlined in Supplemental Fig. 1. Briefly, terns were also combined with spoligotyping patterns for additional for each cluster created by the given method, we defined the MRCA refinement of clusters. Isolates were clustered if both the spoligotyping node as the internal node that connects all taxa in that cluster. The C.J. Meehan et al. / EBioMedicine 37 (2018) 410–416 413 youngest node was then defined as the tip that is furthest from this two isolates. Bayesian phylodynamic dating approaches implemented MRCA within the clade (i.e. the tip descendant from that node that in BEAST-2 [34] were then utilised to assign timespans to the transmis- was sampled closest to the present time). To better account for changes sion events estimated by each genotyping method. in the mutation rate over short periods [23], all trees estimated and As expected, classical genotyping methods clustered the most sampled during the Bayesian MCMC process were used instead of only strains, with the lowest resolution (i.e. highest clustering rate) (Fig. 1, a single summary phylogeny. For each retained tree in the MCMC pro- Table 1). WGS-based methods had by far the highest discriminatory cess, the difference in age between the MRCA node and youngest power and low SNP cut-offs grouped isolates into smaller clusters (e.g. node was calculated. This gave a distribution of likely maximum trans- 2–10 isolates per cluster for a 5 SNP cut-off) (Table 1, Fig. 1). The high mission event times within that cluster. For each method, these per- percentage of strains in a 12 SNP cluster (75%) suggests high levels of cluster aggregated ages were then combined across all clusters to give transmission in this population, making is suitable for further transmis- a per-method distribution of transmission event times represented by sion analyses, despite the estimated low sampling proportion (2% based the clusters. The 95% Highest Posterior Density (HPD) interval of these on demographic data). distributions was calculated with the LaplacesDemon [43] p.interval Bayesian phylogenetic dating of the timeframe associated with par- function in R v3.4.0 [44]. ticular transmission chains showed large differences in estimated clus- ter ages between the different genotyping approaches used (Table 1), 3. Results correlating well with the difference in discriminatory power. Cluster ages are defined here as the most ancient transmission event that In this study, we assessed five different approaches for generating links any two isolates within a specific cluster (see methods and supple- putative M. tuberculosis transmission clusters: Spoligotyping, MIRU- mental Fig. 1). Thus, in phylogenetic terms, the cluster age is the differ- VNTR, Spoligotyping and MIRU-VNTR, SNP-based clustering using a ence in time between when the most recent common ancestor (MRCA) 12, 5 and 1 SNP cut-off, and cgMLST allele clustering with 12, 5 and 1 al- of the entire cluster existed and the date of isolation of the furthest iso- lele cut-offs, using a dataset of 324 isolates collected 2005–2010 in Kin- late from this ancestor. shasa, Democratic Republic of Congo (DRC). The dataset contained The aggregate median ages of clusters derived from Spoligotyping 309 L4 and 15 L5 isolates, with a maximum of 1671 SNPs between any were found to often be several hundred years old (median 178 years

Fig. 1. Clustering of M. tuberculosis isolates. For each approach the inclusion of an isolate into a cluster is outlined in the surrounding circles using GraPhlAn [59]. The ML phylogenetic tree was created using RAxML-NG [60] (see supplemental material) and is rooted between L4 and L5 isolates. 414 C.J. Meehan et al. / EBioMedicine 37 (2018) 410–416

Table 1 Clustering method overview for each clustering method, the general features are outlined in the table. Median ages and 95% HPD ranges are based upon the BEAST-2 estimates of clade heights (see methods).

Method Strains Number Percent of strains Cluster Maximum Clustering Mean Timespan in of in sizes SNP rate timespan 95% clusters clusters clusters distances HPD

Spoligotyping 276 33 85.19 2–39 1–685 0.75 178.35 0.34–7747 MIRU-VNTR 207 38 63.89 2–30 0–611 0.5216 35.58 0–1830 Spoligo-MIRU 174 36 53.7 2–25 0–611 0.4259 36.38 0–1969 12 SNP cluster 242 47 74.69 2–34 0–23 0.6019 23.63 0–102.58 5 SNP cluster 147 40 45.37 2–27 0–10 0.3302 10.86 0–47.07 1 SNP cluster 74 29 22.84 2–60–2 0.1389 3.91 0–23.54 12 allele 254 45 78.4 2–39 0–51 0.6451 24.06 0–112.25 cgMLST 5 allele cgMLST 173 42 53.4 2–28 0–22 0.4043 13.4 0–68.53 1 allele cgMLST 80 31 24.69 2–60–4 0.1512 4.73 0–24.65

(95% HPD: 0–7747)) (Table 1). MIRU-VNTR clustering encompassed convergence in other datasets cannot be estimated. Combination of more recent transmission events than Spoligotyping, but were still these two classical methods was similar to MIRU-VNTR alone, further found to be often over three decades old (median 36 years (95% HPD: limiting the use of Spoligotyping for molecular epidemiology. 0–1830)). The combination of MIRU-VNTR and Spoligotyping resulted For defining transmission events that occurred in more recent time in cluster ages similar to MIRU-VNTR alone (Table 1). Clusters based on frames before sampling, WGS-based methods were found to be better SNP cut-offs correlated to 23 years using a 12 SNP cut-off (95% HPD: suited than classical genotyping methods (Table 1). The 12 SNP cut-off, 0–103), 11 years using a 5 SNP cut-off (95% HPD: 0–47), and 4 years currently the recommended upper bound for clustering isolates, often de- using a 1 SNP cut-off (95% HPD: 0–24) (Table 1). Cluster sizes and ages fines transmission events that occurred on average two decades prior to based on cgMLST alleles were similar to the SNP-based clusters (Table 1). sampling, slightly younger in median age to clusters estimated by MIRU-VNTR, but also drastically more recent in maximum ages. This sug- 4. Discussion gests that the 12 SNP cluster method may be a good replacement for MIRU-VNTR as it detects larger transmission networks spanning similar The term ‘recent transmission’ is often applied to gain a better un- transmission time periods but is less affected by convergent evolution. derstanding of the current transmission dynamics of pathogens in a Isolates clustered at a low (5 SNP) or nearly identical (1 SNP) cut-off given population. However, little data is available on how recent a likely were found to represent transmission events occurring over a time span transmission event occurred when measured with different genotyping of up to ten years. These findings correlate well with previous studies methods. To get a better understanding of the discriminatory power of where confirmed contact tracing-based epidemiological links were different classical genotyping techniques and WGS-based approaches found between patients that were two [15,50]andthree[7]SNPsapart. in relation to outbreak timing, this study has performed an in-depth The original paper that proposed the 5 and 12 SNP cut-offs found that se- comparison of clustering rates and dated phylogenies obtained in a col- rial isolates that were 10 years apart differed by, on average, 6 SNPs, also lection of 324 Mtbc strains from a high incidence setting (Kinshasa, agreeing with the findings presented here [14]. Comparisons between the DRC). With a whole genome phylodynamic approach employed as a SNP-based (using almost all genomic differences) and the cgMLST-based gold standard, our study demonstrates that each genotyping method (using a defined core set of genes) methods demonstrated that the latter was associated with a specific discriminatory power resulting in clusters approach gives similar estimates to full SNP approaches. This supports the representing vastly different time periods of transmission events use of low SNP or cgMLST differences for detection or exclusion of very re- (Table 1). This has significant implications for data interpretations e.g. cent transmission, although basing clustering on such low numbers of when selecting and utilising different genotyping/clustering approaches SNPs makes robust identification of transmission direction difficult. for epidemiological studies and assessing the effectiveness of public The mutation rate of M. tuberculosis has been estimated to be be- health intervention strategies. tween 10−7 and 10−8 substitutions per site per year [3,7,24]. Within As the most extreme example, Spoligotyping-derived clusters were the Bayesian analysis employed here, the mutation rate was free to associated with transmission events that can be several hundred years vary between these values but was found to strongly favour ~3 old. This is due to the low discriminatory power coupled with the high ×10−8 (ESS N 1000 for all runs; 95% HDP: 4 × 10−9 -8×10−8), trans- rate of convergent evolution (the same spoligotype pattern found in lating to approximately 0·13 SNPs per genome per year (95% HDP: phylogenetically distant isolates). When convergent patterns are 0.017 - 0.35). While the mutation rate used here is in line with previous removed, the median and maximum transmission ages drop dramati- estimates for lineage 4 [24] (which most of this dataset is comprised of), cally (see Supplementary table 1). However, in practise, such pattern it may be similar in other lineages, although this has only been shown removal is impossible without WGS data. Thus, these findings add for lineage 2 [3,24]. Thus, per-lineage estimates are required for all weight to the previous suggestion that this technique is not suitable seven lineages to ensure similar transmission times are linked to for recent transmission studies [45], although may be of use as a low- genotyping methods across the whole diversity of the Mtbc. cost method of sorting Mtbc strains into the seven primary lineages While this study has many advantages due to its five year population [46,47]. The transmission times encompassed by MIRU-VNTR clusters based design in an endemic setting coupled with the application of often spanned over three decades (Table 1), confirming previous stud- three different genotyping methods (Spoligotyping, 24-locus MIRU- ies showing over-estimation of recent transmission with this method VNTR and WGS), future confirmatory studies could address the follow- [7,13,19,48]. In line with previous findings [45,49], convergent evolu- ing drawbacks that are inherent to genomic epidemiology [16,22]: tion of 24-loci MIRU-VNTR patterns was rarer than observed for 1) studies employing contact tracing and/or digital epidemiology [51] Spoligotyping, but did occur in 16% of MIRU-VNTR-based clusters. in conjunction with these genotyping methods can help confirm trans- Removal of such convergent patterns did not drastically change the me- mission times associated with different clusters and increase the dian transmission ages for MIRU-VNTR (36 vs 26 years) but did affect sampling proportion (although these methods also have many limita- the maximum ages (Supplementary table 1). As with Spoligotyping, tions); 2) as outlined above, strains of other lineages of the Mtbc should such patterns cannot be easily detected and thus the impact of be analysed in a similar fashion to ensure transferability of findings C.J. Meehan et al. / EBioMedicine 37 (2018) 410–416 415 across the entire complex; 3) a broad range of drug resistance profiles Funding sources should be included to fully assess the impact of such mutations on trans- mission estimates; 4) improved WGS methods, such as directly from This work was supported by an ERC grant [INTERRUPTB; no. 311725] clinical samples to help reduce culture biases [52] and longer reads to BdJ, FG and CJM; an ERC grant to TS [PhyPD; no. 335529]; an FWO (e.g. PacBio SMRT or Nanopore MinION) to capture the entire genome, PhD fellowship to PM [grant number 1141217N]; the Leibniz Science including repetitive regions such as PE/PPE genes known to impact ge- Campus EvolLUNG for MM and SN; the German Centre for Infection nome remodelling [53,54], will ensure that the maximum diversity be- Research (DZIF) for TAK, MM, CU, PB and SN; a SNF SystemsX grant tween isolates is captured; 6) extensive panels of Spoligotyping and (TBX) to JP and TS and a Marie Heim-Vögtlin fellowship granted to DK MIRU-VNTR results paired with WGS data will help assess the extent by the Swiss National Science Foundation. The computational resources of convergence in these methods and better correlate their clusters and services used in this work were provided by the VSC (Flemish with those of low SNP thresholds and 7) standardised SNP calling pipe- Supercomputer Center), funded by the Research Foundation - Flanders lines appropriate across all lineages, with high true positive/low false (FWO) and the Flemish Government – department EWI. negative rates, will ensure that Mtbc molecular epidemiology can be uniformly implemented and comparable across studies. Additionally, Declaration of interests extensions of the current WGS-based strategies, such as including within-patient diversity [55,56](maybemissedbysinglecolonypick- The authors declare there are no conflicts of interest attached to this ing for WGS) or counting inferred transmissions instead of SNPs [57] work. are required to truly understand the underlying dynamics of the M. tuberculosis transmission network. Author contributions Since each method was found to represent different timespans and clustering definitions, they can be used in a stratified manner in an inte- CJM, FG and BCdJ conceived the study. MKK and BCdJ oversaw grated epidemiological and public health investigation addressing the collection of isolates and ethical approval. TAK, SA, MM, PB and SN transmission of Mtbc strains. For instance, although Spoligotyping clus- undertook classic genotyping and sequencing of isolates. CJM, PM, TA, ters represented potentially very old transmission events, the low asso- CU and PL undertook WGS assembly and data preparation. CJM under- ciated cost and its ability to be applied directly on sputum helps reduce took all convergence and clustering analyses. CJM, PM, JP, MM, TS and culture bias and thus robustly assign lineages. This may aid public health DK undertook all phylodynamics. CJM, PM, SN and BCdJ wrote the man- officials in high burden settings understand (changes in) the population uscript. All authors read and revised the manuscript and approved its structure of the MTBc lineages, including ruling out instances of relapse final form. or laboratory contamination in case patterns differ. However, due to the problems outlined above, the usefulness of this method in public health Appendix A. Supplementary data initiatives is limited. MIRU-VNTR may serve well as first-line surveil- lance of potential transmission events in the population, guiding further Supplementary data to this article can be found online at https://doi. investigations and resource allocations. Although with the ever decreas- org/10.1016/j.ebiom.2018.10.013. ing cost and increasing speed of WGS methods, the expense and work- load of MIRU-VNTR makes it difficult to justify over the vast increase in References data gained from genomics. If classical genotyping methods are employed, any potential trans- [1] WHO. Global TB Rep 2017:2018. mission hotspots should then be further investigated with contact trac- [2] Kendall EA, Fofana MO, Dowdy DDW, Who, Dye C, Garnett G, et al. Burden of trans- mitted multidrug resistance in epidemics of tuberculosis: a transmission modelling ing and/or WGS. Employment of different cut-offs and clustering analysis. Lancet Respir Med 2015 Nov 12;3(12):963–72. approaches to WGS data can then address several questions. The 12 [3] Merker M, Blin C, Mona S, Duforet-Frebourg N, Lecher S, Willery E, et al. Evolutionary SNP/cgMLST allele cluster approaches serve well for high level surveil- history and global spread of the Mycobacterium tuberculosis Beijing lineage. Nat – lance targeting larger (older) transmission networks, akin to what is Genet 2015 Jan 19;47(3):242 9. [4] Shah NS, Auld SC, Brust JCM, Mathema B, Ismail N, Moodley P, et al. Transmission of currently often done using MIRU-VNTR (e.g. [15,58]). Recent transmis- extensively drug-resistant tuberculosis in South Africa. N Engl J Med 2017 Jan 19; sion events can then be detected through employment of low SNP 376(3):243–53. cut-offs (e.g. 5 SNPs for transmission in the past 10 years or 1 SNPs for [5] Merker M, Kohl TA, Niemann S, Supply P. The evolution of strain typing in the My- cobacterium tuberculosis complex. Advances in Experimental Medicine and Biology; transmission in the past 5 years). In high incidence/low diversity 2017. p. 43–78. settings where amalgamation of clusters may inadvertently obscure [6] Fox GJ, Barry SE, Britton WJ, Marks GB. Contact investigation for tuberculosis: a sys- distinct hotspots of transmission at different time points, subdivision tematic review and meta-analysis. Eur Respir J 2012;41(1). [7] Roetzer A, Diel R, Kohl TA, Rückert C, Nübel U, Blom J, et al. Whole genome sequenc- into distinct time-dependant clusters can be undertaken using the algo- ing versus traditional genotyping for investigation of a Mycobacterium tuberculosis rithm presented in such a study in East Greenland [19]. outbreak: a longitudinal molecular epidemiological study. (Neyrolles O, editor) PLoS Overall, phylodynamic approaches applied to whole genome Med 2013 Jan 12;10(2):e1001387. [8] Bjorn-Mortensen K, Lillebaek T, Koch A, Soborg B, Ladefoged K, Sørensen HCF, et al. sequences, as undertaken here, are recommended to fully investigate Extent of transmission captured by contact tracing in a tuberculosis high endemic the specific transmission dynamics within a study population to setting. Eur Respir J 2017;49(3). account for setting-specific conditions, such as low/high TB incidence, [9] Thierry D, Cave MD, Eisenach KD, Crawford JT, Bates JH, Gicquel B, et al. IS6110, an IS-like element of Mycobacterium tuberculosis complex. Nucleic Acids Res 1990 low/high pathogen population diversity, and sparse/dense sampling Jan 11;18(1):188. fractions. As WGS methods become more commonplace and easier to [10] Kamerbeek J, Schouls L, Kolk A, van Agterveld M, van Soolingen D, Kuijper S, et al. implement in a variety of settings, each genotyping method can be Simultaneous detection and strain differentiation of Mycobacterium tuberculosis – employed as part of an overall evidence gathering program for trans- for diagnosis and epidemiology. J Clin Microbiol 1997 Apr;35(4):907 14. [11] Supply P, Magdalena J, Himpens S, Locht C. Identification of novel intergenic repet- mission, placing molecular epidemiological approaches as an integral itive units in a mycobacterial two-component system operon. Mol Microbiol 1997 part in tracking and stopping the spread of TB. Dec;26(5):991–1003. [12] Supply P, Allix C, Lesjean S, Cardoso-Oelemann M, Rusch-Gerdes S, Willery E, et al. Proposal for standardization of optimized mycobacterial interspersed repetitive Acknowledgements unit-variable-number tandem repeat typing of mycobacterium tuberculosis. J Clin Microbiol 2006 Dec 1;44(12):4498–510. The authors would like to thank Armand Van Deun and Koen [13] Wyllie DH, Davidson JA, Grace Smith E, Rathod P, Crook DW, Peto TEA, et al. A quan- titative evaluation of MIRU-VNTR Typing against whole-genome sequencing for Vandelannoote for valuable discussion and input and Cecile Uwizeye identifying mycobacterium tuberculosis transmission: a prospective observational for aid with spoligotyping. cohort study. EBioMedicine 2018 Aug;34:122–30. 416 C.J. Meehan et al. / EBioMedicine 37 (2018) 410–416

[14] Walker TM, Ip CLC, Harrell RH, Evans JT, Kapatai G, Dedicoat MJ, et al. Whole- [39] Didelot X, Fraser C, Gardy J, Colijn C. Genomic infectious disease epidemiology in genome sequencing to delineate Mycobacterium tuberculosis outbreaks: a retro- partially sampled and ongoing outbreaks. Mol Biol Evol 2017 Jan 18;34(4) spective observational study. Lancet Infect Dis 2013 Feb;13(2):137–46. (msw075). [15] Walker TM, Merker M, Knoblauch AM, Helbling P, Schoch OD, van der Werf MJ, et al. [40] Malm S, Linguissi LSG, Tekwu EM, Vouvoungui JC, Kohl TA, Beckert P, et al. New My- A cluster of multidrug-resistant Mycobacterium tuberculosis among patients arriv- cobacterium tuberculosis complex Sublineage, Brazzaville. Congo Emerg Infect Dis ing in Europe from the Horn of Africa: a molecular epidemiological study. Lancet In- 2017 Mar;23(3):423–9. fect Dis 2018 Jan;8. [41] Leaché AD, Banbury BL, Felsenstein J, de Oca A Nieto-M, Stamatakis A, et al. Short [16] Comas I. Genomic Epidemiology of Tuberculosis. Cham: Springer; 2017; 79–93. tree, long tree, right tree, wrong tree: new acquisition bias corrections for inferring [17] Kohl TA, Diel R, Harmsen D, Rothgänger J, Walter KM, Merker M, et al. Whole-ge- SNP phylogenies. Syst Biol 2015 Nov;64(6):1032–47. nome-based Mycobacterium tuberculosis surveillance: a standardized, portable, [42] Rieux A, Khatchikian C. TipDatingBeast: Using Tip Dates with Phylogenetic Trees in and expandable approach. J Clin Microbiol 2014 Jul 1;52(7):2479–86. BEAST (Software for Phylogenetic Analysis); 2018. [18] Kohl TA, Harmsen D, Rothgänger J, Walker T, Diel R, Niemann S. Harmonized ge- [43] Staticat LLC. LaplacesDemon: Complete Environment for Bayesian Inference. nome wide typing of tubercle bacilli using a web-based gene-by-gene nomenclature Bayesian-Inference.com. (R package version 16.0.1. [Internet]. Available system. EBioMedicine 2018 Jul 30;0(0). from)2016: https://cran.r-project.org/web/packages/LaplacesDemon/citation. [19] Bjorn-Mortensen K, Soborg B, Koch A, Ladefoged K, Merker M, Lillebaek T, et al. Trac- html ing mycobacterium tuberculosis transmission by whole genome sequencing in a [44] R Core Team. R: A language and environment for statistical computing. R Foundation high incidence setting: a retrospective population-based study in East Greenland. for Statistical Computing; 2017 (Vienna, Austria). Sci Rep 2016 Sep 12;6:33180. [45] Comas I, Homolka S, Niemann S, Gagneux S. Genotyping of genetically monomor- [20] Cole ST, Brosch R, Parkhill J, Garnier T, Churcher C, Harris D, et al. Deciphering the phic bacteria: DNA sequencing in Mycobacterium tuberculosis highlights the limita- biology of mycobacterium tuberculosis from the complete genome sequence. Nature tions of current methodologies. PLoS One 2009 Nov 12;4(11):e7815. 1998 Jun 11;393(6685):537–44. [46] Kato-Maeda M, Gagneux S, Flores LL, Kim EY, Small PM, Desmond EP, et al. Strain [21] Hatherell H-A, Colijn C, Stagg HR, Jackson C, Winter JR, Abubakar I. Interpreting classification of Mycobacterium tuberculosis: congruence between large se- whole genome sequencing for investigating tuberculosis transmission: a systematic quence polymorphisms and spoligotypes. Int J Tuberc Lung Dis 2011 Jan;15 review. BMC Med 2016 Dec 23;14(1):21. (1):131–3. [22] Guthrie JL, Gardy JL. A brief primer on genomic epidemiology: lessons learned from [47] Filliol I, Motiwala AS, Cavatore M, Qi W, Hazbón MH, Bobadilla del Valle M, et al. Mycobacterium tuberculosis. Ann N Y Acad Sci 2017 Jan 1;1388(1):59–77. Global phylogeny of Mycobacterium tuberculosis based on single nucleotide poly- [23] Bryant JM, Schürch AC, van Deutekom H, Harris SR, de Beer JL, de Jager V, et al. Infer- morphism (SNP) analysis: insights into tuberculosis evolution, phylogenetic accu- ring patient to patient transmission of Mycobacterium tuberculosis from whole ge- racy of other DNA fingerprinting systems, and recommendations for a minimal nome sequencing data. BMC Infect Dis 2013 Feb 27;13(1):110. standard SNP set. J Bacteriol 2006 Jan 15;188(2):759–72. [24] Duchêne S, Holt KE, Weill F-X, Le Hello S, Hawkey J, Edwards DJ, et al. Genome-scale [48] Stucki D, Ballif M, Egger M, Furrer H, Altpeter E, Battegay M, et al. Standard rates of evolutionary change in bacteria. Microb Genomics 2016 Nov;2(11): genotyping overestimates transmission of Mycobacterium tuberculosis among e000094. immigrants in a low-incidence country. J Clin Microbiol 2016 Jul 1;54(7): [25] Eldholm V, Monteserin J, Rieux A, Lopez B, Sobkowiak B, Ritacco V, et al. Four de- 1862–70. cades of transmission of a multidrug-resistant Mycobacterium tuberculosis outbreak [49] Scott AN, Menzies D, Tannenbaum T-N, Thibert L, Kozak R, Joseph L, et al. Sensitivi- strain. Nat Commun 2015 May 11;6:7119. ties and specificities of spoligotyping and mycobacterial interspersed repetitive unit- [26] Wirth T, Hildebrand F, Allix-Béguec C, Wölbeling F, Kubica T, Kremer K, et al. Origin, variable-number tandem repeat typing methods for studying molecular epidemiol- spread and demography of the mycobacterium tuberculosis complex. Achtman M, ogy of tuberculosis. J Clin Microbiol 2005 Jan;43(1):89–94. editor PLoS Pathog 2008 Sep 26;4(9):e1000160. [50] Walker TM, Lalor MK, Broda A, Ortega LS, Morgan M, Parker L, et al. Assessment of [27] Ragheb MN, Ford CB, Chase MR, Lin PL, Flynn JL, Fortune SM. The mutation rate of Mycobacterium tuberculosis transmission in Oxfordshire, UK, 2007–12, with mycobacterial repetitive unit loci in strains of M. tuberculosis from cynomolgus ma- whole pathogen genome sequences: an observational study. Lancet Respir Med caque infection. BMC Genomics 2013 Mar 5;14(1):145. 2014 Apr;2(4):285–92. [28] Ford CB, Shah RR, Maeda MK, Gagneux S, Murray MB, Cohen T, et al. Mycobacterium [51] Salathé M, Bengtsson L, Bodnar TJ, Brewer DD, Brownstein JS, Buckee C, et al. Digital tuberculosis mutation rate estimates from different lineages predict substantial dif- Epidemiology. PLoS Comput Biol 2012 Jul 26;8(7):e1002616. ferences in the emergence of drug-resistant tuberculosis. Nat Genet 2013 Jul;45(7): [52] Sanoussi CN, Affolabi D, Rigouts L, Anagonou S, de Jong B. Genotypic characterization 784–90. directly applied to sputum improves the detection of Mycobacterium africanum [29] Kohl TA, Utpatel C, Schleusener V, De Filippo MR, Beckert P, Cirillo DM, et al. West African 1, under-represented in positive cultures. (Picardeau M, editor) PLoS MTBseq: A Comprehensive Pipeline for Whole Genome Sequence Analysis of Myco- Negl Trop Dis 2017 Sep 1;11(9):e0005900. bacterium Tuberculosis Complex Isolates. PeerJ; 2018 (in press). [53] Phelan JE, Coll F, Bergval I, Anthony RM, Warren R, Sampson SL, et al. Recombination [30] Lew JM, Kapopoulou A, Jones LM, Cole ST. TubercuList – 10 years after. Tuberculosis in pe/ppe genes contributes to genetic variation in Mycobacterium tuberculosis lin- 2011;91(1):1–7. eages. BMC Genomics 2016 Feb 29;17:151. [31] Feuerriegel S, Schleusener V, Beckert P, Kohl TA, Miotto P, Cirillo DM, et al. PhyResSE: [54] Ates LS, Dippenaar A, Ummels R, Piersma SR, van der Woude AD, van der Kuij K, a web tool delineating mycobacterium tuberculosis antibiotic resistance and lineage et al. Mutations in ppe38 block PE_PGRS secretion and increase virulence of Myco- from whole-genome sequencing data. (Carroll KC, editor) J Clin Microbiol 2015 Jun; bacterium tuberculosis. Nat Microbiol 2018 Feb 15;3(2):181–8. 53(6):1908–14. [55] Casali N, Broda A, Harris SR, Parkhill J, Brown T, Drobniewski F. Whole genome se- [32] Meehan CJ, de Jong BC. Membrane Spoligotype Mtbc Kinshasa 2005–2010; 2018. quence analysis of a large isoniazid-resistant tuberculosis outbreak in London: A [33] Meehan CJ, de Jong BC. [dataset] MIRU-VNTR Mtbc Kinshasa 2005–2010; 2018. Retrospective Observational Study. (Metcalfe JZ, editor) PLoS Med 2016 Oct 4;13 [34] Bouckaert R, Heled J, Kühnert D, Vaughan T, Wu C-H, Xie D, et al. BEAST 2: A Soft- (10):e1002137. ware Platform for Bayesian Evolutionary Analysis. (Prlic A, editor) PLoS Comput [56] Martin M, Lee RS, Cowley LA, Gardy JL, Hanage WP. Within-host diversity and its Biol 2014 Apr 10;10(4):e1003537. utility for Mycobacterium tuberculosis transmission inferences. Microb Genomics [35] Drummond AJ, Ho SYW, Phillips MJ, Rambaut A, Rambaut A. Relaxed phylogenetics October 2018(10) (in press). and dating with confidence. (Penny D, editor) PLoS Biol 2006 Mar 14;4(5):e88. [57] Stimson J, Gardy JL, Mathema B, Crudu V, Cohen T, Colijn C. Beyond the SNP Thresh- [36] Drummond AJ, Rambaut A, Shapiro B, Pybus OG. Bayesian coalescent inference of old: Identifying Outbreak Clusters Using Inferred Transmissions bioRxiv, 319707; past population dynamics from Molecular Sequences. Mol Biol Evol 2005 Feb 9;22 2018 May 10. (5):1185–92. [58] Guthrie JL, Kong C, Roth D, Jorgensen D, Rodrigues M, Hoang L, et al. Molecular ep- [37] Stadler T, Vaughan TG, Gavryushkin A, Guindon S, Kühnert D, Leventhal GE, et al. idemiology of tuberculosis in British Columbia, Canada: a 10-year retrospective How well can the exponential-growth coalescent approximate constant-rate study. Clin Infect Dis 2018 Mar 5;66(6):849–56. birth–death population dynamics? Proc R Soc Lond B Biol Sci 2015;282(1806). [59] Asnicar F, Weingart G, Tickle TL, Huttenhower C, Segata N. Compact graphical repre- [38] Lee RS, Radomski N, Proulx J-F, Levade I, Shapiro BJ, McIntosh F, et al. Population ge- sentation of phylogenetic data and metadata with GraPhlAn. PeerJ 2015 Jun 18;3: nomics of Mycobacterium tuberculosis in the Inuit. Proc Natl Acad Sci U S A 2015 e1029. Nov 3;112(44):13609–14. [60] Kozlov A. RAxML-NG; 2017. https://doi.org/10.5281/zenodo.593079. APPENDIX

99 Supplementary material

Convergence removal

For the classical genotyping methods (Spoligotyping and MIRU-VNTR), convergence of patterns can occur 1,2 resulting in incorrect clustering. Such patterns were removed based on inter-sub-lineage convergence across the phylogenetic tree. Firstly, the SNP alignment of all isolates was used as the basis for creating a maximum likelihood (ML) phylogeny. RAxML-NG version 0.5.1b3 was used to reconstruct the phylogeny from this alignment using a GTR+GAMMA model of evolution, accounting for ascertainment bias4 with the Stamatakis reconstituted DNA approach5 and site repeat optimisation6 with 20 different starting trees and 100 bootstraps.

All subsequent topology visualisation was undertaken using GraPhlAn7. Mtbc lineage and sub-lineage numbering was then applied to all isolates based on the Coll SNP set8. If the same clustering pattern was observed in two different sub-lineages, with other patterns seen in-between on the tree, this was flagged as pattern convergence. For example, if isolates with the same Spoligotyping pattern appeared in lineage 4,1 and

4,6 with different patterns in-between, this was confirmed as a convergent pattern.

Convergent evolution was found to affect 39% (12) of Spoligotyping-based clusters and 16% (6) of the MIRU-

VNTR clusters. Convergence-free versions of these methods (Spoligotyping, MIRU-VNTR and the combination of both) were then used as input to BEAST2 for divergence dating, as outlined in the main methods. Supplemental table 1 outlines their median transmission ages alongside the 95% HPD.

1 Supplemental table 1: Clustering method overview with convergent patterns removed from classical methods.

For each clustering method, the general features are outlined in the table. Median ages and 95% HPD ranges are based upon the BEAST-2 estimates of clade heights (see methods).

Percent of Maximum Strains in Number of Cluster Clustering Mean Timespan Method strains in SNP clusters clusters sizes rate Timespan 95% HPD clusters distances Spoligotyping 118 21 36.42 2-28 0-189 0.2994 76.51 0.81 - 823.21 MIRU-VNTR 121 32 37.35 2-11 0-48 0.2747 26.08 0 - 162.27 Spoligotyping-MIRU- 50 12 15.43 2-10 2-48 0.1173 32.92 0.8 - 216.31 VNTR 1 SNP cluster 74 29 22.84 2-6 0-2 0.1389 3.91 0 - 23.54 5 SNP cluster 147 40 45.37 2-27 0-10 0.3302 10.86 0 - 47.07 12 SNP cluster 242 47 74.69 2-34 0-23 0.6019 23.63 0 - 102.58 1 allele cgMLST 80 31 24.69 2-6 0-4 0.1512 4.73 0 - 24.65 5 allele cgMLST 173 42 53.4 2-28 0-22 0.4043 13.4 0 - 68.53 12 allele cgMLST 254 45 78.4 2-39 0-51 0.6451 24.06 0 - 112.25

2 1) 2) 3)

4) 5) 6) Cluster Age 1 Cluster Age 1 Age2 ... Clustering approach 1 13 1 13 11 ... Median: 12 2 20 2 20 19 ... 95% HPD: 10-20 Supplemental fgure 1. Algorithm for es�ma�ng transmission �mes encompassed by diferent clustering approaches. Step 1: Extract the tree from the MCMC sampled step. Step 2: Map the clusters on the tree. Step 3: Get the �me diference between the ancestral node (most recent common ancestor) and the youngest (furthest from the ancestor) sampled �p in each cluster. These are defned as the age of each cluster. Step 4: Aggregate all these ages across clusters. Step 5: Repeat for every tree calculated in each MCMC sampled step. Step 6: Calculate the median and 95% HDP based on all the ages of all clusters in all MCMC steps for the given clustering approach.

3 Supplemental references

1. Scott AN, Menzies D, Tannenbaum T-N, Thibert L, Kozak R, Joseph L, et al. Sensitivities and

specificities of spoligotyping and mycobacterial interspersed repetitive unit-variable-number tandem

repeat typing methods for studying molecular epidemiology of tuberculosis. J Clin Microbiol. 2005

Jan;43(1):89–94.

2. Driscoll JR. Spoligotyping for Molecular Epidemiology of the Mycobacterium tuberculosis Complex.

In: Methods in molecular biology (Clifton, NJ). 2009. p. 117–28.

3. Kozlov A. RAxML-NG. 2017.

4. Lewis PO. A likelihood approach to estimating phylogeny from discrete morphological character data.

Syst Biol. 2001;50(6):913–25.

5. Leaché AD, Banbury BL, Felsenstein J, de Oca A nieto-M, Stamatakis A, K. S, et al. Short Tree, Long

Tree, Right Tree, Wrong Tree: New Acquisition Bias Corrections for Inferring SNP Phylogenies. Syst

Biol. 2015 Nov;64(6):1032–47.

6. Kobert K, Stamatakis A, Flouri T. Efficient Detection of Repeating Sites to Accelerate Phylogenetic

Likelihood Calculations. Syst Biol. 2016 Aug 29;66(2):syw075.

7. Asnicar F, Weingart G, Tickle TL, Huttenhower C, Segata N. Compact graphical representation of

phylogenetic data and metadata with GraPhlAn. PeerJ. 2015 Jun 18;3:e1029.

8. Coll F, McNerney R, Guerra-Assunção JA, Glynn JR, Perdigão J, Viveiros M, et al. A robust SNP

barcode for typing Mycobacterium tuberculosis complex strains. Nat Commun. 2014 Jan 1;5:4812.

4

QUANTIFYINGTHEFITNESSCOSTOFHIVDRUGRESISTANCE 6

Unsurprisingly, another ruthless single-pathogen killer is HIV. While the overall mortality rates from HIV have been going down, they are nowhere near as low as the targets set by WHO, and an approximate million people died of HIV-related causes in 2017 (WHO, 2019b). Moreover, the percentage of transmitted antiretroviral resistance is high, ranging between 10% and 15% in European countries and in North America (Hauser et al., 2017). One major difference between TB and HIV evolutionary dynamics lies in the mechanics of drug resistance. Mycobacterium tuberculosis is a relatively slow-evolving pathogen, which is typically thought to incur a significant fitnes cost due to resistance evolution and be very unlikely to revert to sensitivity (Andersson and Hughes, 2010). On the other hand, HIV, a viral pathogen, evolves quickly to escape the drug pressure, however if drug pressure is lifted, an overall resistant population may revert back to non-resistance. These transitions of the dominating population dictate that resistance testing is absolutely necessary, which in turn provides plenty of genetic material for further analysis. High rates of transmitted drug resistance and the routine collection of viral sequence samples for drug resistance testing makes HIV the perfect model pathogen for estimating relative transmission fitness of drug resistance. Depending on the specific drug-resistance conferring substitution, numbers of clustered sequences in this study range from 311 to 1014. These numbers are incredibly high when comparing to the same types of data acquired for TB – in the studies described in Chapter 3 and Chapter 4 of this thesis the numbers of clustered sequences for all possible drug resistant types are at best two hundred. The rich HIV data allowed to estimate the relative transmission fitness for 14 different drug resistance mutations, of which four showed a loss in fitness when comparing to drug-sensitive strains, nine showed no apparent fitness loss, and one showed to be even more fit than the drug sensitive strain. My contribution to this work was the implementation of a re-parametrisation of the multi- type birth-death model that allows us to estimate a constant through time relative fitness of a resistant variant with respect to a sensitive strain population. This work was published in February 2018 in PLOS Pathogens as an article titled “Quantifying the fitness cost of HIV- 1 drug resistance mutations through phylodynamics”, DOI: 10.1371/journal.ppat.1006895, where I am a middle author. Following is the publisher’s version of the article followed by the supplementary text and figures.

105 RESEARCH ARTICLE Quantifying the fitness cost of HIV-1 drug resistance mutations through phylodynamics Denise KuÈhnert1,2,3,4,5¤*, Roger Kouyos1,2, George Shirreff3,6, Jūlija Pečerska4,5, Alexandra U. Scherrer1,2, JuÈ rg BoÈ ni2, Sabine Yerly7, Thomas Klimkait8, Vincent Aubert9, Huldrych F. GuÈ nthard1,2, Tanja Stadler4,5³, Sebastian Bonhoeffer3³, the Swiss HIV Cohort Study¶

1 Division of Infectious Diseases and Hospital Epidemiology, University Hospital Zurich, Zurich, Switzerland, 2 Institute of Medical Virology, University of Zurich, Zurich, Switzerland, 3 Institute of Integrative Biology, ETH Zurich, Zurich, Switzerland, 4 Department of Biosystems Science and Engineering, ETH Zurich, Basel, a1111111111 Switzerland, 5 Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland, 6 School of Medicine, Imperial a1111111111 College London, London, United Kingdom, 7 Laboratory of Virology, Division of Infectious Diseases, Geneva a1111111111 University Hospital, Geneva, Switzerland, 8 Department of Biomedicine, University of Basel, Basel, a1111111111 Switzerland, 9 Division of Immunology and Allergy, University Hospital Lausanne, Lausanne, Switzerland a1111111111 ¤ Current address: Division of Infectious Diseases and Hospital Epidemiology, University Hospital Zurich, Zurich, Switzerland ³ These authors are joint senior authors who contributed equally to this work. ¶ Membership list can be found in the Acknowledgments section. * [email protected] OPEN ACCESS

Citation: Ku¨hnert D, Kouyos R, Shirreff G, Pečerska J, Scherrer AU, Bo¨ni J, et al. (2018) Quantifying the fitness cost of HIV-1 drug resistance mutations Abstract through phylodynamics. PLoS Pathog 14(2): Drug resistant HIV is a major threat to the long-term efficacy of antiretroviral treatment. e1006895. https://doi.org/10.1371/journal. Around 10% of ART-naïve patients in Europe are infected with drug-resistant HIV type 1. ppat.1006895 Hence it is important to understand the dynamics of transmitted drug resistance evolution. Editor: Adi Stern, University of California, San Thanks to routinely performed drug resistance tests, HIV sequence data is increasingly Francisco, UNITED STATES available and can be used to reconstruct the phylogenetic relationship among viral lineages. Received: November 10, 2017 In this study we employ a phylodynamic approach to quantify the fitness costs of major Accepted: January 23, 2018 resistance mutations in the Swiss HIV cohort. The viral phylogeny reflects the transmission Published: February 20, 2018 tree, which we model using stochastic birth±death-sampling processes with two types: hosts infected by a sensitive or resistant strain. This allows quantification of fitness cost as Copyright: © 2018 Ku¨hnert et al. This is an open access article distributed under the terms of the the ratio between transmission rates of hosts infected by drug resistant strains and transmis- Creative Commons Attribution License, which sion rates of hosts infected by drug sensitive strains. The resistance mutations 41L, 67N, permits unrestricted use, distribution, and 70R, 184V, 210W, 215D, 215S and 219Q (nRTI-related) and 103N, 108I, 138A, 181C, reproduction in any medium, provided the original author and source are credited. 190A (NNRTI-related) in the reverse trancriptase and the 90M mutation in the protease gene are included in this study. Among the considered resistance mutations, only the 90M Data Availability Statement: The HIV sequence data was obtained as part of the Swiss HIV Cohort mutation in the protease gene was found to have significantly higher fitness than the drug Study (SHCS), whose authors may be contacted at sensitive strains. The following mutations associated with resistance to reverse transcrip- www.shcs.ch/contact. Due the representativeness tase inhibitors were found to be less fit than the sensitive strains: 67N, 70R, 184V, 219Q. of the dataset, the sensitivities associated with HIV The highest posterior density intervals of the transmission ratios for the remaining resis- infections, and to protect the privacy of patients enrolled in the study, a deposition of all sequence tance mutations included in this study all included 1, suggesting that these mutations do not data in an open database is not possible at this have a significant effect on viral transmissibility within the Swiss HIV cohort. These patterns time. are consistent with alternative measures of the fitness cost of resistance mutations. Overall, Funding: DK gratefully acknowledges support from the ETH Zurich Postdoctoral Fellowship Program

PLOS Pathogens | https://doi.org/10.1371/journal.ppat.1006895 February 20, 2018 1 / 16 Phylodynamic quantification of HIV-1 drug resistance fitness cost

and the Marie Curie Actions for People COFUND, we have developed and validated a novel phylodynamic approach to estimate the transmis- and the Swiss National Science Foundation (SNSF) sion fitness cost of drug resistance mutations. for generously funding her research with a Marie Heim- Vo¨gtlin fellowship. RK also acknowledges support from the SNSF (grant BSSGI0_155851). SB, GS and DK gratefully acknowledge support from the ERC (PBDR 268540). GS was also funded Author summary by the World Health Organisation (grant number 2013/363982) and the Medical Research Council The introduction of antiretroviral therapy (ART) has decreased mortality and morbidity Centre for Outbreak Analysis and Modelling. JP rates among HIV-infected people, and improved their quality of life. In fact, the WHO was funded through a SystemsX grant from the states that antiretroviral therapy programmes averted an estimated 7.8 million deaths SNSF (TbX). TS is supported in part by the European Research Council under the Seventh worldwide between 2000 and 2014. However, the antiretroviral regimen prescribed to a Framework Programme of the European patient may be unable to control HIV infection. Factors that can contribute to treatment Commission (PhyPD grant 335529). This study failure include drug resistance, drug toxicity, or poor treatment adherence. In this study has been financed in the framework of the Swiss we aim to understand the dynamics of transmitted drug resistance by analysing the viral HIV Cohort Study, supported by the Swiss National sequence data that was collected for resistance testing. We present a novel approach to Science Foundation (SNSF grant 33CS30- quantify how drug resistance impacts virus lineage transmissibility, how fast resistance 134277), the SHCS Projects 470, 528, 569, 683, 724, the SHCS Research Foundation, the Swiss mutations evolve in sensitive strains and how fast they revert back to the sensitive type. National Science Foundation grant 159868 (to We apply our approach to the Swiss HIV cohort study, and obtain patterns of viral trans- HFG), by the Yvonne-Jacob foundation, an mission fitness that are consistent with alternative, harder to obtain measures of fitness. unrestricted research grant from Gilead, Switzerland, to the SHCS research foundation, and by the University of Zurich’s Clinical research Priority Program (CRPP) Viral infectious diseases: Zurich Primary HIV Infection Study (to HFG). The Introduction funders had no role in study design, data collection The emergence and subsequent spread of drug resistant human immunodeficiency virus type and analysis, decision to publish, or preparation of 1 (HIV-1) is a major threat to the long-term efficacy of antiretroviral treatment. Around 10% the manuscript. of antiretroviral therapy (ART)-naïve patients in Europe are infected with drug-resistant HIV- Competing interests: The authors have declared 1 and transmitted drug resistance (TDR) has been associated with a higher virological failure that no competing interests exist. rate during treatment [1–9]. The dynamics of TDR depend largely on the respective resistance mutation and requires quantification of their fitness cost. Estimates of fitness costs, resistance evolution and reversion rates could previously only be obtained by comparing the replication kinetics of the virus after infection of cell cultures or more complicated experimental tech- niques [10] or through longitudinal cohort studies [11, 12]. These methods are essential in understanding the type of fitness cost related to replication within the host. Here we are inter- ested in a different type of viral fitness, namely the transmission fitness, which describes the success of a viral lineage in transmission between hosts. As pol sequences are routinely collected from infected individuals to test for drug resistance, HIV sequence data is increasingly available. These sequences can be used to reconstruct the phylogenetic relationship among viral lineages, which is an approximation of the transmission tree. A considerable number of phylogenetic and phylodynamic approaches for the analysis of pathogen outbreaks have been developed in the last decade and have greatly contributed to a better understanding of the dynamics of HIV epidemics [13–19]. In this study we employ a phylodynamic approach to quantify the fitness costs of major resistance mutations using data from the Swiss HIV cohort study (SHCS) and the associated drug resistance database. Our approach is based on stochastic birth–death–sampling pro- cesses, which have been shown to be suitable for the modelling of epidemic processes [20]. In terms of the transmission tree a “birth” event corresponds to the infection of a new host, a “death” event corresponds to the host’s removal from the infectious pool (e.g. successful treat- ment). The removed host may or may not have been sampled before removal, which corre- sponds to the viral strain being sequenced and included in the SHCS.

PLOS Pathogens | https://doi.org/10.1371/journal.ppat.1006895 February 20, 2018 2 / 16 Phylodynamic quantification of HIV-1 drug resistance fitness cost

Fig 1. The two-type birth–death model with types ‘sensitive’ and ‘resistant’. Virus samples are grouped into the compartments by their resistance status (corresponding to a single major resistance mutation). Transmission at

transmission rates λs and λr can only occur within the sensitive and resistant compartment, respectively. In either compartment, removal from the infectious pool occurs at rate δ. The compartments are connected by (exponential) rates of resistance evolution and reversion. https://doi.org/10.1371/journal.ppat.1006895.g001

We consider each major resistance mutation separately such that our model requires exactly two types, sensitive and resistant, between which we assume a simple ‘migration’ pro- cess of resistance evolution and reversion, see Fig 1. The aim of this paper is to demonstrate that our phylodynamic approach can shed light on the dynamics of HIV resistance evolution. In fact, the approach we present here allows us to

quantify the fitness cost of major resistance mutations (i.e. λr/λs in Fig 1), as well as the rates at which resistance mutations evolve and are reversed, on a population level. Notably, our approach requires only cross-sectional viral sequence data annotated with dates of sampling.

Materials and methods SHCS sequence data set The SHCS is a multicentre, prospective observational study for interdisciplinary human immunodeficiency virus (HIV) research, with an estimated coverage of more than 45% of all HIV cases in Switzerland overall and more than 70% since 1996 [21]. The associated drug resistance database is a central, anonymised collection of all genotypic drug resistance tests performed on SHCS enrolees. The base data set used in this study con- tains the first HIV-positive (subtype B) samples of treatment naïve SHCS patients. Due to the availability of the SHCS biobank, in total more than 12000 genotypes could be generated retro- spectively, making the SHCS one of the globally most densely sampled populations [6]. In par- ticular, all patients with available samples present before the start of ART were retrospectively

PLOS Pathogens | https://doi.org/10.1371/journal.ppat.1006895 February 20, 2018 3 / 16 Phylodynamic quantification of HIV-1 drug resistance fitness cost

sequenced. This is of particular importance because in general, HIV drug resistance testing was not routinely performed before 2003 in Switzerland or even later in other countries. Hence, the sampling dates of the sequences included here range from as early as 1989 to 2015. The resulting 5638 pol sequences were combined with the 4284 most closely related sequences selected from the LANL database using BLAST (allowing up to 10 hits per sequence; expect value = 1 × 10−30). The resulting 9922 sequences were aligned against the HBX2 subtype B reference strain and the alignment was stripped off the major resistance mutations listed by the IAS-USA [22]. A maximum likelihood phylogeny of the alignment was reconstructed in FastTree version 2.1.9 [23] assuming a GTR model of molecular evolution with gamma distrib- uted rate heterogeneity. The resulting phylogeny was used to identify all sub-epidemics with at least two samples for which a minimum of 80 per cent of the samples were from Switzerland [14]. In the following we refer to these Swiss sub-epidemics as clusters. For computational rea- sons any cluster that contained more than 250 sequences was split at the root of the phylogeny and considered as two separate clusters. For each of the major resistance mutations we selected all clusters that contained at least one sequence in which the respective mutation was present. From the remaining samples that did not have the respective mutation, we removed sequences in which any of the other major resistance mutations were present ( 10%). Hence, the sequences in the resulting clusters are either ‘truly sensitive’, or contain the respective resistance mutation. From the resulting resis- tance mutation related data sets (RMDS) we included only those that met the following two criteria: Summing over all clusters (i) the total number of sequences is  25 and (ii) the num- ber of resistant sequences is  10 (see Table 1). Numbers of resistant and total samples per RMDS per cluster are illustrated in S1 Fig. More detailed cluster characteristics are given in Table 3 within S1 Text. Exemplary analysis files are provided in S1 File.

Phylodynamic analyses The 15 RMDS were analysed using the multi-type birth–death model [18] in BEAST2 [24]. A single analysis was set up for each of the resistance mutations. The sequences were annotated with their date of sampling and type (sensitive or resistant) and the phylogenies for each

Table 1. Resistance mutations with numbers of corresponding clusters and samples, related drugs and drug usage dates within Switzerland. nRTI NNRTI PI Resistance mutation 41L 67N 70R 184V 210W 215D 215S 215Y 219Q 103N 108I 138A 181C 190A 90M Number (#) of clusters of size  2 56 23 19 35 18 18 16 25 20 25 10 46 8 8 14 # Sequences in clusters 927 667 712 1011 481 569 494 807 605 725 334 1014 329 311 389 # Resistant samples in clusters 93 39 26 44 26 41 31 28 28 38 11 109 10 12 38 Drug AZT AZT AZT 3TC AZT AZT AZT AZT AZT NVP NVP RPV NVP NVP NFV (SHCS drug codes) D4T D4T D4T ABC D4T D4T D4T D4T D4T EFV EFV EFV EFV SQV FTC ETV RPV Drug usage  1% 1987 1987 1987 1995.5 1987 1987 1987 1987 1987 1997 1997 2013 1997 1997 1996 Drug usage < 1% ------2008 nRTIs: Resistance mutation related to nucleoside/nucleotide reverse transcriptase inhibitors NNRTIs: Resistance mutation related to non-nucleoside reverse-transcriptase inhibitors PIs: Resistance mutation related to protease inhibitors ‘Drug usage  1%’ refers to the time at which the respective drug was prescribed to a minimum of one percent of patients within the SHCS. If multiple drugs are associated with a resistance mutation the earliest date is used. Accordingly, ‘Drug usage < 1%’ refers to the time when the respective drugs are no longer used in  1% of the patients. https://doi.org/10.1371/journal.ppat.1006895.t001

PLOS Pathogens | https://doi.org/10.1371/journal.ppat.1006895 February 20, 2018 4 / 16 Phylodynamic quantification of HIV-1 drug resistance fitness cost

cluster were reconstructed jointly with the epidemiological parameters. While the tree topol- ogy, tree height and length, the branch rate variation and sampled ancestors [25] were esti- mated separately for each of the clusters, the substitution rate and epidemiological parameters were estimated jointly for all clusters. The parameters that were estimated jointly for all clus- ters associated with a particular resistance mutation were informed by the phylogenies of these clusters together. For example, we estimate a single resistance evolution rate for each resistance mutation, such that all clusters contribute to the estimate of the resistance evolution rate. The unit of time used in our analyses is years, such that all rates estimated here are average rates per lineage per year. Fig 1 depicts the two-type birth–death model employed. Transmission events can only occur within each type, i.e. a transmission event caused by a sensitive strain results in a new infection with a sensitive strain and likewise for resistant strains. The migration rates estimated from sensitive to resistant lineages and back are equivalent to population-level rates of resis- tance evolution and reversion, respectively. Specifically, moving from the sensitive compart- ment to the resistant compartment is modelled through a resistance evolution rate, the opposite direction is determined by the resistance reversion rate. Both rates are assumed to be zero before significant usage of the related drug(s) in Switzerland. We consider drug usage as significant whenever the respective drug is prescribed to a minimum of one percent of patients within the SHCS. The NNRTI-related mutation 138A is an exception because it occurred in 0.5% to 5% of viruses from treatment-naïve patients even before the introduction of the related drug rilpivirine [26, 27]. Hence, we do not assume the resistance evolution and reversion rates for the 138A mutation to be zero at any time. Instead we allow the rate to change in 2013, when usage of the respective drug in Switzerland became significant. The protease inhibitors related to resistance mutation 90M have been used in less than 1% of patients after 2008. Therefore, we allow the resistance evolution and reversion rates for the 90M mutation to change in a piecewise constant fashion, such that it is zero before 1996 and we obtain one esti- mate for the time during and one for after significant drug usage in Switzerland. The effective reproduction number of the sensitive type through time, which is the quotient

of the sensitive transmission rate over the removal rate: Rs = λs/δ, is allowed to change over seven-year intervals (before 1994, 1994-2001, 2001-2008, 2008-2015) in a piecewise constant fashion. We estimate a constant between-host transmission ratio

lr rl ¼ ls

between the per lineage resistant transmission rate λr and the sensitive transmission rate λs. Hence, rλ < 1 implies that there is a fitness cost associated with a resistance mutation. On the other hand rλ > 1 suggests that the resistance mutation confers an advantage to the viral strain. We assume a joint removal rate δ for both types. The analyses were conducted using a HKY substitution model with gamma distributed rate heterogeneity and a relaxed clock model with lognormally distributed branch rates. The substi- tution rate was fixed to 2.55 Á 10−3 [28]. For each cluster sampling was assumed to have started at the sampling time of its earliest sequence. We estimate a constant sampling proportion p with its prior distribution centered around 22%, which results from considering that about 50% of the HIV infected population in Switzerland is included in the SHCS, 67% of which have successfully been sequenced [21] and 65% of which are treatment naïve. Upon sampling, individuals are removed from the infectious pool at a removal probability r. This means that we allow sampled ancestors [25] in the reconstructed phylogenies implying that infected

PLOS Pathogens | https://doi.org/10.1371/journal.ppat.1006895 February 20, 2018 5 / 16 Phylodynamic quantification of HIV-1 drug resistance fitness cost

Table 2. Prior distributions for the birth–death model parameters.

Rs δ rλ p resistance evolution rate resistance reversion rate removal probability LogN(0,1.25) LogN(-1,0.5) LogN(0,0.5) Beta(22,78) Exp(1) Exp(1) Unif(0,1) https://doi.org/10.1371/journal.ppat.1006895.t002

individuals may still transmit to others after they have been diagnosed. The prior distributions employed are listed in Table 2. Multiple independent instances were run for each RMDS, which were then combined, resulting in a combined Markov chain length of at least 250 million after burn-in. The resulting effective sample size of each parameter estimated is greater or equal to 200. This unusually long chain length was necessary due to the setup of the analyses: In each analysis we reconstruct the phylogenies for 8 to 56 clusters, many of which are very small (down to 2 samples), but some of which are fairly large (up to 184 samples), which makes operator optimisation difficult.

Ethics statement The SHCS was approved by the ethics committees of the participating institutions (Kantonale Ethikkommission Bern, Ethikkommission des Kantons St. Gallen, Comite Departemental d’Ethique des Specialites Medicales et de Medicine Communataire et de Premier Recours, Kantonale Ethikkommission Zurich, Repubblica e Cantone Ticino—Comitato Ethico Canto- nale, Commission Cantonale d’E´tique de la Recherche sur l’Eˆtre Humain, Ethikkommission beider Basel; all approvals are available on http://www.shcs.ch/206-ethic-committee-approval- and-informed-consent), and written informed consent was obtained from all participants.

Results After applying the selection criteria described in the Methods section, we obtained 15 RMDS with 8 (181C & 190A) to 56 (41L) clusters per resistance mutation comprising a total of 311 (190A) to 1014 (138A) sequences (Table 1). The total number of patients included in the study after applying the above criteria is 2614. Most clusters contain very few resistant samples. Sen- sitive samples are allowed to occur in multiple RMDS. Histograms of the numbers of (i) all and (ii) resistant samples per cluster are shown in S1 Fig. As an example of the resulting recon- structed cluster phylogenies, Fig 2 shows the maximum clade credibility trees of one cluster per drug class. We were unable to obtain reliable estimates for the 215Y RMDS, as the MCMC did not con- verge even after more than 2.400 million combined MCMC steps (for details see Section 3.1 within S1 Text). Hence we present the results from the remaining 14 RMDS only. We quantify the transmission dynamics of the drug sensitive virus samples by estimating

the effective reproduction number, Rs, through time. This yields one estimate for each of the 14 RMDS, which agree with one another (depicted by 14 overlapping violin plots per time

interval, Fig 3). Our results suggest that Rs was above 2 before 1994, below the epidemic thresh- old of one between 1994 and 2001 and around 1 after 2001.

For each resistance mutation we quantify the between-host fitness cost as the ratio rλ of the transmission rate of the drug resistant strains divided by the transmission rate of the drug sen- sitive strain. Among the considered resistance mutations, only the 90M mutation in the prote- ase gene was found to have significantly higher fitness than the drug sensitive strains (Fig 4). The following mutations associated with resistance to reverse transcriptase inhibitors were found to be less fit than the sensitive strains: 67N, 70R, 184V, 219Q. The highest posterior den- sity intervals of the transmission ratios for the remaining resistance mutations (41L, 103N,

PLOS Pathogens | https://doi.org/10.1371/journal.ppat.1006895 February 20, 2018 6 / 16 Phylodynamic quantification of HIV-1 drug resistance fitness cost

Fig 2. Maximum clade credibility trees of one cluster per drug class. Summary of the posterior distribution of the reconstructed phylogeny for one exemplary cluster in the 219Q, 138A, 184V, 103N and 90M RMDS. Exemplarily, the 103N cluster contains five resistant and seven sensitive samples. It has one sampled ancestor (indicated by the resistant sample with zero branch length), indicating that the respective patient transmitted to at least one other person after having been diagnosed with HIV. https://doi.org/10.1371/journal.ppat.1006895.g002

PLOS Pathogens | https://doi.org/10.1371/journal.ppat.1006895 February 20, 2018 7 / 16 Phylodynamic quantification of HIV-1 drug resistance fitness cost

Fig 3. Estimates of the effective reproduction number Rs of the sensitive strains through time. Time has been partitioned into 4 fixed time intervals: before 1994, 1994-2001, 2001-2008, 2008-2015. For each time interval there are 14 estimates, one from each of the 14 resistance-mutation data sets (RMDS). The violin plots show

the 95% HPDs of the Rs estimates. https://doi.org/10.1371/journal.ppat.1006895.g003

108I, 138A, 181C, 190A, 210W, 215D, 215S) all included the threshold one, suggesting that these mutations do not have a significant effect on viral transmissibility within the Swiss HIV Cohort Study. The estimated rates of resistance evolution and reversion are shown in S2 and S3 Figs, respectively. The median rates of resistance evolution during significant drug usage in Switzer- land are between 0.005–0.13. The corresponding median rates of resistance reversion range from 0.03–0.20. In detail, the posterior estimates of the epidemiological parameter estimates are given in Table 4 within S1 Text. Our results suggest that once the patient samples are taken, the patients are unlikely to con- tinue transmitting to others. In fact, the median number of sampled ancestors for most RMDS clusters is zero. Very few clusters belonging to the 215D, 108I, 181C and 190A data sets have significantly more than zero sampled ancestors. Overall, the probability of being removed from the infectious pool upon sampling is above 89% (smallest median estimate) for patients infected with sensitive strains. There is much uncertainty in the estimates of the removal prob- ability for resistant strains, which range from 44% (215D; 95% HPD: 0.7-84%) to 95% (138A; 95% HPD: 81-100%),

Discussion In this study we present a computational method to quantify the transmission cost of HIV-1 drug resistance mutations and assess it in the context of the Swiss HIV Cohort Study, as the

PLOS Pathogens | https://doi.org/10.1371/journal.ppat.1006895 February 20, 2018 8 / 16 Phylodynamic quantification of HIV-1 drug resistance fitness cost

Fig 4. Estimates of the transmission ratio of the resistant strains during consumption in Switzerland. For each resistance mutation we estimate a between-host

transmission ratio rλ = λr/λs between the per lineage resistant transmission rate λr and the sensitive transmission rate λs. The respective drug consumption dates are given in Table 1. The violin plots show the 95% HPD intervals of the rλ estimates. The same prior distribution was employed for all analyses (plotted on the far right). https://doi.org/10.1371/journal.ppat.1006895.g004

transmission of drug resistant strains is a big threat to the success of global HIV-containment programs (e.g. test-and-treat) [29]. The base data set used here is intentionally limited to sam- ples from patients that have not undergone antiretroviral therapy (ART) at the time of sam- pling. Including ART experienced individuals would be problematic, because the transmission rates estimated from ART-naïve vs. ART-experienced individuals are likely confounded by several factors such as behavioural differences due to awareness of HIV status and the effect of ART on viral load and persistence of drug resistance mutations [30–36]. Furthermore, we exclude resistance mutations that occur in very few samples, to avoid identifiability issues in our analyses. Our study includes eight major resistance mutations related to nucleoside reverse transcriptase inhibitors, five related to non-nucleoside reverse transcriptase inhibitors and one related to protease inhibitors. Our selection criteria serve as a filter for those drug resistance mutations that are moder- ately prevalent and transmissible in Switzerland. Hence, they represent the most important drug resistance mutations to be considered in Switzerland from a clinical and public health perspective. However, this does not include very recently introduced drugs, as they may not cross our thresholds yet. More specifically, it is unlikely to have samples from ten or more patients with a specific mutation conferring resistance to a drug that was introduced less than two or three years ago. For our data set this would translate to any introductions after 2013. Our approach is based on the assumption that a viral strain can be annotated with either of two types, sensitive or resistant. This is a strong simplification since the presence of multiple

PLOS Pathogens | https://doi.org/10.1371/journal.ppat.1006895 February 20, 2018 9 / 16 Phylodynamic quantification of HIV-1 drug resistance fitness cost

drug resistance mutations may impact viral fitness in a different way than a single mutation. However, most transmitted virus strains in this study contain either zero or one major resis- tance mutation. We did not include compensatory mutations in our analyses due to their rare occurrence in the data set. Furthermore, we do not distinguish among different risk groups in this study, as there does not appear to be any significant difference in TDR prevalence accord- ing to risk groups [37]. We are assuming that the transmission tree coincides with the phylogeny, which is based on assuming that there is little or no superinfection and that within-host coalescence time is short compared to the viral transmission time. In addition, we assume a constant between-

host transmission ratio rλ = λr/λs, which implies that transmission rate changes due to a changing number of susceptible individuals or a change in public health interventions have the same effect on sensitive and resistant hosts. This effect cancels out in the ratio. Note that between-host transmission fitness and within-host replication fitness are intertwined. Within-host replication fitness is one of the factors that contributes to the overall epidemio- logical fitness of a viral strain, which is commonly measured by the effective reproduction number [38]. There are two technical assumptions of the phylodynamic model that were made for computational reasons only: (1) the removal rate is constant through time and (2) resistance evolution rates can be averaged over (i) connected time periods of no or very little drug usage (< 1%) and (ii) connected time periods of relevant usage (> 1%), see Fig 1 within S1 Text (sce- nario A). For three of the RMDS (184V, 103N and 90M) we have performed the same analysis again with these assumptions relaxed to have changes in removal rates at two time points and direct dependance of the resistance evolution rates on the annual drug usage data in the SHCS (scenario B). While we see a small decrease in the estimated transmission ratios, the 95% HPDs largely overlap, resulting in qualitatively equivalent interpretations (see Section 1 within S1 Text, esp. Fig 3 within S1 Text). Additionally we have conducted a simulation study in which we simulate three sets of RMDS representative of the actual RMDS 184V, 103N and 90M under the more complex scenario B and reconstruct the trees and epidemiological parameters under the simpler scenario A. Although this does introduce some bias into our estimates we can robustly estimate the transmission ratio in all three simulation sets (see Sec- tion 2 within S1 Text, esp. Fig 5 within S1 Text). In particular, the simulation study presented in S1 Text shows that we can differentiate among different types of between-host transmission fitness despite differences in population sizes. While there may be variation among individuals that confounds the potential effects of drug resistance mutations on transmission fitness, the overall patterns of viral fitness identified in this study are consistent with other measures of the fitness cost of resistance mutations such as site-directed mutagenesis or reversion rates [10–12]. Furthermore, Wertheim et al. [39] have recently applied a network approach to a large data set from the United States, obtaining results consistent with those presented here. The 90M mutation in the protease and the 138A mutation in the reverse transcriptase have both previously been associated with TDR even in the absence of drugs [40, 41]. In this study, they are the only resistance mutations that have as many as ten resistant samples per cluster. In fact, the 138A mutation occurs in the SHCS as early as 1995 although the drug (rilpivirine) was only introduced in Switzerland in 2013. It appears to be a natural polymorphism, which is present in 0.5% to 5% of viruses from treatment-naïve patients, although it is more common in subtype C than B [26, 27, 40]. Before 2013 the median resistance evolution and reversion rates are 0.008 (95% HPD: [0.005-0.01]) and 0.04 (95% HPD: [0.008-0.08]), respectively. There is much uncertainty in the resistance evolution rate (median 0.14, 95% HPD: [0.00009-0.43]) and reversion rate (median 0.20, 95% HPD: [0.0004-0.82]) after 2013, which is due to the

PLOS Pathogens | https://doi.org/10.1371/journal.ppat.1006895 February 20, 2018 10 / 16 Phylodynamic quantification of HIV-1 drug resistance fitness cost

interval 2013–2015 being relatively short. We estimate a transmission ratio of 1.09 (median; 95% HPD: [0.88, 1.34]) confirming that its fitness is similar to that of sensitive strains. The 90M mutation in the protease is another case of treatment-independent transmission [42, 43]. Our results suggest that it confers a significant fitness advantage over the sensitive lin- eages. While prescription of the drugs (saquinavir and nelfinavir) against which 90M confers resistance has ceased in Switzerland around 2008, the mutation still occurs in later samples. The latest sample showing the 90M mutation within a cluster was obtained in 2011. There are four 90M clusters that contain more than one resistant sample, three of which are driven by men having sex with men (MSM) and one that contains five resistant samples all of which are from heterosexual individuals falling together in a five-sample sub-cluster. Since this appears to reflect the situation in the Swiss HIV epidemic well [44], there does not appear to be a con- founding effect due to risk group. Note that these results are based on a relatively small sample size (Table 3 within S1 Text). However, others have also estimated a fitness advantage for the 90M mutation based on an independent data set [39]. Furthermore, it has been shown previously that 90M prevalence increased in the Swiss HIV cohort in recent years, although it’s occurrence in patients failing treatment is declining [6]. These studies thus support that the 90M mutation has either a fit- ness advantage or at least no significant fitness cost. In contrast, the 184V mutation in the reverse transcriptase stands out as the one with the highest transmission cost (median transmission ratio: 0.43, 95% HPD: [0.18-0.66]). Being a major nRTI mutation the 184V is the only resistance mutation in this study that has no cluster with more than two resistant samples (apart from the 215Y RMDS for which we were not able to obtain reliable results, see Section 3.1 within S1 Text). Hence, it appears that while the 184V mutation evolves frequently under failing treatment, its high between-host transmission cost results in very short transmission chains. For the 103N mutation in the reverse transcriptase we estimate a transmission ratio of 0.97 (95% HPD, [0.57-1.41]), which may imply that the mutation confers no significant fitness advantage or disadvantage. This major NNRTI mutation is associated with failure of current first-line treatment [2, 9, 45, 46] and currently is the most important NNRTI [47]. The lack of a disadvantage in transmission underlines the clinical importance of the 103N mutation in the context of TDR, particularly in resource limited settings. For the remaining NNRTI mutations (108I, 181C and 190A) we obtain results very similar to the 103N results, although with larger 95% HPD intervals, particularly for the 108I RMDS, due to small sample sizes (10-12 resistant samples per RMDS). Extensive propagation of the 103N and 90M mutations is unlikely in Switzerland and other resource-rich countries, where drug resistance testing is performed routinely to iden- tify active drugs. Nevertheless, even in these settings it is crucial to diagnose HIV-positive individuals early to decrease the prevalence of transmitted drug resistance. The transmission potential of these mutations must be taken into account in the management of low-resource HIV-epidemics. Although the prevalence of transmitted drug resistance appears to be lower in African countries, for example, this may be due to the later introduction and roll-out of ART. Hence, there is a large risk of NNRTI-resistant viruses spreading quickly in such set- tings, which would make the management of the HIV-epidemics in resource-limited settings very difficult. Our results regarding sampled ancestors and the removal probability suggest that patients infected with sensitive strains may be less likely to transmit after diagnosis than patients infected with resistant strains. However, there is considerable uncertainty in the removal prob- ability estimates for resistant strains. This may be overcome by including all type B samples of the cohort (rather than only the treatment naïve first samples), which will be the subject of

PLOS Pathogens | https://doi.org/10.1371/journal.ppat.1006895 February 20, 2018 11 / 16 Phylodynamic quantification of HIV-1 drug resistance fitness cost

future work. In data sets where transmission frequently occurs after diagnosis and treatment, it would be advisable to include ART status into the analysis. The merits of molecular epidemiological approaches have been highlighted in previous studies, particularly for cases in which genetic data is combined with epidemiological, demo- graphic and clinical data [19, 41, 48–52]. However, this is one of the first studies presenting a phylogenetic approach that allows direct quantification of population-level fitness costs of HIV resistance mutations from viral sequence data (annotated with sampling date and resis- tance type) alone. Hence, this approach may become particularly useful in assessing the risk of TDR in resource limited settings where resistance testing is possible but epidemiological, demographic and clinical data are missing. Furthermore, our approach is not only applicable to HIV but to any measurably evolving pathogen.

Supporting information S1 Fig. Histograms of the number of (i) all and (ii) resistant samples per cluster per resis- tance mutation. (TIF) S2 Fig. Estimates of the resistance evolution rates during drug consumption in Switzer- land. The violin plots show the 95% HPD intervals of the resistance evolution rate estimates for each resistance mutation. An exponential prior distribution with mean 1 was employed for all analyses. (TIF) S3 Fig. Estimates of the resistance reversion rates during drug consumption in Switzer- land. The violin plots show the 95% HPD intervals of the resistance reversion rate estimates for each resistance mutation. An exponential prior distribution with mean 1 was employed for all analyses. (TIF) S1 Text. SHCS re-analysis and simulation study. Description and results of reanalysis under complex model, simulation study and supplementary information on clusters and posterior rate estimates. (PDF) S1 File. XML files. Compressed data file containing analysis files for phylodynamic HIV data analysis under scenarios A and B as well as the XML files for the simulation and reanalysis of RMDS. The analyses were run with BEAST version 2.4.6 and require the bdmm package ver- sion 0.2 and its dependencies. (ZIP)

Acknowledgments We thank the patients who participate in the Swiss HIV Cohort Study (SHCS); the physicians and study nurses, for excellent patient care; the resistance laboratories, for high-quality geno- typic drug resistance testing; SmartGene (Zug, Switzerland), for technical support; Brigitte Remy, RN, Martin Rickenbach, MD, Franziska Scho¨ni-Affolter, MD, and Yannick Vallet, MSc, from the SHCS Data Center (Lausanne, Switzerland), for data management; and Danièle Perraudin and Mirjam Minichiello, for administrative assistance.

PLOS Pathogens | https://doi.org/10.1371/journal.ppat.1006895 February 20, 2018 12 / 16 Phylodynamic quantification of HIV-1 drug resistance fitness cost

Members of the Swiss HIV Cohort Study Aubert V, Battegay M, Bernasconi E, Bo¨ni J, Braun DL, Bucher HC, Calmy A, Cavassini M, Ciuffi A, Dollenmaier G, Egger M, Elzi L, Fehr J, Fellay J, Furrer H (Chairman of the Clinical and Laboratory Committee), Fux CA, Gu¨nthard HF (President of the SHCS), Haerry D (dep- uty of ‘Positive Council’), Hasse B, Hirsch HH, Hoffmann M, Ho¨sli I, Kahlert C, Kaiser L, Kei- ser O, Klimkait T, Kouyos RD, Kovari H, Ledergerber B, Martinetti G, Martinez de Tejada B, Marzolini C, Metzner KJ, Mu¨ller N, Nicca D, Pantaleo G, Paioni P, Rauch A (Chairman of the Scientific Board), Rudin C (Chairman of the Mother & Child Substudy), Scherrer AU (Head of Data Centre), Schmid P, Speck R, Sto¨ckle M, Tarr P, Trkola A, Vernazza P, Wandeler G, Weber R, Yerly S.

Author Contributions Conceptualization: Denise Ku¨hnert, Roger Kouyos, Vincent Aubert, Huldrych F. Gu¨nthard, Tanja Stadler, Sebastian Bonhoeffer. Data curation: Alexandra U. Scherrer, Ju¨rg Bo¨ni, Sabine Yerly, Thomas Klimkait. Formal analysis: Denise Ku¨hnert. Funding acquisition: Denise Ku¨hnert, Roger Kouyos, Tanja Stadler, Sebastian Bonhoeffer. Investigation: Denise Ku¨hnert, Roger Kouyos, George Shirreff, Huldrych F. Gu¨nthard, Tanja Stadler, Sebastian Bonhoeffer. Methodology: Denise Ku¨hnert, Roger Kouyos, George Shirreff, Jūlija Pečerska, Tanja Stadler. Project administration: Denise Ku¨hnert, Tanja Stadler, Sebastian Bonhoeffer. Resources: Denise Ku¨hnert. Software: Denise Ku¨hnert. Supervision: Huldrych F. Gu¨nthard, Tanja Stadler, Sebastian Bonhoeffer. Validation: Denise Ku¨hnert. Visualization: Denise Ku¨hnert. Writing – original draft: Denise Ku¨hnert. Writing – review & editing: Denise Ku¨hnert, Roger Kouyos, Jūlija Pečerska, Huldrych F. Gu¨nthard, Tanja Stadler, Sebastian Bonhoeffer.

References 1. Wheeler WH, Ziebell RA, Zabina H, Pieniazek D, Prejean J, Bodnar UR, et al. Prevalence of transmitted drug resistance associated mutations and HIV-1 subtypes in new HIV-1 diagnoses, U.S.-2006. AIDS. 2010; 24(8):1203±12. https://doi.org/10.1097/QAD.0b013e3283388742 PMID: 20395786 2. Wittkop L, GuÈnthard HF, de Wolf F, Dunn D, Cozzi-Lepri A, de Luca A, et al. Effect of transmitted drug resistance on virological and immunological response to initial combination antiretroviral therapy for HIV (EuroCoord-CHAIN joint project): a European multicohort study. Lancet Infect Dis. 2011; 11(5):363±71. https://doi.org/10.1016/S1473-3099(11)70032-9 PMID: 21354861 3. Schmidt D, Kollan C, FaÈtkenheuer G, SchuÈlter E, Stellbrink HJ, Noah C, et al. Estimating trends in the proportion of transmitted and acquired HIV drug resistance in a long term observational cohort in Ger- many. PLoS One. 2014; 9(8):e104474. https://doi.org/10.1371/journal.pone.0104474 PMID: 25148412 4. Frange P, Assoumou L, Descamps D, CheÂret A, Goujard C, Tran L, et al. HIV-1 subtype B-infected MSM may have driven the spread of transmitted resistant strains in France in 2007-12: impact on sus- ceptibility to first-line strategies. J Antimicrob Chemother. 2015; 70(7):2084±9. PMID: 25885327

PLOS Pathogens | https://doi.org/10.1371/journal.ppat.1006895 February 20, 2018 13 / 16 Phylodynamic quantification of HIV-1 drug resistance fitness cost

5. Mourad R, Chevennet F, Dunn DT, Fearnhill E, Delpech V, Asboe D, et al. A phylotype-based analysis highlights the role of drug-naive HIV-positive individuals in the transmission of antiretroviral resistance in the UK. AIDS. 2015; 29(15):1917±25. https://doi.org/10.1097/QAD.0000000000000768 PMID: 26355570 6. Yang WL, Kouyos R, Scherrer AU, BoÈni J, Shah C, Yerly S, et al. Assessing the Paradox Between Transmitted and Acquired HIV Type 1 Drug Resistance Mutations in the Swiss HIV Cohort Study From 1998 to 2012. J Infect Dis. 2015; 212(1):28±38. https://doi.org/10.1093/infdis/jiv012 PMID: 25576600 7. Hofstra LM, Sauvageot N, Albert J, Alexiev I, Garcia F, Struck D, et al. Transmission of HIV Drug Resis- tance and the Predicted Effect on Current First-line Regimens in Europe. Clin Infect Dis. 2016; 62 (5):655±63. https://doi.org/10.1093/cid/civ963 PMID: 26620652 8. Hauser A, Hofmann A, Hanke K, Bremer V, Bartmeyer B, Kuecherer C, et al. National molecular surveil- lance of recently acquired HIV infections in Germany, 2013 to 2014. Euro Surveill. 2017; 22(2). https:// doi.org/10.2807/1560-7917.ES.2017.22.2.30436 PMID: 28105988 9. Tostevin A, White E, Dunn D, Croxford S, Delpech V, Williams I, et al. Recent trends and patterns in HIV-1 transmitted drug resistance in the United Kingdom. HIV Med. 2017; 18(3):204±213. https://doi. org/10.1111/hiv.12414 PMID: 27476929 10. Martinez-Picado J, MartõÂnez MA. HIV-1 reverse transcriptase inhibitor resistance mutations and fitness: a view from the clinic and ex vivo. Virus Res. 2008; 134(1±2):104±23. https://doi.org/10.1016/j.virusres. 2007.12.021 PMID: 18289713 11. Castro H, Pillay D, Cane P, Asboe D, Cambiano V, Phillips A, et al. Persistence of HIV-1 transmitted drug resistance mutations. J Infect Dis. 2013; 208(9):1459±63. https://doi.org/10.1093/infdis/jit345 PMID: 23904291 12. Yang WL, Kouyos RD, BoÈni J, Yerly S, Klimkait T, Aubert V, et al. Persistence of transmitted HIV-1 drug resistance mutations associated with fitness costs and viral genetic backgrounds. PLoS Pathog. 2015; 11(3):e1004722. https://doi.org/10.1371/journal.ppat.1004722 PMID: 25798934 13. Alizon S, von Wyl V, Stadler T, Kouyos RD, Yerly S, Hirschel B, et al. Phylogenetic Approach Reveals That Virus Genotype Largely Determines HIV Set-Point Viral Load. PLoS Pathogens. 2010; 6(9): e1001123. https://doi.org/10.1371/journal.ppat.1001123 PMID: 20941398 14. Kouyos RD, von Wyl V, Yerly S, BoÈni J, Taffe P, Shah C, et al. Molecular epidemiology reveals long- term changes in HIV type 1 subtype B transmission in Switzerland. J Infect Dis. 2010; 201(10):1488± 97. https://doi.org/10.1086/651951 PMID: 20384495 15. Stadler T, Kouyos RD, von Wyl V, Yerly S, BoÈni J, BuÈrgisser P, et al. Estimating the basic reproductive number from viral sequence data. Molecular Biology and Evolution. 2012; 29:347±357. https://doi.org/ 10.1093/molbev/msr217 PMID: 21890480 16. Stadler T, KuÈhnert D, Bonhoeffer S, Drummond AJ. Birth-death skyline plot reveals temporal changes of epidemic spread in HIV and hepatitis C virus (HCV). Proc Natl Acad Sci U S A. 2013; 110(1):228±33. https://doi.org/10.1073/pnas.1207965110 PMID: 23248286 17. KuÈhnert D, Stadler T, Vaughan TG, Drummond AJ. The Birth-Death SIR model: simultaneous recon- struction of evolutionary history and epidemiological dynamics from viral sequences; 2013. 18. KuÈhnert D, Stadler T, Vaughan TG, Drummond AJ. Phylodynamics with Migration: A Computational Framework to Quantify Population Structure from Genomic Data. Mol Biol Evol. 2016; 33(8):2102±16. https://doi.org/10.1093/molbev/msw064 PMID: 27189573 19. Rasmussen DA, Kouyos R, GuÈnthard HF, Stadler T. Phylodynamics on local sexual contact networks. PLoS Comput Biol. 2017; 13(3):e1005448. https://doi.org/10.1371/journal.pcbi.1005448 PMID: 28350852 20. Stadler T, Vaughan TG, Gavryushkin A, Guindon S, KuÈhnert D, Leventhal GE, et al. How well can the exponential-growth coalescent approximate constant-rate birth-death population dynamics? Proc Biol Sci. 2015; 282(1806):20150420. https://doi.org/10.1098/rspb.2015.0420 PMID: 25876846 21. Swiss HIV Cohort Study, Schoeni-Affolter F, Ledergerber B, Rickenbach M, Rudin C, GuÈnthard HF, et al. Cohort profile: the Swiss HIV Cohort study. Int J Epidemiol. 2010; 39(5):1179±89. https://doi.org/ 10.1093/ije/dyp321 PMID: 19948780 22. Wensing AM, Calvez V, GuÈnthard HF, Johnson VA, Paredes R, Pillay D, et al. 2015 Update of the Drug Resistance Mutations in HIV-1. Top Antivir Med. 2015; 23(4):132±41. PMID: 26713503 23. Price MN, Dehal PS, Arkin AP. FastTree 2±approximately maximum-likelihood trees for large align- ments. PLoS One. 2010; 5(3):e9490. https://doi.org/10.1371/journal.pone.0009490 PMID: 20224823 24. Bouckaert R, Heled J, KuÈhnert D, Vaughan T, Wu CH, Xie D, et al. BEAST 2: a software platform for Bayesian evolutionary analysis. PLoS Comput Biol. 2014; 10(4):e1003537. https://doi.org/10.1371/ journal.pcbi.1003537 PMID: 24722319

PLOS Pathogens | https://doi.org/10.1371/journal.ppat.1006895 February 20, 2018 14 / 16 Phylodynamic quantification of HIV-1 drug resistance fitness cost

25. Gavryushkina A, Welch D, Stadler T, Drummond AJ. Bayesian inference of sampled ancestor trees for epidemiology and fossil calibration. PLoS Comput Biol. 2014; 10(12):e1003919. https://doi.org/10. 1371/journal.pcbi.1003919 PMID: 25474353 26. Sluis-Cremer N, Jordan MR, Huber K, Wallis CL, Bertagnolio S, Mellors JW, et al. E138A in HIV-1 reverse transcriptase is more common in subtype C than B: implications for rilpivirine use in resource- limited settings. Antiviral Res. 2014; 107:31±4. https://doi.org/10.1016/j.antiviral.2014.04.001 PMID: 24746459 27. Theys K, Van Laethem K, Gomes P, Baele G, Pineda-Peña AC, Vandamme AM, et al. Sub-Epidemics Explain Localized High Prevalence of Reduced Susceptibility to Rilpivirine in Treatment-Naive HIV-1- Infected Patients: Subtype and Geographic Compartmentalization of Baseline Resistance Mutations. AIDS Res Hum Retroviruses. 2016; 32(5):427±33. https://doi.org/10.1089/aid.2015.0095 PMID: 26651266 28. Hue S, Pillay D, Clewley JP, Pybus OG. Genetic analysis reveals the complex structure of HIV-1 trans- mission within defined risk groups. Proc Natl Acad Sci U S A. 2005; 102(12):4425±9. https://doi.org/10. 1073/pnas.0407534102 PMID: 15767575 29. WHO | HIV drug resistance report 2017; 2017. 30. Leigh Brown AJ, Frost SDW, Mathews WC, Dawson K, Hellmann NS, Daar ES, et al. Transmission fit- ness of drug-resistant human immunodeficiency virus and the prevalence of resistance in the antiretro- viral-treated population. J Infect Dis. 2003; 187(4):683±6. https://doi.org/10.1086/367989 PMID: 12599087 31. de Mendoza C, Rodriguez C, Corral A, del Romero J, Gallego O, Soriano V. Evidence for differences in the sexual transmission efficiency of HIV strains with distinct drug resistance genotypes. Clin Infect Dis. 2004; 39(8):1231±8. https://doi.org/10.1086/424668 PMID: 15486849 32. Corvasce S, Violin M, Romano L, Razzolini F, Vicenti I, Galli A, et al. Evidence of differential selection of HIV-1 variants carrying drug-resistant mutations in seroconverters. Antivir Ther. 2006; 11(3):329±34. PMID: 16759049 33. Wagner BG, Garcia-Lerma JG, Blower S. Factors limiting the transmission of HIV mutations conferring drug resistance: fitness costs and genetic bottlenecks. Sci Rep. 2012; 2:320. https://doi.org/10.1038/ srep00320 PMID: 22432052 34. Rosenbloom DIS, Hill AL, Rabi SA, Siliciano RF, Nowak MA. Antiretroviral dynamics determines HIV evolution and predicts therapy outcome. Nat Med. 2012; 18(9):1378±85. https://doi.org/10.1038/nm. 2892 PMID: 22941277 35. Poon AFY, Joy JB, Woods CK, Shurgold S, Colley G, Brumme CJ, et al. The impact of clinical, demo- graphic and risk factors on rates of HIV transmission: a population-based phylogenetic analysis in Brit- ish Columbia, Canada. J Infect Dis. 2015; 211(6):926±35. https://doi.org/10.1093/infdis/jiu560 PMID: 25312037 36. Winand R, Theys K, EuseÂbio M, Aerts J, Camacho RJ, Gomes P, et al. Assessing transmissibility of HIV-1 drug resistance mutations from treated and from drug-naive individuals. AIDS. 2015; 29 (15):2045±52. https://doi.org/10.1097/QAD.0000000000000811 PMID: 26355575 37. Yerly S, von Wyl V, Ledergerber B, BoÈni J, SchuÈpbach J, BuÈrgisser P, et al. Transmission of HIV-1 drug resistance in Switzerland: a 10-year molecular epidemiology survey. AIDS. 2007; 21(16):2223. https:// doi.org/10.1097/QAD.0b013e3282f0b685 PMID: 18090050 38. Keeling MJ, Rohani P. Modeling infectious diseases in humans and animals. Princeton: Princeton Uni- versity Press; 2008. Available from: http://www.loc.gov/catdir/toc/fy0805/2006939548.html. 39. Wertheim JO, Oster AM, Johnson JA, Switzer WM, Saduvala N, Hernandez AL, et al. Transmission fit- ness of drug-resistant HIV revealed in a surveillance system transmission network. Virus Evol. 2017; 3 (1):vex008. https://doi.org/10.1093/ve/vex008 PMID: 28458918 40. Scherrer AU, Hasse B, von Wyl V, Yerly S, BoÈni J, BuÈrgisser P, et al. Prevalence of etravirine mutations and impact on response to treatment in routine clinical care: the Swiss HIV Cohort Study (SHCS). HIV Med. 2009; 10(10):647±56. https://doi.org/10.1111/j.1468-1293.2009.00756.x PMID: 19732174 41. Drescher SM, von Wyl V, Yang WL, BoÈni J, Yerly S, Shah C, et al. Treatment-naive individuals are the major source of transmitted HIV-1 drug resistance in men who have sex with men in the Swiss HIV Cohort Study. Clin Infect Dis. 2014; 58(2):285±94. https://doi.org/10.1093/cid/cit694 PMID: 24145874 42. Mbisa JL, Fearnhill E, Dunn DT, Pillay D, Asboe D, Cane PA, et al. Evidence of Self-Sustaining Drug Resistant HIV-1 Lineages Among Untreated Patients in the United Kingdom. Clin Infect Dis. 2015; 61 (5):829±36. https://doi.org/10.1093/cid/civ393 PMID: 25991470 43. Kouyos RD, GuÈnthard HF. Editorial Commentary: The Irreversibility of HIV Drug Resistance. Clin Infect Dis. 2015; 61(5):837±9. https://doi.org/10.1093/cid/civ400 PMID: 25991467

PLOS Pathogens | https://doi.org/10.1371/journal.ppat.1006895 February 20, 2018 15 / 16 Phylodynamic quantification of HIV-1 drug resistance fitness cost

44. Turk T, Bachmann N, Kadelka C, BoÈni J, Yerly S, Aubert V, et al. Assessing the danger of self-sustained HIV epidemics in heterosexuals by population based phylogenetic cluster analysis. Elife. 2017; 6. https://doi.org/10.7554/eLife.28721 PMID: 28895527 45. Kuritzkes DR, Lalama CM, Ribaudo HJ, Marcial M, Meyer WA 3rd, Shikuma C, et al. Preexisting resis- tance to nonnucleoside reverse-transcriptase inhibitors predicts virologic failure of an efavirenz-based regimen in treatment-naive HIV-1-infected subjects. J Infect Dis. 2008; 197(6):867±70. https://doi.org/ 10.1086/528802 PMID: 18269317 46. Cozzi-Lepri A, Paredes, Phillips AN, Clotet B, Kjaer J, Von Wyl V, et al. The rate of accumulation of non- nucleoside reverse transcriptase inhibitor (NNRTI) resistance in patients kept on a virologically failing regimen containing an NNRTI*. HIV Med. 2012; 13(1):62±72. PMID: 21848790 47. Gupta RK, Gregson J, Parkin N, Haile-Selassie H, Tanuri A, Andrade Forero L, et al. HIV-1 drug resis- tance before initiation or re-initiation of first-line antiretroviral therapy in low-income and middle-income countries: a systematic review and meta-regression analysis. Lancet Infect Dis. 2017; https://doi.org/ 10.1016/S1473-3099(17)30702-8 48. Hughes GJ, Fearnhill E, Dunn D, Lycett SJ, Rambaut A, Leigh Brown AJ, et al. Molecular phylody- namics of the heterosexual HIV epidemic in the United Kingdom. PLoS Pathog. 2009; 5(9):e1000590. https://doi.org/10.1371/journal.ppat.1000590 PMID: 19779560 49. Leigh Brown AJ, Lycett SJ, Weinert L, Hughes GJ, Fearnhill E, Dunn DT, et al. Transmission network parameters estimated from HIV sequences for a nationwide epidemic. J Infect Dis. 2011; 204(9):1463± 9. https://doi.org/10.1093/infdis/jir550 PMID: 21921202 50. Avila D, Keiser O, Egger M, Kouyos R, BoÈni J, Yerly S, et al. Social meets molecular: Combining phylo- genetic and latent class analyses to understand HIV-1 transmission in Switzerland. Am J Epidemiol. 2014; 179(12):1514±25. https://doi.org/10.1093/aje/kwu076 PMID: 24821749 51. Hue S, Brown AE, Ragonnet-Cronin M, Lycett SJ, Dunn DT, Fearnhill E, et al. Phylogenetic analyses reveal HIV-1 infections between men misclassified as heterosexual transmissions. AIDS. 2014; 28 (13):1967±75. https://doi.org/10.1097/QAD.0000000000000383 PMID: 24991999 52. Ragonnet-Cronin M, Lycett SJ, Hodcroft EB, Hue S, Fearnhill E, Brown AE, et al. Transmission of Non- B HIV Subtypes in the United Kingdom Is Increasingly Driven by Large Non-Heterosexual Transmission Clusters. J Infect Dis. 2016; 213(9):1410±8. https://doi.org/10.1093/infdis/jiv758 PMID: 26704616

PLOS Pathogens | https://doi.org/10.1371/journal.ppat.1006895 February 20, 2018 16 / 16

APPENDIX

123 Supplementary Text S1

1 Reanalysis of SHCS data including drug usage data and changes in recovery rate

The phylodynamic analyses described in the main text were repeated under a more complex model specification for three of the RMDS: 184V, 103N and 90M. There are two differences in the model specification, see Figure 1 B: 1. Since we know how drug usage changed over time in percentage treated in the Swiss HIV cohort study (SHCS), we have implemented a variation of our model that estimates a resistance evolution rate which is proportional to the drug usage percentage of the relevant drug(s). Hence, we estimate a scaling factor instead of the resistance evolution rate itself. Tables 1-2 contain the relevant drug usage percentages in the SHCS per year. 2. We estimate the removal rate δ as a piecewise constant rate, with rate changes allowed at two time points: (i) in 2000 to account for the introduction of boosted protease inhibitors and (ii) in 2008 to account for the introduction of integrase inhibitors, since both events have likely caused a reduction in the time until infected individuals become virally suppressed and are hence removed from the infectious pool. The resulting effective reproduction numbers in the susceptible population and the transmission ratios rλ of the respective resistance mutations are summarized in Figures 2-3.

Fig 1. Schematic presentation of the simulation and analysis setups. The results presented in the main text where obtained using analysis setup A. In the simulation study scenario B was assumed during simulation, inference was done under scenario A. The SHCS results shown below were inferred with analysis setup B.

Table 1. Percentage of SHCS patients treated with one of the drugs associated with the 90M, 103N and 184V mutations, respectively, per year from 1993 to 2003. 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 mutation NFV or SQV 0 0.0002 0.0012 0.0124 0.2211 0.4006 0.4172 0.3208 0.2684 0.2097 0.1602 90M EFV or NVP 0 0 0 0.0002 0.0131 0.0514 0.2169 0.2886 0.3151 0.3257 0.3668 103N 3CT or ETC 0.0007 0.0016 0.0265 0.4520 0.6257 0.5923 0.6056 0.6030 0.6078 0.5948 0.6016 184V

The analysis setup was identical to the analyses described in the main text. In particular, the same prior distributions were employed

PLOS 1/15 Table 2. Percentage of SHCS patients treated with one of the drugs associated with the 90M, 103N and 184V mutations, respectively, per year from 2004 to 2014. 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 mutation NFV or SQV 0.1141 0.0812 0.0536 0.0347 0.0024 0.0026 0.0021 0.0015 0.0010 0.0009 0.0003 90M EFV or NVP 0.3748 0.3523 0.3593 0.3715 0.3844 0.3894 0.4005 0.4088 0.4075 0.3857 0.3484 103N 3CT or ETC 0.6148 0.6497 0.7005 0.8569 0.8156 0.8394 0.8677 0.8952 0.8967 0.9145 0.8921 184V

Fig 2. Estimates of the effective reproduction number Rs of the sensitive strains through time. Time has been partitioned into 4 fixed time intervals: before 1994, 1994-2001, 2001-2008, 2008-2015. For each time interval there are three estimates, one from each of the resistance-mutation data sets 184V, 103N and 90M. The violin plots show the 95% HPDs of the Rs estimates. 3 2 s R

● 1 0

Before 1994 1994−2001 2001−2008 2008−2015

2 Simulation study: How well does the simple model capture the complex transmission dynamics?

In order to understand the effect of the simplifying assumptions that we made in the analyses under scenario A, we conducted a simulation study in which we simulated three sets of resistance mutation data sets (RMDS) under the complex model scenario B, which we then analysed assuming the simpler scenario A. The difference between the two scenarios is illustrated in Figure 1. The simulation parameters where chosen such that each of the three sets represents one of the three resistance mutations 184V, 103N and 90M. That is, in the 184V-like set of simulations we assume that there is a between-host fitness cost, for the 103N-like set we assume near-neutrality and for the 90M-like set we assume a fitness advantage for the strains carrying the mutation.

2.1 Simulating resistance-mutation data sets For each of the three sets we simulated 40 replicates under scenario B. For each replicate a transmission tree was simulated using MASTER [?]. The simulated outbreaks start with a single host infected with a sensitive strain in year 1982. The effective reproduction number of the sensitive strains is piecewise constant, with

PLOS 2/15 Fig 3. Estimates of the transmission ratio of the resistant strains during consumption in Switzerland. For each resistance mutation we estimate a between-host transmission ratio rλ = λr/λs between the per lineage resistant transmission rate λr and the sensitive transmission rate λs. The violin plots show the 95% HPD intervals of the rλ estimates. 2.0 1.5

● 1.0 Transmission ratio Transmission 0.5 0.0

184V 103N 90M

Resistance mutation

changes in years 2000, 2007 and 2014. The transmission ratio rλ is set to 1.61 for the 90M-like simulation set, 0.96 for the 103N-like set and 0.41 for the 184V-like set. While the rate of resistance evolution changes annually, proportional to the respective drug usage in the SHCS (Tables 1-2), the resistance reversion rates are constant through time. The recovery rate increases twice, in 2000 and in 2008, with a population average infectious period of about 3.5, 3 and 2.5 years, assuming that infected individuals get diagnosed and successfully treated faster in the recent periods due to major improvements in antiretroviral therapy. The sensitive and resistant sampling proportions are set to zero until the time the first sensitive (or resistant) sample occurred in the respective SHCS RMDS and is a positive constant afterwards. An individual simulation was stopped when 150 samples (representing infected hosts) have been generated and was accepted as valid if it had a minimum of 1 resistant sample and a total (sum over sensitive and resistant samples) of at least 2 samples. From the transmission trees generated with MASTER, sequence alignments of length 2000 were simulated using SeqGen [?]. The sequences were generated under the Hasegawa-Kishino-Yano (HKY) model of sequence evolution and an uncorrelated relaxed clock model assuming log-normally distributed rates of evolution among the 3 branches of the tree with a mean substitution rate of 2.55 10− and a standard deviation of 0.5. The branch rate variation was included in× order to mimic variation in the trees that may be caused by differences in within-host replicative dynamics. For each replicate we simulated 10 sequence alignments representing 10 transmission chains of randomly varying sizes (from 2-150, see Figure 4).

2.2 Re-estimating simulated transmission fitness The analysis setup for the simulated data sets was identical to the SHCS analyses described in the main text. In particular, the same prior distributions were employed,

PLOS 3/15 Fig 4. Overview of the simulation study.

with one exception: For the resistance evolution and reversion parameters a Exp(0.01) prior distribution was used in the simulation study. Although the resistance evolution and reversion rates were robust to different prior distributions in the SHCS analyses, the estimates of those parameters in the simulation study were not, instead they differed when different prior distributions were employed. Analysing the RMDS under the simplified model scenario A leads to a reduction in the number of parameters to be estimated that reduces the computational complexity and improves identifiability of the epidemiological parameters. The price we usually pay for this is a loss of accuracy. The aim of this simulation study was to understand how well we can still estimate the between-host transmission fitness. The accuracy of the reconstructed transmission ratios rλ is 87%, 73% and 100%, for the 90M-like, 103N-like and 184V-like RMDS, respectively, two of which are below the desired 95% accuracy. However, as depicted in Figure 5, accuracy is good and the simulation replicates for which rλ is inaccurately estimated (i.e. the 95% highest posterior density interval does not contain the true value), the estimate is close to the truth and qualitatively correct. Indeed, all 184V-like simulation replicates yield a rλ significantly smaller than 1, all 90M-like simulation replicates yield a rλ significantly larger than 1, and all 103N-like simulation replicates yield a rλ around or close to 1.

3 Supplementary information for the scenario A analyses presented in main text

Table 3 gives an overview of cluster the characteristics per RDMS.

mutation nrSensitive nrResistant firstSen lastSen firstRes lastRes 41L 0 2 2003.21 2007.15 7 1 1995.53 2009.68 2006.34 2006.34 6 2 1995.45 2003.16 1998.75 2006.32 58 3 1993.1 2011.21 1998.52 2004.78 3 1 1997.23 2008.29 2003.75 2003.75 1 1 1996.37 1996.37 1995.88 1995.88 7 1 1995.53 2001.95 1995.76 1995.76 3 1 2008.62 2010.83 1998.38 1998.38 27 1 1995.78 2013.8 2004.46 2004.46 4 1 1991.97 2000.63 1997.28 1997.28 2 6 1997.22 2003.39 2001.18 2012.96 1 1 1999.24 1999.24 2003.87 2003.87 6 3 1997.12 2006.91 1999.17 2001.95 42 2 1995.57 2009.02 2001.68 2004.29 165 3 1994.92 2014.23 1999.76 2009.98

PLOS 4/15 mutation nrSensitive nrResistant firstSen lastSen firstRes lastRes 7 1 1995.94 1999.67 1999.87 1999.87 5 1 1992.83 2014.14 1996.25 1996.25 1 1 1999.97 1999.97 2001.7 2001.7 1 1 2000.89 2000.89 2000.03 2000.03 29 1 1995.15 2011.74 2002.82 2002.82 1 1 1998.96 1998.96 1997.7 1997.7 33 1 1995.16 2012.77 1995.74 1995.74 6 1 1997.11 2008.93 2000.81 2000.81 7 1 1996.48 2011.27 1996.52 1996.52 5 2 1995.63 2010.41 1997.39 2013.78 47 7 1994.91 2013.45 1997.77 2001.12 0 4 2006.83 2012.51 1 2 2013.41 2013.41 2009.14 2009.2 1 6 2007.22 2007.22 2008.19 2012.64 12 1 1995.38 2014.26 2009.52 2009.52 24 2 1995.88 2014.95 2000.8 2004.5 7 1 1996.14 2008.77 2010.82 2010.82 61 1 1996.77 2015.19 2010.86 2010.86 0 1 2005.68 2005.68 17 1 1995.97 2010.73 1999.55 1999.55 3 1 1996.22 2003.38 2001.39 2001.39 1 1 1997.44 1997.44 2001.78 2001.78 3 1 1996.89 2007.79 1999.8 1999.8 3 1 1997.82 2007.45 2001.31 2001.31 0 2 2012.06 2012.11 1 2 2011.7 2011.7 2012.14 2012.17 36 1 1993.08 2013.45 1997.64 1997.64 2 1 2002.02 2002.02 1997.96 1997.96 1 1 2008.9 2008.9 1999.44 1999.44 4 1 2004.86 2008.86 1999.47 1999.47 38 1 1995.92 2012.76 1998.08 1998.08 63 1 1995.07 2009.18 1997.1 1997.1 16 1 1995.51 2005.65 1998.12 1998.12 0 3 2012.52 2012.6 3 1 1997.23 2013.15 1999.02 1999.02 6 1 2000.28 2014.52 2000.7 2000.7 1 1 2006.91 2006.91 1997.93 1997.93 4 1 1995.93 2007.47 1996.16 1996.16 14 1 1995.93 2010.69 2010.75 2010.75 6 1 1995.9 2006.44 2003.19 2003.19 0 1 2013.19 2013.19 3 1 1997.58 2013.41 2006.63 2006.63 29 1 1996.13 2012.94 2005.48 2005.48 67N 0 6 2006.63 2014.31 3 1 1997.23 2008.29 2003.75 2003.75 27 1 1995.78 2013.8 2004.46 2004.46 47 1 1994.91 2013.45 1995.54 1995.54 1 2 2007.22 2007.22 1995.59 1999.78 1 1 1999.24 1999.24 2003.87 2003.87 1 1 2003.73 2003.73 1995.61 1995.61 29 1 1996.13 2012.94 1995.67 1995.67 3 2 2002.9 2011.88 2007.84 2007.84 165 2 1994.92 2014.23 1999.76 2000.69 9 1 1996.23 2009.85 2003.2 2003.2 58 2 1993.1 2011.21 1998.52 2004.63 10 1 2001.03 2013.81 2000.07 2000.07

PLOS 5/15 mutation nrSensitive nrResistant firstSen lastSen firstRes lastRes 31 1 1996.35 2013.54 1996.51 1996.51 69 1 1994.91 2012.12 1997.43 1997.43 0 1 2008.39 2008.39 1 1 1995.91 1995.91 1999.94 1999.94 21 3 1995.56 2014.91 2008.36 2013.05 6 1 2008.29 2009.62 1996.75 1996.75 21 2 1996.87 2014.43 1996.34 2006.13 38 1 1995.92 2012.76 1998.08 1998.08 24 1 1995.88 2014.95 1999.19 1999.19 63 1 1995.07 2009.18 1997.1 1997.1 0 4 2012.53 2014.58 70R 0 4 2007.18 2009.94 1 2 2007.22 2007.22 1995.59 1999.78 1 1 1999.24 1999.24 2003.87 2003.87 4 1 1997.41 2010.08 1997.29 1997.29 9 1 1996.23 2009.85 2003.2 2003.2 5 1 1996.17 2008.14 1995.92 1995.92 165 2 1994.92 2014.23 2001.5 2007.43 31 1 1996.35 2013.54 1996.51 1996.51 42 2 1995.57 2009.02 1991.8 1996.61 5 1 1995.93 2001.3 1996.76 1996.76 183 1 1993.37 2014.33 1997.27 1997.27 69 1 1994.91 2012.12 1997.43 1997.43 3 1 1995.68 1996.72 1997.92 1997.92 1 1 1995.91 1995.91 1999.94 1999.94 6 1 2008.29 2009.62 1996.75 1996.75 11 2 1997.6 2010.48 2001.11 2002.68 24 1 1995.88 2014.95 1999.19 1999.19 123 1 1993.04 2015.1 1997.58 1997.58 3 1 1995.76 2008.87 1995.94 1995.94 184V 61 1 1994.89 2012.54 2006.94 2006.94 0 2 2003.86 2007.36 24 1 1996.98 2012.84 2007.39 2007.39 3 1 1997.23 2008.29 2003.75 2003.75 0 1 2003.21 2003.21 27 2 1995.78 2013.8 2000.29 2004.46 183 1 1993.37 2014.33 2003.84 2003.84 1 1 1999.24 1999.24 2003.87 2003.87 6 2 1995.9 2006.44 2003.19 2004.14 1 1 2000.89 2000.89 2000.03 2000.03 58 2 1993.1 2011.21 1998.52 2004.63 165 2 1994.92 2014.23 1996.28 2011.3 0 1 2008.39 2008.39 8 1 1995.71 2010.83 1999.21 1999.21 8 1 1995.65 2011.15 2004.61 2004.61 5 1 1995.62 2007.74 2010.67 2010.67 1 1 1999.05 1999.05 2010.97 2010.97 11 2 1997.6 2010.48 2001.11 2002.68 3 1 1997.43 2004.12 2000.7 2000.7 17 1 1995.97 2010.73 1999.55 1999.55 3 1 1996.22 2003.38 2001.39 2001.39 42 1 1995.57 2009.02 2001.68 2001.68 3 1 1996.89 2007.79 1999.8 1999.8 4 1 1999.56 2009.1 2003.43 2003.43 36 1 1993.08 2013.45 1997.64 1997.64 32 1 1995.82 2012.84 1999.15 1999.15

PLOS 6/15 mutation nrSensitive nrResistant firstSen lastSen firstRes lastRes 1 1 2008.9 2008.9 1999.44 1999.44 0 1 1999.85 1999.85 38 1 1995.92 2012.76 1998.08 1998.08 25 1 1996.47 2013.31 1999.95 1999.95 123 1 1993.04 2015.1 1996.93 1996.93 3 1 1997.23 2013.15 1999.02 1999.02 16 1 1995.51 2005.65 1999.03 1999.03 1 1 2007.22 2007.22 1999.78 1999.78 15 1 1995.56 2004.65 2000.03 2000.03 29 1 1995.64 2013.64 1998.27 1998.27 9 1 1997.12 2009.7 1999.5 1999.5 5 1 1995.63 2010.41 2013.78 2013.78 215D 8 3 1996.43 2013.49 2004.23 2007.02 0 1 2007.15 2007.15 27 1 1995.78 2013.8 2004.46 2004.46 15 1 1995.56 2004.65 1998.43 1998.43 2 6 1997.22 2003.39 2001.18 2012.96 165 1 1994.92 2014.23 2001.5 2001.5 5 1 1995.63 2010.41 1997.39 1997.39 47 7 1994.91 2013.45 1997.77 2001.12 1 7 2007.22 2007.22 2008.19 2012.64 24 2 1995.88 2014.95 2000.8 2004.5 61 1 1996.77 2015.19 2010.86 2010.86 0 2 2012.06 2012.11 2 1 2002.02 2002.02 1997.96 1997.96 123 1 1993.04 2015.1 1998.02 1998.02 1 1 1996.22 1996.22 2002.94 2002.94 33 1 2005.75 2014.26 2000.67 2000.67 7 1 1996.48 2011.27 1997.1 1997.1 4 1 1995.93 2007.47 1996.16 1996.16 0 1 2013.92 2013.92 3 1 1997.58 2013.41 2006.63 2006.63 215S 6 1 1995.45 2003.16 2006.32 2006.32 0 1 2001.06 2001.06 40 2 1995.61 2007.47 2003.52 2008.05 5 1 1996.4 2005.48 1996.81 1996.81 0 2 2005.68 2008.3 129 6 1989.67 2012.24 2001.26 2005.53 0 5 2006.83 2012.51 1 2 1997.44 1997.44 2001.78 2008.41 31 1 1996.35 2013.54 2009.38 2009.38 12 1 1995.38 2014.26 2009.52 2009.52 4 1 1995.76 2003.93 2005.84 2005.84 21 2 1996.87 2014.43 1996.34 2006.13 12 1 1995.59 2010.23 1995.99 1995.99 165 1 1994.92 2014.23 2006.51 2006.51 2 1 1997.33 1999.45 1997.22 1997.22 2 1 1994.64 2000.61 2002.72 2002.72 0 1 2013.19 2013.19 33 1 1995.16 2012.77 2003.5 2003.5 215Y 61 1 1994.89 2012.54 2006.94 2006.94 7 1 1995.53 2001.95 1995.76 1995.76 3 1 2008.62 2010.83 1998.38 1998.38 4 1 1991.97 2000.63 1997.28 1997.28 165 2 1994.92 2014.23 1999.76 2005.73 5 1 1992.83 2014.14 1996.25 1996.25

PLOS 7/15 mutation nrSensitive nrResistant firstSen lastSen firstRes lastRes 58 2 1993.1 2011.21 1998.52 2004.63 33 1 1995.16 2012.77 1995.74 1995.74 129 1 1989.67 2012.24 1996.24 1996.24 7 1 1996.48 2011.27 1996.52 1996.52 40 2 1995.61 2007.47 1996.54 1999.11 1 1 1999.05 1999.05 2010.97 2010.97 6 1 2008.29 2009.62 1996.75 1996.75 17 1 1995.97 2010.73 1999.55 1999.55 3 1 1996.22 2003.38 2001.39 2001.39 42 1 1995.57 2009.02 2001.68 2001.68 6 1 1995.45 2003.16 1998.75 1998.75 36 1 1993.08 2013.45 1997.64 1997.64 1 1 2008.9 2008.9 1999.44 1999.44 38 1 1995.92 2012.76 1998.08 1998.08 63 1 1995.07 2009.18 1997.1 1997.1 16 1 1995.51 2005.65 1998.12 1998.12 3 1 1997.23 2013.15 1999.02 1999.02 29 1 1995.64 2013.64 1996.94 1996.94 6 1 1995.9 2006.44 2003.19 2003.19 219Q 3 1 1997.23 2008.29 2003.75 2003.75 27 1 1995.78 2013.8 2004.46 2004.46 47 1 1994.91 2013.45 1995.54 1995.54 1 2 2007.22 2007.22 1995.59 1999.78 15 1 1995.56 2004.65 1997.51 1997.51 1 1 2003.73 2003.73 1995.61 1995.61 29 1 1996.13 2012.94 1995.67 1995.67 3 2 2002.9 2011.88 2007.84 2007.84 31 1 1996.35 2013.54 1996.51 1996.51 183 1 1993.37 2014.33 1997.27 1997.27 1 1 2012.95 2012.95 1998.42 1998.42 1 1 1995.91 1995.91 1999.94 1999.94 165 1 1994.92 2014.23 2000.69 2000.69 5 1 1995.62 2007.74 2010.04 2010.04 21 3 1995.56 2014.91 2008.36 2013.05 6 1 2008.29 2009.62 1996.75 1996.75 21 2 1996.87 2014.43 1996.34 2006.13 0 4 2012.53 2014.58 2 1 2011.46 2011.75 2002.04 2002.04 15 1 1995.81 2009.21 2001.39 2001.39 210W 7 2 1995.72 2010.33 2005.6 2007.03 33 1 1995.16 2012.77 2006.78 2006.78 40 1 1995.61 2007.47 2003.52 2003.52 2 6 1997.22 2003.39 2001.18 2012.96 165 2 1994.92 2014.23 1999.76 2005.73 5 1 1992.83 2014.14 1996.25 1996.25 58 2 1993.1 2011.21 1998.52 2004.63 29 1 1995.15 2011.74 2002.82 2002.82 1 1 2006.77 2006.77 1996.24 1996.24 17 1 1995.97 2010.73 1999.55 1999.55 3 1 1996.22 2003.38 2001.39 2001.39 42 1 1995.57 2009.02 2001.68 2001.68 1 1 2008.9 2008.9 1999.44 1999.44 4 1 2004.86 2008.86 1999.47 1999.47 38 1 1995.92 2012.76 1998.08 1998.08 3 1 1997.23 2013.15 1999.02 1999.02 1 1 1996.22 1996.22 2002.94 2002.94

PLOS 8/15 mutation nrSensitive nrResistant firstSen lastSen firstRes lastRes 6 1 1995.9 2006.44 2003.19 2003.19 103N 7 5 1995.72 2010.33 2005.6 2008.83 10 1 1995.63 2008.5 2000.45 2000.45 38 1 1995.47 2011.82 2007.38 2007.38 3 1 2011.24 2014.06 2007.35 2007.35 24 1 1996.98 2012.84 2007.39 2007.39 183 1 1993.37 2014.33 2003.84 2003.84 8 1 1996.18 2011.65 2007.83 2007.83 6 2 1995.9 2006.44 2003.19 2004.14 58 3 1993.1 2011.21 2007.83 2009.94 21 1 1995.56 2014.91 2007.97 2007.97 13 2 1996.93 2013.24 2008.37 2012.66 17 2 1995.97 2010.73 2003.91 2006.41 8 3 1995.65 2011.15 2009.16 2009.65 8 2 1995.66 2013.51 2009.77 2013.24 1 1 2004.48 2004.48 2010.5 2010.5 48 1 1995.68 2010.71 2010.52 2010.52 6 1 2003.05 2011.95 2004.96 2004.96 1 1 1999.05 1999.05 2010.97 2010.97 0 1 2011.2 2011.2 42 1 1995.57 2009.02 2001.68 2001.68 33 1 2005.75 2014.26 2005.09 2005.09 1 1 2007.22 2007.22 2011.84 2011.84 3 1 1999.73 1999.77 2006.3 2006.3 4 1 1999.56 2009.1 2003.43 2003.43 123 1 1993.04 2015.1 2003.18 2003.18 21 1 1995.57 2010.67 2010.88 2010.88 108I 3 1 2011.24 2014.06 2007.35 2007.35 24 1 1996.98 2012.84 2007.39 2007.39 21 1 1995.56 2014.91 2007.97 2007.97 183 2 1993.37 2014.33 2002.33 2002.67 38 1 1995.47 2011.82 1998.68 1998.68 5 1 1995.93 2001.3 2005.26 2005.26 2 1 1998.31 2010.72 2011.38 2011.38 42 1 1995.57 2009.02 2001.68 2001.68 4 1 1999.56 2009.1 2003.43 2003.43 1 1 2011.58 2011.58 1996.79 1996.79 138A 2 8 1998.31 2010.72 2006.52 2011.38 25 3 1996.47 2013.31 1999.95 2006.92 0 10 2001.39 2013.95 7 3 1995.88 2003.62 2003.47 2006.99 2 6 1997.33 1999.45 1996.22 2007.24 2 1 1999.12 2010.18 1996.02 1996.02 1 7 1997.77 1997.77 2004.79 2010.02 23 2 1995.51 2008.45 2003.14 2005.38 165 7 1994.92 2014.23 1995.34 2014.46 21 8 1995.56 2014.91 2004.23 2011.49 6 1 1997.11 2008.93 2004.18 2004.18 0 2 2000.03 2001.86 1 6 2000.08 2000.08 2008.02 2014.29 1 2 1996.49 1996.49 2004.49 2006.49 21 1 1996.87 2014.43 1996.05 1996.05 123 2 1993.04 2015.1 1996.06 1996.34 0 2 1997.77 2006.21 1 1 1997.16 1997.16 1998.78 1998.78 36 1 1993.08 2013.45 1997.08 1997.08

PLOS 9/15 mutation nrSensitive nrResistant firstSen lastSen firstRes lastRes 42 1 1995.57 2009.02 1996.61 1996.61 3 2 1998.32 2003.22 1996.83 1997.67 1 2 2011.35 2011.35 2005.32 2008.82 3 1 2009.18 2010.47 2006.81 2006.81 10 1 2001.03 2013.81 2008.85 2008.85 20 3 2004.47 2013.58 2008.94 2009.04 14 1 1995.64 2007.68 2004.86 2004.86 12 1 1995.38 2014.26 2009.52 2009.52 15 1 1995.56 2004.65 2004.26 2004.26 1 1 2003.69 2003.69 2004.86 2004.86 29 1 1995.64 2013.64 2004.98 2004.98 33 1 1995.16 2012.77 2009.98 2009.98 1 1 2004.48 2004.48 2010.5 2010.5 0 2 2010.94 2011.59 0 2 1996.43 2005.93 3 1 1999.1 2009.31 2002.47 2002.47 21 1 1995.57 2010.67 2002.18 2002.18 1 1 2002.54 2002.54 2000.83 2000.83 14 2 1995.93 2010.69 1998.78 1999.19 3 1 1997.1 2003.04 2006.17 2006.17 3 1 1999.73 1999.77 2006.3 2006.3 58 1 1993.1 2011.21 1997.43 1997.43 47 1 1994.91 2013.45 1998.88 1998.88 61 1 1994.89 2012.54 1997.29 1997.29 63 1 1995.07 2009.18 2008.6 2008.6 10 1 1996.43 2008.42 2003.24 2003.24 0 2 2013.14 2013.2 0 1 2013.75 2013.75 181C 7 1 1995.53 2009.68 2006.34 2006.34 3 1 2011.24 2014.06 2007.35 2007.35 3 1 1994.82 2003.18 2003.66 2003.66 3 1 1997.23 2008.29 2003.75 2003.75 48 1 1995.68 2010.71 2007.22 2007.22 129 3 1989.67 2012.24 2007.3 2010.92 123 1 1993.04 2015.1 1998.02 1998.02 3 1 1997.58 2013.41 2006.63 2006.63 190A 165 5 1994.92 2014.23 2006.51 2013.56 40 1 1995.61 2007.47 2007.3 2007.3 3 1 1994.82 2003.18 2003.66 2003.66 48 1 1995.68 2010.71 2007.22 2007.22 10 1 1995.95 2013.12 2008.86 2008.86 3 1 1999.73 1999.77 2006.3 2006.3 24 1 1995.88 2014.95 1996.16 1996.16 6 1 1995.9 2006.44 2003.19 2003.19 90M 21 5 1994.97 2011.15 2003.77 2006.64 0 5 2007.18 2011.56 3 1 1997.23 2008.29 2003.75 2003.75 3 1 2008.62 2010.83 1998.38 1998.38 1 1 1999.24 1999.24 2003.87 2003.87 1 10 2009.81 2009.81 2005.37 2011.12 58 1 1993.1 2011.21 2004.63 2004.63 29 1 1995.15 2011.74 2002.82 2002.82 6 1 1997.11 2008.93 2000.81 2000.81 1 8 2010.56 2010.56 2008.95 2011.22 38 1 1995.92 2012.76 1998.08 1998.08 24 1 1995.88 2014.95 1999.19 1999.19

PLOS 10/15 mutation nrSensitive nrResistant firstSen lastSen firstRes lastRes 1 1 2007.22 2007.22 1999.78 1999.78 165 1 1994.92 2014.23 2007.43 2007.43 Table 3. Overview of cluster characteristics per RDMS. Each line corresponds to one cluster, and the header abbreviations refer to the following: mutation - which RMDS does the cluster belong to; nrSensitive - the number of sensitive samples in the cluster; nrResistant - the number of resistant samples in the cluster; firstSen - the time when the first (i.e. oldest) sensitive sequence was sampled ; lastSen - the time when the last (i.e. most recent) sensitive sequence was sampled; firstRes- the time when the first (i.e. oldest) resistant sequence was sampled; lastRes - the time when the last (i.e. most recent) resistant sequence was sampled

Table 4 lists the posterior estimates of the effective reproduction number of the sensitive strains, the transmission ratio and the resistance evolution and reversion rates estimated under scenario A.

mutation parameter median 95% HPD 103N Rs,before1994 2.7418 (2.4045-3.1105) Rs,1994 2001 0.5945 (0.5011-0.6968) − Rs,2001 2008 0.9082 (0.7752-1.0274) − Rs,2008 2015 0.8969 (0.6164-1.1809) − transmission ratio rλ 0.9581 (0.5592-1.3651) resistance evolution rate 0.0135 (0.0077-0.0202) resistance reversion rate 0.0304 (0-0.0937) 108I Rs,before1994 2.7997 (2.285-3.401) Rs,1994 2001 0.6462 (0.4968-0.7912) − Rs,2001 2008 0.9477 (0.6266-1.1682) − Rs,2008 2015 0.4519 (0.1064-0.8209) − transmission ratio rλ 0.791 (0.1787-6.4336) resistance evolution rate 0.0425 (0.0056-0.1396) resistance reversion rate 1.1013 (0.0156-3.1517) 181C Rs,before1994 3.5568 (2.9421-4.2367) Rs,1994 2001 0.525 (0.3895-0.6602) − Rs,2001 2008 0.771 (0.5647-0.9839) − Rs,2008 2015 0.6226 (0.0979-1.1613) − transmission ratio rλ 0.8659 (0.2634-1.7771) resistance evolution rate 0.0179 (0.0044-0.0466) resistance reversion rate 0.1504 (0-0.7867) 190A Rs,before1994 2.5024 (2.0983-2.9422) Rs,1994 2001 0.5444 (0.4097-0.6809) − Rs,2001 2008 0.9882 (0.81-1.174) − Rs,2008 2015 0.4983 (0.1574-0.8585) − transmission ratio rλ 1.0102 (0.4594-1.7366) resistance evolution rate 0.0108 (0.0036-0.0213) resistance reversion rate 0.0682 (0-0.2586) 138A Rs,before1994 2.7003 (2.4217-2.9803) Rs,1994 2001 0.4723 (0.3932-0.553) − Rs,2001 2008 1.0012 (0.8925-1.1027) − Rs,2008 2015 0.9414 (0.724-1.1748) − transmission ratio rλ 1.0903 (0.8849-1.3293) resistance evolution rate (before drug usage) 0.008 (0.0054-0.0109)

PLOS 11/15 mutation parameter median 95% HPD resistance evolution rate 0.0463 (2e-04-0.1402) resistance reversion rate (before drug usage) 0.0379 (0.0084-0.0721) resistance reversion rate 0.0553 (0-0.2254) 184V Rs,before1994 2.7151 (2.42-3.0277) Rs,1994 2001 0.5807 (0.5033-0.6693) − Rs,2001 2008 1.0009 (0.8848-1.1085) − Rs,2008 2015 0.6199 (0.3932-0.8707) − transmission ratio rλ 0.4101 (0.1981-0.6868) resistance evolution rate 0.0184 (0.0114-0.028) resistance reversion rate 0.0929 (7e-04-0.2196) 210W Rs,before1994 2.5632 (2.1858-2.9289) Rs,1994 2001 0.5631 (0.4532-0.6848) − Rs,2001 2008 0.8596 (0.6987-1.018) − Rs,2008 2015 0.8296 (0.4114-1.2195) − transmission ratio rλ 0.6909 (0.4183-1.024) resistance evolution rate 0.0086 (0.0042-0.0134) resistance reversion rate 0.0472 (0-0.1163) 215D Rs,before1994 2.7622 (2.3947-3.1683) Rs,1994 2001 0.5336 (0.4287-0.64) − Rs,2001 2008 0.9684 (0.835-1.1116) − Rs,2008 2015 0.8323 (0.5939-1.0529) − transmission ratio rλ 0.9019 (0.6016-1.2766) resistance evolution rate 0.0065 (0.0029-0.0105) resistance reversion rate 0.0534 (0.0036-0.1261) 215S Rs,before1994 2.8829 (2.4775-3.3406) Rs,1994 2001 0.6086 (0.4863-0.7279) − Rs,2001 2008 0.9317 (0.7827-1.0917) − Rs,2008 2015 0.8535 (0.5275-1.1692) − transmission ratio rλ 0.8336 (0.4841-1.2828) resistance evolution rate 0.0071 (0.0036-0.012) resistance reversion rate 0.0561 (0-0.1349) 219Q Rs,before1994 2.6481 (2.3144-3.0027) Rs,1994 2001 0.5924 (0.4872-0.6949) − Rs,2001 2008 1.0065 (0.8786-1.1449) − Rs,2008 2015 0.7207 (0.472-0.9785) − transmission ratio rλ 0.6793 (0.5017-0.8951) resistance evolution rate 0.0047 (0.0017-0.0087) resistance reversion rate 0.0987 (0.0505-0.1631) 41L Rs,before1994 2.6184 (2.3361-2.9074) Rs,1994 2001 0.5251 (0.4403-0.6178) − Rs,2001 2008 0.9373 (0.8182-1.0627) − Rs,2008 2015 0.9991 (0.746-1.255) − transmission ratio rλ 0.7768 (0.5706-1.0052) resistance evolution rate 0.0121 (0.0081-0.017) resistance reversion rate 0.0903 (0.0381-0.1455) 67N Rs,before1994 2.8678 (2.4767-3.2527) Rs,1994 2001 0.5306 (0.4335-0.642) − Rs,2001 2008 0.9984 (0.8645-1.1194) − Rs,2008 2015 0.8702 (0.6191-1.1162) −

PLOS 12/15 mutation parameter median 95% HPD transmission ratio rλ 0.6819 (0.5-0.8828) resistance evolution rate 0.0059 (0.0026-0.01) resistance reversion rate 0.0888 (0.0451-0.1413) 70R Rs,before1994 3.2026 (2.8077-3.6536) Rs,1994 2001 0.6253 (0.5292-0.7162) − Rs,2001 2008 0.7655 (0.626-0.8831) − Rs,2008 2015 0.6788 (0.4019-0.9641) − transmission ratio rλ 0.6137 (0.4241-0.8315) resistance evolution rate 0.0048 (0.0015-0.0088) resistance reversion rate 0.1239 (0.0638-0.2014) 90M Rs,before1994 2.5241 (2.1226-2.9732) Rs,1994 2001 0.6651 (0.5386-0.8109) − Rs,2001 2008 0.9131 (0.7525-1.089) − Rs,2008 2015 0.5074 (0.1795-0.9026) − transmission ratio rλ 1.6141 (1.1059-2.1688) resistance evolution rate 0.0073 (0.003-0.0128) resistance evolution rate (post drug usage) 0.0052 (0-0.0221) resistance reversion rate 0.0601 (2e-04-0.1444) resistance reversion rate (post drug usage) 0.0578 (0-0.1883) Table 4. Posterior Bayesian estimates of epidemiological parameters per RDMS. Each line corresponds to one parameter. We report the median estimates together with the 95% highest posterior density interval (HPD). The parameter ”resistance evolution/reversion rate” refers to the time period when drug usage was above 1% in Switzerland. The RMDS 138A has additional estimates for the time before the relevant drug usage became significant (”before drug usage”). The 90M RMDS has additional estimates for the time after the relevant drug usage was ceased (”post drug usage”).

3.1 Non-Convergence of 215Y RMDS Figure 6 shows the traces of the transmission ratio and the resistance reversion rate for the 215Y RMDS. Both parameters ”jump” between two areas of the state space, indicating a bimodality of the posterior distribution. This is likely due to a lack of data in the resistant deme. The 28 resistant samples fall into 25 distinct cluster. In the three clusters that have two (rather than one) resistant samples, they fall into different parts of the respective trees. Since the conflict between the two states was not resolved after running the MCMC for more than 2.400 million steps, the results are not reliable and were hence not included.

PLOS 13/15 184V 0.8 0.7 0.6 0.5 0.4 0.3 0.2

0 10 20 30 40

103N 1.4 1.2 1.0 0.8 0.6 0.4

0 10 20 30 40

90M. 2.4 2.2 2.0 1.8 1.6 1.4 1.2

0 10 20 30 40

Fig 5. The 95% highest posterior density (HPD) intervals of the reconstructed transmission ratio rλ for each of the 40 replicates (each blue bar corresponds to one simulation replicate) of the 184V-like simulation set (top), the 103N-like set (middle) and the 90M-like simulation set (bottom). The true value is depicted by a green line.

PLOS 14/15 Fig 6. Posterior traces of the transmission ratio (black) and the resistance reversion rate (blue) for the 215Y RMDS, indicating a bimodality of the posterior distribution

PLOS 15/15 quantifying the fitness cost of hiv drug resistance 139 Fig: Histograms of the number of (i) all and (ii) resistant samples per cluster per resistance mutation. 1 S 140 quantifying the fitness cost of hiv drug resistance eitneeouinrt siae o ahrssac uain nepnnilpirdsrbto ihmean with distribution prior exponential An mutation. resistance each for estimates rate evolution resistance S 2 i:Etmtso h eitneeouinrtsdrn rgcnupini wteln.Tevoi lt hwthe show plots violin The Switzerland. in consumption drug during rates evolution resistance the of Estimates Fig: 1 a mlydfralanalyses. all for employed was 95% P nevl fthe of intervals HPD quantifying the fitness cost of hiv drug resistance 141 HPD intervals of the 95% was employed for all analyses. 1 Fig: Estimates of the resistance reversion rates during drug consumption in Switzerland. The violin plots show the 3 S resistance reversion rate estimates for each resistance mutation. An exponential prior distribution with mean

TRANSMISSIONOFHEPATITISBANDDINANAFRICAN 7 COMMUNITY

This work studies the transmission of Hepatitis B (HBV) and D (HDV) in an African rural community. The study is particularly interesting as the sampling was done for a different disease – Buruli Ulcer, and serum samples were tested retrospectively for HBV and infection markers. The infections were divided into two categories – active and occult cases. The active cases can be transmitted through contact with blood or other bodily fluids of an infected person, and transmission is particularly high between actively interacting children of age 5 or less (WHO, 2019a). On the other hand, occult infection seems to be transmissible only perinatally or through medical procedures such as blood transfusion or organ donation, neither of which are available on the sampling site (Hu, 2002). I participated in the research of this paper by designing and performing the phylodynamic analysis for the active and occult cases. In the analysis the active cases were allowed to transmit to create both active and occult cases, while the occult cases were not transmitting at all. For 28 sequences, and in particular for all occult cases, where very little DNA is present in the bloodstream of a patient, no DNA could be amplified by whole-genome PCR, therefore only pre-S/S gene sequences were available. For the other 39 sequences we could analyse whole genomes. Using Bayesian methods allowed us to make use of all the genetic data available for HBV, while the ML analyses were done only using the pre-S/S sequences that were available for all samples. In this particular case, both Bayesian analyses with structure in the population, and ML analyses without structure have shown a very similar clustering of the occult and active cases, where most occult cases in the dataset cluster within sequences from active cases from same or neighbouring households. This work was published in August 2018 in mSystems as an article titled “Transmission of Hepatitis B and D Viruses in an African Rural Community”, DOI: 10.1128/mSystems.00120-18, where I am a middle author. Following is the publisher’s version of the article followed by the supplementary text and figures.

143 RESEARCH ARTICLE Clinical Science and Epidemiology crossm

Transmission of Hepatitis B and D Viruses in an African Rural Community

Carlos Augusto Pinho-Nascimento,a,b,c Martin W. Bratschi,a,b Rene Höfer,d Caroline Cordeiro Soares,e Louisa Warryn,a,b Ju៮lija Pecˇerska,f Jacques C. Minyem,h Izabel C. N. P. Paixão,c Marcia Terezinha Baroni de Moraes,g Alphonse Um Boock,h Downloaded from Christian Niel,e Gerd Pluschke,a,b Katharina Röltgena,b aSwiss Tropical and Public Health Institute, Molecular Immunology, Basel, Switzerland bUniversity of Basel, Basel, Switzerland cLaboratory of Molecular Virology, Biology Institute, Fluminense Federal University, Niterói, Brazil dJena-Optronik GmbH, Jena, Germany eLaboratory of Molecular Virology, Oswaldo Cruz Institute, Fiocruz, Rio de Janeiro, Brazil fDepartment of Biosystems Science and Engineering, ETH Zürich, Basel, Switzerland gOswaldo Cruz Institute, Fiocruz, Laboratory of Comparative and Environmental Virology, Rio de Janeiro, Brazil http://msystems.asm.org/ hFAIRMED, Yaoundé, Cameroon

ABSTRACT According to the World Health Organization (WHO), an estimated 257 million people worldwide are chronically infected with hepatitis B virus (HBV), with approximately 15 million of them being coinfected with hepatitis D virus (HDV). To investigate the prevalence and transmission of HBV and HDV within the general population of a rural village in Cameroon, we analyzed serum samples from most (401/448) of the villagers. HBV surface antigen (HBsAg) was detected in 54 (13.5%) of the 401 samples, with 15% of them also containing anti-HDV antibodies. Al- though Cameroon has integrated HBV vaccination into their Expanded Program on

Immunization for newborns in 2005, an HBsAg carriage rate of 5% was found in chil- on July 5, 2019 by guest dren below the age of 5 years. Of the 54 HBsAg-positive samples, 49 HBV pre-S/S sequences (7 genotype A and 42 genotype E sequences) could be amplified by PCR. In spite of the extreme geographical restriction in the recruitment of study partici- pants, a remarkable genetic diversity within HBV genotypes was observed. Phyloge- netic analysis of the sequences obtained from PCR products combined with demo- graphic information revealed that the presence of some genetic variants was restricted to members of one household, indicative of intrafamilial transmission, which appears to take place at least in part perinatally from mother to child. Other genetic variants were more widely distributed, reflecting horizontal interhousehold transmission. Data for two households with more than one HBV-HDV-coinfected in- Received 11 July 2018 Accepted 21 August dividual indicate that the two viruses are not necessarily transmitted together, as 2018 Published 18 September 2018 family members with identical HBV sequences had different HDV statuses. Citation Pinho-Nascimento CA, Bratschi MW, Höfer R, Soares CC, Warryn L, Pecˇerska J, IMPORTANCE This study revealed that the prevalence of HBV and HDV in a rural Minyem JC, Paixão ICNP, Baroni de Moraes MT, area of Cameroon is extremely high, underlining the pressing need for the improve- Um Boock A, Niel C, Pluschke G, Röltgen K. ment of control strategies. Systematic serological and phylogenetic analyses of HBV 2018. Transmission of hepatitis B and D viruses sequences turned out to be useful tools to identify networks of virus transmission in an African rural community. mSystems 3: e00120-18. https://doi.org/10.1128/mSystems within and between households. The high HBsAg carriage rate found among chil- .00120-18. dren demonstrates that implementation of the HBV birth dose vaccine and improve- Editor Katrine L. Whiteson, University of ment of vaccine coverage will be key elements in preventing both HBV and HDV in- California, Irvine fections. In addition, the high HBsAg carriage rate in adolescents and adults Copyright © 2018 Pinho-Nascimento et al. This is an open-access article distributed under emphasizes the need for identification of chronically infected individuals and linkage the terms of the Creative Commons Attribution to WHO-recommended treatment to prevent progression to liver cirrhosis and hepa- 4.0 International license. tocellular carcinoma. Address correspondence to Katharina Röltgen, [email protected]. KEYWORDS hepatitis B virus, molecular epidemiology, transmission

September/October 2018 Volume 3 Issue 5 e00120-18 msystems.asm.org 1 Pinho-Nascimento et al.

espite being entirely vaccine preventable, hepatitis B virus (HBV) infection remains Da serious public health concern, with estimates of 257 million chronic HBV surface antigen (HBsAg) carriers worldwide (1). Without treatment, disease will progress to liver cirrhosis and hepatocellular carcinoma in 15 to 40% of the chronically infected patients. According to WHO figures, about 60 million HBsAg carriers live in the WHO African Region, the majority of whom are unaware of their infection and, thus, constitute a reservoir for the inadvertent spread of the virus to others (2, 3). Transmission of HBV in areas of high endemicity commonly occurs perinatally from mother to child or hori- zontally by exposure to infected blood or other body fluids. In African settings, the predominant route of HBV transmission appears to be horizontal, in particular, from child to child during the first 5 years of life due to close interaction with infected household contacts and playmates (1). HIV-HBV coinfection seems to lead to an Downloaded from increase in the risk of mother-to-child transmission (3). Occult HBV infection (OBI), characterized by the presence of very low levels of HBV DNA in blood and liver and undetectable HBsAg levels, poses a significant risk to individuals receiving blood transfusions or tissue transplants (4). In a recent systematic review and meta-analysis on the prevalence of HBV infection in Cameroon, an overall pooled HBsAg carriage rate of 11% was reported, with a higher prevalence in rural (13%) than in urban (9%) areas (5). Seroprevalence rates in rural areas of Cameroon were reported for pregnant women (6, 7), Pygmy groups (8), and http://msystems.asm.org/ HIV patients (9), whereas information on the burden of HBV in the general population, including age-specific prevalence rates, is limited. The impact of perinatal versus horizontal HBV transmission in African countries where the disease is hyperendemic, like Cameroon, remains controversial (6, 7, 10, 11). HBV isolates from around the world have been classified into nine genotypes (A to I), most of which show a more-or-less distinct geographical distribution (12). In sub- Saharan Africa, a predominance of genotypes A (HBV/A) and E (HBV/E) has been reported (13–15). While HBV/E is characterized by comparatively low genetic diversity (14), a new classification scheme for HBV/A consisting of subgenotypes A1, A2, and A4 (previously referred to as A6) as well as quasi-subgenotype A3 (qA3; previously subdi- vided into A3, A4, A5, and A7) was proposed (16, 17). on July 5, 2019 by guest The HBV genome consists of a circular, partly double-stranded DNA of approxi- mately 3,200 nucleotides and comprises four partially overlapping open reading frames (ORFs) encoding the surface proteins (pre-S/S; corresponding to the HBsAg), the HBeAg, and the core protein (pre-C/C), the polymerase (P), and the regulatory protein (X). The pre-S/S ORF encodes three different, structurally related envelope proteins, which are synthesized from alternative initiation codons and are referred to as large (L), middle (M), and small (S) proteins, respectively (18). The pre-S/S region is the most variable part of the HBV genome (19) and is therefore the most commonly analyzed region for phylogenetic investigations. According to WHO estimates, approximately 15 million of all HBsAg carriers world- wide are chronically coinfected with hepatitis D virus (HDV) (20), a defective virus that requires HBsAg to establish an infection. HDV infection occurs either simultaneously with HBV or through superinfection of an HBV-positive individual. Whereas clearance of both viruses is a common outcome of simultaneous infection, the majority of patients with superinfection progress to chronic HDV and hepatitis (21). The prevalences of HDV in Africa vary geographically and are particularly high in West and Central African countries (22). In these settings, transmission of HDV through contact with the blood or other body fluids of an infected person is common (23), while the risk of vertical spread seems to be low (20). Recent data on the HDV seroprevalence among HBsAg carriers in Cameroon ranged from 11% (24) to 18% (25). HDV exhibits a high degree of genetic heterogeneity and has been classified into eight different clades (1 to 8) (26), of which clade 1 has a worldwide distribution (27). Only a few HDV genotyping studies have been conducted in Africa. One of them reports a cocirculation of several clades (HDV-1, HDV-5, HDV-6, and HDV-7) in Cameroon (25). HDV has a circular RNA genome of

September/October 2018 Volume 3 Issue 5 e00120-18 msystems.asm.org 2 Molecular Epidemiology of HBV and HDV in Cameroon Downloaded from

FIG 1 Age distribution of HBsAg carriers among study participants of Mbandji 2. A stacked graph illustrating the number of HBsAg carriers among the total number of study participants for each age group (left y axis) is shown. Diamonds represent the percentage of HBsAg carriage for each age group (right y axis). approximately 1,700 nucleotides, containing a single ORF encoding the hepatitis delta antigen (HDAg) (28).

The purpose of the present study was to investigate the prevalence and genetic http://msystems.asm.org/ diversity of HBV and HDV in the general population of a remote rural village in Cameroon. By combining these data with available demographic information on the study population, and by assuming that high similarity between HBV or HDV sequences from different individuals is strongly indicative of a related source of infection, intra- and interfamilial patterns of the transmission of these viruses were analyzed.

RESULTS HBsAg and anti-HDV serum antibody positivity in the study population. Serum samples from 401 of the 448 inhabitants (living in 88 households) of the village Mbandji 2 in the Bankim Rural Health Area of Cameroon were available for assessing HBV and HDV infection prevalence. Of the 401 individuals, 222 were children under the age of on July 5, 2019 by guest 15 years. An enzyme-linked immunosorbent assay (ELISA) revealed that 13.5% of all sera (54/401) were HBsAg positive. Between different age groups, HBsAg positivity was highly varied, with the lowest prevalence rates recorded for young children (Ͻ5 years) and the elderly population (Ն55 years) (Fig. 1). There was no marked gender difference in positivity, as 12.6% (26/206) of the male and 14.4% (28/195) of the female partici- pants tested positive. Fifteen percent (8/54) of the sera positive for HBsAg also contained anti-HDV antibodies. Five of the HBV-HDV-coinfected individuals were adults (one male of unknown age, two males aged 28 and 40 years, and two females aged 39 and 50 years), and three were children 9, 11, and 14 years of age. Phylogenetic analysis of the circulating HBV and HDV variants. HBV pre-S/S region sequences could be amplified from 91% (49/54) of the HBsAg ELISA-positive sera. To examine the position of the genetic variants circulating in Mbandji 2 within the global HBV phylogeny, these sequences were compared with a representative selection of HBV sequences retrieved from GenBank (Fig. 2). While 86% (42/49) of the Mbandji 2 sequences clustered with HBV/E isolates, 14% (7/49) were closely related to HBV/A isolates. The mean genetic distance of the 42 HBV/E pre-S/S sequences was lower (0.7% Ϯ 0.1) than that of the seven HBV/A sequences (1.2% Ϯ 0.2%). HBV whole- genome sequences could be amplified from 72% (39/54) of the HBsAg ELISA-positive samples. Results of phylogenetic analyses of these whole genomes are consistent with results obtained with the pre-S/S sequences (see Fig. S1 in the supplemental material). High-resolution phylogenetic analysis of the Mbandji 2 HBV/A sequences demon- strated that all of the seven pre-S/S sequences of this study grouped with strains belonging to quasi-subgenotype A3 (Fig. 3A). Classification into quasi-subgenotype A3 was confirmed by whole-genome analysis of the four complete HBV/A genome se- quences obtained (Fig. 3B).

September/October 2018 Volume 3 Issue 5 e00120-18 msystems.asm.org 3 Pinho-Nascimento et al. Downloaded from http://msystems.asm.org/ on July 5, 2019 by guest FIG 2 Phylogenetic reconstruction of HBV pre-S/S sequences. A maximum-likelihood phylogenetic tree of the 49 HBV pre-S/S sequences obtained in this study together with 53 publicly available sequences covering all HBV genotypes was constructed under the general time-reversible (GTR) model ϩG ϩI embedded in MEGA 6.0. Of the 49 Mbandji 2 sequences, 42 (blue dots) and 7 (red dots) clustered with HBV/E and HBV/A isolates, respectively. The tree was drawn to scale, with branch lengths measured as the number of substitutions per site. The tree was rooted using the sequence of a woolly monkey HBV as an outgroup. There were a total of 1,212 positions in the final data set. Bootstrap values (Ն80%) are shown at branch nodes.

For HDV, amplification of a 360-bp fragment of the small hepatitis D (sHD) gene region succeeded for three of the eight samples in which anti-HDAg antibodies were detected. Phylogenetic reconstruction together with publicly available sequences of different HDV clades showed a clustering of the three sequences with HDV clade 1 isolates (Fig. 4). Phylogeography and evidence for familial transmission of HBV and HDV. The geographical distribution of the 88 studied households is depicted in a map of Mbandji 2 in the form of pie charts, corresponding in size to the numbers of study participants (Fig. 5). The 42 study participants infected with variants of HBV/E resided in 25 households, 9 of which were inhabited by more than one HBV/E infected individual. In order to study the spatial pattern of HBV/E variants within the village, a phylogeny of the 49 pre-S/S sequences was constructed (Fig. 6A) (HBV/A sequences were used as an outgroup). Dots representing the residential homes of each individual with HBV/E sequence information in a map of Mbandji 2 (Fig. 6B) are colored according to the corresponding branches in the HBV/E phylogenetic tree in Fig. 6A, in which information on participant identifiers (IDs), household IDs, gender, family relationships, and age is also provided. The spatial analysis revealed that in most cases, the HBV/E sequences

September/October 2018 Volume 3 Issue 5 e00120-18 msystems.asm.org 4 Molecular Epidemiology of HBV and HDV in Cameroon Downloaded from http://msystems.asm.org/

FIG 3 High-resolution phylogeny of HBV/A sequences. Maximum-likelihood phylogenetic trees of HBV/A pre-S/S (A) and whole-genome (B) sequences from Mbandji 2 (red dots) together with publicly available sequences covering the described HBV/A subgenotypes were constructed in MEGA 6.0. (A) All of the seven HBV/A pre-S/S sequences of this study clustered with strains of quasi-subgenotype A3 (qA3). Distances were calculated using the Kimura 2-parameter model ϩG. There were a total of 1,206 positions in the final data set. (B) Classification of the Mbandji 2 HBV/A sequences into qA3 was reconfirmed by whole-genome analysis. Distances were calculated using the Kimura 2-parameter model ϩG ϩI. There were a total of 3,221 positions in the final data set. Both trees were drawn to scale, with branch lengths measured as the number of substitutions per site, and were rooted using the sequence of an HBV/E strain as an outgroup (black dot). Bootstrap values (Ն80%) are shown at branch nodes. on July 5, 2019 by guest from infected individuals living in the same households clustered together (Fig. 6), indicating prevailing intrahousehold transmission of HBV. HBV/E pre-S/S sequences with high similarity were found in six of the nine house- holds inhabited by more than one HBV/E-infected individual. This included two genetic clusters with completely identical sequences each: one (cluster E1 in Table 1; light green in Fig. 6) consisting of sequences from three siblings living in household A40 and the other (cluster E2 in Table 1; dark blue in Fig. 6) comprising sequences from three siblings living in household A6 and a couple residing in household A23, as well as five additional individuals from five other households. Another genetic cluster (cluster E3 in Table 1; light blue in Fig. 6) consisted of sequences from a mother with her three children resident in household B19 as well as sequences from three additional individ- uals from three other households. While the sequences of the mother and her two youngest children were identical and contained two unique single-nucleotide poly- morphisms (SNPs) that were not present in any other sequence in the entire data set, the sequence of the oldest daughter exhibited six additional SNP differences, two of which were uniquely found in her sequence, whereas four also occurred in the almost identical sequence of her cousin living in household B21. It is noteworthy that the mother tested HDV positive but that her three children were HDV negative. Sequences from two siblings living in household A9 (cluster E4 in Table 1; red in Fig. 6) differed by only a single SNP. They clustered together with the sequences of four additional individuals from other households. The mother of the two siblings was HBsAg negative. Sequences of three siblings (cluster E5 in Table 1; orange in Fig. 6) living in household B6 were completely identical. The sequence of a fourth sibling living in this household exhibited three unique SNPs and a 15-bp deletion that was found only in this isolate.

September/October 2018 Volume 3 Issue 5 e00120-18 msystems.asm.org 5 Pinho-Nascimento et al. Downloaded from http://msystems.asm.org/ on July 5, 2019 by guest

FIG 4 Phylogenetic reconstruction of HDV sequences. An unrooted maximum-likelihood phylogenetic tree of the 3 HDV sequences obtained in this study (green dots) and 41 publicly available sequences covering all currently described HDV genotypes (1 to 8) was constructed with the GTR model ϩG embedded in MEGA 6.0. The three Mbandji 2 sequences clustered with strains of HDV genotype 1. The tree was drawn to scale, with branch lengths measured as the number of substitutions per site. There were a total of 365 positions in the final data set. Bootstrap values (Ն80%) are shown at branch nodes.

The sequences of all four siblings shared two unique SNPs that were not present in any other sequence of the whole data set. Particularly notable is that the two female siblings were identified to be coinfected with HDV but that the two male siblings tested anti-HDV antibody negative. In three of the nine households with more than one HBV/E-infected individual, no particularly close phylogenetic relation was found between sequences of the inhabit- ants (Fig. 6). Among these households, sequences were derived from (i) a mother (dark blue) living in household A5 with her 11-year-old daughter (light blue), whose se- quence was closely related to another child, (ii) an aunt (dark blue) and her niece (dark green) resident in household A21, and (iii) a pair of siblings living in household A34 (light blue and dark green). While one of the siblings was coinfected with HDV, the other tested anti-HDV antibody negative. Residences of the seven individuals from whom HBV/A sequences were obtained are distributed over the entire study area, with two of them living in the same household, B18 (Fig. 5). These two individuals were siblings (9 and 11 years of age) and presented with completely identical sequences (cluster A in Table 1). In one household (B28), an HBV/E sequence was amplified from the serum of a female and an HBV/A sequence from her male partner.

September/October 2018 Volume 3 Issue 5 e00120-18 msystems.asm.org 6 Molecular Epidemiology of HBV and HDV in Cameroon Downloaded from http://msystems.asm.org/

FIG 5 Geographical distribution of HBV/A, HBV/E, and HDV infections in Mbandji 2. Locations of the 88 households participating in the study are displayed on July 5, 2019 by guest in the form of pie charts, with the pies corresponding in size to the numbers of study participants. Individuals identified as being infected with HBV/A (red) and HBV/E (blue) or coinfected with HBV/E and HDV (green) are represented as slices in the pie charts according to their numerical proportion. No household location is shown for study participant 490, coinfected with HBV/E and HDV, because no demographic information was available. Overview images of Cameroon (lower left) and the Mapé Basin (lower middle) are shown to illustrate the geographical location of the study area (black rectangle). The background images, courtesy of ESA Sentinel and the U.S. Geological Survey, are in the public domain.

The eight individuals with HDV infection were all coinfected with HBV/E; their homes (six different households) were located mostly in the southwestern area of the village (Fig. 5). For one of the eight HBV-HDV-coinfected individuals, no demographic infor- mation was available. Deduced amino acid sequence analysis. SNP sites among the 42 available HBV/E pre-S/S sequences and the pre-C/C, X, and P sequences of the 35 complete HBV/E genomes are listed in Table S2. Predicted amino acid sequence polymorphisms and deletions are illustrated in Table S3. Twenty-three of the 35 HBV/E sequences for which full genome data were available were 3,212 nucleotides long. In-frame deletions of 1 to 5 amino acids affecting both the pre-S2 and P regions were detected in the remaining 12 isolates. Deduced amino acid sequences for the pre-S1 region of all 42 HBV/E pre-S/S sequences showed the characteristic features described for genotype E isolates (29), including a length of 118 amino acids due to a single amino acid deletion in the amino terminus, the signature motif Leu3SerTrpThrValProLeuGluTrp11, and a methionine at position 83, which intro- duces a new translational start codon. Interestingly, another methionine at pre-S1 position 86 was present in 17 of the 42 sequences. Substitutions leading to the loss of the pre-S2 start codon, with consequent abolishment of the M protein synthesis, were detected in the sequences of two individuals (study participants 214 and 522). Ten

September/October 2018 Volume 3 Issue 5 e00120-18 msystems.asm.org 7 Pinho-Nascimento et al. Downloaded from http://msystems.asm.org/ FIG 6 Phylogeographic and transmission analysis of HBV in the population of Mbandji 2. (A) A maximum-likelihood phylogenetic tree of the 42 HBV/E and 7 HBV/A pre-S/S sequences was constructed using the Kimura 2-parameter model ϩG to illustrate inter- and intrafamilial HBV transmission in the village. Individuals are marked with dots colored according to different branches of the phylogenetic tree, and additional demographic information, such as participant ID, household ID, gender, family relationships (for households with sequence information from several HBsAg carriers), and age, is shown. The tree was drawn to scale, with branch lengths measured as the number of substitutions per site. There were a total of 1,209 positions in the final data set. Bootstrap values (Ն80%) are shown at branch nodes. Asterisks indicate households in which members with identical or highly similar HBV sequences were identified (high likelihood of intrafamilial transmission). Plus or minus signs highlight households in which family members with identical HBV sequences had differential HDV statuses. (B) The residence of each of the study participants (except for participant 490, for whom no demographic information was available) with HBV/E sequence information is displayed according to the color of the respective dots in the branches of the phylogenetic tree. The background image, courtesy of ESA Sentinel, is in the public domain.

isolates, which constitute the dark-blue branch of the phylogenetic tree (Fig. 6A), had on July 5, 2019 by guest a single amino acid deletion at pre-S2 position 22. Two isolates from individuals living in two different households contained unique amino acid deletions of 4 and 5 amino acids at pre-S2 positions 19 to 22 (study participant 240) and 18 to 22 (study participant 398), respectively. In the S region, all but one of the sequences had a Thr57 residue, like the majority of isolates from northwestern Africa. In contrast, an Ile57 residue present in the majority of HBV/E isolates from southwestern Africa (29) was detected in one sequence belonging to a 7-year-old child (study participant 416), who was the only HBsAg-positive individual in his household. Amino acid substitutions in the “a” deter- minant of the major hydrophilic region, associated with immune escape, consisting of amino acids 124 to 147 deduced from the S gene sequence (30) were identified in three sequences at positions L127I (study participant 398), L127P (study participant 416), and S140L (study participant 240). Analysis of the predicted amino acid sequences of the 35 HBV/E whole-genome sequences revealed that the amino acid sequence deduced from the pre-C/C region was particularly conserved, except for the sequences of study participants 240, 472, and 522, which contained a number of amino acid replacements in this region (Table S3). A stop codon mutation affecting amino acid sequence position W28* deduced from the pre-C gene region (nucleotide position G1896A in the genome sequence), abolishing HBeAg production, was detected in two sequences (study par- ticipants 472 and 522). The four HBV/A sequences with complete genome information were 3,221 nucleo- tides long with a 2-amino-acid insertion at the carboxy terminus of the core protein, characteristic of HBV/A isolates. Of the only three SNPs detected among the four isolates, one was nonsynonymous (amino acid position 244 in the P region), whereas the other two were synonymous. One of the two synonymous SNPs found in three of the four sequences was the silent G1888A (nucleotide position in the genome se-

September/October 2018 Volume 3 Issue 5 e00120-18 msystems.asm.org 8 etme/coe 08Vlm su e00120-18 5 Issue 3 Volume 2018 September/October TABLE 1 Transmission of HBV and HDV within householdsa Cameroon in HDV and HBV of Epidemiology Molecular Unique genetic differences Additional households with Genotype and Anti-HDV Genetic differences at the indicated positions at the indicated positions one individual infected genetic cluster House- Household member no., antibody (in pre-S/S region) within (in pre-S/S region) with a variant belonging to (color code) hold ID relationship (age in yr) status households/genetic cluster within genetic cluster the same cluster 27 43 732 E1 (light green) A40 333, female sibling (10) Neg None AT T None 334, female sibling (14) Neg AT T 668, male sibling (11) Neg AT T Consensus CC C sequence 265 373 417–419 E2 (dark blue) A6 199, female sibling (8) Neg None A G DEL A3, A4, A5, A17, A21 200, male sibling (12) Neg A G DEL 201, male sibling (14) Neg A G DEL A23 237, male partner (52) Neg A G DEL 232, female partner (50) Neg A G DEL Consensus G A NoDEL sequence 48 225 259 326 342 449 470 1085 E3 (light blue) B19 441, mother (39) Pos T GT C G T C C None A5, A34 438, daughter (3) Neg T GT C G TCC 439, son (5) Neg T GT C G TCC 491, daughter (12) Neg C AGT T CTT B21 458, male cousin of Neg C G GT T CTC 491 (24) Consensus CGTCTTCC sequence 321 96 E4 (red) A9 209, female sibling (4) Neg ACA1, A15, A23, B4 210, male sibling (6) Neg C C Consensus CA sequence 257 407–421 665 898 1085 515 723 E5 (orange) B6 398, female sibling (11) Pos C Del G A T G C None 400, female sibling (14) Pos T NoDel T C C GC 401, male sibling (7) Neg T NoDel T C C GC 402, male sibling (6) Neg T NoDel T C C GC Consensus N NoDel T C C C T sequence 159 342 414 467 468 A B18 443, female sibling (9) Neg None GG T CCA32, B9 446, female sibling (11) Neg GG T CC Consensus AC C AT

msystems.asm.org sequence aHBV/E and HBV/A pre-S/S sequences with high similarity (genetic clusters) were found in members of the same households. To illustrate the composition of the genetic clusters listed in the first column of the table (the color code corresponds to the color scheme in Fig. 6), IDs of affected households, IDs of household members, and their relationships are specified. The presence (or absence) of anti-HDV antibodies in the sera of individuals is indicated. Pos, positive; Neg, negative; Del, deletion; NoDel, no deletion. Genetic differences detected between HBV pre-S/S sequences of household members as well as unique sequence characteristics that were found only in sequences belonging to the same genetic cluster are also listed. To complete the picture, households with only one individual infected with a variant belonging to the same genetic cluster are given in the last column. Boldface indicates genetic differences with respect to the consensus (the nucleotides present in the majority of sequences of this study).

9

Downloaded from from Downloaded http://msystems.asm.org/ on July 5, 2019 by guest by 2019 5, July on Pinho-Nascimento et al. quence) mutation, which has previously been described as a unique characteristic of subgenotype A1 (31). SNPs among the seven HBV/A pre-S/S sequences as well as the corresponding deduced amino acids are listed in Table S2. All predicted polymorphic amino acid positions detected in the pre-S/S region are illustrated in Table S3. The deduced amino acid sequence from the pre-S1 gene region of six sequences was found to be 119 amino acids long, while that of the seventh isolate, belonging to a 35-year-old male (study participant 203) who was the only participant of his household, contained 120 amino acids. His sequence also contained a mutation leading to the loss of the pre-S2 start codon. No deletions as well as no mutations in the “a” determinant of the major hydrophilic region were detected in any of the HBV/A pre-S/S sequences.

No mutations in the polymerase region known to be relevant for phenotypic Downloaded from resistance to the five antiretroviral drugs lamivudine, adefovir, entecavir, tenofovir, and telbivudine were found in any of the HBV sequences. Based on the S region amino acid sequence algorithms, Arg122, Lys160, and Leu/ Ile127, which were found in 41 HBV/E sequences, and Arg122, Lys160, Pro127, Gly159, and Ser140, which were found in 1 HBV/E sequence (study participant 416), all HBV/E isolates of this study were classified as serotype ayw4. All HBV/A isolates were predicted to belong to serotype ayw1 based on their Arg122, Lys160, Pro127, and Ala159 amino acid

pattern (32). http://msystems.asm.org/ Detection of OBI and placement of sequences in the HBV phylogeny. In order to detect OBI in the study population, nucleic acids were extracted from the sera of 149 household contacts of the 54 HBsAg-positive individuals. Pre-S/S amplification products were obtained from 19 of the 149 sera. Nucleotide sequencing and phylogenetic analysis revealed that the 19 OBI sequences were highly similar to the sequences of the 49 HBsAg-positive individuals, with 17/19 clustering with HBV/E sequences and the remaining 2 being closely related to the HBV/A sequences identified in this study (Fig. S2). Of the 17 occult HBV/E sequences, 5 were obtained from members of household A6. One of the five sequences, derived from the mother, was identical to the sequences of her three HBsAg-positive children, whereas four clustered with other

sequences. Of the remaining 12 occult HBV/E sequences, 5 were highly similar to on July 5, 2019 by guest the sequences of their household contacts living in households A3, A9, A17, A19, and A23, 3 (from household A15) were located on different branches of the tree, and 4 (from households A4, A18, A35, and B9) showed no particularly close phylogenetic relation to the sequences of their household contacts. HBsAg-positive household contacts living in A18 and B9 were infected with HBV/A. Of the two occult HBV/A sequences, only one was highly similar to the sequence of the HBsAg-positive household contact living in A32, whereas the other HBsAg-positive household contact residing in B4 was infected with HBV/E. Highly similar phylogenetic relations were obtained by means of Bayesian analysis (Fig. S3). In contrast to the S regions of HBsAg-positive individuals, all of which had a Val106 residue, both of the OBI HBV/A sequences had Ile106. No amino acid variation charac- teristic for all OBI HBV/E sequences was detected.

DISCUSSION Despite the high endemicity of chronic HBV infection and HBV-HDV coinfection in West and Central Africa, data on the epidemiology of these viruses in the region are scarce, and those available are mostly from urban areas. Here, we investigated the prevalence and genetic diversity of HBV and HDV in a remote rural population of the Mbandji 2 community of Cameroon, for which detailed demographic information was available to study familial relationships of transmission. A high prevalence of HBsAg carriers (13.5%) and HBV-HDV coinfections (15%) was detected among inhabitants of the Mbandji 2 village, mainly in individuals born in the era before Cameroon integrated the HBV vaccine into its Expanded Program on Immunization for newborns in 2005 (33). However, the HBsAg prevalence rate in children below the age of 5 years was still high

September/October 2018 Volume 3 Issue 5 e00120-18 msystems.asm.org 10 Molecular Epidemiology of HBV and HDV in Cameroon

(5%), which might be explained by poor access to medical care. In rural African settings, prevention of mother-to-child transmission, requiring a combination of routine ante- natal screening, antiviral drug treatment during pregnancy, and the integration of HBV birth dose vaccination, needs to be implemented and improved (3). A lower HBV prevalence in the elderly population (Ͼ55 years of age) than in younger adults, as observed in our study population, has also been described in another study (34). This may be associated with the high mortality rate of elderly individuals infected with HBV, presumably due to immunosenescence combined with a high prevalence of other chronic conditions accumulated during their lifetimes (35, 36). The extremely high HBsAg carrier rate of Ͼ20% in some age groups demonstrates that public health efforts are required to diagnose infections in the community and to ensure access to treat- ment. None of the known drug resistance mutations was detected in any of the Downloaded from analyzed HBV sequences, which might be explained by the lack of selection pressure due to the limited use of antiviral drugs in this region. By analyzing genomic sequences, evidence of familial transmission of HBV became apparent for 7 of the 11 households in which more than one HBV-positive individual was identified; identical or almost identical HBV sequences were isolated from infected household members. It has been reported previously that apart from perinatal trans- mission, horizontal transmission during early childhood may be a common mechanism

of HBV infection in sub-Saharan Africa (7, 10, 37). In our study, six households in which http://msystems.asm.org/ siblings were infected with highly similar HBV/E or HBV/A sequences were identified. In one of these (household B19), perinatal transmission from a mother to her three children is most likely. That it is, in any case, an example of intrafamilial transmission is proven by a genomic fingerprint; the viral sequences of the mother and her two youngest children contained two unique SNPs that were not found in any other sequence. With respect to the involvement of the mother in intrafamilial transmission, the situation is less clear for the other five households, in which the mother either had OBI (household A6), was HBsAg negative (household A9), or did not participate in the study (households A40, B6, and B18). Interestingly, the HBV pre-S/S sequence derived from the mother with OBI of household A6 was identical to the sequences of three of her children. Unique SNPs in almost-identical pre-S/S sequences of family members on July 5, 2019 by guest were also identified in households A40 and B6. Routes of HDV and HBV transmission were reported to be similar. Apart from sexual contact, inadvertent intrafamilial spread of HDV seems to be common in regions of high prevalence. The risk of perinatal transmission of HDV is considered to be low (38). Anti-HDV IgG levels, which indicate exposure to HDV and persist long term, even after viral clearance, were detected in eight study participants. In all three cases in which HDV infections occurred in households with more than one HBsAg carrier, HDV coinfection affected only part of the HBV-positive inhabitants. Interestingly, in two of these (households B6 and B19), individuals with completely identical HBV sequences had different HDV statuses. While in one family (household B6), HBV-HDV coinfection was identified in two of four siblings, in another family (household B19), the mother, but none of her three HBV-infected children, was coinfected. Since the sensitivity and specificity of the applied anti-HDAg antibody kit was reported to be very high (39), these results suggest that HDV is not necessarily transmitted together with HBV from a coinfected donor. However, the possibility that the two siblings (household B6) and the mother (household B19) acquired their HDV infections from different sources cannot be ruled out completely. Of the 49 individuals from whom HBV pre-S/S sequences were obtained in this study, 7 and 42 belonged to genotypes A and E, respectively. The three analyzed HDV sequences showed high similarity to other publicly available HDV clade 1 sequences. Cocirculation of HBV/A and HBV/E genotypes (15) as well as a predominance of HDV clade 1 infections (25) in Cameroon have been reported previously. Intragenotype mean genetic distances of the analyzed HBV/A and HBV/E pre-S/S sequences were determined to be 1.2% and 0.7%, respectively. Within the geographically very confined study population, the mutation rate appeared to be relatively low, since completely

September/October 2018 Volume 3 Issue 5 e00120-18 msystems.asm.org 11 Pinho-Nascimento et al. identical genomic sequences were found in inhabitants of the same households and even in individuals living in different households (dark-blue branch of the phylogenetic tree). A maximum of six unique SNPs were observed between the sequences of family members with apparently linked transmission (household B19). In this household, a mother most likely transmitted her HBV variant perinatally to her three children, which were at the time of blood collection 3, 5, and 12 years of age. While the mother and her two younger children exhibited completely identical sequences, the HBV strain of her 12-year-old daughter had accumulated six unique SNPs. In two of the HBV/E sequences and in one HBV/A sequence, we detected mutations leading to the abolishment of the pre-S2 start codon. In addition, deletions within the pre-S2 region of 1 to 5 amino acids were identified in 19 HBV/E sequences (12 from

HBsAg-positive individuals and 7 from individuals with OBI). The pre-S2 gene is known Downloaded from as the most varied region of the HBV genome, and the emergence of pre-S2 mutants is a frequent event that may occur spontaneously or as a consequence of immune or drug pressure (19). Several lines of evidence indicate that deletions in this region tend to accumulate during later stages of persistent HBV infection and are associated with different severe forms of acute and chronic liver disease (19). In our study, the occurrence of a single amino acid deletion in completely identical sequences from 16 participants may reflect frequent transmission of this variant. The site of this deletion

as well as the one in three sequences with unique 4- or 5-amino-acid deletions involves http://msystems.asm.org/ a region with known B cell epitopes (40) and is therefore indicative of the emergence of HBV immune escape variants. The potential correlation between the pre-C mutation G1896A, detected in two sequences of this study, and severity of liver disease remains controversial (41). The prevalence of HBV, particularly in rural areas of Africa remains very high. Achieving the WHO goals of eliminating HBV (and HDV) as a public health threat by 2030 (2) will require more-effective prevention of new infections via implementation of the HBV birth dose vaccine and improvement of vaccine coverage. Furthermore, treatment protocols for chronic HBV infection suitable for low-income countries (42) will have to be implemented in sub-Saharan Africa. on July 5, 2019 by guest MATERIALS AND METHODS Study design. Within the framework of a cross-sectional house-by-house survey for the neglected tropical skin disease Buruli ulcer in the Mapé River Basin (Bankim District) of Cameroon, located in the northwest Adamawa region (43, 44), serum samples from 401 of 448 inhabitants living in 88 different households of the village Mbandji 2 were collected in January 2011. Here, we tested the 401 sera retrospectively for the presence of HBV and HDV infection markers. Nucleic acid extraction and genome sequencing were performed to obtain an overview of the genetic population structure of the viruses in this region. Demographic information on the study population, including age, gender, and GPS coordi- nates of the participants’ residences, recorded during the survey, was extracted to enable a microepi- demiological investigation of HBV and HDV infections. Ethics statement. Ethical clearance for the collection and testing of human blood samples was obtained from the Cameroon National Ethics Committee (reference no. Nu172/CNE/SE/201) as well as the Ethics Committee of Basel (EKBB, reference no. 53/11). Written informed consent was obtained from all individuals involved in the study. Parents or guardians provided written consent on behalf of children. ELISA testing for the presence of HBV and HDV infection markers. All serum samples were screened for the presence of HBsAg as an indicator for an HBV infection by using the bioelisa HBsAg3.0 kit (Biokit, Barcelona, Spain) according to the manufacturer’s instructions. Serum samples with a positive HBsAg ELISA result were subsequently analyzed with the ETI-AB-DELTAK-2 ELISA kit (DiaSorin, Saluggia, Italy) by following the manufacturer’s protocol to test for the presence of anti-HDAg antibodies as a marker for HDV infection. Nucleic acids were extracted from all serum samples of HBsAg-positive individuals as well as from sera of household contacts of these individuals to enable the detection of OBI. Nucleic acid extraction, PCR, and genome sequencing. Isolation of viral nucleic acids from 200 ␮l of each of the serum samples was performed by using the High Pure viral nucleic acid kit (Roche Diagnostics, Mannheim, Germany) in accordance with the manufacturer’s instructions. Total nucleic acids were resuspended in 50 ␮l Tris-EDTA (TE) buffer. The amplification of HBV whole-genome sequences was attempted with a protocol modified from reference 45 using 4 ␮l of the nucleic acid template, primers P1 (5=-CCGGAAAGCTTGAGCTCTTCTTTTTCACCTCTGCCTAATCA-3=) and P2 (5=-CCGGAAAGCTTGAGCTCTT CAAAAAGTTGCATGGTGCTGG-3=), and the following PCR profile: denaturation at 94°C for 4 min followed by 11 cycles at 94°C for 40 s, 55°C for 1 min, and 72°C for 3 min; 11 cycles of 94°C for 40 s, 60°C for 1 min, and 72°C for 5 min; 11 cycles of 94°C for 40 s, 62°C for 1 min, and 72°C for 7 min; 11 cycles of 94°C for 40 s, 62°C for 1 min, and 72°C for 9 min; and a final extension step at 72°C for 10 min. Samples for which

September/October 2018 Volume 3 Issue 5 e00120-18 msystems.asm.org 12 Molecular Epidemiology of HBV and HDV in Cameroon

no DNA could be amplified in the HBV whole-genome PCR were subjected to seminested PCR assays targeting the HBV pre-S/S region (34, 46). Amplification was attempted with 3 ␮l of the nucleic acid template and primer pairs PS1 (5=-CCATATTCTTGGGAACAAGA-3=) and P3 (5=-AAAGCCCAAAAGACCCA CAA-3=) in the first assay round and PS1 and S2 (5=-GGGTTTAAATGTATACCCAAAA-3=) in the second assay round. We also amplified a smaller fragment spanning the S gene using primers PS1a (5=-GGAAAACAT CACATCAGGAT-3=) and P3 for the first round of PCR and primers PS1b (5=-AAAATTCGCAGTCCCCAACC-3=) and P3 for the second round. Thermal conditions for all of the described PCR assays included an initial denaturation step at 94°C for 5 min followed by 32 cycles at 94°C for 30 s, 57°C for 30 s, and 72°C for 1 min and a final extension step at 72°C for 10 min. All PCRs were performed using Platinum Taq DNA polymerase and supplied reagents (Invitrogen, Carlsbad, CA) in accordance with product recommenda- tions. For reverse transcription of HDV RNA, 15 ␮l of the nucleic acid template was heated to 95°C for 5 min before being incubated with deoxynucleoside triphosphates (dNTPs) and primer 1302 (5=-GGATTCACC GACAAGGAGAG-3=) at 70°C for 10 min. In a next step, cDNA synthesis was attempted by adding Moloney murine leukemia virus (M-MLV) reverse transcriptase and M-MLV buffer (Sigma) to the mix and incubat- Downloaded from ing it at 37°C for 50 min, followed by subjection to 94°C for 10 min. For cDNA amplification of a 360-bp fragment of the HDV small HD (sHD) gene sequence, nested PCR was performed with 10 ␮l of cDNA and primers 1302 and 853 (5=-CGGATGCCCAGGTCGGACC-3=) in the first assay round and 5 ␮l of the amplicon and primers 5414 (5=-GAGATGCCATGCCGACCCGAAGAG-3=) and 5415 (5=-GAAGGAAGGCCCTCGAGAACA AGA-3=) in the second assay round. Thermal conditions for both of the reactions included an initial denaturation step at 94°C for 2 min followed by 35 cycles of 94°C for 30 s, 55°C for 50 s, and 72°C for 1 min and a final extension step at 72°C for 5 min (47). All reactions were performed in a TProfessional basic thermocycler (Biometra). PCR products were resolved on 1% agarose gels and were purified with NucleoSpin gel and PCR cleanup kits (Macherey- Nagel, Düren, Germany). Sequencing was done at Macrogen Inc., Europe (Amsterdam, the Netherlands) http://msystems.asm.org/ using primers listed in Table S1. Phylogenetic analysis of HBV and HDV isolates and inference of HBV serotypes. Genotyping of HBV was done by analyzing either whole-genome sequences or the pre-S/S gene region. Genotyping of HDV was based on a partial sHD gene region. Sequences were assembled using CodonCode Aligner software version 6.0.2 (CodonCode Corporation) and were aligned with MUSCLE implemented in MEGA version 6.0 (48). Maximum-likelihood phylogenetic analysis was performed with MEGA 6.0 after we inferred the best DNA substitution model for the different data sets. Node support for all constructed phylogenies was assessed with 1,000 bootstrap replicates. Reference sequences included in the phylog- enies were retrieved from GenBank, and accession numbers are given in the respective figure legends. MEGA 6.0 was also used to estimate the mean genetic distances among HBV sequences of the same genotype using the Kimura 2-parameter model. Bayesian phylogenetic analysis was performed using the birth-death migration model (BDMM; v0.2.0) (49) within the BEAST2 framework (v2.4.7). For that purpose,

we analyzed whole-genome sequences as well as pre-S/S sequences complemented with unknown on July 5, 2019 by guest nucleotides where genetic data were missing. We labeled the samples according to their occult/active status, where the occult cases would not be allowed to transmit. Active cases were allowed to produce both active and occult secondary cases. We applied the general time-reversible (GTR) model with empirical base frequencies as the substitution model and fixed the mean substitution rate to 1.9E–4 (50), allowing for some variation in the rate. The Markov chain Monte-Carlo (MCMC) algorithm was run until 60,000,000 states were sampled and 10% of the samples were discarded as burn-in. The trees were logged on every 10,000th MCMC step, and the tree sample was summarized using TreeAnnotator as a maximum-clade-credibility tree, with 20% of samples discarded as burn-in and median heights used as the node heights in the tree. Previously published mutations in HBV strains associated with “escape,” diminished antibody binding, or phenotypic resistance to antiretroviral drugs were predicted and analyzed using the Geno2pheno[HBV] online tool at http://hbv.geno2pheno.org/index.php. HBV serotypes were determined on the basis of predicted amino acids at positions 122, 160, 127, 159, and 140 of the HBsAg sequence as described previously (32) using the Web-based HBV Serotyper tool (51), which can be accessed at http://hvdr.bioinf.wits.ac.za/SmallGenomeTools. Data availability. All HBV pre-S/S and whole-genome sequences as well as the sequences of fragments of the sHD gene obtained in this study were deposited in GenBank under accession numbers MG821087 to MG821154, MH580614 to MH580652, and MH595493 to MH595495, respectively.

SUPPLEMENTAL MATERIAL Supplemental material for this article may be found at https://doi.org/10.1128/ mSystems.00120-18. FIG S1, TIF file, 0.6 MB. FIG S2, TIF file, 2 MB. FIG S3, PDF file, 0.4 MB. TABLE S1, PDF file, 0.1 MB. TABLE S2, PDF file, 0.1 MB. TABLE S3, PDF file, 0.3 MB.

September/October 2018 Volume 3 Issue 5 e00120-18 msystems.asm.org 13 Pinho-Nascimento et al.

REFERENCES 1. WHO. 2017. Hepatitis B fact sheet no. 204. WHO, Geneva, Switzerland. Spring Harb Perspect Med 4:a021550. https://doi.org/10.1101/cshper www.who.int/mediacentre/factsheets/fs204/en/. Accessed May 2017. spect.a021550. 2. WHO. 2017. Global hepatitis report, 2017. WHO, Geneva, Switzerland. 22. Stockdale AJ, Chaponda M, Beloukas A, Phillips RO, Matthews PC, Pa- 3. Spearman CW, Afihene M, Ally R, Apica B, Awuku Y, Cunha L, Dusheiko padimitropoulos A, King S, Bonnett L, Geretti AM. 2017. Prevalence of G, Gogela N, Kassianides C, Kew M, Lam P, Lesi O, Lohoues-Kouacou MJ, hepatitis D virus infection in sub-Saharan Africa: a systematic review and Mbaye PS, Musabeyezu E, Musau B, Ojo O, Rwegasha J, Scholz B, meta-analysis. Lancet Glob Health 5:e992–e1003. https://doi.org/10 Shewaye AB, Tzeuton C, Sonderup MW, Gastroenterology and Hepatol- .1016/S2214-109X(17)30298-X. ogy Association of Sub-Saharan Africa. 2017. Hepatitis B in sub-Saharan 23. Niro GA, Casey JL, Gravinese E, Garrubba M, Conoscitore P, Sagnelli E, Africa: strategies to achieve the 2030 elimination targets. Lancet Gas- Durazzo M, Caporaso N, Perri F, Leandro G, Facciorusso D, Rizzetto M, troenterol Hepatol 2:900–909. https://doi.org/10.1016/S2468-1253(17) Andriulli A. 1999. Intrafamilial transmission of hepatitis delta virus: mo- 30295-9. lecular evidence. J Hepatol 30:564–569. https://doi.org/10.1016/S0168 4. Seo DH, Whang DH, Song EY, Han KS. 2015. Occult hepatitis B virus -8278(99)80185-8.

infection and blood transfusion. World J Hepatol 7:600–606. https://doi 24. Luma HN, Eloumou SAFB, Okalla C, Donfack-Sontsa O, Koumitana R, Downloaded from .org/10.4254/wjh.v7.i3.600. Malongue A, Nko’Ayissi GB, Noah DN. 2017. Prevalence and character- 5. Bigna JJ, Amougou MA, Asangbeh SL, Kenne AM, Noumegni SRN, Ngo- istics of hepatitis delta virus infection in a tertiary hospital setting in Malabo ET, Noubiap JJ. 2017. Seroprevalence of hepatitis B virus infec- Cameroon. J Clin Exp Hepatol 7:334–339. https://doi.org/10.1016/j.jceh tion in Cameroon: a systematic review and meta-analysis. BMJ Open .2017.05.010. 7:e015298. https://doi.org/10.1136/bmjopen-2016-015298. 25. Foupouapouognigni Y, Noah DN, Sartre MT, Njouom R. 2011. High 6. Ducancelle A, Abgueguen P, Birguel J, Mansour W, Pivert A, Le Guillou- prevalence and predominance of hepatitis delta virus genotype 1 infec- Guillemette H, Sobnangou JJ, Rameau A, Huraux JM, Lunel-Fabiani F. tion in Cameroon. J Clin Microbiol 49:1162–1164. https://doi.org/10 2013. High endemicity and low molecular diversity of hepatitis B virus .1128/JCM.01822-10. infections in pregnant women in a rural district of North Cameroon. 26. Deny P. 2006. Hepatitis delta virus genetic variability: from genotypes I, PLoS One 8:e80346. https://doi.org/10.1371/journal.pone.0080346. II, III to eight major clades? Curr Top Microbiol Immunol 307:151–171. http://msystems.asm.org/ 7. Noubiap JJ, Nansseu JR, Ndoula ST, Bigna JJ, Jingi AM, Fokom-Domgue 27. Cunha C, Tavanez JP, Gudima S. 2015. Hepatitis delta virus: a fascinating J. 2015. Prevalence, infectivity and correlates of hepatitis B virus infec- and neglected pathogen. World J Virol 4:313–322. https://doi.org/10 tion among pregnant women in a rural district of the far North region of .5501/wjv.v4.i4.313. Cameroon. BMC Public Health 15:454. https://doi.org/10.1186/s12889 28. Taylor JM. 2012. Virology of hepatitis D virus. Semin Liver Dis 32: -015-1806-2. 195–200. https://doi.org/10.1055/s-0032-1323623. 8. Foupouapouognigni Y, Mba SA, Betsem A, Betsem E, Rousset D, Froment 29. Kramvis A, Restorp K, Norder H, Botha JF, Magnius LO, Kew MC. 2005. A, Gessain A, Njouom R. 2011. Hepatitis B and C virus infections in the Full genome analysis of hepatitis B virus genotype E strains from South- three pygmy groups in Cameroon. J Clin Microbiol 49:737–740. https:// Western Africa and Madagascar reveals low genetic variability. J Med doi.org/10.1128/JCM.01475-10. Virol 77:47–52. https://doi.org/10.1002/jmv.20412. 9. Zoufaly A, Onyoh EF, Tih PM, Awasom CN, Feldt T. 2012. High prevalence 30. Zuckerman JN, Zuckerman AJ. 2003. Mutations of the surface protein of hepatitis B and syphilis co-infections among HIV patients initiating of hepatitis B virus. Antiviral Res 60:75–78. https://doi.org/10.1016/j antiretroviral therapy in the north-west region of Cameroon. Int J STD .antiviral.2003.08.013. AIDS 23:435–438. https://doi.org/10.1258/ijsa.2011.011279. 31. Kimbi GC, Kew MC, Kramvis A. 2012. The effect of the G1888A mutation 10. Kfutwah AK, Tejiokem MC, Njouom R. 2012. A low proportion of HBeAg of subgenotype A1 of hepatitis B virus on the translation of the core among HBsAg-positive pregnant women with known HIV status could protein. Virus Res 163:334–340. https://doi.org/10.1016/j.virusres.2011 on July 5, 2019 by guest suggest low perinatal transmission of HBV in Cameroon. Virol J 9:62. .10.024. https://doi.org/10.1186/1743-422X-9-62. 32. Purdy MA, Talekar G, Swenson P, Araujo A, Fields H. 2007. A new 11. Frambo AA, Atashili J, Fon PN, Ndumbe PM. 2014. Prevalence of HBsAg and knowledge about hepatitis B in pregnancy in the Buea health algorithm for deduction of hepatitis B surface antigen subtype determi- district, Cameroon: a cross-sectional study. BMC Res Notes 7:394. https:// nants from the amino acid sequence. Intervirology 50:45–51. https://doi doi.org/10.1186/1756-0500-7-394. .org/10.1159/000096312. 12. Kramvis A. 2014. Genotypes and genetic variability of hepatitis B virus. 33. Bekondi C, Zanchi R, Seck A, Garin B, Giles-Vernick T, Gody JC, Bata P, Intervirology 57:141–150. https://doi.org/10.1159/000360947. Pondy A, Tetang SM, Ba M, Ekobo CS, Rousset D, Sire JM, Maylin S, 13. Kramvis A, Kew MC. 2007. Epidemiology of hepatitis B virus in Africa, its Chartier L, Njouom R, Vray M. 2015. HBV immunization and vaccine genotypes and clinical associations of genotypes. Hepatol Res 37: coverage among hospitalized children in Cameroon, Central African S9–S19. https://doi.org/10.1111/j.1872-034X.2007.00098.x. Republic and Senegal: a cross-sectional study. BMC Infect Dis 15:267. 14. Andernach IE, Hubschen JM, Muller CP. 2009. Hepatitis B virus: the https://doi.org/10.1186/s12879-015-1000-2. genotype E puzzle. Rev Med Virol 19:231–240. https://doi.org/10.1002/ 34. Ampah KA, Pinho-Nascimento CA, Kerber S, Asare P, De-Graft D, Adu-Nti rmv.618. F, Paixão ICNP, Niel C, Yeboah-Manu D, Pluschke G, Röltgen K. 2016. 15. Forbi JC, Ben-Ayed Y, Xia GL, Vaughan G, Drobeniuc J, Switzer WM, Limited genetic diversity of hepatitis B virus in the general population of Khudyakov YE. 2013. Disparate distribution of hepatitis B virus geno- the Offin River Valley in Ghana. PLoS One 11:e0156864. https://doi.org/ types in four Sub-Saharan African countries. J Clin Virol 58:59–66. 10.1371/journal.pone.0156864. https://doi.org/10.1016/j.jcv.2013.06.028. 35. Ott JJ, Stevens GA, Groeger J, Wiersma ST. 2012. Global epidemiology of 16. Pourkarim MR, Amini-Bavil-Olyaee S, Lemey P, Maes P, Van Ranst M. hepatitis B virus infection: new estimates of age-specific HBsAg sero- 2010. Are hepatitis B virus ЉsubgenotypesЉ defined accurately? J Clin prevalence and endemicity. Vaccine 30:2212–2219. https://doi.org/10 Virol 47:356–360. https://doi.org/10.1016/j.jcv.2010.01.015. .1016/j.vaccine.2011.12.116. 17. Pourkarim MR, Amini-Bavil-Olyaee S, Lemey P, Maes P, Van Ranst M. 36. Marcus EL, Tur-Kaspa R. 1997. Viral hepatitis in older adults. J Am Geriatr 2011. HBV subgenotype misclassification expands quasi-subgenotype Soc 45:755–763. https://doi.org/10.1111/j.1532-5415.1997.tb01484.x. A3. Clin Microbiol Infect 17:947–949. https://doi.org/10.1111/j.1469-0691 37. Menendez C, Sanchez-Tapias JM, Kahigwa E, Mshinda H, Costa J, Vidal J, .2010.03374.x. Acosta C, Lopez-Labrador X, Olmedo E, Navia M, Tanner M, Rodes J, 18. Seeger C, Mason WS. 2000. Hepatitis B virus biology. Microbiol Mol Biol Alonso PL. 1999. Prevalence and mother-to-infant transmission of hep- Rev 64:51–68. https://doi.org/10.1128/MMBR.64.1.51-68.2000. atitis viruses B, C, and E in southern Tanzania. J Med Virol 58:215–220. 19. Pollicino T, Cacciola I, Saffioti F, Raimondo G. 2014. Hepatitis B virus https://doi.org/10.1002/(SICI)1096-9071(199907)58:3Ͻ215::AID-JMV5Ͼ3 PreS/S gene variants: pathobiology and clinical implications. J Hepatol .0.CO;2-K. 61:408–417. https://doi.org/10.1016/j.jhep.2014.04.041. 38. Hughes SA, Wedemeyer H, Harrison PM. 2011. Hepatitis delta virus. 20. WHO. 2016. Hepatitis D fact sheet. WHO, Geneva, Switzerland. www.who Lancet 378:73–85. https://doi.org/10.1016/S0140-6736(10)61931-9. .int/mediacentre/factsheets/hepatitis-d/en/. Accessed May 2017. 39. Chow SK, Atienza EE, Cook L, Prince H, Slev P, Lape-Nixon M, Jerome KR. 21. Negro F. 2014. Hepatitis D virus coinfection and superinfection. Cold 2016. Comparison of enzyme immunoassays for detection of antibodies

September/October 2018 Volume 3 Issue 5 e00120-18 msystems.asm.org 14 Molecular Epidemiology of HBV and HDV in Cameroon

to hepatitis D virus in serum. Clin Vaccine Immunol 23:732–734. https:// method for efficient amplification of whole hepatitis B virus genomes doi.org/10.1128/CVI.00028-16. permits rapid functional analysis and reveals deletion mutants in immu- 40. Chisari FV, Ferrari C. 1995. Hepatitis B virus immunopathogenesis. Annu nosuppressed patients. J Virol 69:5437–5444. Rev Immunol 13:29–60. https://doi.org/10.1146/annurev.iy.13.040195.00 46. Valente F, Lago BV, Castro CA, Almeida AJ, Gomes SA, Soares CC. 2010. 0333. Epidemiology and molecular characterization of hepatitis B virus in 41. Kim H, Lee SA, Do SY, Kim BJ. 2016. Precore/core region mutations of Luanda, Angola. Mem Inst Oswaldo Cruz 105:970–977. https://doi.org/ hepatitis B virus related to clinical severity. World J Gastroenterol 22: 10.1590/S0074-02762010000800004. 4287–4296. https://doi.org/10.3748/wjg.v22.i17.4287. 47. Botelho-Souza LF, Souza Vieira D, de Oliveira Dos Santos A, Cunha 42. Aberra H, Desalegn H, Berhe N, Medhin G, Stene-Johansen K, Gundersen Pereira AV, Villalobos-Salcedo JM. 2015. Characterization of the geno- SG, Johannessen A. 2017. Early experiences from one of the first treat- typic profile of hepatitis Delta virus: isolation of HDV genotype-1 in the ment programs for chronic hepatitis B in sub-Saharan Africa. BMC Infect western Amazon region of Brazil. Intervirology 58:166–171. https://doi Dis 17:438. https://doi.org/10.1186/s12879-017-2549-8. .org/10.1159/000431040. 43. Bratschi MW, Bolz M, Minyem JC, Grize L, Wantong FG, Kerber S, Njih 48. Tamura K, Stecher G, Peterson D, Filipski A, Kumar S. 2013. MEGA6: Tabah E, Ruf MT, Mou F, Noumen D, Um Boock A, Pluschke G. 2013. Molecular Evolutionary Genetics Analysis version 6.0. Mol Biol Evol Geographic distribution, age pattern and sites of lesions in a cohort of 30:2725–2729. https://doi.org/10.1093/molbev/mst197. Buruli ulcer patients from the Mape Basin of Cameroon. PLoS Negl Trop 49. Kuhnert D, Stadler T, Vaughan TG, Drummond AJ. 2016. Phylodynamics Downloaded from Dis 7:e2252. https://doi.org/10.1371/journal.pntd.0002252. with migration: a computational framework to quantify population struc- 44. Roltgen K, Bratschi MW, Ross A, Aboagye SY, Ampah KA, Bolz M, Andreoli ture from genomic data. Mol Biol Evol 33:2102–2116. https://doi.org/10 A, Pritchard J, Minyem JC, Noumen D, Koka E, Um Boock A, Yeboah- .1093/molbev/msw064. Manu D, Pluschke G. 2014. Late onset of the serological response against 50. Andernach IE, Hunewald OE, Muller CP. 2013. Bayesian inference of the the 18 kDa small heat shock protein of Mycobacterium ulcerans in evolution of HBV/E. PLoS One 8:e81690. https://doi.org/10.1371/journal children. PLoS Negl Trop Dis 8:e2904. https://doi.org/10.1371/journal .pone.0081690. .pntd.0002904. 51. Bell TG, Kramvis A. 2015. Bioinformatics tools for small genomes, such as 45. Gunther S, Li BC, Miska S, Kruger DH, Meisel H, Will H. 1995. A novel hepatitis B virus. Viruses 7:781–797. https://doi.org/10.3390/v7020781. http://msystems.asm.org/ on July 5, 2019 by guest

September/October 2018 Volume 3 Issue 5 e00120-18 msystems.asm.org 15 APPENDIX

S1 Fig: Phylogenetic reconstruction of HBV whole-genome sequences from Mbandji 2 and of worldwide origin. A maximum-likelihood phylogenetic tree of 35 HBV/E (blue dots) and 4 HBV/A (red dots) sequences from Mbandji 2 together with 53 publicly available sequences of worldwide origin (accession numbers are given in the graph) covering all HBV genotypes was constructed under the GTR model +G +I embedded in MEGA 6.0. The tree is drawn to scale, with branch lengths measured according to the number of substitutions per site. The tree was rooted using the sequence of a woolly monkey HBV as an outgroup. Bootstrap values ( 80%) are shown at branch nodes. ≥

159 Table S1. Primers used for HBV (whole genome and pre-S/S) and HDV (sHD gene fragment) nucleotide sequencing.

Sequencing Primer Primer ID 5ʼ-Sequence-3̕ target direction (Reference) C1 (1) CTGTGGAGTTACTCTCGTTTTTGC P01 (2) GGACTCATAAGGTGGGGAA P1(3) CCGGAAAGCTTGAGCTCTTCTTTTTCACCTCTGCCTAATCA PS1 (1) CCATATTCTTGGGAACAAGA PS4 (4) ACACTCATCCTCAGGCCATGCAGTG Forward S1 (2) CTTCTCGAGGACTGGGGACC S4 (5) TGCTGCTATGCCTCATCTTCT HBV whole S18 (5) GGATGATGTGGTATTGGGGGCCA genome X1 (1) ACCTCCTTTCCATGGCTGCT X5 (2) ACTCTTGGACTCBCAGCAATG C8 (2) GAGGGAGTTCTTCTTCTAGG P2 (3) CCGGAAAGCTTGAGCTCTTCAAAAAGTTGCATGGTGCTGG P3 (2) AAAGCCCAAAAGACCCACAA Reverse PS2 (1) GGTCCCCAGTCCTCGAGAAG PS8 (4) TTCCTGAACTGGAGCCACCA S2 (5) GGGTTTAAATGTATACCCAAAGA PS1 (1) CCATATTCTTGGGAACAAGA Forward PS4 (4) ACACTCATCCTCAGGCCATGCAGTG HBV PreS/S PS8 (4) TTCCTGAACTGGAGCCACCA Reverse S2 (5) GGGTTTAAATGTATACCCAAAGA PS1a (4) GGAAAACATCACATCAGGAT Forward HBV S PS1b (4) AAAATTCGCAGTCCCCAACC Reverse P3 (2) AAAGCCCAAAAGACCCACAA HDV sHD Forward HDV-E (6) GAGATGCCATGCCGACCCGAAGAG gene fragment Reverse HDV-A (6) GAAGGAAGGCCCTCGAGAACAAGA

Primer sequences were published previously by:

(1) Niel C, Moraes MT, Gaspar AM, Yoshida CF, Gomes SA. 1994. J Med Virol 44:180-6. (2) de Pina-Araujo IIM, Spitz N, Soares CC, Niel C, Lago BV, Gomes SA. 2018. PLoS One 13:e0192595. (3) Gunther S, Li BC, Miska S, Kruger DH, Meisel H, Will H. 1995. J Virol 69:5437-44. (4) Ampah KA, Pinho-Nascimento CA, Kerber S, Asare P, De-Graft D, Adu-Nti F, Paixao IC, Niel C, Yeboah-Manu D, Pluschke G, Roltgen K. 2016. PLoS One 11:e0156864. (5) Bottecchia M, Souto FJ, O KM, Amendola M, Brandao CE, Niel C, Gomes SA. 2008. BMC Microbiol 8:11. (6) Casey JL, Brown TL, Colan EJ, Wignall FS, Gerin JL. 1993. Proc Natl Acad Sci U S A 90:9016-20.

Table S2. SNPs and variations in the predicted HBV/E and HBV/A amino acid sequences obtained in this study.

2a) Pre-S/S region of the 42 HBV/E sequences:

Nucleotide Amino acid Nucleotides Amino No. of Amino acid

position position acids sequences change

G/A G 39/2 No 6 2 G/T G 39/1 No 21 7 C/T V 41/1 No 27 9 C/A L 39/3 No 43 15 C/T H/Y 39/3 Yes 48 16 C/T S 39/3 No 76 26 G/A D/N 41/1 Yes 96 32 A/C A 36/6 No 100 34 A/C R 41/1 No 101 34 G/A R/K 41/1 Yes 112 38 A/C R 41/1 No 113 38 G/A R/K 33/9 Yes 117 39 T/C N 40/2 No 129 43 C/T D 41/1 No G/A E/K 40/1 Yes 157 53 G/C E/Q 40/1 Yes* Pre-S1 159 53 A/G E/Q 41/1 Yes* 198 66 C/T F 41/1 No 207 69 A/C P 38/4 No 225 75 G/A G 41/1 No 232 78 C/A P/T 41/1 Yes 254 85 A/C K/T 41/1 Yes 257 86 C/T T/M 25/17 Yes 258 86 A/G T 21/21 No 259 87 T/G L/V 38/4 Yes 265 89 G/A A/T 32/10 Yes 273 91 G/T P 41/1 No 285 95 C/A S 41/1 No 297 99 G/A Q 41/1 No 321 107 C/A I 40/2 No 326 109 C/T P/L 40/2 Yes T/G T 38/3 No 342 114 T/C T 38/1 No 350 117 A/T Q/L 41/1 Yes

355 119 A/G M/V 41/1 Yes 357 119 G/A M/I 41/1 Yes Pre-S2 359 120 A/G Q/R 41/1 Yes 360 120 G/A Q 41/1 No 370 124 A/C T/P 41/1 Yes 373 125 A/G T/A 32/10 Yes 385 129 G/A A/T 41/1 Yes 393 131 G/A Q 41/1 No 401 134 G/A R/K 41/1 Yes 407 136 G/A/(Del) R/N/(Del) 40/1/(1) Yes 408 136 A/G/(Del) R/(Del) 38/1/(2) No 418 140 T/C/(Del) F/L/(Del) 29/1/(12) Yes 421 141 C/A P/T 41/1 Yes 440 147 C/T S/F 41/1 Yes 449 150 T/C V/A 40/2 Yes 466 156 A/T T/S 41/1 Yes 470 157 C/T A/V 38/4 Yes 509 170 C/T P/L 41/1 Yes 513 171 A/G A 41/1 No 515 172 C/G P/R 38/4 Yes 516 172 G/A P 41/1 No

527 176 G/A S/N 41/1 Yes 541 181 T/C F/L 41/1 Yes 555 185 G/A L 41/1 No 606 202 G/A P 41/1 No 607 203 C/A Q/K 41/1 Yes 652 218 G/A A/T 41/1 Yes 665 222 T/G L/R 41/1 Yes 678 226 G/A S 41/1 No 686 229 C/A P/Q 41/1 Yes 689 230 C/T T/I 41/1 Yes 692 231 C/G S/C 41/1 Yes 695 232 A/G N/S 22/20 Yes 722 241 T/C I/T 41/1 Yes 723 241 T/C I 38/4 No 732 244 C/T G 39/3 No S-region 734 245 A/T Y/F 41/1 Yes 750 250 G/A L 41/1 No 765 255 C/A I 39/3 No 843 281 T/C P 41/1 No 853 285 G/A G/R 41/1 Yes

894 298 G/A T 41/1 No a-determinant 898 300 C/A L 41/1 No 899 300 T/C L/P 41/1 Yes 933 311 C/T C 41/1 No 938 313 C/T S/L 41/1 Yes 1032 344 C/T S 41/1 No 1085 362 C/T T/I 40/2 Yes 1127 376 C/A P/Q 41/1 Yes 1161 387 T/G P 40/2 No 1196 399 T/A I/N 41/1 Yes 1197 399 T/C I 25/17 No 2b) Pre-C/C region of the 35 HBV/E whole genome sequences:

Nucleotide Amino acid Nucleotides Amino No. of Amino acid position position acids sequences change 13 5 C/G H/D 34/1 Yes Pre-core 34 12 T/C C/R 34/1 Yes 83 28 G/A W/Stop 33/2 Yes 86 29 G/A G/D 32/3 Yes 96 32 T/C I 34/1 No 121 41 A/T T/S 34/1 Yes 125 42 T/A V/E 34/1 Yes 126 42 G/C V 33/2 No 138 46 G/A S 32/3 No 148 50 T/C S/P 34/1 Yes 156 52 C/T F 31/4 No 170 57 G/A R/K 34/1 Yes 186 62 C/T T 34/1 No 187 63 G/A A 34/1 No 188 63 C/T A/I 34/1 Yes 200 67 A/T Y/F 33/2 Yes 220 74 C/T P/S 34/1 Yes 231 77 T/C C 34/1 No 240 80 C/T H 34/1 No 243 81 T/C H 19/16 No 249 83 A/T A 33/2 No 250 84 C/A L/I 34/1 Yes Core 276 92 G/A G 34/1 No 277 93 G/A E/N 34/1 Yes* A/C E/N 32/1 Yes* 279 93 A/C E/D 32/1 Yes A/T E/D 32/1 Yes 280 94 C/T L 34/1 No 282 94 A/T L 34/1 No 287 96 C/A T/N 33/2 Yes 291 97 A/G L 34/1 No 293 98 C/G A/G 34/1 Yes* T/C A 29/5 No 294 98 T/A A/G 29/1 Yes* 306 102 T/G G 32/3 No T/G V/G 33/1 Yes 308 103 T/C V/A 33/1 Yes 312 104 T/C N 34/1 No 323 108 C/A P/Q 34/1 Yes 326 109 C/G A/G 34/1 Yes 336 112 C/A D/E 34/1 Yes 345 115 T/C V 19/16 No 354 118 C/T V 31/4 No 361 121 A/C N/H 34/1 Yes 363 121 T/C N 34/1 No 369 123 C/A G 31/4 No 373 125 A/C K/Q 34/1 Yes 381 127 G/A R 30/5 No 388 130 T/A L/M 34/1 Yes 402 134 T/G I/M 34/1 Yes 409 137 C/A L/I 34/1 Yes 411 137 C/T L 31/4 No 413 138 C/T T/I 34/1 Yes 420 140 A/G G 34/1 No 423 141 A/G R 34/1 No 427 143 A/C T/L 34/1 Yes* 428 143 C/T T/L 34/1 Yes* 429 143 C/T T 20/15 No A/G I/V 33/1 Yes 433 145 A/T I/L 33/1 Yes 447 149 G/A V 34/1 No 476 159 C/A P/Q 33/2 Yes 477 159 A/C P 34/1 No 483 161 C/T Y 18/17 No 489 163 A/G P 34/1 No 491 164 C/A P/Q 34/1 Yes 519 173 G/T P 34/1 No 524 175 A/C N/T 31/4 Yes A/T T/S 33/1 Yes 526 176 A/G T/A 33/1 Yes 538 180 C/T R/S 34/1 Yes* 539 180 G/C R/S 34/1 Yes* 541 181 A/C R 32/3 No 553 185 C/T P/S 34/1 Yes 610 204 A/C R 34/1 No 625 209 G/T A/S 34/1 Yes T/G S/A 32/2 Yes 628 210 T/C S/P 32/1 Yes 632 211 A/G Q/R 34/1 Yes

2c) X region of the 35 HBV/E whole genome sequences:

Nucleotide Amino acid Nucleotides Amino No. of Amino acid position position acids sequences change 34 12 G/T A/S 34/1 Yes 37 13 A/C R 34/1 No 64 22 A/G S/G 32/3 Yes 65 22 G/A S/N 34/1 Yes 73 25 T/C S/P 32/3 Yes 88 30 G/A V/I 33/2 Yes 101 34 T/C L/P 31/4 Yes 103 35 G/A G/R 33/2 Yes 107 36 A/G D/G 24/11 Yes 109 37 C/T L 32/3 No 111 37 A/G L 26/9 No 127 43 T/A S/T 27/8 Yes 138 46 A/G P 31/4 No 143 48 A/T D/V 31/4 Yes 154 52 C/A H/N 34/1 Yes 192 64 A/G S 32/3 No 201 67 A/T G 28/7 No 217 73 T/G F/V 34/1 Yes 240 80 G/A E 34/1 No 260 87 A/G Q/R 34/1 Yes 262 88 A/C I/L 19/16 Yes* C/A I/L 31/2 Yes* 264 88 C/A I 31/1 No C/T I 31/1 No 280 94 C/T H/Y 34/1 Yes 340 114 G/A D/N 31/4 Yes T/G F/L 33/1 Yes 351 117 T/C F 33/1 No 354 118 G/A K 32/3 No 366 122 G/A E 34/1 No 368 123 T/C L/S 18/17 Yes 384 128 A/G R 30/5 No 387 129 A/G L 34/1 No 389 130 A/T K/M 33/2 Yes 391 131 G/A V/I 33/2 Yes 436 146 G/T A/S 34/1 Yes 453 151 C/G F/L 34/1 Yes

2d) P region of the 35 HBV/E whole genome sequences:

Nucleotide Amino acid Nucleotides Amino No. of Amino acid position position acids sequences change 26 9 G/T R/L 34/1 Yes 31 11 A/C I/L 31/4 Yes* A/T I/L 33/1 Yes* 33 11 A/G I/L 33/1 Yes* 45 15 C/T D 34/1 No 46 16 G/C E/Q 34/1 Yes 48 16 A/C E/D 32/3 Yes 60 20 C/T P 34/1 No 117 39 A/C E/D 34/1 Yes 132 44 G/T Q/H 34/1 Yes T/G L 32/2 No 135 45 T/C L 32/1 No 139 47 A/G N/D 34/1 Yes 153 51 T/G P 34/1 No 171 57 A/C G 34/1 No 199 67 A/T I/L 31/4 Yes 200 67 T/C I/T 34/1 Yes 201 67 A/T I 34/1 No 205 69 G/A V/I 31/4 Yes 228 76 T/A T 32/3 No 240 80 T/C P 34/1 No 252 84 G/A L 34/1 No 267 89 T/C I 34/1 No 268 90 A/C N/H 34/1 Yes 270 90 C/T N 31/4 No 291 97 T/C G 32/3 No 294 98 T/G P 33/2 No 295 99 C/T L 34/1 No 297 99 A/G L 34/1 No 303 101 A/T V 32/3 No 313 105 C/A R 32/3 No 316 106 A/C R 32/3 No 318 106 A/G R 34/1 No 324 108 C/A N/K 34/1 Yes 330 110 C/T V 30/5 No 331 111 A/C M/L 34/1 Yes 360 120 G/A T 29/6 No A/G L 19/15 No 369 123 A/T L/F 19/1 Yes C/G P 32/2 No 372 124 C/T P 32/1 No 375 125 A/G L 34/1 No 381 127 A/G K 32/3 No 387 129 A/C I 33/2 No 393 131 T/G P 34/1 No 402 134 A/C P 34/1 No 411 137 A/G V 27/8 No 426 142 C/T F 34/1 No 433 145 A/C R 33/2 No 450 150 C/T T 34/1 No 453 151 A/T L 34/1 No 462 154 G/A A 34/1 No 465 155 C/T G 27/8 No 480 160 A/G R 21/14 No 486 162 T/G T 20/15 No 506 169 T/G F/C 19/16 Yes 534 178 G/C E/D 34/1 Yes 547 183 G/A A/T 33/2 Yes 562 188 C/T P/S 34/1 Yes 568 190 C/A R 32/3 No 584 195 C/T S/L 32/3 Yes 589 197 C/T H/Y 33/2 Yes 637 213 A/C I/L 31/4 Yes 641 214 A/C Q/P 34/1 Yes 642 214 G/A Q 34/1 No 653 218 A/C Q/P 34/1 Yes 654 218 G/A Q 28/7 No 658 220 T/C S/P 34/1 Yes 698 233 G/C R/T 34/1 Yes 700 234 A/G S/G 34/1 Yes 739 247 C/T H/Y 34/1 Yes 748 250 A/C T/P 31/4 Yes 773 258 C/A P/H 34/1 Yes 795 265 A/C K/N 34/1 Yes 798 266 C/T N 21/14 No 799 267 G/A V/I 18/17 Yes 800 267 T/G V/S 32/3 Yes 806 269 G/A S/N 27/8 Yes 814 272 G/T A/S 34/1 Yes 838 280 G/A V/I 34/1 Yes 862 288 C/A H/N 33/2 Yes 867 289 C/T S 34/1 No T/G S/A 32/2 Yes 883 295 T/C S/P 32/1 Yes 891 297 A/T S 34/1 No 896 299 A/G H/R 34/1 Yes 900 300 A/G A 34/1 No 911 304 A/C H/P 34/1 Yes 914 305 A/G N/S 27/8 Yes 926 309 G/A S/N 34/1 Yes 934 312 G/A G/R 34/1 Yes 942 314 G/A Q 34/1 No 948 316 G/A/(Del) K/(Del) 33/1/(1) No 981 327 C/T F 34/1 No 990 330 C/T S 34/1 No 1007 336 A/T Y/F 34/1 Yes 1011 337 C/T C 32/3 No 1050 350 C/T P 34/1 No 1054 352 A/G T/A 34/1 Yes 1056 352 C/G T 31/4 No 1148 383 C/A A/E 34/1 Yes 1206 402 T/G S 34/1 No 1227 409 C/A P 34/1 No 1230 410 C/T N 34/1 No 1233 411 C/G L 34/1 No 1236 412 A/G Q 19/16 No 1263 421 T/C N 34/1 No 1264 422 T/C L 31/4 No 1273 425 C/T L 32/3 No 1275 425 A/T L 34/1 No 1306 436 C/A L/I 33/2 Yes 1384 462 T/C S/P 34/1 Yes 1394 465 G/A R/K 34/1 Yes 1435 479 G/A D/N 34/1 Yes 1439 480 C/A S/Y 34/1 Yes 1440 480 T/C S 34/1 No 1474 492 C/T L 34/1 No 1479 493 C/T F 34/1 No 1573 525 C/T L 34/1 No 1626 542 C/T H 34/1 No 1668 556 C/A A 34/1 No 1702 568 T/G S/A 33/2 Yes 1737 579 T/A H/Q 34/1 Yes 1738 580 T/C L 20/15 No C/T P/S 31/3 Yes 1744 582 C/G P/A 31/1 Yes 1746 582 C/T P 34/1 No 1749 583 C/T N 34/1 No 1759 587 A/C R 31/4 No 1761 587 A/G R 34/1 No 1773 591 C/A S 34/1 No 1774 592 C/T L 34/1 No 1777 593 A/C N/H 20/15 Yes 1788 596 T/C G 32/3 No 1805 602 G/T W/F 34/1 Yes 1806 602 G/T W 34/1 No G/A G 32/2 No 1809 603 G/C G 32/1 No 1812 604 A/T S 33/2 No 1821 607 G/A Q 31/4 No 1830 610 C/T I 34/1 No 1832 611 G/A R/K 18/17 Yes 1834 612 A/T L/M 19/16 Yes 1836 612 G/A L 26/9 No 1872 624 C/T N 34/1 No 1875 625 G/A R 32/3 No 1903 635 A/C I/L 32/3 Yes 1920 640 C/T G 32/3 No 1944 648 T/C C 34/1 No 1950 650 T/C Y 34/1 No 1965 655 T/G P 34/1 No 1968 656 G/A L 34/1 No 1972 658 G/A A/T 30/5 Yes 1974 658 G/A A 34/1 No 1979 660 T/C I/T 24/11 Yes 1984 662 T/G S/A 31/4 Yes 1985 662 C/A S/Y 31/4 Yes 2004 668 C/T F 34/1 No 2022 674 C/T A 34/1 No 2032 678 A/C K/Q 30/5 Yes 2041 681 C/A L/M 34/1 Yes T/A L/Q 33/1 Yes 2042 681 T/G L/R 33/1 Yes 2044 682 A/T N/Y 34/1 Yes 2071 691 C/T P/S 34/1 Yes 2072 691 C/A P/Q 31/4 Yes 2073 691 A/T P 34/1 No 2088 696 G/A V 32/3 No 2112 704 C/T G 32/3 No 2127 709 A/G I/M 34/1 Yes 2139 713 C/T R 34/1 No 2140 714 A/G M/V 32/3 Yes 2212 738 A/C R 31/4 No 2220 740 A/G G 27/8 No G/A A 20/14 No 2223 741 G/T A 20/1 No 2224 742 A/C K/Q 34/1 Yes 2225 742 A/C K/T 30/5 Yes 2232 744 C/T I 33/2 No 2235 745 G/A G 27/8 No 2238 746 T/A T 18/17 No 2247 749 T/C S 31/4 No 2250 750 C/T V 32/3 No 2274 758 A/C S 34/1 No 2313 771 G/T L 34/1 No 2316 772 A/C R 34/1 No 2343 781 A/G S 32/3 No 2344 782 G/A A/T 34/1 Yes 2352 784 T/C N 32/3 No 2367 789 G/A P 33/2 No 2380 794 T/C L 31/4 No 2382 794 G/A L 33/2 No 2386 796 A/G I/V 24/11 Yes 2388 796 C/T I 32/3 No 2390 797 A/G Y/C 26/9 Yes 2406 802 T/A R 27/8 No 2417 806 A/G Q/R 31/4 Yes 2422 808 A/T T/S 31/4 Yes 2433 811 C/A R 34/1 No 2471 824 A/G H/R 32/3 Yes 2480 827 A/T D/V 28/7 Yes 2496 832 T/G A 34/1 No

2e) Pre-S/S region of the 7 HBV/A sequences:

Nucleotide Amino acid Nucleotides Amino No. of Amino acid position position acids sequences change

6 2 A/G G 6/1 No 13 5 T/A L/I 6/1 Yes 96 32 T/C P 6/1 No 159 53 G/A P 4/3 No

183 61 A/C G 6/1 No Pre-S1 217 73 G/A G/S 6/1 Yes 233 78 G/C S/T 6/1 Yes 289 97 A/G T/A 6/1 Yes 305 102 G/A G/E 6/1 Yes 318 106 T/A T 6/1 No 342 114 G/C E/D 4/3 Yes

360 120 G/A M/I 6/1 Yes 363 121 G/A Q 6/1 No 376 126 A/C T/P 6/1 Yes 384 128 C/T H 5/2 No 390 130 T/C A 6/1 No 393 131 G/A L 6/1 No 405 135 G/A R 6/1 No 412 138 G/A G/S 6/1 Yes Pre-S2 414 138 T/C G 4/3 No 421 141 T/C F/L 6/1 Yes 423 141 C/T F 6/1 No 428 143 C/T A/V 5/2 Yes 440 147 A/G N/S 6/1 Yes 461 154 T/C V/A 6/1 Yes 467 156 C/A T/N 4/3 Yes 468 156 C/T T/N 4/3 Yes 470 157 C/T T/I 6/1 Yes 486 162 A/G S 6/1 No

533 178 T/C I/T 6/1 Yes 609 203 G/A P 5/2 No 655 219 T/A S/T 6/1 Yes 813 271 G/T L 6/1 No 930 310 A/C S 6/1 No S-region a a 939 313 T/C C 6/1 No 1103 368 T/C V/A 6/1 Yes 1121 374 A/T Y/F 6/1 Yes 1140 380 C/T Y 6/1 No 1141 381 A/C S/R 6/1 Yes 1142 381 G/A S/N 5/2 Yes 1172 391 C/T P/L 6/1 Yes 1188 396 T/C L 6/1 No 1193 398 C/T A/V 5/2 Yes

Table S3. Replacements in predicted HBV/E and HBV/A amino acid sequences obtained in this study.

3a) Pre-S/S region of the 42 HBV/E sequences as well as the 17 OBI HBV/E sequences:

Pre -S1 Pre -S2 S-region Site 14 15 26 34 38 53 78 85 86 87 89 109 117 1 2 6 7 11 16 18 19 20 21 22 23 29 32 38 39 52 54 3 8 10 30 45 49 56 57 58 59 68 72 112 127 140 165 189 203 226 Consensus N H D R R E P K T L A P Q M Q T T A R R G L Y F P S V T A P P S F G Q A L P T S S I Y G L S W T P I 177 ...... T . . . . . A ...... --- ...... 188 ...... T . . . . . A ...... --- ...... 190 ...... T . . . . . A ...... --- ...... 193 ...... V ...... V ...... 199 ...... T . . . . . A ...... --- ...... 200 ...... T . . . . . A ...... --- ...... 201 ...... T . . . . . A ...... --- ...... 209 . . . . K . . . M ...... N ...... 210 . . . . K . . . M ...... N ...... 212 . . . . K . . . M ...... N ...... 214 . . N . K K . . M . . . . I ...... N ...... N ...... 225 ...... T . . . . . A ...... --- ...... 226 ...... M ...... N ...... 232 ...... T . . . . . A ...... --- ...... 237 ...... T . . . . . A ...... --- ...... 239 . . . . K . . . M ...... N ...... 240 ...... T T ...... K . ------. F . . . L . . . . K ...... L . . Q . 304 ...... V ...... V ...... 306 ...... M ...... S ...... N ...... 310 ...... T ...... 333 . Y . . K ...... N ...... 334 . Y . . K ...... N ...... 398 ...... ------...... R . . . . . R . . . N . . . I . . I . . 400 ...... M ...... R ...... N ...... 401 ...... M ...... R ...... N ...... 402 ...... M ...... R ...... N ...... 408 . . . . K . . . M ...... N ...... 416 ...... Q I . N . . . P . . . . N 438 ...... 439 ...... 441 ...... 491 ...... V . L ...... A . V ...... I . . 458 ...... V . L ...... A . V ...... 472 ...... M ...... C N T F R ...... 476 ...... M ...... N ...... 490 ...... M ...... N ...... 522 . . . K . . . . M . . . L V ...... N ...... 523 ...... M ...... N ...... 525 ...... M ...... L ...... L . . T . . . . N ...... 619 ...... R P ...... 623 ...... T . . . . . A ...... --- ...... 668 . Y . . K ...... N ...... 181 occult ...... T . . . . . A ...... --- ...... 184occult . . . . K . . . M ...... N ...... 194 occult ...... V ...... V ...... 196 occult ...... M ...... N ...... 197 occult D ...... V ...... V ...... 198 occult ...... V ...... V ...... 202 occult ...... T . . . . . A ...... --- ...... 211 occult . . . . K . . . M ...... R ...... N . . . . . R . . . 215 occult . . . . K . . . M ...... N ...... 217 occult ...... M ...... R ...... N ...... 220 occult ...... T . . . . . A ...... --- ...... 233 occult ...... T . . . . . A ...... --- ...... 241 occult ...... T T ...... ------. F . . . L . . . K ...... L . . . . 247 occult ...... T . . . . . A ...... --- ...... 259 occult ...... T . . . . . A ...... --- ...... 311 occult ...... M ...... N ...... 412 occult ......

Blue = amino acid variations exclusively detected in OBI sequences

Yellow = relevant amino acid substitutions discussed in the manuscript

3b) Pre-C/C region of the 35 HBV/E whole genome sequences:

Pre -Core Core Site 5 12 28 29 12 13 21 28 34 38 45 55 64 67 69 74 79 80 83 92 96 101 105 108 109 114 116 130 135 146 147 151 156 180 181 182 Consensus H C W G T V S R A Y P L E T A V P A D N K L I L T T I P P N T R P A S Q 177 ...... 188 ...... 190 ...... 193 ...... 199 ...... 200 ...... 201 ...... 209 ...... 210 ...... 226 ...... 232 ...... 239 ...... 240 D . . D ...... S . D N G G ...... Q . T S . . . . . 304 ...... 306 ...... 310 ...... 333 ...... 334 ...... 398 ...... D ...... Q . T ...... 400 . R ...... 401 ...... 402 ...... 408 ...... 416 ...... 438 ...... A . 439 ...... A . 458 ...... 472 . . * D S . . . . F . . . . . A . G E H . M . . . L V . Q T A S S . P R 476 ...... 490 ...... 522 . . * D . . P K I F . I N N . . Q . . . Q . M I I . L . . T . . . S . . 523 ...... 619 . . . . E ...... 623 ...... 668 ......

3c) X region of the 35 HBV/E whole genome sequences:

Site 12 22 25 30 34 35 36 43 48 52 73 87 88 94 114 117 123 130 131 146 151 Consensus A S S V L G D S D H F Q I H D F L K V A F 177 ...... T . . . . L . . . S . . . . 188 ...... T . . . . L . . . S . . . . 190 ...... T . . . . L . . . S . . . . 193 . G . I ...... L . . . S . . . . 199 ...... T . . . . L . . . S . . . . 200 ...... T . . . . L . . . S . . . . 201 ...... T . . . . L . . . S . . . . 209 ...... G ...... 210 ...... G ...... S . 226 . . . . P . . . V ...... 232 ...... T . . . . L . . . S . . . . 239 ...... G ...... 240 ...... R L . . . S M I . L 304 . G . I ...... L . . . S . . . . 306 . . . . P . . . V ...... 310 ...... N . . L . . . S . . . . 333 . . P ...... 334 . . P ...... 398 ...... G ...... Y N . . M I . . 400 ...... G ...... N ...... 401 ...... G ...... N ...... 402 ...... G ...... N ...... 408 ...... G ...... 416 ...... 438 . . . . . R ...... L . . . S . . . . 439 . . . . . R ...... L . . . S . . . . 458 . G ...... L . . . S . . . . 472 S . . . . . G . . . V ...... 476 . . . . P . . . V ...... 490 ...... G ...... 522 ...... G ...... L S . . . . 523 . . . . P . . . V ...... 619 . N ...... L . . . S . . . . 623 ...... T . . . . L . . . S . . . . 668 . . P ......

3d) P region of the 35 HBV/E whole genome sequences:

Site 9 11 16 39 44 47 67 69 90 108 111 123 169 178 183 188 195 197 213 214 218 220 233 234 247 250 258 265 267 269 272 280 288 295 299 304 305 309 312 316 317 318 319 320 321 336 352 383 436 Consensus R I E E Q N I V N N M L F E A S S H I Q Q S R S H T P K V S A V H S H H N S G K R P V F S Y T A L 177 ...... C ...... I N ...... S ...... --- . . . . . 188 ...... C ...... I N ...... S ...... --- . . . . . 190 ...... C ...... I N ...... S ...... --- . . . . . 193 . . D ...... C ...... S ...... 199 ...... C ...... I N ...... S ...... --- . . . . . 200 ...... C ...... I N ...... S ...... --- . . . . . 201 ...... C ...... I N ...... S ...... --- . . . . . 209 ...... L ...... N ...... 210 ...... L ...... 226 ...... T I ...... P ...... 232 ...... C ...... I N ...... S ...... --- . . . . . 239 ...... L ...... N ...... 240 . L ...... C ...... H N I . . . . P ...... ------. . . E . 304 . . D ...... C ...... S ...... 306 ...... I ...... P ...... F . . . 310 ...... C D ...... I ...... N R ...... A . . 333 ...... L ...... 334 ...... L ...... 398 . L . . . . L ...... ------. . . . 400 ...... L ...... 401 ...... L ...... 402 ...... L ...... 408 ...... L ...... 416 ...... H ...... P P . . . Y . . . I ...... I 438 ...... C . . . . Y ...... I . . . . A ...... 439 ...... C . . . . Y ...... I . . . . A ...... 458 . . D ...... C ...... S ...... 472 . L Q . . D ...... P ...... I 476 ...... I ...... P ...... 490 ...... T ...... 522 L L . D H . . . . K L . . . T S ...... T G ...... S . . . R ...... 523 ...... I ...... P ...... 619 ...... F C ...... I . . I . . . P ...... 623 ...... C ...... I N ...... S ...... 668 ...... L ...... --- . . . . .

Site 462 465 479 480 568 579 582 593 602 611 612 635 658 660 662 678 681 682 691 709 714 742 782 796 797 806 808 824 827 Consensus S R D S S H P N W R M I A I S K L N P I M K A I Y Q T H D 177 ...... H . . L . . T ...... C . . . . 188 ...... H . . L . . T ...... C . . . . 190 ...... H . . L . . T ...... C . . . . 193 ...... L . . T ...... V 199 ...... H . . L . . T ...... C . . . . 200 ...... H . . L . . T ...... C . . . . 201 ...... H . . L . . T ...... C . . . . 209 ...... K . L ...... Q . . . . V . R . . . 210 ...... K . L ...... Q . . . . V . R . . . 226 ...... H . K . . . . A ...... S . . 232 ...... H . . L . . T ...... C . . . . 239 ...... K . L ...... Q . . . . V . R . . . 240 ...... F . L ...... 304 ...... L . . T ...... V 306 ...... H . K . . . . A ...... S . . 310 . . . . A . . . . . L . T . . . . . S ...... 333 ...... S H ...... Q ...... R V 334 ...... S H ...... Q ...... R V 398 . . . Y . . . . . K . . T . Y Q . . . . . T . V . . . . . 400 ...... K . . T . Y . . . . . V T . V . . . . . 401 ...... K . . T . Y . . . . . V T . V . . . . . 402 ...... K . . T . Y . . . . . V T . V . . . . . 408 ...... K . L ...... Q . . . . V . R . . . 416 . . N . . Q A ...... Q Y ...... 438 ...... K L ...... 439 ...... K L ...... 458 ...... L ...... V 472 P K ...... K ...... T . V C . . . V 476 ...... H . K . . . . A ...... S . . 490 ...... K ...... V . . . . . 522 ...... K . . . . . Q R . . . . Q . V . . . . . 523 ...... H . K . . . . A ...... S . . 619 . . . . A . . . . . L . . T . . M . . M . . T ...... 623 ...... H . . L . . T ...... C . . . . 668 ...... S H . . . . . T . Q ...... R V

3e) Pre-S/S region of the 7 HBV/A sequences as well as the 2 OBI HBV/A pre-S/S sequences:

Pre -S1 Pre -S2 S-region Site 5 73 78 97 102 114 1 7 19 22 24 28 35 37 38 4 45 106 194 200 207 217 224 Consensus L G S T G E M T G F A N V T T I S V V Y S P A 203 I . T A E D I P . . . S A N I T T . A . N . V 260 . . . . . D . . . . V . . N . . . . . F N . . 299 ...... 420 ...... 443 ...... 446 ...... 471 . S . . . D . . S L V . . N ...... R L V 296 occult ...... I . . . . . 405 occult ...... I . . . . .

Blue = amino acid variation exclusively detected in OBI sequences

Yellow = relevant amino acid substitutions discussed in the manuscript

180 transmission of hepatitis b and d in an african community

S2 Fig: Phylogenetic reconstruction of Mbandji 2 HBV sequences from HBsAg-positive individuals and individuals with OBI. A maximum-likelihood phylogenetic tree of the 42 HBV/E and 7 HBV/A (whole-genome and pre-S/S) sequences from HBsAg-positive individuals as well as 17 HBV/E and 2 HBV/A (pre-S/S) sequences from individuals with OBI was constructed using the GTR model +G to illustrate the phylogenetic positions of the occult HBV/E and HBV/A sequences. To analyze whole-genome and pre-S/S sequences at the same time, pre-S/S sequences were complemented with unknown nucleotides (N) for regions of the complete genome flanking the pre-S/S sequences. Occult HBV sequences are in red. Individuals are marked with multicolored dots according to different branches of the phylogenetic tree, and additional demographic information, such as participant ID, household ID, gender, family relationships, and age, are shown. The tree was drawn to scale, with branch lengths measured according to the number of substitutions per site. Bootstrap values ( 80%) ≥ are shown at branch nodes. transmission of hepatitis b and d in an african community 181

623_A17_Sibling_3yrs_active 199_A6_Daughter_8yrs_active 200_A6_Son_12yrs_active 181_A3_Aunt_21yrs_occult 177_A3_Niece_13yrs_active A 232_A23_FemalePartner_50yrs_active 225_A21_Aunt_10yrs_active 190_A5_Mother_30yrs_active 202_A6_Partner_33yrs_occult 188_A4_Aunt_17yrs_active 237_A23_MalePartner_52yrs_active 220_A15_SisterInLaw_30yrs_occult 201_A6_Son_14yrs_active 259_A18_DaughterInLaw_18yrs_occult 247_A17_Sibling_5yrs_occult 233_A23_Son_12yrs_occult 304_A34_Sibling_7yrs_active 193_A5_Daughter_11yrs_active 194_A6_MaleHeadOfHouse_43yrs_occult 198_A6_Son_6yrs_occult 197_A6_Son_3yrs_occult 491_B19_Daughter_12yrs_active 458_B21_Male_24yrs_active 241_A19_FemalePartner_22yrs_occult 240_A19_MalePartner_33yrs_active 619_X4_Male_28yrs_active 439_B19_Son_5yrs_active 438_B19_Daughter_3yrs_active 441_B19_Mother_39yrs_active 412_B9_Sibling_3yrs_occult 310_A35_Sibling_7yrs_active 209_A9_Sibling_4yrs_active 184_A4_Nephew_2yrs_oocult 239_A23_Grandson_1yr_active 211_A9_Sibling_3yrs_occult 408_B4_Sibling_7yrs_active 210_A9_Sibling_6yrs_active 215_A15_Son_2yrs_occult 212_A1_Male_15yrs_active 214_A15_FemaleHeadOfHouse_40yrs_active 306_A34_Sibling_9yrs_active 226_A21_Niece_6yrs_active 311_A35_Sibling_4yrs_occult 196_A6_Daughter_3yrs_occult 523_C7_Male_23yrs_active 476_B29_Female_50yrs_active 402_B6_Sibling_6yrs_active 401_B6_Sibling_7yrs_active 400_B6_Sibling_14yrs_active 398_B6_Sibling_11yrs_active 217_A15_Daughter_7yrs_occult 490_active 525_D5_Male_40yrs_active 472_B28_FemalePartner_35yrs_active 522_C1_Male_45yrs_active 334_A40_Sibling_14yrs_active 333_A40_Sibling_10yrs_active 668_A40_Sibling_11yrs_active 416_B13_Male_7yrs_active 446_B18_Sibling_11yrs_active 443_B18_Sibling_9yrs_active 420_B9_Sibling_19yrs_active 405_B4_Sibling_3yrs_occult 296_A32_Sibling_2yrs_occult 299_A32_Sibling_8yrs_active 260_A18_MotherInLaw_47yrs_active 203_A7_Male_35yrs_active 500 400 300 200 100 0 182 transmission of hepatitis b and d in an african community

0.053 623_A17_Sibling_3yrs_active 0.01 199_A6_Daughter_8yrs_active 200_A6_Son_12yrs_active 0.00083 0.041 181_A3_Aunt_21yrs_occult 0.0065 177_A3_Niece_13yrs_active B 0.00021 232_A23_FemalePartner_50yrs_active 0.05 225_A21_Aunt_10yrs_active 0.0012 190_A5_Mother_30yrs_active 0.0035 202_A6_Partner_33yrs_occult 0.039 188_A4_Aunt_17yrs_active 0.051 237_A23_MalePartner_52yrs_active 0.035 0.0037 220_A15_SisterInLaw_30yrs_occult 201_A6_Son_14yrs_active 1 259_A18_DaughterInLaw_18yrs_occult 0.029 247_A17_Sibling_5yrs_occult 233_A23_Son_12yrs_occult 0.58 304_A34_Sibling_7yrs_active 0.3 193_A5_Daughter_11yrs_active 0.12 0.33 194_A6_MaleHeadOfHouse_43yrs_occult 0.44 198_A6_Son_6yrs_occult 1 197_A6_Son_3yrs_occult 1 491_B19_Daughter_12yrs_active 0.46 458_B21_Male_24yrs_active 0.99 1 241_A19_FemalePartner_22yrs_occult 0.12 240_A19_MalePartner_33yrs_active 619_X4_Male_28yrs_active 0.46 439_B19_Son_5yrs_active 1 438_B19_Daughter_3yrs_active 0.98 441_B19_Mother_39yrs_active 0.19 412_B9_Sibling_3yrs_occult 310_A35_Sibling_7yrs_active 0.28 209_A9_Sibling_4yrs_active 0.54 184_A4_Nephew_2yrs_oocult 0.99 239_A23_Grandson_1yr_active 0.29 211_A9_Sibling_3yrs_occult 408_B4_Sibling_7yrs_active 0.47 0.48 210_A9_Sibling_6yrs_active 215_A15_Son_2yrs_occult 1 1 212_A1_Male_15yrs_active 0.97 214_A15_FemaleHeadOfHouse_40yrs_active 0.53 0.62 306_A34_Sibling_9yrs_active 0.14 226_A21_Niece_6yrs_active 0.12 311_A35_Sibling_4yrs_occult 0.85 196_A6_Daughter_3yrs_occult 523_C7_Male_23yrs_active 0.58 476_B29_Female_50yrs_active 1 0.5 402_B6_Sibling_6yrs_active 0.99 401_B6_Sibling_7yrs_active 0.95 400_B6_Sibling_14yrs_active 1 398_B6_Sibling_11yrs_active 0.82 217_A15_Daughter_7yrs_occult 0.73 490_active 1 0.21 0.85 525_D5_Male_40yrs_active 0.88 472_B28_FemalePartner_35yrs_active 0.77 522_C1_Male_45yrs_active 0.52 334_A40_Sibling_14yrs_active 1 333_A40_Sibling_10yrs_active 668_A40_Sibling_11yrs_active 416_B13_Male_7yrs_active 0.54 446_B18_Sibling_11yrs_active 0.85 443_B18_Sibling_9yrs_active 420_B9_Sibling_19yrs_active 1 0.99 405_B4_Sibling_3yrs_occult 0.87 0.94 296_A32_Sibling_2yrs_occult 299_A32_Sibling_8yrs_active 0.96 260_A18_MotherInLaw_47yrs_active 203_A7_Male_35yrs_active 4.1652e+2 1.8189e+2 7.9355e+1 3.4544e+1 1.496e+1 6.4015e+0 2.6612e+0 1.0266e+0 3.122e-1 0e+0

S3 Fig: Rooted maximum-clade-credibility tree of Mbandji 2 HBV sequences from HBsAg-positive individuals and individuals with OBI summarized from the Bayesian analysis posterior tree sample. To analyze whole-genome and pre-S/S sequences at the same time, pre-S/S sequences were complemented with unknown nucleotides (N) for regions of the complete genome flanking the pre-S/S sequences. Tree branch length is measured in years. The gray bars indicate the 95% highest posterior density interval of the particular node age (bottom axis). (A) Linear scale; (B) log scale with internal node labels showing the posterior probability of the underlying clade in the tree sample. BAYESIANANALYSESMADEUNDERSTANDABLE 8

Last but not least comes the publication on “Taming the BEAST” – the workshop that the Computational Evolution (cEvo) PhD students first organised in the summer of 2016 with the support from post-docs and professors from several groups. The initiative was warmly embraced by the scientific community and the first edition of the workshop received an overwhelming number of applications, of which we could only accept a half. Since the original 2016 workshop in Switzerland, there have already been 5 more workshops in Europe, Australia, New Zealand and North America. However, even though a number of workshops has been organised in different parts of the world, they are unable to serve the existing demand for advanced skills in phylogenetics and phylodynamics. Therefore, we decided to set up a web page that would host all the information on the workshop: the schedule, existing tutorials, and provide an option for BEAST2 developers to upload their own tutorials, e.g. when they develop new models. This portal serves as a self-learning platform for users of BEAST2, so that those who cannot attend the workshop still have access to all the materials used for teaching. Phylogenetic and phylodynamic Bayesian analyses can be highly complex to set up and, if set up incorrectly, can show highly misleading or outright incorrect results. Therefore it is vitally important to teach scientists the basics of Bayesian analyses and ensure that the skills needed to perform complex phylodynamics are not limited to an small group of people. While I highly value the opportunity to be involved in different interesting and relevant projects (e.g. the one described in Chapter 7), there was a lot of communication involved in me, as a method expert, joining the project, which would not be necessary if relevant phylodynamic expertise would be present in the research labs where the epidemiological studies are performed and the data is collected. I would love to become replaceable in this sense, which would mean that many people know and understand the implications of the assumptions they are making and can set up analyses correctly and consciously. This work was published in June 2017 in Systematic Biology as an article titled “Taming the BEAST – A Community Teaching Material Resource for BEAST 2”, DOI: 10.1093/sys- bio/syx060, where I am a shared first author. Following is the publisher’s version of the article.

183 Software for Systematics and Evolution

Syst. Biol. 67(1):170–174, 2018 © The Author(s) 2017. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected] DOI:10.1093/sysbio/syx060 Advance Access publication June 29, 2017

Taming the BEAST—A Community Teaching Material Resource for BEAST 2

, , , , , , , , , JOËLLE BARIDO-SOTTANI1 2 †,VERONIKA BOŠKOVÁ1 2 †,LOUIS DU PLESSIS1 3 †,DENISE KÜHNERT1 2 4 †, , , , , , , , , , , CARSTEN MAGNUS1 2 †,VENELIN MITOV1 2 †,NICOLA F. MÜLLER1 2 †,JULIJA¯ PECERSKAˇ 1 2 †,DAVID A. RASMUSSEN1 2 †, , , , , , , CHI ZHANG1 2 †,ALEXEI J. DRUMMOND5 ‡,TRACY A. HEATH6 ‡,OLIVER G. PYBUS3 ‡,TIMOTHY G. VAUGHAN5 ‡, , ,∗, AND TANJA STADLER1 2 § 1Department of Biosystems Science and Engineering, ETH Zürich, Mattenstrasse 26, 4058 Basel, Switzerland; 2Swiss Institute of Bioinformatics (SIB), Quartier Sorge - Batiment Genopode, 1015 Lausanne, Switzerland; 3Department of Zoology, University of Oxford, Peter Medawar Building South Parks Road Oxford, OX1 3SY, UK; 4Department of Environmental Sciences, ETH Zürich, Universitätsstrasse 16, 8092 Zürich, Switzerland; 5Centre for Computational Evolution, University of Auckland, New Zealand; and 6Department of Ecology, Evolution, and Organismal Biology, Iowa State University, 2200 Osborn Dr., Ames, IA 50011 USA ∗ Correspondence to be sent to: Department of Biosystems Science and Engineering, ETH Zürich, Mattenstrasse 26, 4058 Basel, Switzerland; E-mail: [email protected]. †These authors contributed equally. ‡These authors contributed equally. §Senior author. Received 21 December 2016; reviews returned 20 June 2017; accepted 25 June 2017 Associate Editor: David Bryant Abstract.—Phylogenetics and phylodynamics are central topics in modern evolutionary biology. Phylogenetic methods reconstruct the evolutionary relationships among organisms, whereas phylodynamic approaches reveal the underlying diversification processes that lead to the observed relationships. These two fields have many practical applications in disciplines as diverse as epidemiology, developmental biology, palaeontology, ecology, and linguistics. The combination of increasingly large genetic data sets and increases in computing power is facilitating the development of more sophisticated phylogenetic and phylodynamic methods. Big data sets allow us to answer complex questions. However, since the required analyses are highly specific to the particular data set and question, a black-box method is not sufficient anymore. Instead, biologists are required to be actively involved with modeling decisions during data analysis. The modular design of the Bayesian phylogenetic software package BEAST 2 enables, and in fact enforces, this involvement. At the same time, the modular design enables groups to develop new methods at a rapid rate. A thorough understanding of the models and algorithms used by inference software is a critical prerequisite for successful hypothesis formulation and assessment. In particular, there is a need for more readily available resources aimed at helping interested scientists equip themselves with the skills to confidently use cutting-edge phylogenetic analysis software. These resources will also benefit researchers who do not have access to similar courses or training at their home institutions. Here, we introduce the “Taming the Beast” (https://taming-the-beast.github.io/) resource, which was developed as part of a workshop series bearing the same name, to facilitate the usage of the Bayesian phylogenetic software package BEAST 2. [Bayesian inference; MCMC; phylodynamics; phylogenetics.]

Drummond 2008), multispecies coalescent inference BEAST 2 IN A NUTSHELL with *BEAST (Drummond and Heled 2010), and BEAST 2 (Bouckaert et al. 2014) is an open source phylogeographical models (Lemey et al. 2009; 2010). cross-platform software package for analysing genetic Like in BEAST 1, an analysis is set up using input sequences in a Bayesian phylogenetic framework. It XML files. For most standard analyses, these files occupies the same niche, and thus incorporates many of can be easily created using a graphical user interface the same models, as other popular Bayesian evolutionary (BEAUti 2). analyses platforms, including BEAST (Drummond and The key difference in design philosophy between Rambaut 2007) (which we refer to here as BEAST 1 BEAST 1 and BEAST 2 is a greater emphasis in the latter in order to distinguish it from BEAST 2), MrBayes on extensibility, resulting in a modular program built (Huelsenbeck and Ronquist 2001), and RevBayes (Höhna around a set of core components. This allows third-party et al. 2016). Although BEAST 2 is a complete redesign developers to implement new methods as packages of the BEAST 1 software package, it retains a similar that can be added without rebuilding or redeploying user interface and many core model components, BEAST 2. Through such packages, BEAST 2 provides including relaxed molecular clock models (Drummond a growing collection of new models not available in et al. 2006), Bayesian skyline models for nonparametric BEAST 1, such as flexible birth–death tree-priors (Stadler coalescent analyses (Drummond et al. 2005; Heled and et al. 2013; Gavryushkina et al. 2014; Kühnert et al. 2016) 170 2018 BARIDO-SOTTANI ET AL.—TAMING THE BEAST 171 and structured coalescent models (Vaughan et al. 2014; an MCMC analysis is more likely to sample the posterior De Maio et al. 2015), as well as updates to existing models, distribution efficiently. Finally, once the MCMC chain such as StarBEAST 2 (Ogilvie and Drummond 2016). A has sampled a sufficient number of states, the researcher list of available models in BEAST 1 and BEAST 2 can must assess whether the chain has converged and be found at http://beast2.org/beast-features/. (Users recovered a meaningful signal from the data. should bear in mind that BEAST 2 is modular by Consequently, the user is challenged with a myriad design, and thus some third-party packages may not be of choices on the road to a successful analysis. listed.) Although many potential pitfalls exist, a simple but This modular design requires the BEAST 2 user to solid understanding of the theory behind Bayesian make active modeling choices, and it is no longer phylogenetic inference can help guide new users possible to simply perform a “default” analysis. This through an analysis to reach sound conclusions. active involvement opens the door for analyses tailored specifically to particular data sets and questions, greatly increasing the power of the package. However, it also markedly increases the complexity and makes it easier “TAMING THE BEAST" FOR THE USER COMMUNITY to inadvertently introduce errors or use inappropriate In June 2016, we organized a “Taming the BEAST" models. This added complexity could also be daunting to workshop in Engelberg, Switzerland, aimed at fostering novice users and may result in them preferring simpler, interaction between BEAST 2 users and developers. but less powerful, software packages. We will now The workshop was organized by graduate students briefly highlight the key steps required from the BEAST 2 and postdoctoral researchers in the Computational user when running a data analysis. Evolution group at ETH Zürich (https://www.bsse. At its core, BEAST 2 estimates rooted phylogenies ethz.ch/cevo, with generous financial support from ( ) from genetic sequencing data ( ), with branch ETH Zürich) and was a mix of lectures by invited lengthsT in units of calendar time (i.e.,D the phylogenies speakers (A.J.D., T.A.H., O.G.P., T.G.V., and T.S. were are time-trees). It concurrently estimates evolutionary invited speakers.) and hands-on tutorials run by the parameters (), such as the substitution rate, and organisers. (J.B.-S., V.B., L.d.P., D.K., C.M., V.M., N.F.M., parameters describing population dynamics (), such J.P., D.A.R., and C.Z. organized the tutorial sessions.) as speciation/extinction or transmission/recovery rates. Participants had the opportunity to learn how to For inference, BEAST 2 uses a Markov chain Monte use BEAST 2 with help from the developers and Carlo (MCMC) algorithm to sample from the posterior to discuss questions specific to their research with distribution, other experienced scientists. For the developers, such a workshop provides direct feedback from users on ease- Pr[ | ,]Pr[ |]Pr[]Pr[] of-use, identifying specific issues and discovering the Pr[ ,,| ]= D T T . (1) T D Pr[ ] needs and wishes of the community for future software D and methods development. The output of an analysis is a log-file containing The workshop was met with great enthusiasm from a sample of the states ( ,,) visited by the MCMC researchers already using or planning to use BEAST 2, algorithm. After a so-calledT burn-in phase, each value ranging from students to established PIs. (Although ( ,,) is visited by the chain at a frequency proportional originally envisioned for graduate students only, many toT its posterior probability, so the output of BEAST 2 postdoctoral researchers, some lecturers, and a few (after eliminating the burn-in) is a set of samples from professors applied for the workshop as well. Due to the the posterior distribution. A recent book (Drummond limited capacity and resources, out of 75 applications, and Bouckaert 2015) describes the general theory and we selected 36 participants from 14 countries and 28 design behind BEAST 2. universities.) The positive feedback from the participants For the user to carry out a successful and correct (see Fig. 1), the overwhelming support from the analysis, several steps need to be performed carefully community and the demand for further workshops has to analyze the data and answer the research question provided motivation to initiate a series of “Taming the of interest. The researcher must specify a multileveled BEAST" workshops. At the time of writing, a second (i.e., hierarchical) model with several interacting successful edition of “Taming the Beast” was run on components, including: (i) a suitable model describing Waiheke island (New Zealand) in February 2017 and the evolution of the sequence data on a time-tree, a third edition will take place in July 2017 in London. including the substitution and molecular-clock models Further editions are planned for 2018 in Switzerland, (Pr[ | ,]); (ii) a phylodynamic model describing the and for 2019 and 2020 in locations that are yet to be growthD T of the tree over time (Pr[ ,]); and (iii) sensible determined. (We secured funding from ETH Zürich prior distributions for each ofT the parameters of the to support the workshop series in 2017–2020.) Each evolutionary models (Pr[] and Pr[]). workshop is intended as a global event, allowing users In addition to the model components, the researcher and developers from around the world to meet and share must also specify and fine-tune MCMC operators knowledge. that propose new states for the model parameters To ensure these resources are available to the ( ,,). By choosing appropriate proposal algorithms, community, we have set up a website (https://taming- T 172 SYSTEMATIC BIOLOGY VOL. 67

excellent/positively surprised

very good/very satisfied

good/satisfied

poor/unsatisfied

very poor/very unsatisfied

unacceptable/disappointed

overall level helpfulness for of satisfaction confidence with confidence with xpectations met e future research BEAST 2 (after) BEAST 2 (before)

FIGURE 1. Boxplot showing the feedback received from 35 respondents (out of 36 workshop participants) on 5 feedback questions. Of the 35 respondents, all but 3 indicated that they would definitely recommend the workshop to a colleague.

FIGURE 2. Structure of the Taming the BEAST web resource as hosted on GitHub. The diagram on the left shows three possibilities for tutorials available on the website. On the diagram solid lines indicate ownership and dashed lines access. Tutorial 1 is owned by the taming-the-beast organization on GitHub, and does not have any external contributors. Tutorial 2 was created by contributor a, but ownership has been transferred to taming-the-beast. Tutorial 3 was created by contributor b, who has retained ownership. In all three cases, it is essential that at least one of the website administrators has access to the tutorial. The website itself is also hosted on GitHub as a project. When a user visits the website tutorials appear as on the right of the figure. The left panel contains links to a printable PDF version of the tutorial, the data file (or files) used in the tutorial, example BEAST 2 XML files, examples output files and a link to the GitHub repository of the tutorial. Recent changes to the tutorial are also listed. the-beast.github.io/) with the same name as the a license that gives anyone the right to freely use workshop series to serve as a platform for collating a (and modify) tutorials for courses or workshops, as comprehensive and cohesive set of BEAST 2 tutorials long as appropriate credit is given and the updated (see Fig. 2). By providing a set of well-curated tutorials, material is licensed in the same fashion. (By default “Taming the BEAST" offers researchers the resources we use a Creative Commons Attribution 4.0 license, necessary to learn how to perform analyses in BEAST 2. however the exact license to be used is determined by the In addition to tutorials provided by the BEAST 2 tutorial’s authors.) We hope that these open resources developers, this resource page also contains all of the will encourage other research groups/universities to materials (lecture slides, tutorials, data, and example host and organize their own “Taming the BEAST" outputs) used during the first two “Taming the BEAST" workshops. As a community resource, the “Taming the workshops in Switzerland and New Zealand. These BEAST" website will maintain a list of workshops, and materials will be updated and extended for future tutorial developers are available to provide support to editions of the workshop. Tutorials are released under organizers. 2018 BARIDO-SOTTANI ET AL.—TAMING THE BEAST 173

CONTRIBUTING TO TAMING THE BEAST to effectively teach the use of BEAST 2. Finally, this In keeping with the BEAST 2 design philosophy, we platform will hopefully further encourage developers to designed the website to have a modular, extensible share their own materials with the wider community. architecture. Each tutorial is stored in its own GitHub (http://www.github.com) repository, where it is bundled with all of the supporting data and scripts ACKNOWLEDGMENTS needed to run the tutorial, as well as example output First and foremost we would like to express files. This makes it possible for anyone with a GitHub our immense gratitude to the community for the account to raise issues and suggest edits or extensions overwhelmingly positive response both before the first to tutorials. Similarly, it is also possible for external workshop (in the form of letters of support and interest) contributors to submit new tutorials to the website. and after the workshop (in helping us turn it into a We provide a template tutorial and comprehensive series of recurring workshops). We would also like documentation to help potential contributors get to thank the BEAST 2 core developers for supporting started. our initiatives and helping us to run the workshop By providing a “Taming the Beast” platform that smoothly, in particular Walter Xie and Remco Bouckaert allows issues to be raised and content to be edited, we who tested tutorials and implemented last minute bug- hope that the community will play an active role in fixes. We further acknowledge generous support from curating tutorials. We further envision these resources ETH Zürich through the Swiss University Conference will continue to grow as the community contributes (SUK) program. The website architecture is based on more tutorials. For instance, the developers of a new Trevor Bedford’s lab website. Many thanks to Trevor BEAST 2 package will be able to add a tutorial for their for making his code publicly available! O.G.P. wishes to package to the “Taming the BEAST" site, where it will be thank Andrew Rambaut for his contributions to lecture accessible in a central location, along with other BEAST 2 slides. Further, we would like to thank the speakers of tutorials, making it easier for users to become familiar the second workshop, Simon Ho, David Bryant, Remco with their package. Bouckaert, Huw Ogilvie, and David Duchêne, as well as Because tutorials are stored in GitHub repositories Carmella Lee for organizing the logistics of the second that track change history, all contributors can receive workshop. Finally, we would like to thank David Bryant proper credit for their work. Furthermore, authors of and an anonymous reviewer for valuable comments on new tutorials can retain ownership of their tutorials after the article. publication. In addition, GitHub tracks traffic to tutorials over time and makes it easy for users to interact with authors, giving authors a measure of their work’s impact AUTHOR’S CONTRIBUTIONS within the community.Finally,because of the distributed J.B.-S., V.B., L.d.P., V.M., and J.P. wrote and submitted nature of the website, it is robust to changes in any single the SUK application for starting the “Taming the BEAST" repository, making it easy to update or add individual workshop series, with substantial support of C.M. and tutorials. D.A.R. The first workshop was organized by the whole Computational Evolution group (led by J.B.-S., V.B., and L.d.P.). J.B.-S., V.B., L.d.P., D.K., C.M., V.M., N.F.M., J.P., SUMMARY D.A.R., C.Z., A.J.D., T.A.H., O.G.P.,T.G.V., and T.S. wrote The tutorials on the “Taming the Beast” website allow the tutorials and/or lecture slides for teaching. L.d.P. users to learn about the entire BEAST 2 analysis pipeline, created the figures, set up the web resource and GitHub with most tutorials focusing on a particular model repositories and is the corresponding person regarding component or a single BEAST 2 package. The website these online resources. L.d.P., J.B.-S., V.B., and T.S. wrote provides immediate access to the materials that guide the article. users in the application of a range of models to their own data. In addition, there are tutorials on postprocessing, interpreting results, as well as troubleshooting. We will REFERENCES ensure the maintenance of the website and incorporation Bouckaert R., Heled J., Kühnert D., Vaughan T., Wu C.-H., Xie D., of new tutorials through two to three responsible Suchard M.A., Rambaut A., Drummond A.J. 2014. BEAST 2: people from the Computational Evolution group at ETH a software platform for Bayesian evolutionary analysis. PLoS Zürich as well as collaborating groups acting as website Comput. Biol. 10(4):e1003537. administrators. The administrators of the website can be De Maio N., Wu C.-H., O’Reilly K.M., Wilson D. 2015. New routes to phylogeography: a Bayesian structured coalescent approximation. reached via [email protected]. PLoS Genet. 11(8):e1005421. We hope that the “Taming the BEAST” platform will Drummond A.J., Bouckaert R.R. 2015. Bayesian evolutionary analysis allow new BEAST 2 users to accelerate their learning with BEAST. Cambridge, UK: Cambridge University Press. process and to successfully “tame” the BEAST. At the Drummond A.J., Heled J. 2010. Bayesian inference of species trees from same time, we hope that it will serve as a central multilocus data. Mol. Biol. Evol. 27(3):570–580. Drummond A.J., Ho S.Y.W., Phillips M.J., Rambaut A. 2006. repository of teaching materials that will allow BEAST 2 Relaxed phylogenetics and dating with confidence. PLOS Biol. developers and users to exchange knowledge about how 4(5):e88. 174 SYSTEMATIC BIOLOGY VOL. 67

Drummond A.J., Rambaut A. 2007. BEAST: Bayesian evolutionary quantify population structure from genomic data. Mol. Biol. Evol. analysis by sampling trees. BMC Evol. Biol. 7(1):1. 33(8):2102–2116. Drummond A.J., Rambaut A., Shapiro B., Pybus, O.G. 2005. Bayesian Lemey P., Rambaut A., Drummond A.J., Suchard M.A. 2009. Bayesian coalescent inference of past population dynamics from molecular phylogeography finds its roots. PLOS Comput. Biol. 5(9):e1000520. sequences. Mol. Biol. Evol. 22(5):1185–1192. Lemey P., Rambaut A., Welch J.J., Suchard M.A. 2010. Phylogeography Gavryushkina A., Welch D., Stadler T., Drummond A.J. 2014. Bayesian takes a relaxed random walk in continuous space and time. Mol. inference of sampled ancestor trees for epidemiology and fossil Biol. Evol. 27(8):1877–1885. calibration. PLoS Comput. Biol. 10(12):e1003919. Ogilvie, H.A., Bouckaert, R.R., Drummond, A.J. 2017. StarBEAST2 Heled J., Drummond A.J. 2008. Bayesian inference of population size brings faster species tree inference and accurate estimates of history from multiple loci. BMC Evol. Biol. 8:289. substitution rates. Mol. Biol. Evol. doi: 10.1093/molbev/msx126. Höhna S., Landis M.J., Heath T.A., Boussau B., Lartillot N., Moore [Epub ahead of print]. B.R., Huelsenbeck J.P., Ronquist F. 2016. RevBayes: Bayesian Stadler T., Kühnert D., Bonhoeffer S., Drummond A.J. 2013. Birth– phylogenetic inference using graphical models and an interactive death skyline plot reveals temporal changes of epidemic spread in model-specification language. Syst. Biol. 65(4):726–736. HIV and hepatitis c virus (HCV). Proc. Natl. Acad. Sci. USA 110(1): Huelsenbeck J.P., Ronquist F. 2001. Mrbayes: Bayesian inference of 228–233. phylogenetic trees. Bioinformatics 17(8):754–755. Vaughan T.G., Kühnert D., Popinga A., Welch D., Drummond A.J. Kühnert D., Stadler T., Vaughan T.G., Drummond A.J. 2016. 2014. Efficient Bayesian inference under the structured coalescent. Phylodynamics with migration: a computational framework to Bioinformatics 30(16):2272–9. DISCUSSIONANDCONCLUSIONS 9

In this thesis I presented work focused on estimating the relative transmission fitness of drug resistant Tuberculosis (TB) strains, as well as general modelling works. A lot of this thesis focuses on the importance of whole genome sequencing (WGS) data, which is important to gather for all ongoing disease epidemics, but for Mycobacterium tuberculosis in particular. Sequence data is necessary for precise epidemiological analyses and while such data is not widely available model development is consequently slowed down. In Chapter 2 I presented an overview of the modelling efforts made for TB. There is a significant number of models that are being used to analyse TB epidemiological and evolutionary dynamics. While versatility is valuable, this points to the lack of consensus in modelling and to the limited understanding that we still have when it comes to the dynamics of TB on all scales. Moreover, multiple works have shown that not only the TB lineage differences have to be accounted for, but also the differences in the underlying human populations. Resistance modelling efforts have been scarcer than other modelling approaches which is in part due to complications in data collection. While the general hypothesis on TB drug resistance is that it reduces the overall transmission fitness, a simulation study by Knight et al., 2015 showed that a realistic amount of variance in transmission fitness leads to a much increased prevalence of drug resistance over time. Thus, even though drug-resistant strains may be less fit for transmission, the fitness costs are highly variable, which seems to be the main predictor of success for drug-resistant strains rather than the mean fitness cost. We need to account for the differences in relative fitness in order to effectively control the spread of resistance. In Chapter 3 I describe the proof-of-concept approach for estimating drug resistance transmission fitness costs in TB. I first describe a simulation study with which I verify the approach on simulated data and then a real-life dataset analysis of multi-drug resistant TB (MDR-TB) sequences in Kinshasa. In the simulation study I conclude that we can get consistent estimates of drug resistance transmission fitness using a simple model to analyse transmission trees generated under a much more complex model. We use the complex model to simulate a kind of an epidemic that we are most interested in studying at the present time – short term epidemics of active cases in countries where consistent treatment strategies are in effect. We then apply the analysis approach to a dataset from Kinshasa, concluding that the lineage 4 MDR-TB strains in particular lose about 30% in transmission fitness in connection to pyrazinamide resistance. This is also in accordance with previous hypotheses on pyrazinamide resistance, which is thought to confer a high fitness cost (Hertog, Sengstake, and Anthony, 2015). However, while there are plenty of works discussing pyrazinamide resistance as an obvious fitness cost inducing mechanism (e.g. (Casali et al., 2014; Chang, Yew, and Zhang, 2011; Stoffels et al., 2012)), we show a very contrasting picture in the analysis of sequences from Georgia, described in Chapter 4. Not only do the analyses suggest that pyrazinamide resistance in that setting does not confer a transmission fitness cost, but they also show a significant difference in estimates between lineages and even between the same lineage from a different location.

189 190 discussion and conclusions

It is highly worrying that in Georgia the strains resistant to pyrazinamide, a vital drug both in first- and second-line regimens, do not seem to lose any fitness. This also indicates that we need to be wary when reusing estimates from other studies, as the results may be highly lineage- and location-specific. Chapter 5 presents work done on estimating the time spans covered by different conven- tional TB-clustering methods. In particular, it shows that some of the clustering methods in fact cover time spans that range way beyond the periods that we are currently able to sample. On the other hand, WGS data used with conventional SNP-based cut-off seems to give us a good approximation of connected transmission clusters. This again reinforces the idea that WGS is an essential component to further development in our understanding of TB dynamics. Chapter 6 showed an analysis of the most prevalent drug resistance substitutions in HIV in Switzerland. The dataset is much richer than what is available for TB and has shown varying degrees of fitness impact conferred by different prevalent substitutions. In particular, one of the substitutions actually confers a fitness advantage over drug-sensitive strains, in accordance with results from previous studies. In Chapter 7 I used the MTBD model to analyse a dataset of active and occult (characterised by an absence of Hepatitis B surface antigen and low viral replication (Said, 2011)) Hepatitis B cases, accounting for the fact that the occult cases do not seem to transmit in the studied population. The occult cases and some of the active cases only had single gene sequences, while the other cases had full genomes available. Both types of data were included in the analysis setup. Bayesian analysis shows that the occult HBV cases mainly cluster together with possible infection sources among neighbouring households with active cases. This result agreed with trees obtained using a maximum likelihood tree reconstruction method that used the short gene sequences for all samples and did not account for differences in transmission. Finally, in Chapter 8 I introduce the “Taming the Beast” workshop and the related self-study materials that are now available to the scientists at large. Not only do I find that teaching and helping others progress in their scientific journey is extremely gratifying, but helping to run the multiple instances of the workshop also gave me a wonderful opportunity to meet many great scientists around the world. The workshop is a great way to share expertise and discover new datasets and new methodologies. It is also an excellent opportunity to foster collaboration, especially in fields such as TB modelling, where significant effort is still needed to improve our understanding and modelling capabilities. After working on the general topic of TB and of drug resistance fitness costs in particular, I am still sometimes surprised by the level of knowledge that many have when it comes to the disease. A lot of the people I talk to seem to be under the impression that TB has been eradicated and poses no danger to public health worldwide (or at least certainly not in the first world countries). This is, sadly, not the case. TB was declared a public health emergency by WHO in 1993, and while we have made major progress in reducing the case and death rates worldwide in the years that followed, still around 10 million people fall ill with TB every year. Moreover, TB still holds a high rank among the top 10 causes of death (WHO, 2018). Thus, there are still very real and major challenges in TB eradication on a global scale which have to be addressed before we reach the End TB WHO targets of 2035. The primary challenge at the moment is still the limited access to universal healthcare in countries with the highest burdens of TB, e.g. the Democratic Republic of the Congo, South Africa and Ethiopia. Moreover, the rates of TB and HIV co-infection are still high, the risk of developing TB being 20-fold higher for the 37 million people already living with HIV (WHO, 2018). HIV/TB co-infection poses a discussion and conclusions 191 unique diagnostic and therapeutic challenge, as it has been shown that co-infection is much poorer understood and managed than mono-infection. Our knowledge about the interaction of the two pathogens is therefore also still lacking and needs to be further developed to work out effective treatment and prevention methods (Pawlowski et al., 2012). Medical doctors do their best to effectively treat and cure patients, while we as researchers aim to provide conclusive insight into the nature of diseases using a range of datasets and methods. WGS data has been an incredible additional data source improving the resolution of the analyses and the types of analyses that can be done. We now have the challenge and opportunity to develop phylodynamic tools to provide unique insight. While there are certainly some difficulties in communication between fields and there is still a lot of work to be done on both sides to improve our understanding on all levels, I am positive that in time we will be able to integrate all the different levels of knowledge into complete understanding. And there is still plenty of ground to cover in the analyses. For example, the analyses described in this thesis do not take into account the possibility of a different rate of evolution in the latent stage of TB. For now we have only looked at short-term epidemics and the estimates of relative transmission fitness without explicitly focusing on tree dating, essentially averaging over different substitution rates through time. If we need to ensure precise dating of the phylogenetic trees the potentially different rate of evolution should definitely be taken into account. Current estimates show an approximately 30-fold slower mutation rate in contrast to values during the final two years before the onset of active disease (Colangeli et al., 2014). In general, our understanding would be greatly advanced by methods that would allow to detect latent infection in patients and possibly methods that would allow to get sequences from latent infections as an additional source of information. We are not currently capable of sequencing latent infections, which means that all of the available samples come from the active stage, while the latent stage for now remains a hidden state in the tree, which contributes to the uncertainty of estimates. At the current time the methods for detecting latent TB are lacking, and some of the methods even detect the old BCG vaccine as possible infection (Colangeli et al., 2014). WGS for latent infections would allow better tracing and consequently much improved parameter estimation for the dynamics of TB. Yet another area of argument in TB epidemiological analysis is certainly cluster definition. While Chapter 5 illustrates the different methods of cluster definition and their time spans indicating potential use cases, this is still another source of uncertainty that could potentially be integrated within the analysis model. An ideal case analysis would not require external cluster definition and would be able to operate without any pre-set clustering patterns. This could be done with longer sampling periods and with more WGS data sampled through time, allowing us to directly observe the evolution on the time scale of sampling. We would then have to infer change in substitution rates from the expected latent stage to activation over the tree, while simultaneously taking into account crucial characteristics, such as, for example, drug resistance statuses. While some estimates for the length of latency are already available (e.g. Eldholm et al., 2016, where the authors used inferred transmission pairs and simulations to estimate the length of latent infection), phylodynamic analyses such as described here, but inferring the latent infection with a different substitution or transmission rate on the tree, would help us more conclusively estimate the length of latency in different patients. At the moment, however, the methods still need work in computational efficiency, as inferring phylogenetic trees with additional substitution rate changes would significantly slow down current implementations. 192 discussion and conclusions

A lot of work still needs to be done to allow us to directly estimate MDR-TB fitness costs with respect to drug-sensitive TB strains. Better communication and inter-sectional knowledge are key to uncovering many of the parameters of TB epidemiology, drug resistance costs and co-infection dynamics. I write this in the hope that the analyses presented here and the hard work on the phylodynamics workshop will be yet another stepping stone to further advances in this area. BIBLIOGRAPHY

Aldridge, B. B., M. Fernandez-Suarez, D. Heller, V. Ambravaneswaran, D. Irimia, M. Toner, and S. M. Fortune (2012). “Asymmetry and aging of mycobacterial cells lead to variable growth and antibiotic susceptibility.” In: Science 335.6064, pp. 100–4. issn: 1095-9203 (Electronic) 0036-8075 (Linking). doi: 10.1126/science.1216166. url: https://www.ncbi.nlm.nih. gov/pubmed/22174129. Allen, R. C., J. Engelstadter, S. Bonhoeffer, B. A. McDonald, and A. R. Hall (2017). “Reversing resistance: different routes and common themes across pathogens.” In: Proc Biol Sci 284.1863. issn: 1471-2954 (Electronic) 0962-8452 (Linking). doi: 10 . 1098 / rspb . 2017 . 1619. url: https://www.ncbi.nlm.nih.gov/pubmed/28954914. Andersson, D. I. and D. Hughes (2010). “Antibiotic resistance and its cost: is it possible to reverse resistance?” In: Nat Rev Microbiol 8.4, pp. 260–71. issn: 1740-1534 (Electronic) 1740-1526 (Linking). doi: 10.1038/nrmicro2319. url: http://www.ncbi.nlm.nih.gov/ pubmed/20208551. Biek, R., O. G. Pybus, J. O. Lloyd-Smith, and X. Didelot (2015). “Measurably evolving pathogens in the genomic era.” In: Trends Ecol Evol 30.6, pp. 306–13. issn: 1872-8383 (Electronic) 0169- 5347 (Linking). doi: 10.1016/j.tree.2015.03.009. url: https://www.ncbi.nlm.nih.gov/ pubmed/25887947. Blower, S. M., A. R. McLean, T. C. Porco, P. M. Small, P. C. Hopewell, M. A. Sanchez, and A. R. Moss (1995). “The intrinsic transmission dynamics of tuberculosis epidemics.” In: Nat Med 1.8, pp. 815–21. issn: 1078-8956 (Print) 1078-8956 (Linking). doi: 10.1038/nm0895-815. url: http://www.ncbi.nlm.nih.gov/pubmed/7585186. Borrell, S. and S. Gagneux (2011). “Strain diversity, epistasis and the evolution of drug resistance in Mycobacterium tuberculosis.” In: Clin Microbiol Infect 17.6, pp. 815–20. issn: 1469-0691 (Electronic) 1198-743X (Linking). doi: 10.1111/j.1469-0691.2011.03556.x. url: http://www.ncbi.nlm.nih.gov/pubmed/21682802. Borrell, S., Y. Teo, F. Giardina, E. M. Streicher, M. Klopper, J. Feldmann, B. Muller, T. C. Victor, and S. Gagneux (2013). “Epistasis between antibiotic resistance mutations drives the evolution of extensively drug-resistant tuberculosis.” In: Evol Med Public Health 2013.1, pp. 65–74. issn: 2050-6201 (Print) 2050-6201 (Linking). doi: 10.1093/emph/eot003. url: https://www.ncbi.nlm.nih.gov/pubmed/24481187. Bouckaert, R., J. Heled, D. Kühnert, T. Vaughan, C. H. Wu, D. Xie, M. A. Suchard, A. Rambaut, and A. J. Drummond (2014). “BEAST 2: a software platform for Bayesian evolutionary analysis.” In: PLoS Comput Biol 10.4, e1003537. issn: 1553-7358 (Electronic) 1553-734X (Link- ing). doi: 10.1371/journal.pcbi.1003537. url: http://www.ncbi.nlm.nih.gov/pubmed/ 24722319. Casali, N. et al. (2014). “Evolution and transmission of drug-resistant tuberculosis in a Russian population.” In: Nat Genet 46.3, pp. 279–86. issn: 1546-1718 (Electronic) 1061-4036 (Linking). doi: 10.1038/ng.2878. url: http://www.ncbi.nlm.nih.gov/pubmed/24464101. Chang, K. C., W. W. Yew, and Y. Zhang (2011). “Pyrazinamide susceptibility testing in Mycobacterium tuberculosis: a systematic review with meta-analyses.” In: Antimicrob Agents

193 194 bibliography

Chemother 55.10, pp. 4499–505. issn: 1098-6596 (Electronic) 0066-4804 (Linking). doi: 10. 1128/AAC.00630-11. url: https://www.ncbi.nlm.nih.gov/pubmed/21768515. Cohen, K. A. et al. (2015). “Evolution of Extensively Drug-Resistant Tuberculosis over Four Decades: Whole Genome Sequencing and Dating Analysis of Mycobacterium tuberculosis Isolates from KwaZulu-Natal.” In: PLoS Med 12.9, e1001880. issn: 1549-1676 (Electronic) 1549-1277 (Linking). doi: 10.1371/journal.pmed.1001880. url: https://www.ncbi.nlm. nih.gov/pubmed/26418737. Colangeli, R., V. L. Arcus, R. T. Cursons, A. Ruthe, N. Karalus, K. Coley, S. D. Manning, S. Kim, E. Marchiano, and D. Alland (2014). “Whole genome sequencing of Mycobacterium tuberculosis reveals slow growth and low mutation rates during latent infections in hu- mans.” In: PLoS One 9.3, e91024. issn: 1932-6203 (Electronic) 1932-6203 (Linking). doi: 10.1371/journal.pone.0091024. url: https://www.ncbi.nlm.nih.gov/pubmed/24618815. Cole, S. T. et al. (1998). “Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence.” In: Nature 393.6685, pp. 537–44. issn: 0028-0836 (Print) 0028- 0836 (Linking). doi: 10 . 1038 / 31159. url: https : / / www . ncbi . nlm . nih . gov / pubmed / 9634230. Colijn, C., T. Cohen, A. Ganesh, and M. Murray (2011). “Spontaneous emergence of multiple drug resistance in tuberculosis before and during therapy.” In: PLoS One 6.3, e18327. issn: 1932-6203 (Electronic) 1932-6203 (Linking). doi: 10 . 1371 / journal . pone . 0018327. url: http://www.ncbi.nlm.nih.gov/pubmed/21479171. Driscoll, J. R. (2009). “Spoligotyping for molecular epidemiology of the Mycobacterium tuberculosis complex.” In: Methods Mol Biol 551, pp. 117–28. issn: 1064-3745 (Print) 1064- 3745 (Linking). doi: 10.1007/978-1-60327-999-4_10. url: https://www.ncbi.nlm.nih. gov/pubmed/19521871. Drummond, Alexei J., Oliver G. Pybus, Andrew Rambaut, Roald Forsberg, and Allen G. Rodrigo (2003). “Measurably evolving populations.” In: Trends Ecol Evol 18.9, pp. 481– 488. issn: 01695347. doi: 10 . 1016 / s0169 - 5347(03 ) 00216 - 7. url: : / / WOS : 000185311800014. Duchene, S., K. E. Holt, F. X. Weill, S. Le Hello, J. Hawkey, D. J. Edwards, M. Fourment, and E. C. Holmes (2016). “Genome-scale rates of evolutionary change in bacteria.” In: Microb Genom 2.11, e000094. issn: 2057-5858 (Print) 2057-5858 (Linking). doi: 10.1099/mgen.0.000094. url: https://www.ncbi.nlm.nih.gov/pubmed/28348834. Eldholm, V., A. Rieux, J. Monteserin, J. M. Lopez, D. Palmero, B. Lopez, V. Ritacco, X. Didelot, and F. Balloux (2016). “Impact of HIV co-infection on the evolution and transmission of multidrug-resistant tuberculosis.” In: Elife 5. issn: 2050-084X (Electronic) 2050-084X (Linking). doi: 10.7554/eLife.16644. url: https://www.ncbi.nlm.nih.gov/pubmed/ 27502557. Ford, C. B., R. R. Shah, M. K. Maeda, S. Gagneux, M. B. Murray, T. Cohen, J. C. Johnston, J. Gardy, M. Lipsitch, and S. M. Fortune (2013). “Mycobacterium tuberculosis mutation rate estimates from different lineages predict substantial differences in the emergence of drug- resistant tuberculosis.” In: Nat Genet 45.7, pp. 784–90. issn: 1546-1718 (Electronic) 1061-4036 (Linking). doi: 10.1038/ng.2656. url: https://www.ncbi.nlm.nih.gov/pubmed/23749189. Gagneux, Sebastien and Peter M. Small (2007). “Global phylogeography of Mycobacterium tuberculosis and implications for tuberculosis product development.” In: The Lancet Infectious Diseases 7.5, pp. 328–337. issn: 14733099. doi: 10.1016/s1473-3099(07)70108-1. url: https: //www.ncbi.nlm.nih.gov/pubmed/17448936. bibliography 195

Genewein, A., A. Telenti, C. Bernasconi, K. Schopfer, T. Bodmer, C. Mordasini, S. Weiss, A. M. Maurer, and H. L. Rieder (1993). “Molecular approach to identifying route of transmission of tuberculosis in the community.” In: The Lancet 342.8875, pp. 841–844. issn: 01406736. doi: 10.1016/0140-6736(93)92698-s. Godfroid, M., T. Dagan, and A. Kupczok (2018). “Recombination Signal in Mycobacterium tuberculosis Stems from Reference-guided Assemblies and Alignment Artefacts.” In: Genome Biol Evol 10.8, pp. 1920–1926. issn: 1759-6653 (Electronic) 1759-6653 (Linking). doi: 10.1093/ gbe/evy143. url: https://www.ncbi.nlm.nih.gov/pubmed/30010866. Gomes, M. G. M., A. O. Franco, M. C. Gomes, and G. F. Medley (2004). “The reinfection threshold promotes variability in tuberculosis epidemiology and vaccine efficacy.” In: Proc Biol Sci 271.1539, pp. 617–23. issn: 0962-8452 (Print) 0962-8452 (Linking). doi: 10.1098/rspb. 2003.2606. url: http://www.ncbi.nlm.nih.gov/pubmed/15156920. Gordon, S. V. and T. Parish (2018). “Microbe Profile: Mycobacterium tuberculosis: Humanity’s deadly microbial foe.” In: Microbiology 164.4, pp. 437–439. issn: 1465-2080 (Electronic) 1350- 0872 (Linking). doi: 10.1099/mic.0.000601. url: https://www.ncbi.nlm.nih.gov/pubmed/ 29465344. Grenfell, B. T., O. G. Pybus, J. R. Gog, J. L. Wood, J. M. Daly, J. A. Mumford, and E. C. Holmes (2004). “Unifying the epidemiological and evolutionary dynamics of pathogens.” In: Science 303.5656, pp. 327–32. issn: 1095-9203 (Electronic) 0036-8075 (Linking). doi: 10.1126/science.1090727. url: https://www.ncbi.nlm.nih.gov/pubmed/14726583. Guerra-Assuncao, J. A. et al. (2015). “Large-scale whole genome sequencing of M. tuberculosis provides insights into transmission in a high prevalence area.” In: Elife 4. issn: 2050-084X (Electronic) 2050-084X (Linking). doi: 10.7554/eLife.05166. url: http://www.ncbi.nlm. nih.gov/pubmed/25732036. Hatherell, H. A., C. Colijn, H. R. Stagg, C. Jackson, J. R. Winter, and I. Abubakar (2016). “Interpreting whole genome sequencing for investigating tuberculosis transmission: a systematic review.” In: BMC Med 14, p. 21. issn: 1741-7015 (Electronic) 1741-7015 (Linking). doi: 10.1186/s12916-016-0566-x. url: http://www.ncbi.nlm.nih.gov/pubmed/27005433. Hauser, A., A. Hofmann, K. Hanke, V. Bremer, B. Bartmeyer, C. Kuecherer, and N. Bannert (2017). “National molecular surveillance of recently acquired HIV infections in Germany, 2013 to 2014.” In: Euro Surveill 22.2. issn: 1560-7917 (Electronic) 1025-496X (Linking). doi: 10.2807/1560-7917.ES.2017.22.2.30436. url: https://www.ncbi.nlm.nih.gov/pubmed/ 28105988. Hertog, A. L. den, S. Sengstake, and R. M. Anthony (2015). “Pyrazinamide resistance in Mycobacterium tuberculosis fails to bite?” In: Pathog Dis 73.6, ftv037. issn: 2049-632X (Electronic) 2049-632X (Linking). doi: 10.1093/femspd/ftv037. url: https://www.ncbi. nlm.nih.gov/pubmed/25994506. Hu, K. Q. (2002). “Occult hepatitis B virus infection and its clinical implications.” In: Journal of Viral Hepatitis 9.4, pp. 243–257. issn: 1352-0504 (Print) 1365-2893 (Linking). doi: 10.1046/j. 1365-2893.2002.00344.x. Knight, G. M., C. Colijn, S. Shrestha, M. Fofana, F. Cobelens, R. G. White, D. W. Dowdy, and T. Cohen (2015). “The distribution of fitness costs of resistance-conferring mutations Is a key determinant for the future burden of drug-resistant tuberculosis: A model-based analysis.” In: Clin Infect Dis 61Suppl 3,S147–54. issn: 1537-6591 (Electronic) 1058-4838 (Linking). doi: 10.1093/cid/civ579. url: https://www.ncbi.nlm.nih.gov/pubmed/26409276. 196 bibliography

Kühnert, D., T. Stadler, T. G. Vaughan, and A. J. Drummond (2016). “Phylodynamics with Migration: A Computational Framework to Quantify Population Structure from Genomic Data.” In: Mol Biol Evol 33.8, pp. 2102–16. issn: 1537-1719 (Electronic) 0737-4038 (Linking). doi: 10.1093/molbev/msw064. url: https://www.ncbi.nlm.nih.gov/pubmed/27189573. Kühnert, D., M. Coscolla, D. Brites, D. Stucki, J. Metcalfe, L. Fenner, S. Gagneux, and T. Stadler (2018). “Tuberculosis outbreak investigation using phylodynamic analysis.” In: Epidemics 25, pp. 47–53. issn: 1878-0067 (Electronic) 1878-0067 (Linking). doi: 10.1016/j.epidem.2018. 05.004. url: https://www.ncbi.nlm.nih.gov/pubmed/29880306. Pawlowski, A., M. Jansson, M. Skold, M. E. Rottenberg, and G. Kallenius (2012). “Tuberculosis and HIV co-infection.” In: PLoS Pathog 8.2, e1002464. issn: 1553-7374 (Electronic) 1553-7366 (Linking). doi: 10.1371/journal.ppat.1002464. url: https://www.ncbi.nlm.nih.gov/ pubmed/22363214. Roetzer, A. et al. (2013). “Whole genome sequencing versus traditional genotyping for investi- gation of a Mycobacterium tuberculosis outbreak: a longitudinal molecular epidemiological study.” In: PLoS Med 10.2, e1001387. issn: 1549-1676 (Electronic) 1549-1277 (Linking). doi: 10.1371/journal.pmed.1001387. url: https://www.ncbi.nlm.nih.gov/pubmed/23424287. Said, Z. N. (2011). “An overview of occult hepatitis B virus infection.” In: World J Gastroenterol 17.15, pp. 1927–38. issn: 2219-2840 (Electronic) 1007-9327 (Linking). doi: 10.3748/wjg.v17. i15.1927. url: https://www.ncbi.nlm.nih.gov/pubmed/21528070. Sepkowitz, K. A. (1996). “How contagious is tuberculosis?” In: Clin Infect Dis 23.5, pp. 954– 62. issn: 1058-4838 (Print) 1058-4838 (Linking). doi: 10.1093/clinids/23.5.954. url: https://www.ncbi.nlm.nih.gov/pubmed/8922785. Steenwinkel, J. E. de, M. T. ten Kate, G. J. de Knegt, K. Kremer, R. E. Aarnoutse, M. J. Boeree, H. A. Verbrugh, D. van Soolingen, and I. A. Bakker-Woudenberg (2012). “Drug susceptibility of Mycobacterium tuberculosis Beijing genotype and association with MDR TB.” In: Emerg Infect Dis 18.4, pp. 660–3. issn: 1080-6059 (Electronic) 1080-6040 (Linking). doi: 10.3201/eid1804.110912. url: https://www.ncbi.nlm.nih.gov/pubmed/22469099. Stoffels, K., V. Mathys, M. Fauville-Dufaux, R. Wintjens, and P. Bifani (2012). “Systematic analysis of pyrazinamide-resistant spontaneous mutants and clinical isolates of Mycobac- terium tuberculosis.” In: Antimicrob Agents Chemother 56.10, pp. 5186–93. issn: 1098-6596 (Electronic) 0066-4804 (Linking). doi: 10.1128/AAC.05385-11. url: https://www.ncbi.nlm. nih.gov/pubmed/22825123. Supply, P. et al. (2006). “Proposal for standardization of optimized mycobacterial interspersed repetitive unit-variable-number tandem repeat typing of Mycobacterium tuberculosis.” In: J Clin Microbiol 44.12, pp. 4498–510. issn: 0095-1137 (Print) 0095-1137 (Linking). doi: 10.1128/JCM.01392-06. url: https://www.ncbi.nlm.nih.gov/pubmed/17005759. Vergnaud, G. and C. Pourcel (2009). “Multiple locus variable number of tandem repeats analysis.” In: Methods Mol Biol 551, pp. 141–58. issn: 1064-3745 (Print) 1064-3745 (Linking). doi: 10.1007/978- 1- 60327- 999- 4_12. url: https://www.ncbi.nlm.nih.gov/pubmed/ 19521873. WHO (2018). Global tuberculosis report 2018. Report. WHO (2019a). Hepatitis B fact sheet. Web Page. url: https : / / www . who . int / en / news - room/fact-sheets/detail/hepatitis-b. WHO (2019b). Progress report on HIV, viral hepatitis and sexually transmitted infections, 2019. Report. bibliography 197

Zainuddin, Z. F. and J. W. Dale (1990). “Does Mycobacterium tuberculosis have plasmids?” In: Tubercle 71.1, pp. 43–49. issn: 00413879. doi: 10.1016/0041- 3879(90)90060- l. url: http://www.sciencedirect.com/science/article/pii/004138799090060L. J¯ulijaPečerska

Rebweg 4 Contact +41 78 668 01 69 8134 Adliswil Information [email protected] Switzerland

Education 2014-2019 PhD student

ETH Zürich, Department of Biosystems Science & Engineering; Basel, Switzerland. During my time as a PhD student in the cEvo group at D-BSSE I have learned to perform complex phylogenetic and phylodynamic analyses of genetic data. I have used Python as my main scripting tool to perform more routine tasks of data organisation as well as looked into and made fixes to the Java code of Beast2 – one of the main analytic tools used for such analyses, developed in part by our group. I am working on estimating the transmission fitness costs for multi-drug resistant tuberculosis. Supervision: Prof. Dr. Tanja Stadler 2012-2014 Master student

ETH Zürich, Computer Science Department; Zürich, Switzerland. GPA: 5.49 (from 1 [worst] to 6 [best], lowest passing grade: 4) During my Master degree studies at ETH I have taken courses to deepen my understanding of data analysis and, specifically, biological analysis. I have taken courses in computational biology, numerical optimisation and algorithms, and have also taken a few courses on machine learning. My thesis project was a Python-based aggregation tool which analysed and appropriately combined results from multiple tandem repeat detectors to create a descriptive and biologically meaningful set of tandem repeats on a given protein sequence. Thesis: Large-scale Prediction and Functional Analysis of Tandem Re- peats in Proteomes of Diverse Organisms. Computational Biochemistry Research Group Supervision: Dr. Maria Anisimova, Dr. Stefan Zoller Final grade: 6 (from 1 [worst] to 6 [best], lowest passing grade: 4) 2008-2012 Bachelor student

University of Latvia, Faculty of Computing; R¯ıga,Latvia. GPA: 9.127 (from 1 [worst] to 10 [best], lowest passing grade: 4) During my time in the University of Latvia I have taken a variety of courses in basic CS such as discrete maths, numerical methods, data structures and algorithms, data bases and operating systems. I have written a small Android organiser project together with a full project description for my qualification work and a Python-based regular expression syntax parser as an extension of LaTeX syntax for my Bachelor thesis. Thesis: Dynamic parsing using regular-expression-extended grammars. Supervision: Prof. Dr. Guntis Arnic¯ans Final grade: 9 (from 1 [worst] to 10 [best], lowest passing grade: 4)

1 of 5 2002-2008 Programming courses "Progmeistars"; R¯ıga,Latvia. Basics of programming and algorithms. 1997-2008 R¯ıgaSecondary School No. 40; R¯ıga,Latvia. Class with a special focus on mathematics and languages. GPA: 9.07 (from 1 [worst] to 10 [best], lowest passing grade: 4) Final examination levels: English Language - A, Physics - A, Latvian Language and Literature - A, Maths - A.

Teaching 2015-2019 Teaching assistant ETH Zürich, Department of Biosystems Science & Engineering; Basel, Switzerland. I was a teaching assistant in the Molecular Evolution, Phylogenet- ics, Phylodynamics (MEPP, 2015/2016) and Computational Biology (CB, 2016/2017, 2017/2018 and 2018/2019) courses in ETH Zürich. As the course was created from scratch, I participated in designing the course, as well as designed the tutorials and homeworks together with my super- visor and colleagues. I also taught the tutorials and supervised students’ progress throughout the course, helping along the way. In the semesters of 2017/2018 I also participated in designing an auto- mated homework grading system, and prepared the R template code for each assignment that was the base for the homeworks as we were teaching students that may have no prior knowledge in programming. The 2015/2016 edition of the course has won the Golden Owl award for outstanding teaching. 2016-2019 Lecturer, tutor Taming the Beast workshop As part of the developer team of Beast2 I have participated in setting up a community learning platform that will allow researchers to get the appropriate set of skills to perform Bayesian phylogenetic and phylody- namic analyses on their data, and helped organize and kick off the Taming the Beast workshop series. In the course of preparation we have organised all the logistics of the course as well as prepared a comprehensive set of tutorials that previously were either dispersed around different websites or did not exist in the first place. Thus far I have also taught and assisted the students in the 2016 and 2018 Swiss editions, the 2017 London edition and the 2019 Canada edition of the workshops. The workshop and the tutorials developed in the preparation of the workshop have resulted in a publication. 2008-2009 Lecturer Mathematics, Informatics, Physics (M.I.Ph.) summer school;

Preil,i, Latvia. Together with a colleague I prepared the syllabus and led seminar classes on cryptography (2008) and genetics (2009) for gifted high school students aged 15 to 18.

2 of 5 Work 2013 Intern ETH Zürich, Computer Science Department, Computational Biochem- istry Research Group; Zürich, Switzerland. I was working on a project on tandem repeat detection in a large body of proteins. The goal of this project was to create a reliable set of tandem repeats to be used for further analysis and testing of biological hypotheses. This project enabled me to improve my skills in performing computa- tions on large amounts of data as well as to use biological databases to get the data I need. This project also served as the basis for my Masters thesis in ETH. 2012-2013 Research Assistant ETH Zürich, Computer Science Department, Global Information Sys- tems Group; Zürich, Switzerland. I have worked as an assistant in a post-doctoral research project on website adaptation for large and/or touch-enabled screens. My responsi- bilities included creating prototypes for the system and performing short subject studies if necessary. 2010-2012 Software Engineer/Senior Software Engineer Accenture RDC; R¯ıga,Latvia. I participated in multiple projects for mobile platforms such as An- droid, MeeGo and Maemo, first as Software Engineer, then as Senior Software Engineer. I was responsible for creating important functional UI elements using Java and Qt, as well as for creating Python backend for a tool that performed simultaneous configuration of multiple Android devices. For a short period of time I was also doing manual testing and test process quality assurance for European customs software. The time in a big software engineering company allowed me to im- prove my communication and teamwork skills, as well as to take on real responsibility for the products we created.

Publications Article Pečerska J., Gygli, S., Gagneux S. and Stadler T. Pyrazinamide resistance relative transmission fitness comparison of multi- drug resistant tuberculosis in different settings. In preparation. Article Pečerska J., Kühnert D., Meehan C. J., Coscolla M., de Jong B. C., Gagneux S. and Stadler T. Estimating drug resistance reproductive fitness costs of multi-drug resis- tant tuberculosis Epidemics, in review. Article Meehan C. J., Moris P., Kohl T. A., Pečerska J., Akter S., Merker M., Gehre F., Lempens P., Stadler T., Kaswa M. K., Kühnert D., Niemann S. and de Jong B. C. The relationship between transmission time and clustering methods in Mycobacterium tuberculosis epidemiology EBioMedicine, November 2018.

3 of 5 Article Pinho-Nascimento C. A., Bratschi M., Höfer R., Soares C. C., Warryn L., Pečerska J., Paixão I. C. N. P., de Moraes M. T. B., Um Boock A., Niel C., Pluschke G. and Röltgen K. Transmission of Hepatitis B and D virus infections in an African rural community mSystems, September 2018.

Article Kühnert, D., Kouyos R., Shirreff G., Pečerska, J., Scherrer A. U., Böni J., Yerly S., Klimkait T., Aubert V., Günthard H. F., Stadler T., Bonhoeffer S. and the Swiss HIV Cohort Study Quantifying the fitness cost of HIV-1 drug resistance mutations through phylodynamics PLoS Pathogens, February 2018.

Article Barido-Sottani J., Bošková V., du Plessis L., Kühnert D., Magnus C., Mi- tov V., Müller N. F., Pečerska J., Rasmussen D. A., Zhang C., Drum- mond A., Heath T., Pybus O. G., Vaughan T. and Stadler T. Taming the BEAST - A Community Teaching Material Resource for BEAST 2 Systematic Biology, January 2018.

Book chapter Pečerska J., Wood J., Tanaka M. M. and Stadler T. Mathematical Models for the Epidemiology and Evolution of Mycobac- terium tuberculosis In Strain Variation in the Mycobacterium tuberculosis Complex: Its Role in Biology, Epidemiology and Control, edited by Gagneux, Sebastien, Springer, November 2017.

Article Anisimova M., Pečerska J. and Schaper E. Statistical approaches to detecting and analyzing tandem repeats in ge- nomic sequences Frontiers in Bioengineering and Biotechnology, March 2015.

Article Schaper E., Korsunsky A., Pečerska J., Messina A., Murri R., Stockinger H., Zoller S., Xenarios I. and Anisimova M. TRAL: tandem repeat annotation library Bioinformatics, September 2015.

Posters Poster Quantifying transmission fitness cost of TB drug resistance Evolution, August 2018

Poster Quantifying transmission fitness cost of MDR Tuberculosis ESEB, July 2015

4 of 5 Skills & Languages: Interests Russian (mother tongue), Latvian (fluent), English (fluent, IELTS score 8.0), German (intermediate, Goethe-exam level B1), French (be- ginner) Programming Experience: Proficient in C/C++, Python, Java, JavaScript, R. Knowledge of Matlab, Pascal, Delphi, Prolog. Other skills: Mockups; UML diagrams, flowcharts; HTML, CSS. Other interests: Computer science and technology, epidemiology, genetics and epige- netics, sports, science fiction.

5 of 5