Genomic Signatures of Pre-Resistance in Mycobacterium

Arturo Torres Ortiz Imperial College https://orcid.org/0000-0002-2492-0958 Jorge Coronel Universidad Peruana Cayetano Heredia Julia Rios Vidal Ministerio de Salud, Lima, Peru Cesar Bonilla Ministerio de Salud, Lima, Peru David Moore London School of Hygiene and Tropical Medicine https://orcid.org/0000-0002-3748-0154 Robert Gilman Johns Hopkins Bloomberg School of Public Health Francois Balloux University College London https://orcid.org/0000-0003-1978-7715 Onn Min Kon Imperial College Xavier Didelot University of Warwick Louis Grandjean (  [email protected] ) University College London

Article

Keywords: Tuberculosis, phylogenetics, drug resistance, AMR

Posted Date: May 25th, 2021

DOI: https://doi.org/10.21203/rs.3.rs-364747/v1

License:   This work is licensed under a Creative Commons Attribution 4.0 International License. Read Full License 1 Genomic signatures of Pre-Resistance in Mycobacterium 2 tuberculosis

1 2 3 3,4 3 Arturo Torres Ortiz , Jorge Coronel , Julia Rios Vidal , Cesar Bonilla , 5 6 7 8 4 David AJ Moore , Robert H Gilman , Francois Balloux , Onn Min Kon , 9 1,10∗ 5 Xavier Didelot , Louis Grandjean

1 6 , Department of Infectious Diseases, London, UK 2 7 Universidad Peruana Cayetano Heredia, Lima, Perú 3 8 Unidad Técnica de Tuberculosis MDR, Ministerio de Salud, Lima, Perú 4 9 Universidad Privada San Juan Bautista, Lima Perú 5 10 London School of Hygiene and Tropical Medicine, London, UK 6 11 Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, USA 7 12 UCL Computational Institute, London, UK 8 13 Respiratory Medicine, National Heart and Lung Institute, Imperial College London, UK 9 14 University of Warwick, School of Life Sciences and Department of Statistics, Warwick, UK 10 15 UCL Department of Infection, Institute of Child Health, London, UK

16

∗ 17 Corresponding author [email protected] 18 Institute of Child Health, 30 Guilford Street, London, WC1N 1EH

19

20 Keywords: Tuberculosis, phylogenetics, drug resistance. AMR

1 21 Abstract

22 Recent advances in bacterial whole-genome sequencing have resulted in the identification of 23 a comprehensive catalogue of genomic signatures of antibiotic resistance in Mycobacterium 24 tuberculosis. With a view to pre-empting the emergence of drug-resistance, we hypothesized 25 that pre-existing balanced polymorphisms in drug susceptible genotypes (“pre-resistance mu- 26 tations”) could increase the risk of acquiring in the future.

27

28 In order to identify a genomic signature of future drug resistance we undertook 29 whole-genome sequencing on 3135 culture positive isolates from different patients sampled 30 over a 17-year period in Lima, Peru.

31

32 Reconstructing ancestral whole genomes on time-calibrated phylogenetic trees we iden- 33 tified no single drug resistance in Peru predating 1940. Moving forward in evolutionary time 34 through the phylogenetic tree from 1940, we apply a novel genome-wide survival analysis to 35 determine the hazard of drug resistance acquisition at the level of lineage, mono-resistance 36 state, and single-nucleotide polymorphism. We demonstrate that lineage 2 has a signifi- 37 cantly higher incidence of drug resistance acquisition than lineage 4 (HR 3.36, 95% CI 2.10 −7 38 - 5.38, p-value = 4.25 × 10 ) and estimate that the hazard of evolving rifampicin following 39 isoniazid resistance acquisition is 14 times that of genomes with a susceptible background −15 40 (HR 14.45, 95% CI 8.46 - 15.50, p-value < 10 ). Our findings are validated in a separate 41 publicly available dataset from Samara, Russia. After controlling for population structure, 42 we also show that a deletion in a gene coding for the cell surface protein lppP predisposes 43 to the acquisition of drug resistance in susceptible genotypes (HR 6.71, 95% CI 4.82-11.22, −9 44 p-value = 1.17 × 10 ).

45

46 Prediction of future drug resistance in susceptible together with targeted ex- 47 panded therapy has the potential to prevent drug resistance emergence in Mycobacterium 48 tuberculosis and other pathogens. Prospective cohort studies of participants with and with- 49 out these polymorphisms should be undertaken with a view to implementing personalized 50 pathogen genomic therapy. This approach could be employed to preempt and prevent the 51 emergence of drug resistance and other important traits in other organisms.

2 52 Introduction

53 Mycobacterium tuberculosis is estimated to have killed 1 billion people over the last 200 years 54 [1] and remains one the world’s most deadly pathogens [2]. Drug resistance in bacteria, par- 55 ticularly the Enterobacteriaceae and Mycobacterium tuberculosis, imposes an unsustainable 56 burden on health programs worldwide with some strains so extensively resistant that they 57 are untreatable with existing antibiotic therapy [3]. Although recent advances in bacterial 58 whole-genome sequencing have significantly improved the identification of drug resistance [4], 59 post hoc approaches to diagnosis miss the opportunity to preempt the emergence of drug re- 60 sistance and implement preventive measures prior to the acquisition and spread of antibiotic 61 resistant disease.

62

63 An increased risk of drug resistance emergence is often attributed to inadequate imple- 64 mentation of control measures [5], but bacterial factors have also been proposed as potential 65 contributors to drug resistance [6]. Evidence of differential drug resistance acquisition at 66 the Mycobacterium tuberculosis sublineage level is conflicting. Epidemiological and in vitro 67 studies have suggested that the Beijing family, belonging to lineage 2, is hyper-mutable [7] 68 with a propensity to develop resistance at a higher frequency than other lineages [8–11], 69 while others cite evidence to the contrary [12, 13]. Pre-existing resistance to one antibiotic 70 (mono-resistance) is another factor that may influence the acquisition of multidrug-resistance 71 [14]. Mono-resistance to isoniazid or rifampicin have been associated with increased rates of 72 further development of multidrug-resistance [15, 16], but the relative risk of either remains 73 unclear. Similarly, phylogenetic analyses suggest a stepwise progression towards multidrug- 74 resistance, where mutations conferring isoniazid resistance tend to precede those linked to 75 rifampicin resistance [17–20].

76

77 Phylogenetic trees have been increasingly used to study pathogen dynamics and evolu- 78 tionary processes of a wide range of phenotypes of epidemiological interest, including viru- 79 lence or drug resistance acquisition [21, 22]. A necessary focus on improving the molecular 80 diagnosis of drug resistance has led to the generation of large strain collections of drug resis- 81 tant pathogens. However, unrepresentative samples of this kind enriched for drug-resistant 82 isolates limit the ability to characterize the and dynamics of drug resistance from a 83 diverse background of ancestral susceptible strains. Inadequate sampling without comprehen- 84 sive population level coverage or sufficient temporal span compounds this problem, while the 85 monomorphic nature of the tuberculosis genome makes constructing time-calibrated phyloge- 86 netic trees particularly challenging. As a consequence, a single mutation rate is often applied 87 to the data, but this assumption inappropriately forces lineages and sub-lineages to conform 88 to the same global mutation rate, thus limiting the inferences that can be made from the data.

89

90 Overcoming these issues, we present findings from samples collected over a 17-year time 91 span with population level coverage in the hyperendemic suburbs of Lima, Peru. We apply 92 a novel genome-wide survival analysis to a time-calibrated phylogeny of 3135 tuberculosis

3 93 strains and show the existence of pre-resistance mutations among drug susceptible genotypes 94 that increase the risk of future drug resistance emergence in tuberculosis. We demonstrate 95 clear differences in the acquisition of drug resistance between lineages, on mono-resistant 96 backgrounds, and at the level of nucleotide polymorphisms. Our findings are then tested and 97 replicated in an independent publicly available data set of 1000 whole genomes collected in 98 Samara, Russia, to demonstrate that they can be globally generalized.

4 99 Results

100 Population structure, genomic analysis, and patient demographics

101 A total of 3432 Mycobacterium tuberculosis genomes from Lima (Peru) were analyzed, of 102 which 3135 passed genomic quality filters. Of this, 2037 were part of a population level study 103 carried out in 2009 where sputum samples were taken from all patients presenting tuber- 104 culosis symptoms in the Lima areas of Callao and Lima South [23]. Comparison of drug 105 resistance prevalence between the population level sampling and reports of epidemiological 106 data in Peru [2] are consistent: 1.5% (32/2037) of samples were rifampicin mono-resistant; 107 5% were isoniazid mono-resistant (105/2037), and 13% were multidrug-resistant (269/2037). 108 The remaining samples were collected from cohort studies covering a 17-year period of re- 109 search in the regions of Lima and Callao in order to achieve a sufficient temporal span in our 110 sampling window (Supplementary Fig. 1).

111

112 The isolates were first aligned to the reference genome H37Rv, then lineages and sublin- 113 eages were assigned using clade specific SNPs [24]. Lineage 4 (L4, Euro-American) consisted 114 of 2807 samples, while lineage 2 (L2, Beijing) had 327 isolates (Table 1). There was a single 115 representative of lineage 1 (Indo-Oceanic), which was used to root the phylogenetic tree. The 116 remaining samples from the data set, which included 5 Mycobacterium caprae isolates, were 117 not used in the downstream analysis. Lineage 4 had the highest diversity, comprising 1235 118 isolates from lineage 4.3 (LAM), 935 from lineage 4.1.2.1 (Haarlem), 271 from lineage 4.1.1 119 (X-Type), and 312 from lineages denoted as Type T, which encompasses lineages 4.5, 4.7, 4.8 120 and 4.9. Other minor sublineages included lineage 4.2.2 (TUR) and lineage 4.4. All isolates 121 from Lineage 2 were part of the Beijing sublineage (lineage 2.2), mainly from the sublineage 122 2.2.1 or Modern Beijing, and with only one representative of the sublineage 2.2.2 or Asia 123 Ancestral [25] (Table 1).

124

125 The alignment of the isolates to the reference genome resulted in 64,586 SNPs, of which 126 18,022 were singletons (28%). Most SNPs were not widely distributed across the dataset, 127 and only 8,088 variants had a frequency higher than 1%. A total of 59,789 SNPs (16,934 128 singletons, 28%) were identified for lineage 4, and 4,821 SNPs (1,370 singletons, 28%) for lin- 129 eage 2. Similarly, we applied the same analysis to a publicly available data set from Samara, 130 Russia, as a validation set where, unlike the Peruvian dataset, lineage 2 constitutes the main 131 lineage. We identified a total of 28,414 SNPs, consistent with previous publications of this 132 data [19].

133

134 Patient demographic metadata was available for 2220 samples, of which 88% were smear 135 positive, 27% were previously treated for tuberculosis, and 2.8% were HIV positive, consistent 136 with previous population level estimates in Peru [26]. The median age was 28 years (IQR 137 21–41). In our cohort, 86% of lineage 2 and 89% of lineage 4 were smear positive.

5 138 Phylogenetic analysis and drug resistance emergence

139 The maximum likelihood phylogeny constructed using these alignments grouped the isolates 140 by lineage similarly to previously published global data sets [24] (Fig. 1a, Supplementary 141 Fig. 2). To study the temporal dynamics of drug resistance acquisition at the population 142 level, the maximum likelihood phylogeny was time-calibrated using the sampling dates of 143 the isolates, which extended from 1999 to 2016. Dated phylogenies were built separately for 144 lineage 2 and lineage 4 in order to avoid the confounding effect of the temporal and popula- 145 tion structures [27]. Both datasets had a clear temporal signal (Supplementary Fig. 3), and 146 thus model parameters could be confidently inferred from the data [28, 29]. We used the 147 relaxed clock model implemented in BactDating [30], allowing the mutation rate to vary in 148 each branch independently. We ran the MCMC until convergence of the chains was achieved, 149 with an effective sample size (EES) of at least 100 (Supplementary Fig. 4). The estimated 150 rate for lineage 2 was 0.45 substitutions per genome per year (0.32;0.57 95% CI), while lin- 151 eage 4 had a clock rate of 0.299 (0.25;0.36 95% CI) (Fig. 1b). The estimates of the molecular 152 clock for both lineages were consistent with previous reports [31]. The most common recent 153 ancestor (tMRCA) for our samples was placed at 565 CE (263;826 95% CI) for lineage 4 154 (Fig. 1c), while lineage 2 had a tMRCA in 1325 CE (1070;1499 95% CI) (Fig. 1d).

155

156 Drug resistance was inferred for all isolates at the tips of the phylogenetic tree using 157 well-established drug resistance associated SNPs [32]. The time of emergence of drug resis- 158 tant mutations was inferred by reconstructing the ancestral sequences of the internal nodes 159 in the phylogenetic tree. The time of emergence of a specific antibiotic resistance mutation 160 was approximated to the inferred year of the internal node where such mutation first ap- 161 peared. The phylogenetic estimates of the emergence of drug resistance conferring mutations 162 in lineage 4 occurred around the time of the known introductions of the corresponding drug. 163 In contrast the emergence of drug resistance in lineage 2 was observed to have arisen many 164 years after the introduction of antituberculous drugs. This is consistent with the geographic 165 spread of lineage 4 in Europe together with early widespread use of drugs in this region (Ta- 166 ble 2, Fig. 2). For both lineage 2 and lineage 4, the earliest inferred occurrence of resistance 167 was to isoniazid, by the Ser315Thr mutation in the gene KatG, around 1957 (1928;1978 95% 168 CI) for lineage 2, and 1942 (1913;1960 95% CI) for lineage 4, in line with the reported wide 169 introduction of isoniazid in 1952 [33]. The rifamycins were first isolated in 1957 [33], and we 170 estimate the date for the first acquisition of resistance to rifampicin due to the rpoB mutation 171 Ser450Leu to be around 1951 (1931;1971 95% CI) for lineage 4 and 1974 (1953;1987.8 95% 172 CI) for lineage 2. None of the drug resistant nodes reverted to susceptible along the branches 173 of the two phylogenetic trees.

174 Between lineage differences in drug resistance acquisition

175 The risk of acquiring drug resistance was calculated as the Cox Proportional Hazard Ratio 176 (HR) using the time between sensitive internal nodes and the first drug resistant node in the

6 177 time-calibrated phylogenetic trees. Lineage 2 was characterized by a higher incidence of drug 178 resistance acquisition when compared to lineage 4 in both the Peruvian dataset (HR 3.36, −7 179 95% CI 2.10 - 5.38, Likelihood ratio test p-value = 4.25 × 10 ; Fig. 3a), and the Samara −15 180 dataset (HR 4.82, 95% CI 3.74 - 6.21, Likelihood ratio test p-value < 10 ; Fig. 3b). Simi- 181 larly, the incidence of drug resistance acquisition was also higher in lineage 2 when compared 182 to all sublineages of lineage 4 in the Peruvian dataset (All p-values for comparison < 0.05; 183 Fig. 3c). To assess the adequacy of our data to the proportional hazard assumption, we 184 calculated the relationship between the Schoenfeld residuals against time. In all cases, a 185 non-significant association between the Schoenfeld residuals and time supported the use of 186 the proportional hazards model (Supplementary Fig. 5).

187

188 Since previous treatment with antituberculous drugs has been associated to an increased 189 risk of acquiring drug resistance [34], we tested the association between previous treatment 190 and the different sublineages of our cohort. A total of 2236 samples contained metadata 191 regarding previous treatment of tuberculosis. We found no significant association between 192 Mycobacterium tuberculosis sublineages and previous treatment with antituberculous drugs 193 in a logistic regression model (n = 2236, all p-values > 0.1), suggesting that the between 194 lineage differences in drug resistance acquisition observed in the survival analysis are not 195 confounded by a differential distribution of antibiotics between sublineages. Similarly, we 196 found no significant association between Mycobacterium tuberculosis sublineages and HIV 197 and smear positivity in a logistic regression model (n = 2133, all p-values > 0.1).

198

199 In order to evaluate the robustness of our maximum likelihood phylogeny, we repeated 200 both the dating and the survival analysis on 100 phylogenetic bootstrap replicates. In both 201 lineage 2 and lineage 4, the Cox Proportional Hazard ratio was not significantly different 202 between the maximum likelihood phylogeny and the bootstrap replicates (Supplementary 203 Fig. 6a,b). Additionally, the Kaplan-Meier curve of the 100 replicates was similar to that of 204 the maximum likelihood tree (Supplementary Fig. 6c,d).

205 The risk of developing MDR TB from isoniazid mono-resistance

206 To determine the effect of mono-resistance on the acquisition of further multidrug-resistance, 207 the hazard ratio of acquiring rifampicin resistance was calculated for isoniazid mono-resistant 208 ancestral genotypes versus susceptible ancestral strains. Genotypes with an isoniazid mono- 209 resistant background had 15 times the hazard of developing rifampicin resistance tuberculosis 210 relative to wild type susceptible strains (HR 15.12, 95% CI 10.54 - 21.69, Likelihood ratio −15 211 test p-value < 10 ; Fig. 4a). A larger hazard ratio was obtained in the Samara dataset −15 212 (HR 37.28, 95% CI 18.81 - 73.88, Likelihood ratio test p-value < 10 ; Fig. 4b), although 213 the low prevalence of mono-resistance clades in the Samara set may bias this estimate.

214

215 Due to the infrequent occurrence of rifampicin mono-resistance prior to multidrug-resistance 216 emergence, the risk of developing multidrug-resistance from a rifampicin mono-resistance

7 217 background could not be reliably estimated.

218 Genomic signatures of drug resistance acquisition

219 Genome-wide survival analysis was performed using a Cox Proportional Hazard regression 220 model to identify genetic variants in phylogenetic nodes inferred to be drug susceptible but 221 associated with a higher risk of progression towards drug resistance. Resistant nodes were de- 222 fined as those inferred to be resistant to any antibiotic in order to identify common pathways 223 of increased risk of acquiring resistance regardless of the specific antibiotic. The association 224 analysis was performed in lineage 4 and lineage 2 separately, and we further corrected for 225 population structure using a kinship matrix, which reduced the genomic inflation factor (λ) 226 to 1.21 (Supplementary Fig. 7).

227

228 Three variants in drug susceptible ancestral genotypes were associated with a higher risk 229 of acquiring drug resistance in lineage 4 after population and multiple testing correction, two 230 of which were in previously annotated genes (Fig. 5a). The variant with the lowest p-value 231 corresponded to a 9bp deletion at location 2,604,157 in the locus lppP, which encodes for 232 a lipoprotein and has been predicted to be required for growth in macrophages [35]. This 233 deletion had a frequency of 1.7% in the population, and it evolved 12 times independently 234 along the phylogenetic tree. Genotypes with this variant had a hazard ratio 6.7 times greater −9 235 than those with an intact lppP (Fig. 5b, HR 6.71 95% CI 4.82-11.22, p-value = 1.17 × 10 ). 236 A synonymous polymorphism at position 2,626,011 in the gene esxO (HR 1.64, 95% CI 0.9- −6 237 3.33 p-value = 3.23 × 10 ) with a frequency in the population of 5% had the second lowest 238 p-value. The esxO gene is part of a family of protein secretion systems described to be critical 239 for growth, pathogenesis, and mycobacterial–host interactions [36].

240

241 No significant associations were identified for lineage 2 after correcting for population 242 structure, possibly due to the lower diversity of lineage 2 as well as the strong lineage effect 243 on the phenotype. The analysis could not be replicated in the Samara dataset as the Samara 244 dataset was significantly smaller and hence lacked sufficient statistical power.

8 245 Discussion

246 This study represents the largest population level genomic analysis of Mycobacterium tuber- 247 culosis to date. Our 17-year sampling time frame provided a unique opportunity to study 248 drug resistance acquisition dynamics and evolution. To our knowledge this is the first de- 249 scription and evaluation of pathogen pre-resistance (pre-existing balanced polymorphisms 250 that predispose to the acquisition of future drug resistance).

251

252 Using a novel ancestral state genome-wide survival analysis to move in time through the 253 phylogenetic tree, we show that Mycobacterium tuberculosis is predisposed to acquire drug 254 resistance mutations at the lineage level, after mono-resistance, and at the level of nucleotide 255 polymorphisms. Identifying pathogen genetic factors that predispose strains to evolve drug 256 resistance could help prevent the acquisition and spread of resistance as well as treatment 257 failure by expanding treatments to those strains most likely to become resistant in the future.

258

259 Previous studies of acquired drug resistance at the sublineage level in Mycobacterium 260 tuberculosis have led to contradictory outcomes, with small sample sizes in fluctuation assays 261 [7, 12] or using amalgamated sub-population level samples. Here we demonstrate conclusively 262 that lineage 2 acquired resistance to antibiotics more rapidly than lineage 4. There were no 263 significant differences observed in drug resistance acquisition between the sub-lineages of the 264 most diverse lineage 4. Even though lineage 2 showed an increased risk in drug-resistance 265 acquisition, lineage 4 evolved resistance earlier than lineage 2 for almost all drugs analyzed, 266 with the exception of streptomycin. This may be explained by the Euro-American distribu- 267 tion of lineage 4 and the earlier widespread implementation of antibiotics in these regions.

268

269 The identification and control of mono-resistant strains is a key component of tuber- 270 culosis public health infection prevention and control efforts. Mono-resistance is associated 271 with worse clinical outcomes [37] and an increased probability of progressing to multidrug- 272 resistance [15]. At the population level, our study quantifies this risk and shows that isoniazid 273 mono-resistant strains have at least 15 times the hazard of developing multidrug-resistance 274 relative to wild type susceptible strains. Despite the use of new therapeutics, multi-drug resis- 275 tant tuberculosis continues to require polypharmacy with increased toxicity and longer treat- 276 ment duration [2]. Globally, drug resistance surveillance is focused primarily on rifampicin, 277 with the widely implemented GeneXpert MTB/RIF PCR based assay unable to detect isoni- 278 azid mono-resistance. Inadequate diagnosis of isoniazid mono-resistance will inevitably lead 279 to inappropriate treatment and could fuel rapid evolution of multidrug-resistance, thus pos- 280 ing a significant threat to tuberculosis control.

281

282 We also identified three genomic loci associated with higher risk of future drug resistance 283 acquisition. To remain polymorphic these variants must be under balancing selection and 284 only be positively selected once exposed to drug therapy. Rather than causing resistance 285 directly, these variants could promote resistance acquisition by compensating for the fitness

9 286 costs of resistance in vivo [38].

287

288 The variant with the lowest p-value corresponded to a 9bp deletion in the gene lppP 289 present at a frequency of 1.7% in the population and arising 12 times independently. Dele- 290 tions in lipoproteins have been well characterized in the past [39], and lppP has been predicted 291 to be required for growth in macrophages [35]. Lipoproteins can act as antigenic proteins 292 [39], and thus deletions in the genes encoding them may alter the interaction between the 293 bacilli and the macrophages, potentially conferring a selective advantage in the presence of 294 drug and increasing the probability of acquiring drug resistance.

295

296 An additional variant was identified in the gene esxO, which encodes for the ESAT-6-like 297 protein esxO. The gene esxO is part of a family of genes that code for immunogenic secreted 298 proteins that play a role in mycobacterial growth, pathogenesis, and host-pathogen interac- 299 tions [36]. Moreover, esxO has been associated with pathogenesis by inducing autophagy 300 in infected macrophages [40]. Synonymous homoplastic variants in ESX genes have been 301 previously identified [23], but their phenotypic effects are still unclear.

302

303 This study benefited from an unbiased population level coverage of both drug resistant 304 and drug susceptible strains that enabled us to reliably correct for the founder effect and 305 control for the influence of pre-existing population diversity. The large sampling size and 306 time frame — a consequence of 17-years of continued research in the same location — allowed 307 us to generate time-calibrated phylogenies without imposing a global mutation rate. This 308 was a pre-requisite for the downstream analyses and our novel GWAS survival analysis ap- 309 proach. We were also able to validate our sublineage and mono-resistance dependent hazards 310 of acquired resistance in the smaller Samara dataset. However, the time scale and size of this 311 publicly available data was insufficient to allow us to confirm the effect of the lppP deletion 312 in a second independent dataset.

313

314 Although our phylogenetic analysis reveals trends of drug resistance acquisition over 315 evolutionary time, prospective cohort studies are required to determine the effect of these 316 mutations at the individual patient and household level. There was no difference observed 317 in the proportion of patients receiving previous antituberculous treatment between the two 318 lineages. This makes our findings unlikely to be influenced by differential exposure to drugs 319 among lineages. There was no difference in sputum smear grade between lineages, suggesting 320 that our findings are not a consequence of increased pathogenicity of lineage 2 in comparison 321 to lineage 4. Significant immigration to Peru occurred well before the advent of antibiotics 322 [41], which limits the influence of imported drug resistant strains. Nevertheless, it is possible 323 that some resistance events may have arisen as a result of importation of resistant strains 324 from countries with different drug selection pressures.

325

326 In summary, this population wide 17-year long epidemiological study of Mycobacterium 327 tuberculosis genetics provides the first description and evaluation of pre-resistant balanced

10 328 polymorphisms in susceptible genotypes that predispose to the acquisition of future drug re- 329 sistance. Prediction of future drug resistance in susceptible pathogens together with targeted 330 expanded therapy has the potential to prevent drug resistance emergence in Mycobacterium 331 tuberculosis and other pathogens. Prospective cohort studies of participants with and without 332 these polymorphisms should be undertaken to inform clinical trials of personalized pathogen 333 genomic therapy. This novel ancestral state genome-wide survival analysis could also be em- 334 ployed to predict and prevent the emergence of resistance or indeed any important trait of 335 interest in other organisms.

11 336 Materials and methods

337 Study design and sample selection

338 Samples were selected from previous projects taking place across the region of Lima. The 339 first project consisted of a population level study carried out between 2008 and 2010 as 340 part of the population level implementation of Microscopic Observation Drug Susceptibility 341 (MODS) testing [42, 43]. A total of 2,139 unselected patients of tuberculosis were collected. 342 Of these patients, 284 were analyzed in previous studies (PRJEB5280) [23], while 1855 were 343 processed as part of this project.

344

345 A second set of 213 MDR-TB samples was obtained from a 3-year long household follow- 346 up study conducted between 2010–2013 [34], of which 185 randomly selected samples under- 347 went whole-genome sequencing (PRJEB5280) [23].

348

349 Additionally, 42 unpublished whole-genome sequences of samples collected from 1999 to 350 2007 in different regions of Lima were added to the study, as well as 575 samples that were 351 collected between 2003 and 2013 as part of the CRyPTIC Consortium (PRJEB32234) [44], 352 and 489 samples from the TANDEM Consortium (PRJEB23245) [45] taken between 2014 353 and 2016.

354

355 All samples without collection date, as well as reference clinical samples, were excluded 356 from the final list. Drug susceptibility testing (DST) was performed either by MODS [43] or 357 by the proportions method on agar [46].

358

359 To replicate our findings, all the analyses were repeated in a publicly available indepen- 360 dent dataset from Samara, Russia (PRJEB2138), as the sampling was also representative of 361 the population [19].

362 Whole-genome sequence analysis

363 Quality analysis of the raw reads was performed using FastQC [47]. A de-novo assembly of 364 the short reads was done with SPAdes genome assembler v3.14.0 [48] across kmers of size 365 21, 33, 45, 55, 65, 75, 81, 101, 111, and 121. The resulting assembly contigs were mapped 366 to the well annotated H37Rv reference genome (Gene bank: AL123456) using minimap2 367 [49] with the asm20 option. Single nucleotide polymorphisms (SNPs) and small insertions 368 and deletions (indels) were identified with BCFtools mpileup and BCFtools call v1.9 [50] us- 369 ing the multiallelic calling algorithm, keeping the information about every single site in the 370 genome in a VCF file. Lastly, indels were left-aligned and normalised using BCFtools norm. 371 A consensus sequence was created from the VCF file. In order to determine the quality of 372 the variants, the raw reads were mapped against the resulting consensus sequence using the 373 mem algorithm implemented in BWA v0.7.17 [51], after which the alignments were sorted

12 374 using SAMtools v1.9 [52] and filtered for possible PCR and optical duplicates using Picard 375 v2.19.0 [53]. Local realignment around indels was performed using the GATK v3.8-1-0 ‘In- 376 delRealigner’ module [54]. The mean coverage for each sample was calculated as the number 377 of mapped bases (excluding soft-clipped bases) divided by the genome size. Samples with 378 a mean coverage lower than 15x were excluded from subsequent analysis. SNPs and indels 379 were detected as described in the previous step.

380

381 Variants that did not meet the quality criteria were filtered using a combination of 382 BCFtools filter and custom scripts in Python v3.7.3 with the following cutoffs: minimum 383 Phred-scaled quality score (QUAL) of 20; minimum mapping quality (MQ) of 20; minimum 384 genotype quality (QG) of 20; minimum read position bias (RPB), mapping quality bias 385 (MQB), and strand bias (SP) of 0.001; minimum depth (DP) of 10 and a maximum of 5 386 times the mean coverage; minimum of reads supporting the alternate allele (AD) of 75% of 387 the total depth in that position, with no less than two reads in the forward (ADF) and the 388 reverse (ADR) strands. Additionally, SNPs within 2bp of an indel and indels within 3bp 389 of another indel were removed, as both situations can be indicative of mapping artifacts. 390 Positions that did not meet the quality criteria were annotated using the IUPAC ambiguity 391 codes [55]. Samples with more than 25 high quality heterozygous calls were removed to avoid 392 the inclusion of putative mixed infections. Finally, variants that overlapped 100bp intervals 393 around known hypervariable regions, such repetitive elements and transposases [56], were 394 removed from the analysis as this may affect the reliability of the alignment. Similarly, re- 395 combinant regions in genes coding for proline-glutamate (PE) or proline-proline-glutamate 396 (PPE) [57], and SNPs implicated in drug resistance [32] were excluded in order to minimize 397 homoplasies that could disrupt the tree topology. The resulting sequences were concatenated 398 to generate a multiple sequence alignment. Sites with a proportion of ambiguous bases higher 399 than 10% were excluded from the analysis. Finally, samples with a proportion of ambiguous 400 sites in the alignment higher than 5% were excluded.

401 Phylogenetic analyses

402 All the phylogenetic analyses were performed based on the alignment containing both lineage 403 4 and lineage 2 samples, as well as separately for each lineage.

404

405 A maximum likelihood phylogeny was inferred using RAxML-NG [58] with the GTR 406 model, 20 starting trees (10 random and 10 parsimony), 100 bootstrap replicates, and a min- −9 407 imum branch length of 10 . A Lineage 2 sample randomly selected from our dataset was 408 selected as an outgroup for the Lineage 4 phylogeny. Likewise, a random Lineage 4 isolate 409 was used as a root for the Lineage 2 phylogeny. The tree containing both lineage 4 and 410 lineage 2 samples was rooted using a lineage 1 isolate.

411

412 In order to construct a time-calibrated phylogeny, we tested whether there was a de- 413 tectable amount of evolutionary change between samples collected at different times [28, 29].

13 414 This was done in lineage 4 and lineage 2 separately in order to avoid population structure 415 confounding in the temporal signal [27]. Two different tests of the temporal signal were 416 applied: the root-to-tip regression method and the date-randomization test [59]. For the 417 former, BactDating [30] was used to perform a linear regression between the collection dates 418 and their root-to-tip genetic distance in the maximum likelihood tree. Additionally, we car- 419 ried out a date-randomization test, where evolutionary rates estimated by BactDating [30] 420 were compared between the observed data set and 100 data sets obtained by permutation of 421 sampling dates [60].

422 7 423 BactDating [30] was used to time-calibrate the tree using the mixed model for 10 iter- 424 ations to achieve both convergence of the MCMC chains and an effective sample size of at 425 least 100.

426

427 The subsequent phylogenetic analysis was performed using the R package ape [61]. An- 428 cestral sequence reconstruction was carried out by maximum likelihood as implemented in 429 Phangorn [62], including gaps and ambiguity codes [55].

430 Time-to-event analysis

431 Time-to-event analysis was performed on the tree using the R package Survival [63]. The 432 time was measured for all pairs of nodes as the distance between the older and the younger 433 node in the time-calibrated phylogeny. An observation was defined as censored if both nodes 434 were drug sensitive. On the other hand, an event occurred if the older node was drug 435 sensitive and the younger node was drug resistant. Only the first acquisition of resistance 436 was considered. Observations taking place before 1940 were discarded. The survival curve 437 and the Cox proportional hazard ratio were calculated. The entire pipeline was repeated for 438 100 phylogenetic bootstrap replicates.

439 Genome-wide association of predisposition to drug resistance

440 Association analysis was performed using the variant sites alignment for the tips of the phy- 441 logenetic tree, as well as the reconstructed sequences of the internal nodes. The phenotype 442 was defined as leading to resistance in the phylogenetic tree, and thus only drug susceptible 443 nodes that immediately preceded the first resistant node of each branch were considered. 444 Variants with a frequency in the population (tips of the tree) lower than 1% and a propor- 445 tion of ambiguous bases higher than 5% were excluded. Furthermore, only those variants 446 that were polymorphic at the node level were used. Genome-wide association was performed 447 using the Cox proportional hazard model and the time between nodes. In order to correct 448 for population structure, a genetic distance matrix was calculated using SNPs with a fre- 449 quency in the population higher than 5%, and the eigenvectors were used as covariates in the 450 Cox regression model. The p-values were corrected for multiple testing using a Bonferroni 451 correction.

14 452 Data availability

453 All raw sequencing data are available with accession numbers listed in Materials and Methods. 454 Samples sequenced as part of this study have been submitted to the European Nucleotide 455 Archive under accession PRJEB39837.

15 456 References

457 1. Paulson, T. : A mortal foe. Nature 502, S2–S3 (2013).

458 2. World Health Organization. Global tuberculosis report 2020 (2020).

459 3. Raviglione, M. XDR-TB: Entering the post-antibiotic era? International Journal of Tuberculosis and 460 Lung Disease 10, 1185–1187 (2006).

461 4. Walker, T. M. et al. Whole-genome sequencing for prediction of Mycobacterium tuberculosis drug 462 susceptibility and resistance: a retrospective cohort study. The Lancet Infectious Diseases 15, 1193– 463 1202 (2015).

464 5. Gandhi, N. R. et al. Multidrug-resistant and extensively drug-resistant tuberculosis: a threat to global 465 control of tuberculosis. The Lancet 375, 1830–1843 (2010).

466 6. Gygli, S. M., Borrell, S., Trauner, A. & Gagneux, S. Antimicrobial resistance in Mycobacterium tuber- 467 culosis: mechanistic and evolutionary perspectives. FEMS Microbiology Reviews 41, 354–373 (2017).

468 7. Ford, C. B. et al. Mycobacterium tuberculosis mutation rate estimates from different lineages predict 469 substantial differences in the emergence of drug-resistant tuberculosis. Nature Genetics 45, 784–790 470 (2013).

471 8. Cox, H. S. et al. The Beijing genotype and drug resistant tuberculosis in the Aral Sea region of Central 472 Asia. Respiratory Research 6, 134 (2005).

473 9. Drobniewski, F. Drug-Resistant Tuberculosis, Clinical Virulence, and the Dominance of the Beijing 474 Strain Family in Russia. JAMA 293, 2726 (2005).

475 10. Taype, C. et al. Genetic diversity, population structure and drug resistance of Mycobacterium tuber- 476 culosis in Peru. Infection, Genetics and Evolution 12, 577–585 (2012).

477 11. Pang, Y. et al. Spoligotyping and Drug Resistance Analysis of Mycobacterium tuberculosis Strains from 478 National Survey in China. PLoS ONE 7, e32976 (2012).

479 12. Werngren, J. & Hoffner, S. E. Drug-susceptible Mycobacterium tuberculosis Beijing genotype does not 480 develop mutation-conferred resistance to rifampin at an elevated rate. Journal of Clinical Microbiology 481 41, 1520–1524 (2003).

482 13. Glynn, J. R., Whiteley, J., Bifani, P. J., Kremer, K. & van Soolingen, D. Worldwide Occurrence of 483 Beijing/W Strains of Mycobacterium tuberculosis : A Systematic Review. Emerging Infectious Diseases 484 8, 843–849 (2002).

485 14. Borrell, S. et al. Epistasis between antibiotic resistance mutations drives the evolution of extensively 486 drug-resistant tuberculosis. Evolution, Medicine, and Public Health 2013, 65–74 (2013).

487 15. Saunders, N. J. et al. Deep resequencing of serial sputum isolates of Mycobacterium tuberculosis during 488 therapeutic failure due to poor compliance reveals stepwise mutation of key resistance genes on an 489 otherwise stable genetic background. Journal of Infection 62, 212–217 (2011).

490 16. Menzies, D. et al. Effect of Duration and Intermittency of Rifampin on Tuberculosis Treatment Out- 491 comes: A Systematic Review and Meta-Analysis. PLoS Medicine 6, e1000146 (2009).

492 17. Eldholm, V. et al. Four decades of transmission of a multidrug-resistant Mycobacterium tuberculosis 493 outbreak strain. Nature Communications 6, 7119 (2015).

494 18. Cohen, K. A. et al. Evolution of Extensively Drug-Resistant Tuberculosis over Four Decades: Whole 495 Genome Sequencing and Dating Analysis of Mycobacterium tuberculosis Isolates from KwaZulu-Natal. 496 PLOS Medicine 12, e1001880 (2015).

497 19. Casali, N. et al. Evolution and transmission of drug-resistant tuberculosis in a Russian population. 498 Nature Genetics 46, 279–286 (2014).

16 499 20. Manson, A. L. et al. Genomic analysis of globally diverse Mycobacterium tuberculosis strains provides 500 insights into the emergence and spread of multidrug resistance. Nature Genetics 49, 395–402 (2017).

501 21. Grenfell, B. T. Unifying the Epidemiological and Evolutionary Dynamics of Pathogens. Science 303, 502 327–332 (2004).

503 22. Galvani, A. P. Epidemiology meets evolutionary . Trends in Ecology & Evolution 18, 132–139 504 (2003).

505 23. Grandjean, L. et al. Convergent evolution and topologically disruptive polymorphisms among multidrug- 506 resistant tuberculosis in Peru. PLOS ONE 12, e0189838 (2017).

507 24. Coll, F. et al. A robust SNP barcode for typing Mycobacterium tuberculosis complex strains. Nature 508 Communications 5, 4812 (2014).

509 25. Ajawatanawong, P. et al. A novel Ancestral Beijing sublineage of Mycobacterium tuberculosis suggests 510 the transition site to Modern Beijing sublineages. Scientific Reports 9, 1–12 (2019).

511 26. World Health Organization. Global tuberculosis report 2009 (2009).

512 27. Murray, G. G. R. et al. The effect of genetic structure on molecular dating and tests for temporal signal. 513 Methods in Ecology and Evolution 7, 80–89 (2016).

514 28. Drummond, A. J., Pybus, O. G., Rambaut, A., Forsberg, R. & Rodrigo, A. G. Measurably evolving 515 populations. Trends in Ecology & Evolution 18, 481–488 (2003).

516 29. Biek, R., Pybus, O. G., Lloyd-Smith, J. O. & Didelot, X. Measurably evolving pathogens in the genomic 517 era. Trends in Ecology & Evolution 30, 306–313 (2015).

518 30. Didelot, X., Croucher, N. J., Bentley, S. D., Harris, S. R. & Wilson, D. J. Bayesian inference of ancestral 519 dates on bacterial phylogenetic trees. Nucleic Acids Research 46, e134–e134 (2018).

520 31. Menardo, F., Duchêne, S., Brites, D. & Gagneux, S. The molecular clock of Mycobacterium tuberculosis. 521 PLOS Pathogens 15, e1008067 (2019).

522 32. Coll, F. et al. Rapid determination of anti-tuberculosis drug resistance from whole-genome sequences. 523 Genome Medicine 7, 51 (2015).

524 33. Keshavjee, S. & Farmer, P. E. Tuberculosis, Drug Resistance, and the History of Modern Medicine. 525 New England Journal of Medicine 367, 931–936 (2012).

526 34. Grandjean, L. et al. Transmission of Multidrug-Resistant and Drug-Susceptible Tuberculosis within 527 Households: A Prospective Cohort Study. PLOS Medicine 12, e1001843 (2015).

528 35. Rengarajan, J., Bloom, B. R. & Rubin, E. J. From The Cover: Genome-wide requirements for Mycobac- 529 terium tuberculosis adaptation and survival in macrophages. Proceedings of the National Academy of 530 Sciences 102, 8327–8332 (2005).

531 36. Tufariello, J. M. et al. Separable roles for Mycobacterium tuberculosis ESX-3 effectors in iron acquisition 532 and virulence. Proceedings of the National Academy of Sciences 113, E348–E357 (2016).

533 37. Van der Heijden, Y. F. et al. Isoniazid-monoresistant tuberculosis is associated with poor treatment 534 outcomes in Durban, South Africa. The International Journal of Tuberculosis and Lung Disease 21, 535 670–676 (2017).

536 38. Gagneux, S. et al. Variable host-pathogen compatibility in Mycobacterium tuberculosis. Proceedings of 537 the National Academy of Sciences 103, 2869–2873 (2006).

538 39. Tsolaki, A. G. et al. Functional and evolutionary genomics of Mycobacterium tuberculosis: Insights 539 from genomic deletions in 100 strains. Proceedings of the National Academy of Sciences 101, 4865– 540 4870 (2004).

17 541 40. Mohanty, S. et al. Mycobacterium tuberculosis EsxO (Rv2346c) promotes bacillary survival by inducing 542 oxidative stress mediated genomic instability in macrophages. Tuberculosis 96, 44–57 (2016).

543 41. Ritacco, V. et al. Mycobacterium tuberculosis strains of the Beijing genotype are rarely observed in 544 tuberculosis patients in South America. Memorias do Instituto Oswaldo Cruz 103, 489–492 (2008).

545 42. Moore, D. A. et al. Microscopic-Observation Drug-Susceptibility Assay for the Diagnosis of TB. New 546 England Journal of Medicine 355, 1539–1550 (2006).

547 43. Caviedes, L. et al. Rapid, Efficient Detection and Drug Susceptibility Testing of Mycobacterium tuber- 548 culosis in Sputum by Microscopic Observation of Broth Cultures. Journal of Clinical Microbiology 38, 549 1203–1208 (2000).

550 44. The CRyPTIC Consortium; 100000 Genomes Project. Prediction of Susceptibility to First-Line Tuber- 551 culosis Drugs by DNA Sequencing. New England Journal of Medicine 379, 1403–1415 (2018).

552 45. Ugarte-Gil, C. et al. Diabetes mellitus among pulmonary tuberculosis patients from 4 tuberculosis- 553 endemic countries: The tandem study. Clinical Infectious Diseases 70, 780–788 (2020).

554 46. Clinical and Laboratory Standards Institute (CLSI). Susceptibility Testing of Mycobacteria, Nocardiae, 555 and Other Aerobic Actinomycetes; Approved Standard—Second Edition tech. rep. (2011).

556 47. Andrews, S. & Babraham Bioinformatics. FastQC: A quality control tool for high throughput sequence 557 data. Manual (2010).

558 48. Bankevich, A. et al. SPAdes: A new genome assembly algorithm and its applications to single-cell 559 sequencing. Journal of 19, 455–477 (2012).

560 49. Li, H. Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

561 50. Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).

562 51. Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioin- 563 formatics 26, 589–595 (2010).

564 52. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

565 53. Thomer, A. K., Twidale, M. B., Guo, J. & Yoder, M. J. Picard Tools in Conference on Human Factors 566 in Computing Systems - Proceedings (2016).

567 54. McKenna, A. et al. The Genome Analysis Toolkit: A MapReduce framework for analyzing next- 568 generation DNA sequencing data. Genome Research 20, 1297–1303 (2010).

569 55. Cornish-Bowden, A. Nomenclature for incompletely specified bases in nucleic acid sequences: Rcom- 570 mendations 1984. Nucleic Acids Research 13, 3021–3030 (1985).

571 56. Cole, S. T. et al. Deciphering the biology of Mycobacterium tuberculosis from the complete genome 572 sequence. Nature 393, 537–544 (1998).

573 57. Phelan, J. E. et al. Recombination in pe/ppe genes contributes to genetic variation in Mycobacterium 574 tuberculosis lineages. BMC Genomics 17, 151 (2016).

575 58. Kozlov, A. M., Darriba, D., Flouri, T., Morel, B. & Stamatakis, A. RAxML-NG: A fast, scalable and 576 user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics 35, 4453–4455 (2019).

577 59. Rieux, A. & Balloux, F. Inferences from tip-calibrated phylogenies: a review and a practical guide. 578 Molecular Ecology 25, 1911–1924 (2016).

579 60. Duchêne, S., Duchêne, D., Holmes, E. C. & Ho, S. Y. The Performance of the Date-Randomization 580 Test in Phylogenetic Analyses of Time-Structured Virus Data. Molecular Biology and Evolution 32, 581 1895–1906 (2015).

582 61. Paradis, E., Claude, J. & Strimmer, K. APE: Analyses of Phylogenetics and Evolution in R language. 583 Bioinformatics 20, 289–290 (2004).

18 584 62. Schliep, K. P. phangorn: phylogenetic analysis in R. Bioinformatics 27, 592–593 (2011).

585 63. Terry M. Therneau & Patricia M. Grambsch. Modeling Survival Data: Extending the Cox Model (Springer, 586 2000).

19 587 Acknowledgements

588 LG was supported by the Wellcome Trust (201470/Z/16/Z), the National Institute of Al- 589 lergy and Infectious Diseases of the National Institutes of Health under award number 590 1R01AI146338 and by the GOSH/ICH Biomedical Research Centre. OMK was supported by 591 the Imperial Biomedical Research Centre (NIHR Imperial BRC, grant P45058). We thank 592 the CRyPTIC project and the Tandem project for making whole-genome data available in the 593 public domain. All authors acknowledge UCL Computer Science Technical Support Group 594 (TSG) and the UCL Department of Computer Science High Performance Computing Cluster.

595 Author Contributions

596 ATO, LG, XD, OMK, FB, JRV, RG, CB, and DM conceived and designed the study. JC 597 performed and supervised tuberculosis cultures, DNA extraction, laboratory, and sequencing 598 work. LG, XD, FB, and OMK jointly supervised the research. ATO, LG, FB, and XD 599 performed and advised on computational analyses. ATO, and LG wrote the manuscript with 600 input from all co-authors. All authors read and approved the final manuscript.

601 Competing interests

602 The authors declare no competing financial interests

603 Ethics declarations

604 Ethical approval was obtained for all individual studies from which this data was derived.

20 605 Figures

a b

LAM 0.7 Lineage 0.6 Beijing 0.5 Haarlem T type 0.4 L4.4 0.3 LAM11 Haarlem LAM3 0.2 LAM9 0.1 Substitution rate T type X type (substitutions/genome/year) 0.0 TUR X type Beijing Lineage 4 Lineage 2 69

c d LAM

T type Beijing

Haarlem

X type

500 1000 1500 2000 1250 1500 1750 2000 Year Year

Figure 1: Phylogenetic analysis of 3134 Mycobacterium tuberculosis iso- lates from Lima, Peru Colors represent different lineages and sublineages. (a) Maximum likelihood phy- logeny. Scale in number of substitutions per genome. (b) Posterior density for the inferred substitution rate in substitutions per genome per year for lineage 4 (blue) and lineage 2 (red). (c) Time-calibrated phylogeny of lineage 4. (d) Time-calibrated phylogeny of lineage 2.

21 Tree estimation of TB drug resistance emergence

Rifampicin Isoniazid Ethambutol Pyrazinamide Lineage 4 Lineage Lineage 2 Lineage

1900 1920 1940 1960 1980 2000 2020

Distribution of new TB compounds

Figure 2: Inferred posterior density distribution of the earliest occurrence of resistance to first line antituberculous drugs Posterior density distribution inferred using a time-calibrated phylogeny for both lineage 4 and lineage 2. Arrows represent the approximated time of antibiotic distri- bution.

22 aLima, Peru b Samara, Russia

Lineage 4 Lineage 4 ● Lineage 2 *** Lineage 2 ● ***

0 1 2 3 4 5 0 1 2 3 4 5 6 Hazard ratio Hazard ratio 1.0 Lineage 4 1.0 Lineage 4 Lineage 2 Lineage 2 0.8 0.9 0.6 0.8 0.4

0.7 0.2 Probability of Probability p < 0.0001 of Probability p < 0.0001 0.0

remaining susceptible 0.6 remaining susceptible 2338 1071 687 474 333 Numbers 496 242 199 171 151 124 106 Numbers 436 83 36 16 8 at risk 388 59 25 24 21 18 14 at risk

0 10 20 30 40 0 10 20 30 40 50 60 Time (years) Branch length c Lima, Peru LAM 3 LAM 9 LAM 11 Haarlem T type X type Beijing *** 0 1 2 3 4 5 6 Hazard ratio 1.0 LAM 3 0.9 LAM 9 LAM 11 0.8 Haarlem T type 0.7 p < 0.0001 X type

Probability of Probability Beijing 0.6 362 162 104 66 38

remaining susceptible 372 130 80 59 48 66 32 21 18 16 1088 521 339 236 159 Numbers 160 80 50 40 31 at risk 232 121 82 49 36 436 83 36 16 8

0 10 20 30 40 Time (years)

Figure 3: Hazard ratio and Kaplan-Meier curve for different sublineages of Mycobacterium tuberculosis (a-c) Top: Hazard ratio (HR). Bottom: Kaplan-Meier curve and numbers at risk. Y-axis represents the probability of remaining susceptible to any antibiotic, while the X-axis shows the time in years or the distance in branch length. (a) depicts the HR of lineage 2 compared to lineage 4 in the Peruvian dataset (HR 3.36, 95% CI 2.10 - 5.38, Likelihood ratio test p-value = 4.25 × 10−7) and the different Kaplan-Meier curve for lineage 2 and lineage 4 (p<0.0001). (b) same metrics for the Samara dataset (HR 4.82, 95% CI 3.74 - 6.21, Likelihood ratio test p-value < 10−15; Kaplan-Meier curve p<0.0001). (c) shows HR between lineage 2 and the different sublineages of lineage 4 found in the Peruvian dataset (LAM9, LAM3, LAM11, Haarlem, X type and T type), using LAM3 as a reference. Statistical significance of the hazard ratio differences presented next to the CI bars (*, p<0.05; **, p<0.01; ***, p<0.001)

23 aLima, Peru b Samara, Russia

Sensitive Sensitive ●

Isoniazid *** Isoniazid ● ***

0 5 10 15 20 0 20 40 60 80 Hazard ratio Hazard ratio 1.0 1.0 Sensitive Sensitive 0.8 Isoniazid 0.8 Isoniazid

0.6 0.6

0.4 0.4

Probability of Probability 0.2 of Probability 0.2 p < 0.0001 p < 0.0001 0.0 0.0 remaining monoresistant remaining monoresistant 2774 1154 723 490 341 Numbers 707 270 210 183 163 Numbers 378 135 59 33 6 at risk 272 67 40 35 28 at risk

0 10 20 30 40 0 10 20 30 40 Time (years) Branch length

Figure 4: Hazard ratio and Kaplan-Meier curve for rifampicin acquisition (a-b) Top: Hazard ratio (HR). Bottom: Kaplan-Meier curve and numbers at risk. Y-axis represents the probability of remaining susceptible to rifampicin, while the X- axis shows the time in years or the distance in branch length. (a) depicts the risk of acquiring rifampicin resistance from an already isoniazid mono-resistant background compared to a drug susceptible one (HR 15.12, 95% CI 10.54 - 21.69, Likelihood ra- tio test p-value < 10−15) and the Kaplan-Meier curves for the different backgrounds (p<0.0001). (b) same metrics for the Samara dataset (HR 37.28, 95% CI 18.81 - 73.88, Likelihood ratio test p-value < 10−15; Kaplan-Meier curve p<0.0001). Sta- tistical significance of the hazard ratio differences presented next to the CI bars (*, p<0.05; **, p<0.01; ***, p<0.001)

24 a b

0.51 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 WT lppP Hazard ratio c.2604157_2604165del

1.0 le

lppP b 8 0.8 esxO 0.6 6 0.4 alue) v obability of

4 intergenic Pr 0.2

remaining suscepti 0.0 p < 0.0001

−log10 (p− 2 3810 1874 1315 999 764 606 Numbers at risk 56 27 13 8 2 0 0 0 1 2 3 4 0 20 40 60 80 100

Genome Position (Mb) Time (years)

Figure 5: Genome-Wide association study (GWAS) results. (a) Manhattan plot for GWAS on increased risk of drug resistance acquisition in lineage 4. The red line represents the Bonferroni corrected p-value threshold of 3.37 × 10−5. Labels show the genes where the significant hits are located. Colors indicate the hazard ratio, with a scale of blue representing hazards ratio lower than 1 and a scale of reds for hazard ratios higher than 1. (b) Kaplan- Meier curve and numbers at risk of a 9bp deletion in the gene lppP comparing the probability of remaining susceptible between those nodes without the deletion (blue) and those with it (red).

25 606 Tables

Table 1: Population structure.

Clade name Number

Lineage 1 Indo-Oceanic 1 Lineage 2 East-Asian 327 Lineage 2.2.1 Beijing 319 Lineage 2.2.1.1 Beijing 7 Lineage 2.2.2 Beijing 1 Lineage 4 Euro-American 2807 Lineage 4.1.1 X Type 271 Lineage 4.1.2.1 Haarlem 935 Lineage 4.2.2 TUR 3 Lineage 4.3.1 LAM 5 Lineage 4.3.2 LAM3 328 Lineage 4.3.3 LAM9 676 Lineage 4.3.4 LAM11 226 Lineage 4.4 - 51 Lineage 4.5 T Type 4 Lineage 4.7 T Type 31 Lineage 4.8 T Type 86 Lineage 4.9 T Type 12 Lineage 41 T Type 179

Lineages and sublineages defined using clade specific SNPs [24]. 1 Clade name assigned phylogenetically.

26 Table 2: First emergence of drug resistance conferring mutations in Lima, Peru.

Drug Lineage Gene Year Mutation

RIF Lineage 4 rpoB 1951.7 [1931.7 - 1970.7] p.Ser450Leu Lineage 2 rpoB 1974.4 [1953.2 - 1986.8] p.Ser450Leu INH Lineage 4 KatG 1941.6 [1913.6 - 1959.7] p.Ser315Thr Lineage 2 KatG 1957.5 [1928.3 - 1977.4] p.Ser315Thr ETH Lineage 4 embB 1973.7 [1967.8 - 1980.8] p.Gly406Ala Lineage 2 embB 1983.1 [1967.0 - 1994.1] p.Met306Val PZA Lineage 4 pncA 1962.4 [1942.6 - 1972.2] p.His51Arg Lineage 2 pncA 2002.1 [1997.0 - 2005.5] c.-11A>C STR Lineage 4 rpsL 1974.5 [1955.9 - 1986.2] p.Lys43Arg Lineage 2 rpsL 1958.1 [1934.6 - 1975.1] p.Lys43Arg

Emergence of drug resistance conferring mutations for the 5 antibiotics historically used as first line drugs for the treatment of tuberculosis. RIF: rifampicin; INH: isoniazid; ETH: ethambutol; PZA: pyrazinamide; STR: streptomycin. Year presented as a point estimate with the highest posterior density interval.

27 607 Supplementary material

2000 Project 1808 CRyPTIC Household Other Population Tandem 1500

1000

Number of samples 500 330

150 181 138166 88 123 97 2 1 4 2 13 14 1 4 12 0

1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 Sampling year

Supplementary Figure 1: Sample cohort Temporal distribution of the 3134 samples included in the study.

28 a b Lineage

Lineage 2 Lineage 4.1 Lineage 4.2 Lineage 4.3 Lineage 4.4 Lineage 4.5 Lineage 4.6 Lineage 4.7 Lineage 4.8 Lineage 4.6

52 44

Supplementary Figure 2: Maximum likelihood phylogeny of the Samara (Russia) data set Phylogeny of the Samara data set for (a) lineage 4 and (b) lineage 2.

29 Root to tip regression Date randomization test

600

500 0.6 400

300 0.4

200 0.2

Lineage 2 Lineage Rate=1.95e−01 MRCA=304.49 100 Substitution rate R2=1.46e−03 Root−to−tip distance 0 0.0 1980 1990 2000 2010 2020 Sampling date

600

500 0.4 400

300 0.3

200 0.2

Lineage 4 Lineage Rate=5.78e−01 MRCA=1290.02 100 Substitution rate R2=1.01e−03

Root−to−tip distance 0.1 0 1980 1990 2000 2010 2020 Sampling date

Supplementary Figure 3: Temporal signal analysis The presence of measurable evolution (temporal signal) within the lineage 2 (top) and lineage 4 (bottom) was tested by a root-to-tip regression (left) and by a date randomization test (right). In the root to tip regression the distance from the tips to the root of the phylogenetic tree (y-axis) are plotted against the sampling dates (x- axis). The mutation rate and the time of the most recent common ancestor (MRCA) are calculated from the linear regression. The date randomization test is performed by comparing the substitution rate of the original dataset (black) with a set of ran- domized data sets obtained by permutating the sampling dates (grey). Error bars represent 90% confidence intervals. The date randomization test is passed if the estimates from the original data don’t overlap with the randomization sets.

30 Lineage 2 Lineage 4 Posterior Posterior

−3700 −43500 −3750 −3800 −44000 −3850 −3900 −44500 −3950 −45000 0 200 400 600 800 1000 0 200 400 600 800 1000 Likelihood Likelihood −1160 −1180 −12100 −1200 −12200 −1220 −12300 −1240

−1260 −12400 0 200 400 600 800 1000 0 200 400 600 800 1000 Prior Prior −2450 −31000 −2500 −2550 −31500 −2600 −2650 −32000 −2700 −2750 −32500 −2800 0 200 400 600 800 1000 0 200 400 600 800 1000 Date of root Date of root 1600 1000

1400 800

600 1200 400 1000 200 800 0 200 400 600 800 1000 0 200 400 600 800 1000 Clock rate Clock rate 0.7 0.45

0.6 0.40

0.5 0.35

0.4 0.30

0.3 0.25

0 200 400 600 800 1000 0 200 400 600 800 1000 Coalescent time unit Coalescent time unit

2000 45000 1800 40000 1600 1400 35000 1200 30000 1000 800 25000

0 200 400 600 800 1000 0 200 400 600 800 1000 Relaxation parameter Relaxation parameter 2.6 2.5 2.4 2.0 2.2

1.5 2.0

1.0 1.8 0 200 400 600 800 1000 0 200 400 600 800 1000 Sampled iterations Sampled iterations

Supplementary Figure 4: BactDating MCMC chain convergence BactDating trace output for lineage 2 and lineage 4 in three independendet MCMC chains. Parameters shown are posterior probabilities, likelihood, prior probabilities, date of MRCA, substitution rates, coalescent time unit and relaxation parameters. Three independent chains were run for each dataset.

31 a b Schoenfeld individual Test Schoenfeld individual Test 40 p: 0.634 p: 0.663 20 20 10

0 0

lineage lineage −10 Beta (t) for −20 Beta (t) for −20 −40 18 31 44 54 59 63 65 68 18 33 45 55 60 63 66 68 Time Time c d Schoenfeld individual Test Schoenfeld individual Test 30 p: 0.297 20 p: 0.936 20 10 10 0 0

−10 lineage −10 Beta (t) for Beta (t) for background −20 −20 −30 13 24 35 42 57 62 65 66 0.83 4.4 7.7 13 41 72 140 Time Time Schoenfeld individual Test e p: 0.744 30 20 10 0 −10 Beta (t) for background −20 −30 0.38 2.2 6.6 8.8 13 56 Time

Supplementary Figure 5: Scaled Schoenfeld residuals (a-e) Schoenfeld residuals plotted against time. In all cases, the proportional hazard assumption is supported by a non-significant association between residuals and time, thus the proportional hazards model can be assumed. Test and plots modified from the survminer R package. (a) Residuals for lineage 2 and lineage 4 in Peru. (b) Residuals for lineage 2 and several sublineages of lineage 4 in Peru. (c) Residuals for sensitive and isoniazid mono-resistant background in Peru. (d) Residuals for lineage 2 and lineage 4 in Samara. (e) Residuals for sensitive and isoniazid mono-resistant background in Samara.

32

a b

0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0

Hazard ratio Hazard ratio c d 1.0 1.0

0.9 0.9

0.8 0.8

Probability of Probability 0.7 of Probability 0.7 L4 ML tree L2 ML tree

remaining susceptible L4 bootstrap remaining susceptible L2 bootstrap 0.6 0.6

0 10 20 30 40 0 10 20 30 40 Time (years) Time (years)

Supplementary Figure 6: Hazard ratio and Kaplan-Meier curve of 100 phylogenetic bootstrap replicates 100 phylogenetic bootstrap replicates were performed in parallel to the maximum likelihood phylogeny. Blue color represents the lineage 4 phylogeny while the red color shows the lineage 2 phylogeny. No statistical differences are found between the results obtained using the maximum likelihood tree and those using the phylo- genetic bootstrap replicates. (a, b), Hazard ratio for the maximum likelihood tree (dark point) and 100 phylogenetic bootstrap replicates (light colored points). (c, d), Kaplan-Meier curve for the maximum likelihood tree (dark solid lines) with the 95% CI interval (dark dotted lines), and 100 phylogenetic bootstrap replicates (light colored lines).

33 9

Raw

Corrected 6

3

−log10 (p) observed λ = 4.27 =1.21 0 λ 0 1 2 3 -log10 (p) expected

Supplementary Figure 7: Population correction for Genome-Wide asso- ciation study (GWAS) Quantile-quantile (QQ) for the raw p-values (grey), and the p-values corrected for population structure (black). Red line indicates the null hypothesis of uniformly distributed p-values. Lambda represents the genomic inflation factor.

34 Supplementary Files

This is a list of supplementary les associated with this preprint. Click to download.

suppMaterialpreresistancearticle.pdf