A genomic and transcriptomic study of lineage-specific variation in

Graham David Rose

Thesis submitted for the degree of Doctor of Philosophy

2013

MRC National Institute for Medical Research

Declaration

I, Graham David Rose, confirm that the work presented in this thesis is my own. Where information has been derived from other sources, I confirm that this has been indicated in the thesis.

Signed………………………………………….Date……………………………………..

The thesis work was conducted from September 2009 to March 2013 at the MRC National Institute of Medical Research (NIMR), London, UK, under the supervision of Douglas Young (NIMR, London), and Sebastien Gagneux (Swiss Tropical and Public Health Institute, Switzerland).

ii Abstract

Human tuberculosis (TB) is caused by several closely related species of collectively known as the Mycobacterium tuberculosis complex (MTBC). In this thesis the identification and effect of lineage-specific genetic variation within the phylogenetic lineages of the MTBC was investigated using a combination of computational methods and high-throughput sequencing technology.

Genome sequencing has now identified an extensive repertoire of single nucleotide polymorphisms (SNPs) amongst clinical isolates of the MTBC. Comparative analysis focused on the detection of all lineage-specific SNPs, providing the first glimpse of the total SNP diversity that separates the main phylogenetic lineages from each other. Bioinformatic analysis focused on SNPs more likely to contribute to functional diversity, which predicted nearly half of all SNPs in the MTBC to have functional consequences, while SNPs within regulatory proteins were over-represented. To determine whether these and other lineage- specific SNPs lead to phenotypic diversity, genome datasets were integrated with RNA- sequencing to assess their impact on the comparative transcriptome profiles of strains belonging to two MTBC lineages. Analysing the transcriptomes in the light of the underlying genetic variation found clear correlations between genotype and transcriptional phenotype. These arose by three mechanisms. First, lineage-specific changes in amino acid sequence of transcriptional regulators were associated with alterations in their ability to control gene expression. Second, changes in nucleotide sequence were associated with alteration of promoter activity and generation of novel transcriptional start sites in intergenic regions and within coding sequences. Finally, genes showing lineage-specific patterns of differential expression not linked directly to primary mutations were characterised by a striking over- representation of toxin-antitoxin pairs.

iii Acknowledgements

This thesis would not have been possible without the efforts of my colleagues and friends. Firstly I would like to thank my PhD supervisors Sebastien Gagneux and Douglas Young for their support and guidance throughout my project, providing me with their invaluable depth of knowledge and resources. Of special note were the annual Gagneux group retreats in Charmey and Les Diablerets, which always provided a healthy mix of stimulating scientific discussions about my projects and great food, including of course the meringue et la crème double. I am grateful to my three thesis supervisor’s, Delmiro Fernandez-Reyes, Roger Buxton and Seb, who were a great help in contextualising my ideas and providing a focus. My thesis relied heavily on sequence data, and as such I thank Abdul Sesay and the rest of the High Throughput Sequencing group at NIMR for performing the Illumina sequencing. Next I would like to thank Iñaki Comas, who was always happy to answer my questions on evolutionary theory and phylogenomics, and provide more general daily support on all things computational. I also thank the other original member of the Gagneux group at NIMR, Sonia Borrell, particularly so for her help in getting me up and running in the lab at the start, and then the current members of Douglas Young’s group, including Kristine Arnvig, for her guidance on the RNA side of my project, and Steve Coade, who was my Biosafety Containment Level 3 trainer for the first six months of my PhD. My time at NIMR would not have been as enjoyable without my colleagues and friends Christina Kahramanoglou and Teresa Cortés Méndez, and to Teresa, I am indebted to you for your support in keeping me focused and all things in perspective during the final few months. I apologise that despite your and the past efforts from the Spanish contingent of the group that my vocabulary is still quite limited in your language. One day! Of course I am grateful to my parents, who provided me with their untiring support to undertake my studies throughout the years, and to my brother Phil for his advice and the countless Sunday lunches in Balham. Finally I am grateful to the Medical Research Council (MRC) for their funding, who supported not only my university costs and living expenses for the last three and a half years, but the research of many of my colleagues as well. Thank you.

iv CONTENTS

Contents

Declaration...……………………………………………………………………………..ii Abstract.…………………………………………………………………………………iii Acknowledgements...……………………………………………………………………iv List of Figures...………………………………………………………………………….x List of Tables...………………………………………………………………………….xii Glossary...………………………………………………………………………………xiii

Chapter 1 Introduction ...... 1 1.1 The genus Mycobacterium ...... 2 1.1.1 Taxonomy ...... 2 1.1.2 The Mycobacterium tuberculosis complex (MTBC) ...... 4 1.1.3 TB disease in humans ...... 5 1.1.4 Disease diversity ...... 6 1.2 Genetic diversity in the MTBC ...... 7 1.2.1 General features of the M. tuberculosis genome ...... 7 1.2.2 Typing the MTBC ...... 7 1.2.3 The phylogenetic lineages of the MTBC ...... 9 1.2.4 Origin of the MTBC ...... 13 1.2.5 Selective pressures acting within the MTBC ...... 13 1.3 Phenotypic diversity ...... 15 1.3.1 Laboratory strains ...... 15 1.3.2 Clinical strain phenotype ...... 16 1.4 Linking genotype to phenotype ...... 17 1.4.1 In silico prediction of functional SNPs ...... 19 1.4.2 Gene expression diversity ...... 20 1.4.3 High throughput DNA sequencing technology ...... 22 1.5 Thesis Outline ...... 25

v CONTENTS

Chapter 2 Materials and Methods ...... 26 2.1 General microbiological methods ...... 26 2.1.1 Containment 3 laboratory ...... 26 2.1.2 General chemicals and reagents ...... 26 2.1.3 Bacterial culture and storage ...... 27 2.1.4 Growth curves ...... 27 2.2 Molecular biology techniques ...... 28 2.2.1 Genomic DNA extraction ...... 28 2.2.2 RNA Isolation and handling ...... 28 2.2.3 Quantification of DNA and RNA by Nanodrop ...... 29 2.2.4 Determination of DNA and RNA integrity by micro fluidics ...... 30 2.2.5 Removal of DNA contamination from RNA samples ...... 30 2.2.6 Polymerase chain reaction (PCR) ...... 30 2.3 Materials ...... 31 2.3.1 Mycobacterium tuberculosis strains ...... 31 2.4 DNA-seq ...... 31 2.5 RNA-seq ...... 32 2.5.1 Strand specific RNA-seq libraries ...... 32 2.5.2 TSS 5’ enriched RNA-seq libraries ...... 34 2.6 Illumina sequencing DNA (genome) and cDNA (RNA-seq) libraries ...... 34 2.7 Quantitative RT-PCR ...... 34 2.7.1 Primer sequences ...... 35 2.8 MTBC annotation datasets ...... 36 2.8.1 Coding sequence annotations ...... 36 2.8.2 Functional Categories ...... 36 2.8.3 Essential M. tuberculosis genes ...... 36 2.9 Bioinformatics software ...... 37 2.9.1 Artemis ...... 37 2.9.2 Quality control of raw RNA-sequencing data ...... 37 2.9.3 Transcriptome mapping software ...... 38 2.9.4 Calculation of mapped read frequencies per feature region ...... 39 2.9.5 R ...... 40 2.9.6 Perl scripts ...... 40 2.9.7 Graph pad prism 5.0 ...... 40

Chapter 3 Lineage-specific SNPs ...... 41 3.1 Introduction ...... 41

vi CONTENTS

3.1.1 Aims ...... 42 3.2 Materials and Methods ...... 43 3.2.1 Genome collection used in study ...... 43 3.2.2 Genome sequencing...... 43 3.2.3 Mapping genome sequences ...... 43 3.2.4 Phylogenetic analysis...... 44 3.2.5 Categorising SNPs ...... 44 3.2.6 dN/dS calculation ...... 45 3.3 Results ...... 47 3.3.1 A globally representative 28-genome human-adapted MTBC phylogeny . 47 3.3.2 Identification of all lineage-specific SNPs ...... 53 3.3.3 Distribution of SNPs ...... 56 3.3.4 Monomorphic population structure and homoplasic SNPs ...... 59 3.3.5 Creation of pseudogenes ...... 62 3.3.6 SNPs within genes associated with antibiotic resistance ...... 69 3.3.7 Conservation and removal of lineage-specific nonsynonymous SNPs ...... 72 3.4 Discussion ...... 77 3.4.1 Strengths and limitations of this study ...... 77 3.4.2 General characteristics of lineage-specific diversity ...... 78 3.4.3 Insights into the evolution of M. tuberculosis lineages ...... 80

Chapter 4 In silico prediction of functional Single Nucleotide Polymorphisms .. 84 4.1 Introduction ...... 84 4.1.1 Aims ...... 86 4.2 Materials and Methods ...... 87 4.2.1 SIFT ...... 87 4.2.2 Indels ...... 89 4.2.3 Homology modelling ...... 89 4.2.4 Change in protein stability ...... 90 4.3 Results ...... 91 4.3.1 Predicting functional SNPs within control set ...... 91 4.3.2 Predicted functional nonsynonymous SNPs ...... 92 4.3.3 Impact of nonsynonymous SNPs outside of the human adapted MTBC .... 95 4.3.4 Clustering of functional SNPs ...... 95 4.3.5 Functional category analysis of functional SNPs ...... 99 4.3.6 Functional impairment of Lineage 1 and 2 regulatory proteins ...... 101 4.4 Discussion ...... 106

vii CONTENTS

4.4.1 Strengths and limitations of the study ...... 106 4.4.2 Validation of the SIFT method ...... 108 4.4.3 Half of lineage-specific SNPs are predicted to have functional consequences ...... 109

Chapter 5 Screening the effect of lineage-specific variation by sequence-based transcriptional profiling ...... 112 5.1 Introduction ...... 112 5.1.1 Aims ...... 113 5.2 Methods ...... 114 5.2.1 Clinical isolates in study ...... 114 5.2.2 Cluster analysis ...... 118 5.2.3 Differential expression analysis ...... 118 5.2.4 Transcriptional Start Site (TSS) calling ...... 119 5.3 Results ...... 120 5.3.1 Growth rate in vitro ...... 120 5.3.2 RNA isolation and Illumina ready libraries ...... 124 5.3.3 Transcriptome sequencing ...... 125 5.3.4 Mapping reads to the H37Rv genome ...... 128 5.3.5 Identifying strain specific gene deletions ...... 129 5.3.6 Clustering of strains at the total sample level ...... 133 5.3.7 Clustering of strains by antisense expression ...... 138 5.3.8 Testing for differential expression in RNA-seq data ...... 140 5.3.9 Lineage-specific gene expression ...... 141 5.3.10 Enrichment of toxin-antitoxins ...... 155 5.4 Discussion ...... 159 5.4.1 Strengths and limitations of the study ...... 159 5.4.2 Lineage-specific expression ...... 161 5.4.3 Linking genotype to phenotypic at the transcriptional level ...... 162

Chapter 6 Final discussion ...... 167

References ...... 174 Appendices A-G Appendix A. genomeDeletions.pl…………………………………………………209 Appendix B. Lineage-specific SNPs………………………………………………211 Appendix C. Lineage-specific SNPs within drug resistance associated genes……265 Appendix D. Nonsynonymous/synonymous SNP ratio………………………...…267 Appendix E. RNA-seq differential expression……………………………………269

viii CONTENTS

Appendix F. Functional categories…………………………………………..…274 Appendix G. Publications…………………………………………...…..……...275

ix LIST OF FIGURES

List of Figures

Figure 1.1. Phylogenetic structure of the genus Mycobacterium...... 3! Figure 1.2. The most complete phylogeny of the human adapted MTBC ...... 11! Figure 1.3. Distribution of the MTBC lineages globally ...... 12! Figure 1.4. The number of MTBC genome sequences in the Short Read Archive…....18! Figure 3.1. Neighbour-joining phylogeny for 28 human-adapted MTBC genomes ..... 49! Figure 3.2. Within-lineage SNP diversity...... 52! Figure 3.3. Isolating lineage-specific SNPs from the phylogeny...... 54! Figure 3.4. Distribution of the lineage-specific SNPs across the genome...... 55! Figure 3.5. The average number of non-coding and coding lineage-specific SNPs ..... 57! Figure 3.6 Distribution of lineage SNPs per gene...... 58! Figure 3.7. Homoplasic lineage SNPs...... 60! Figure 3.8. Change in protein length due to nonsense SNPs...... 67! Figure 3.9. Gene creation by nonsense SNPs ...... 68! Figure 3.10 Lineage-specific SNPs within genes associated with drug resistance ...... 69! Figure 3.11. The rate of nonsynonymous SNP accumulation by functional category .... 75! Figure 4.1. SIFT database phylogeny...... 88! Figure 4.2. SIFT predictions...... 94! Figure 4.3. Distribution of predicted functional SNPs per gene...... 97! Figure 4.4. Frequency distribution of predicted functional SNPs across genome...... 98! Figure 4.5. Functional category representation...... 99! Figure 4.6. Predicted loss of function of virS transcriptional regulator in Lineage 1.. 105! Figure 4.7. Spectrum of functional SNPs...... 111! Figure 5.1. Strains sequenced in RNA-seq study...... 117! Figure 5.2. In vitro growth curves...... 121! Figure 5.3. Quality control of RNA-seq samples by Bioanalyser...... 124! Figure 5.4. Distribution of quality scores for strain N0145...... 125! Figure 5.5. Circular plot of mapped RNA-seq data...... 128!

x LIST OF FIGURES

Figure 5.6. Representation of transcriptome plot based on Artemis...... 129! Figure 5.7. Distribution of gene deletions in the six RNA-seq study strains...... 130! Figure 5.8. Distribution of gene deletions grouped by gene function category...... 132! Figure 5.9. Unsupervised hierarchical clustering of total gene expression...... 135! Figure 5.10. Relationship of genotypic to transcriptomic diversity...... 136! Figure 5.11. Correlation of SNP distance to gene expression...... 137! Figure 5.12. Unsupervised hierarchical clustering of total antisense expression...... 139! Figure 5.13. Venn diagram comparing differential expression methods ...... 141! Figure 5.14. Heatmap of 112 differentially expressed genes...... 142! Figure 5.15. Differential expression of divergently regulated genes...... 144! Figure 5.16. Heat map of dosR regulon...... 146! Figure 5.17. Duplication of dosR region...... 147! Figure 5.18. DosR regulon and SNP-associated TSS...... 149! Figure 5.19. SNP-associated TSS leading to differential gene expression...... 152! Figure 5.20. SNP-associated TSS leading to differential antisense expression...... 154! Figure 5.21. Over-representation of differentially expressed toxin-antitoxins...... 156! Figure 5.22. Validation of select RNA-seq differentially expressed toxin- antitoxins.. 156! Figure 5.23. Rates of the types of nucleotide mutations across...... 165!

xi LIST OF TABLES

List of Tables

Table 2.1. Primer sequences used in the qRT-PCR study...... 35 Table 3.1. Twenty eight strains used in this study...... 46 Table 3.2. Estimates of evolutionary divergence between strains...... 50 Table 3.3. Summary of lineage-specific SNPs...... 57 Table 3.4. Homoplasic nucleotide positions within the lineage branches...... 60 Table 3.5. Variable genomic positions within the lineages...... 61 Table 3.6. Nonsense SNPs ...... 63 Table 3.7. Nonsense SNPs by lineage...... 64 Table 3.8. Nonsense SNPs grouped by functional category...... 64 Table 3.9. Mutations found in drug resistance studies associated with drug resistance Table 3.10. The rate of nonsynonymous SNP accumulation across the lineages...... 73 Table 3.11. The rate of nonsynonymous SNP accumulation by functional category….76 Table 4.1. SIFT database of non-MTBC species...... 89 Table 4.2. Predicted tolerated and functional SNPs using SIFT...... 94 Table 4.3. Functional category representation...... 100 Table 4.4. Transcriptional regulators with predicted functional mutations...... 102 Table 4.5. Regulatory proteins with predicted functional mutations in Lineage 1 and 2...... 104 Table 5.1. Lineage 1 and 2 strain used in the RNA-seq study...... 115 Table 5.2. Additional strains used in growth curve experiment...... 115 Table 5.3. Additional strains used in qRT-PCR confirmation...... 116 Table 5.4. In vitro growth rates...... 123 Table 5.5. Details of exponential phase transcriptomes used in differential expression analysis...... 126 Table 5.6. Transcriptomes used in TSS mapping…...... 127 Table 5.7. Differential expression associated with lineage-specific amino acid mutations SNPs...... 143

xii LIST OF TABLES

Table 5.8. Ten differentially expressed genes associated with a change in promoter sequences……...... 150 Table 5.9. Nine differentially expressed antisense associated with introduction of SNP- associated TSS……………...... 153 Table 5.10. Ten differentially expressed toxin-antitoxins (TA)...... 157

xiii

PCR polymerase chain reaction PDB protein data bank PE proline-glutamic acid PPE proline-proline-glutamic acid PGRS polymorphic glycine rich Glossary sequence qRT-PCR quantitative realtime-PCR RD region of difference RNA ribonucleic acid ∆∆G change in Gibbs free RNA-seq RNA-sequencing energy RPKM reads per kilobase per -10 Pribnow box million mapped reads CCAL creative commons rRNA ribosomal RNA attribution license sd standard deviation cDNA complementary DNA SNP single nucleotide dt doubling time polymorphism DNA deoxyribonucleic acid SEM standard error of the mean DNA-seq DNA-sequencing sRNA small RNA g gram TA toxin-antitoxin GA Genome Analyser TSS transcriptional start site Gb gigabase µg microgram HS HiSeq2000 µl microlitre HTH helix-turn-helix UTR untranslated region indel insertion/deletion VST variance stabilising LSP large sequence transformation polymorphism HGT horizontal gene transfer Mb megabase TbD1 M.tuberculosis specific mg milligram deletion 1 ml millilitre HMM Hidden Markov model MLSA multilocus sequence VCF variant call format analysis GTF gene transfer format mRNA messenger RNA X2 chi-square test MTBC Mycobacterium tuberculosis complex nt nucleotide OD optical density

xiv 1.1 The genus Mycobacterium

Chapter 1 Introduction

Tuberculosis (TB) is caused by several closely related species of bacteria collectively known as the Mycobacterium tuberculosis complex (MTBC) (Cole et al., 1998). The infamous member of the MTBC is the human-adapted pathogen Mycobacterium tuberculosis, the etiologic agent of human TB along with , a phylogenetic variant limited to West Africa (de Jong et al., 2010). Together these species are regarded as human-adapted MTBC members. Today, TB causes more adult deaths than any other single infectious disease, and is second only to HIV/AIDS, of which TB is the greatest cause of mortality in those infected with HIV (WHO, 2012). It is estimated that nine million new TB cases and over one million deaths from TB currently occur each year (WHO, 2012). In addition to active cases of TB, two billion people have a latent infection, effectively acting as a reservoir of active TB cases for several decades to come (Barry et al., 2009).

Historically TB is an ancient disease (Donoghue et al., 2004). Early cultural references date back to classical Greek times (Daniel, 1997), when Hippocrates used the term “phthisis” to describe active TB in individuals (Coar, 1982). Ancient M. tuberculosis DNA has been isolated from mummies found in Egypt (Nerlich et al., 1997) and South America (Salo et al., 1994). More recently, molecular genetics and the advent of sequencing technologies have facilitated more rigorous dating of M. tuberculosis and other MTBC members; low estimates range from 15,000-20,000 (Sreevatsan et al., 1997a), but more recently 70,000 years or more has been suggested (Hershberg et al., 2008). TB has therefore been a burden on humans for a long time, possibly since the migration of modern humans out of Africa (Hershberg et al., 2008). Recent analyses of MTBC evolution, largely driven by the advances in sequencing technology (Loman et al., 2012), have revealed a global picture of human MTBC strain variation, consisting of

1 1.1 The genus Mycobacterium six major phylogenetic lineages that display strong geographic structure (Gagneux & Small, 2007; Hershberg et al., 2008) and a rare seventh lineage recently discovered in the Horn of Africa (Firdessa et al., 2013). This has questioned the accuracy of prior assumptions that variation in the MTBC was negligible and of no clinical significance (Musser et al., 2000; Sreevatsan et al., 1997a), whilst bringing to the forefront the identification, potential effects of genetic variation, and future trajectory of the disease (Comas & Gagneux, 2009; Hershberg et al., 2008; Homolka et al., 2010). New opportunities now exist to study how the evolution of the MTBC has resulted in functional consequences in the lineages of MTBC at the definitive resolution - the level of DNA and RNA. It is these opportunities that shall be explored in this thesis.

1.1 The genus Mycobacterium

A genus of , Mycobacteria are distinctive rod-shaped bacteria that are characterised by high GC content, and complex lipid-rich cell walls (Madigan et al., 2003). This physical property of the cell wall was exploited in 1882 by Koch, who stained M. tuberculosis with alkaline methylene blue and a Bismarck brown stain for surrounding tissue (Ellis & Zabrowarny, 1993). In the same year the Ziehl-Neelsen stain was developed, which used a similar process to identify acid-fast bacteria, and is still used today to identify mycobacteria (Parish & Stoker, 2001).

1.1.1 Taxonomy

A working taxonomy for Mycobacteria was established 50 years ago, with original classifications based on growth rate, pigmentation and clinical significance (Stahl & Urbance, 1990). A fundamental division can be made based on growth rate, splitting Mycobacteria into two major groups, fast and slow growers. The fast growers include mainly opportunistic or non-pathogenic mycobacteria, such as , which can be cultured from dilute inocula within a week. In contrast, the slow growing species can take several weeks for visible growth from dilute inocula. This group includes M. tuberculosis, and , the causative agents of human TB, bovine TB and , respectively. Modern molecular biology techniques based on 16S rRNA have revealed the macro population structure of mycobacteria (Gutierrez et al., 2005; Stahl & Urbance, 1990). The phylogenetic structure of mycobacteria based on this method is shown in Figure 1.1, and of note is the

2 1.1 The genus Mycobacterium position of the MTBC together with the smooth tubercle bacilli, which includes Mycobacterium canetti; it is hypothesised that it was an ancestral pool of smooth tubercle-like bacilli from which the MTBC originated (Gutierrez et al., 2005; Supply et al., 2013).

Figure 1.1. Phylogenetic structure of the genus Mycobacterium. The neighbor- joining tree is based on 16S sequences from seventeen smooth mycobacterial and MTBC strains. The blue triangle indicates the MTBC. Bootstrap support higher than 90% shown on nodes. Scale bar is pairwise distances after Jukes-Cantor correction. Adapted from Gutierrez et al. (2005). Image reproduced under the Creative Commons Attribution License (CCAL).

3 1.1 The genus Mycobacterium

1.1.2 The Mycobacterium tuberculosis complex (MTBC)

The MTBC is used as an umbrella term to group the closely related mycobacteria that cause TB (Cole et al., 1998). Early sequencing of mycobacteria from the MTBC showed that they share more than 99.9% sequence identity (Sreevatsan et al., 1997a), as demonstrated by the collapsed branches in Figure 1.1 for the MTBC members. However, despite this close relatedness, members of the MTBC display different phenotypic characteristics and mammalian host ranges; as described above, MTBC members M. tuberculosis and M. africanum are the primary cause of TB in humans.

The MTBC includes several other species and sub-species that are adapted to various hosts, including both wild and domestic animal species; these bacterial variants have been referred to as “ecotypes” (Smith et al., 2006b). Here an ecotype is used as the definition of a set of strains using the same or similar ecological resources (Cohan, 2002). The host of M. bovis is largely cattle, which is of significant agricultural significance due to the associated cost of bovine TB, estimated globally at $3 billion per year (Garnier et al., 2003). M. bovis can also cause TB in humans through the consumption of unpasteurised milk (de la Rua-Domenech, 2006; Grange, 2001). Fortunately, modern food practices have effectively stopped this transmission route, and person-to-person transmission of M. bovis is rare (Evans et al., 2007; Grange, 2001). Other animal adapted pathogens include (infects voles), (infects sheep and goats) and (infects seals and sea lions). An MTBC pathogen of Dassies, or Rock Hyrax, has been isolated in South Africa and named the Dassie bacillus (Parsons et al., 2008), whilst more recently an MTBC pathogen of banded mongooses has been identified in Botswana named Mycobacterium mungi (Alexander et al., 2010). It is anticipated that MTBC members of other ecotypes will likely be identified in future studies.

A special member of the MTBC is M. canetti, a rare tubercle bacillus with an unusual smooth colony phenotype, unlike the classical rough appearance of other MTBC members (van Soolingen et al., 1997). M. canetti and the other smooth TB bacilli harbor greater genetic diversity compared with the rest of the MTBC, and are more distantly related to the remaining MTBC than any two other MTBC strains are to each other (Gutierrez et al., 2005). M. canetti is subsequently a common choice as an outgroup in phylogenetic analysis (Bentley et al., 2012; Comas et al., 2010). Horizontal recombination events are another feature of the M. canetti genome (Supply et al., 2013),

4 1.1 The genus Mycobacterium which is in stark contrast to the rest of the MTBC where no significant signs of recombination are seen (Hirsh et al., 2004; Supply et al., 2003).

1.1.3 TB disease in humans

M. tuberculosis and M. africanum, which together make up the human adapted members of the MTBC, are the etiological agents of TB in humans. TB infection in humans broadly follows an established pattern of events. Briefly, infectious bacilli are spread through droplet nuclei that can remain aerosolised for several hours. Following inhalation of the droplets the bacteria are phagocytosed by the host’s alveolar macrophages, which are then thought to invade the subtending epithelial layer of the lung (Russell et al., 2010); the infectious dose is estimated to be as low as a single bacterium. A primary site of infection is established, known as the Ghon focus, whereby a localised inflammatory response leads to recruitment of mononuclear cells from the neighboring blood vessels, which acts to provide fresh cells for the bacterial infection. The subsequent lesion or granuloma, is a defining pathogenic feature of TB disease. Initially consisting as a mass of macrophages, neutrophils and monocytes, the granulomas eventually become stratified with recruitment of lymphocytes and develop a centre that is rich in lipids. At this stage an equilibrium with the host immune system is established in most individuals, which can persist from weeks to decades and is known as latent TB infection. In this latent state the host is asymptomatic and noninfectious. It is estimated that 95% of human-adapted MTBC infection follows this route into latency, which is based on evidence of immunological sensitisation by mycobacterial proteins in the absence of clinical signs and symptoms of active TB (Barry et al., 2009). In individuals with active TB, either from disease progression, which occurs in about 5% of cases, or from the reactivation of a latent infection estimated to occur in 10% over a lifetime in HIV-negative individuals, the granuloma centre fills with caseous debris including necrotic macrophages. This ultimately ruptures and releases thousands of infectious bacilli into the lungs and respiratory airways (Kaplan et al., 2003). A persistent productive cough develops, effectively aerosolising and spreading the bacilli to new hosts, and it is this late stage of active TB that contributes to tissue damage and pathogenesis. Bacilli can also escape into other tissues via the lymphatic blood system, and this is known as miliary or extrapulmonary TB. Rapid progression to active TB from an initial infection is higher in infants or immunocompromised persons, whilst latent TB can be triggered by immunosuppression, of which the greatest identified cause is HIV infection (Ho et al., 1995).

5 1.1 The genus Mycobacterium

1.1.4 Disease diversity

Although TB is clinically defined into active and latent TB forms, it is likely that this is a gross oversimplification, with TB infection following a continuous spectrum, ranging from sterilising immunity, subclinical active disease, and active disease (Barry et al., 2009). Development of active disease is likely determined by multiple factors, including the host genotype, environmental factors, and bacterial genetics. On the human genetics side, SNPs have been identified that determine susceptibility of an individual to TB using genome-wide linkage analysis (Bellamy et al., 2000). In addition to environmental influences, strain variation in the MTBC is now also thought to play a role in the outcome of TB infection and disease (Coscolla & Gagneux, 2010). The ability of the MTBC strain to elicit an immune response was explored by Portevin et al. recently using a monocyte-derived macrophage model to study the innate immune response to twenty-eight diverse clinical MTBC strains (Portevin et al., 2011). It was shown that macrophages infected with different strains differed in the levels of cytokines and chemokines produced; infections by a group of strains that belong to the modern phylogenetic lineages produced less pro-inflammatory cytokines compared with strains from the ancient lineages (classification of modern and ancient lineages is discussed in detail below in section 1.2.3). Moving into a clinical setting, it has been shown that over the course of two years household contacts exposed to strains from the modern lineages were more likely to develop active disease compared to strains from the ancient lineages (de Jong et al., 2008). Taken together, Gagneux hypothesised that modern strains have developed an evolutionary strategy of increased virulence and shorter latency, possibly through adaptation to expanding human population sizes over the past few hundred years which have provided more hosts for the MTBC pathogen (Gagneux, 2012). In summary, it is likely that multiple factors play an important role in disease, with a complex interaction between the host, pathogen and environment (Comas & Gagneux, 2009). This study focuses on the pathogen side, and the following section introduces the genetic diversity and lineages of the MTBC.

6 1.2 Genetic diveristy in the MTBC

1.2 Genetic diversity in the MTBC

1.2.1 General features of the M. tuberculosis genome

A seminal moment in mycobacterial research was the genome sequencing of the first strain of M. tuberculosis in 1998 (Cole et al., 1998). A canonical strain of TB research, M. tuberculosis H37Rv was chosen in 1993 to be the first MTBC strain sequenced, and the genome was closed and finished over the next five years. It was shown that the single circular chromosome was 4,411,532 bp in length and consists of just over 4,000 protein coding genes. The annotated genome opened new insights into the biology and metabolism of the pathogen, with identification of large protein families related to fatty acid and polyketide biosynthesis, regulation, drug efflux pumps and transporters, and PE_PGRS proteins. PE_PGRS are a large duplicated family unique to the MTBC.

The genome is rich in repetitive DNA, such as IS6110 insertion sequences, and in multigene families and duplicated housekeeping genes (Cole et al., 1998). Sixteen copies of the IS6110 sequence and six copies of the more stable element IS1081 were found to reside within the genome of H37Rv. Due to the variable number of IS6110 elements in strains these were utilised in a DNA fingerprinting protocol which quickly evolved into the first international gold standard for genotyping of MTBC (van Embden et al., 1993). Typing of the MTBC in the context of strain diversity is discussed in the following section.

1.2.2 Typing the MTBC

Members of the MTBC are considered genetically monomorphic with a high level of genomic sequence similarity and negligible horizontal gene transfer (Hirsh et al., 2004; Liu et al., 2006). As such, the MTBC displays a classic clonal population structure and evolves by descent (Achtman, 2008), which leads to the situation whereby mutations in the parental strain become defining markers for the rest of the progeny. Together, this creates a situation where many genotyping tools useful in other species do not transfer to the MTBC effectively (Achtman, 2008; Comas et al., 2009). Development of tools to measure genetic variation in the MTBC was the start of generating a robust framework needed firstly to measure the amount of genetic variation in strains, before secondary questions, such as the effect of strain variation in TB disease could be asked. Before

7 1.2 Genetic diveristy in the MTBC discussing the lineages of the MTBC it is first necessary to introduce a brief history of typing the MTBC and the evolution of such tools to measure genetic diversity in a robust and definitive manner.

As introduced above, the early 1990s saw the establishment of IS6110 restriction fragment length polymorphism (RFLP) typing as the gold standard of the MTBC typing (van Embden et al., 1993). The method is based on strain differences in the IS6110 copy numbers, ranging from 0 to about 25, as well as the variability in the chromosomal positions of the insertion sequences. Large collections were subsequently typed and the first families of strains with a common genotype were uncovered in the MTBC (Van Soolingen, 2001). It was found that some strains were at a higher frequency and across a wider geographic area, suggesting differential success rates in terms of infection and geographical spread (Van Soolingen, 2001). Although non-sequence based tools including the above RFLP technique, and other methods such as Pulsed-Field Gel Electrophoresis (PFGE) are useful for typing of monomorphic bacteria at the fine scale, they have many drawbacks, including problems of reproducibility between laboratories (Achtman, 2008).

Development of sequence based tools such as spoligotyping and MIRU-VNTR have largely replaced RFLP typing, and are currently the official gold standards for epidemiological typing of the MTBC (Supply et al., 2001). Spoligotyping is the mycobacterial name given to the clustered regularly interspaced short palindromic repeats (CRISPR) typing method, which is based on counting unique spacer regions between a series of direct repeats in the M. tuberculosis genome (Grissa et al., 2008). The second method, MIRU-VNTR or mycobacterial interspersed repetitive units variable number tandem repeats, classifies strains by comparison of strain-specific numbers of repeats of short DNA sequences at various genomic positions (Lindstedt, 2005). Databases have been built around the results of typing tens of thousands of patient isolates with these methods, such as SpolDB4 (Brudey et al., 2006) and MIRU- VNTR plus (Weniger et al., 2010). Although spoligotyping and MIRU-VNTR have been invaluable from an epidemiological view, the application of such tools to study evolutionary questions is not ideal as they are susceptible to convergent evolution. Convergent evolution describes the identification of the same genotype in two strains that is not due to descent, and this impacts the robustness of derived phylogenies (Comas et al., 2009). This scenario arises due to the limited number of loci that the methods are based on. In a study by Comas et al. it was found that phylogenies built

8 1.2 Genetic diveristy in the MTBC using either method had low discriminatory power and were incongruent compared to those based on a recent SNP based typing method (Comas et al., 2009). It was therefore argued that for evolutionary studies the MTBC should be typed using robust SNP or large sequence polymorphisms (LSPs) markers (Comas et al., 2009).

Typing the MTBC by LSP or gene deletions exploits the absence of horizontal gene transfer in the MTBC, making each deletion event unique and so robust informative phylogenetic markers. Whilst LSPs have been used to resolve the main lineages of the MTBC (Gagneux et al., 2006a; Reed et al., 2009), deletions are less abundant that SNPs and were also largely based on deletions found in the reference strain H37Rv, making SNPs the best choice for sampling MTBC diversity. To date numerous studies have utilised SNP markers to classify strains and explore the evolutionary history of the MTBC (Baker et al., 2004; Comas et al., 2010; Gagneux & Small, 2007; Hershberg et al., 2008). However, SNP analyses can also suffer from the same problems as previous studies based on LSPs, such as using SNPs based on prior information, which can introduce a discovery bias, or through simply using a non-representative set of strains. In 2008, Hershberg et al. used de novo sequencing of multiple genes from 108 global MTBC strains to identify novel SNPs and constructed the most complete phylogenetic tree of the MTBC (Hershberg et al., 2008). Subsequent whole genome sequencing of a smaller set of strains in 2010 has defined the MTBC lineages at the highest possible resolution, the single nucleotide level (Comas et al., 2010).

1.2.3 The phylogenetic lineages of the MTBC

The global populations structure of the MTBC is defined by six main phylogenetic lineages, named Lineage 1 to 6 (Comas et al., 2010), although these have also been described by their geographic distribution and other naming schemes in previous studies (Filliol et al., 2003; Gagneux et al., 2006a; Hershberg et al., 2008). The largest phylogeny of global MTBC diversity is shown in Figure 1.2. Lineages are coloured based on previous deletion analysis in a global set of strains (Gagneux et al., 2006a), and the same colouring scheme is continued throughout this thesis. The phylogeny is based on a multi locus sequencing analysis (MLSA) of SNPs identified from the sequencing of 89 genes in 108 MTBC strains (Hershberg et al., 2008). The MLSA also included seven animal-adapted strains, which were shown to all cluster within one of the M. africanum lineages (Lineage 6). Of special note is the Beijing sub-lineage of Lineage 2, which is of interest in the context of association with multidrug resistance and recent expansion

9 1.2 Genetic diveristy in the MTBC

(Borrell & Gagneux, 2009); this is discussed further in section 1.3.2. In addition to strains clustering into six main lineages, two major groupings were observed, the “ancient” and “modern” lineages (Figure 1.2). Lineage 1 and the two M. africanum lineages are referred to as ancient as they branched off from a common ancestor at an early stage of evolution, whilst the remaining three modern lineages diverged at a later time point (Lineage 2, 3, and 4). Previously, studies have classified MTBC strains into two groups based on the presence of a single genomic deletion known as TbD1 (Brosch et al., 2002), but here it was demonstrated this separation is more than a single deletion (Hershberg et al., 2008). TbD1 is in the relatively long branch prior to the separation of Lineages 2, 3 and 4 shown in Figure 1.2, thus representing more genetic variation between the ancient and modern lineages than had been suggested by TbD1. As mentioned previously, recently a rare seventh MTBC lineage was identified, and this has a phylogenetic location that is between the ancient and modern lineages in Figure 1.2, although the Lineage 7 branch point is before TbD1 (Firdessa et al., 2013); Lineage 7 was published in March 2013 and therefore is not discussed further in this thesis.

Strains used in the MLSA study were derived from a global collection of 875 strains from 80 countries that were previously characterised by genome wide deletion analysis (Gagneux et al., 2006a), and represent the broadest sample of genetic and geographic MTBC diversity to date. In the study by Gagneux et al. and following analyses, it was found that the MTBC diversity is highly geographically structured (Gagneux et al., 2006a; Hershberg et al., 2008). This is shown in Figure 1.3, where for example Lineage 4 is the dominant lineage in terms of geographical spread across the continents of Europe, America and Africa, whilst Lineage 2 is predominantly found in East Asia.

10 1.2 Genetic diveristy in the MTBC

The Philippines Lineage 1

Rim of Indian lineages Ancient Ocean M. africanum Lineage 5 (West Africa 1)

M. africanum Lineage 6 (West Africa 2)

India, Lineage 3 East Africa Modern lineages lineages Modern Beijing Beijing East Asia Lineage 2

Europe, America, Lineage 4 Africa

Figure 1.2. The most complete phylogeny of the human adapted MTBC. Maximum Parsimony phylogeny of MTBC built using 89 concatenated gene sequences in 108 strains. The branches are colored according to the main lineages defined previously based on LSP deletion analysis (Gagneux et al., 2006a). Although not part of this study, the animal strains were part of the previous MLSA study and shown here for reference. Adapted from Hershberg et al. (2008). Image reproduced under the Creative Commons Attribution License (CCAL).

11 1.2 Genetic diveristy in the MTBC

Figure 2. Global distribution of the six main lineages of human MTBC. Each dot represents the most frequent lineage(s) circulating in a country. Colours correspond to the lineages defined in Figure 3 (adapted from [20]). Figuredoi:10.1371/journal.ppat.1000600.g002 1.3. Distribution of the MTBC lineages globally. The six lineages display a strongsequenced geographic for each strain [26],structure, has been used with very each successfully dot to representingevolution of MTBC. the dominant In-depth population lineage genetic in analyseseach byof define the genetic population structure of many bacterial species Hershberg et al. highlight the fact that purifying selection against the[27]. 80 Because countries of the low representeddegree of sequence in polymorphisms the strain in collection.slightly deleterious Adapted mutations in fromthis organism Gagneux is strongly reducedet al. MTBC, however, standard MLST is uninformative [28]. A recent compared to other bacteria [29]. As a consequence, nonsynon- study of MTBC extended the traditional MLST scheme by ymous SNPs tend to accumulate in MTBC, leading to a high ratio (2006asequencing) and 89 complete Hershberg genes in 108 et strains,al. (2008) covering. 1.5% Image of the reproducedof nonsynonymous under to synonymous the CCAL. mutations (also known as dN/ genome of each strain [29]. Phylogenetic analysis of this extended dS). The authors hypothesized that the high dN/dS in MTBC multilocus sequence dataset resulted in a tree that was highly compared to most other bacteria might indicate increased random congruent with that generated previously using LSPs (Figure 3). genetic drift associated with serial population bottlenecks during The new sequence-based data also revealed that the MTBC past human migrations and patient-to-patient transmission. If strains that are adapted to various animal species represent just a confirmed, this would indicate that ‘‘chance,’’ not just natural subset of the global genetic diversity of MTBC that affects different selection, has been driving the evolution of MTBC. Although these human populations [29]. Furthermore, by comparing the kinds of fundamental evolutionary questions are often underap- geographical distribution of various human MTBC strains with preciated by clinicians and biomedical researchers, studying the their position on the phylogenetic tree, it became evident that evolution of a pathogen ultimately allows for better epidemiolog- MTBC most likely originated in Africa and that human MTBC ical predictions by contributing to our understanding of basic originally spread out of Africa together with ancient human biology, particularly with respect to antibiotic resistance. migrations along land routes. This view is further supported by the fact that the so-called ‘‘smooth tubercle bacilli,’’ which are the A Vision for the Future closest relatives of the human MTBC, are highly restricted to East Thanks to recent increases in research funding for TB [4], Africa [30]. The multilocus sequence data reported by Hershberg substantial progress has been made in our understanding of the basic et al. [29] further suggested a scenario in which the three biology and epidemiology of the disease. Unfortunately, this increased ‘‘modern’’ lineages of MTBC (purple, blue, and red in Figure 3) knowledge has not yet had any noticeable impact on the current seeded Eurasia, which experienced dramatic human population global trends of TB (Figure 1). While TB incidence appears to have expansion in more recent times. These three lineages then spread stabilized in many countries, the total number of cases is still increasing globally out of Europe, India, and China, respectively, accompa- as a function of global human population growth [1]. Of particular nying waves of colonization, trade and conquest. In contrast to the concern are the ongoing epidemics of multidrug-resistant TB [31], as ancient human migrations, however, this more recent dispersal of well as the synergies between TB and the ongoing epidemics of HIV/ human MTBC occurred primarily along water routes [29]. AIDS and other comorbidities such as diabetes (Box 1). The availability of comprehensive DNA sequence data has also As our understanding of TB improves, we would like to be able allowed researchers to address questions about the molecular to make better predictions about the future trajectory of the

PLoS Pathogens | www.plospathogens.org 3 October 2009 | Volume 5 | Issue 10 | e1000600

12 1.2 Genetic diveristy in the MTBC

1.2.4 Origin of the MTBC

Early dating of the MTBC ranged from 15,000-20,000 years ago, where it was hypothesised that animal domestication was the cause of TB in humans during the Neolithic transition (Sreevatsan et al., 1997a). But more recent estimates place the MTBC at 70,000 or more years old, linked with early human migrations out of Africa (Hershberg et al., 2008). It is interesting that the continent that harbours the greatest MTBC genetic diversity is Africa, with all six lineages represented (Figure 1.3). Based on the MLSA data by Hershberg et al., it was postulated that the MTBC originated in Africa and accompanied the Out-of-Africa migrations of modern humans approximately 70,000 years ago (Hershberg et al., 2008). In this evolutionary model it is suggested that the two ancient M. africanum lineages (Lineage 5 and 6) remained in Africa, whilst the other lineages spread with human migrations into Eurasia, with the three modern MTBC lineages seeding Europe, India and China. Recent expansions in human population over the last few centuries led to the rapid expansion of these modern lineages (Gagneux, 2012). In 2010, Comas et al. generated the first whole-genome global phylogeny of human adapted MTBC (Comas et al., 2010). This phylogeny resolved the lineages at much greater resolution than previous analyses, and demonstrated that the two M. africanum lineages are the most basal. These two lineages are exclusively found in West Africa (de Jong et al., 2010), and whilst the reason for this is unknown, this evidence further supports the model that the MTBC originated in Africa (Gagneux, 2012; Hershberg et al., 2008)

1.2.5 Selective pressures acting within the MTBC

Genetic diversity is introduced and fixed into populations by the four primary evolutionary forces – mutation, natural selection, genetic drift and gene flow (Robinson et al., 2010a). Mutation is a stochastic process affecting DNA regardless of function, but only those mutations that ‘survive’ the processes of genetic drift and selection will be detected in the genome. Genetic drift is a change in allele frequency over time due to random sampling over the course of multiple generations. Importantly, it is dependent on effective population size; smaller sizes are more strongly affected by genetic drift than larger populations. In contrast, natural selection is a non random process and determined by the differential survival of genetic variant within a population (Robinson et al., 2010a). Finally, gene flow in the form of horizontal gene transfer (HGT) or recombination can shuffle mutations and introduce new genetic information into

13 1.2 Genetic diveristy in the MTBC populations. Importantly, while mycobacterial species display gene flow, it has not been detectable in the MTBC (Hirsh et al., 2004; Supply et al., 2003), thus leaving the three former evolutionary forces acting within the MTBC. Mutation, selection and drift are intrinsically interdependent, and Hershberg et al. used the MLSA dataset to explore the evolutionary forces that might have shaped the MTBC genetic diversity (Hershberg et al., 2008). Comparison of nonsynonymous SNPs (which cause an amino acid change) to synonymous SNPs (no amino acid change) can provide a measure of the selective pressures acting within a sequence. This is expressed as the dN/dS ratio, whereby the ratio of nonsynonymous SNPs to potential nonsynonymous SNPs (dN) is divided by the respective synonymous ratio (dS); a ratio of near unity indicates the absence of selection, whilst the ratio increases under positive selection, and decreases under purifying selection (Rocha et al., 2006). Positive selection describes the process of certain alleles increasing in frequency due to a greater fitness than others, whilst purifying selection purges deleterious alleles, likely generated by nonsynonymous SNPs, from the population. Applied to the MLSA it was found that 62% of the SNPs were nonsynonymous and 38% synonymous, corresponding to a dN/dS ratio of 0.57. To put this in context, the dN/dS ratio for M. canetti, the outlying member of the MTBC was 0.18, and in two sequenced Mycobacterium avium strains the dN/dS was 0.17 (see phylogeny in Figure 1.1). Similar ratios were observed across all other Actinobacteria, hence the dN/dS seen in the MTBC is markedly high compared to other mycobacteria. It was concluded that in the MTBC purifying selection is strongly reduced.

The consequence of reduced purifying selection in the MTBC was examined at the level of conservation of amino acid positions in the 89 genes sequenced across the MTBC strains. Orthologs were found for 62 genes in mycobacteria distantly related to the MTBC strains, and using a multiple sequence alignment of these genes the amino acids were divided into either conserved or variable positions. This categorised 64% of the amino acids positions in mycobacteria into conserved positions, and 36% into variable. Mutations within conserved positions are more likely to have a functional effect than at variable positions. Nonsynonymous changes in M. canetti predominantly fell into variable positions (72%), but the majority (58%) of amino acid mutations in MTBC fell into the conserved positions. This percentage was not dissimilar from that expected if purifying selection in MTBC was no longer making a distinction among mutations in these two classes of sites (Hershberg et al., 2008).

14 1.3 Phenotypic diveristy

1.3 Phenotypic diversity

Whilst the outcome of human tuberculosis infection and resulting disease is highly variable and has been attributed to many factors including host and environmental variables, the impact of bacterial strain variation on the clinical outcome of human infection by MTBC remains an open question. At the level of phenotypic diversity, a number of studies have explored the phenotypic differences between specific strains. Many of the earlier studies were based on a small set of canonical laboratory reference strains, whilst later studies moved into the use of clinical strains, increasingly informed by the phylogenetic structure of the MTBC. The former studies shall be discussed first in the next subsection, and then moving onto a discussion of clinical strain phenotypes.

1.3.1 Laboratory strains

As introduced above, many early studies were based on a few characterised reference strains, namely the laboratory strains H37Rv, H37Ra, Erdman and the vaccine strain M. bovis BCG reviewed in Coscolla & Gagneux (2010). In addition to these strains, two additional reference clinical strains CDC1551 and HN878, isolated from TB outbreaks in Tennessee and Texas respectively, have also been used (Jones et al., 1999; Valway et al., 1998). From a phylogenetic context these stains are not representative of MTBC diversity, with H37Rv, H37Ra, Erdman and CDC1551 all from Lineage 4, whilst HN878 is part of the Beijing subgroup of Lineage 2 (Figure 1.2).

One of the clear differences in strain phenotype compared to the above laboratory and clinical reference strains is from strain HN878 in infections. HN878 is consistently associated with low inflammatory response and increased virulence in both in vitro macrophage studies and in vivo animal models compared to the other laboratory stains (Manca et al., 1999; Manca et al., 2001; Manca et al., 2005). In a mouse challenge study using several clinical strains, it was found that HN878 was hypervirulent, causing unusually early death of infected immune-competent mice (Manca et al., 2001). Hypervirulence of HN878 was suggested to be due the failure of this strain to stimulate Th1 type immunity for control of M. tuberculosis infection (Manca et al., 2001).

All studies that utilise laboratory strains suffer from the same issue of strain adaptation to laboratory conditions. This mechanism was exploited to create the laboratory strain

15 1.3 Phenotypic diveristy

H37Ra, an avirulent M. tuberculosis strain that was generated by culturing H37, the parental strain of H37Rv, on solid egg medium and selecting for resistance to lysis (Steenken, 1935). This phenomenon can also affect clinical strains but can be managed through minimal handling and passaging of cells, thereby limiting the number of generations and potential for mutation. Adaptation can lead to changes in the virulence of the strain, such as the loss of phthiocerol dimycocerosate (PDIM) from strain H37Rv grown in vitro. PDIM is a wax-like compound and an important cell wall lipid associated with mycobacterial virulence (Domenech & Reed, 2009). The other laboratory strain, H37Ra, does not synthesise a number of cell surface antigens, including sulfolipid-1, trehalose mycolates, as well as PDIM (Chesne-Seck et al., 2008). As H37Rv and other laboratory strains have been passaged for many decades outside of the human host (Ioerger et al., 2010), their relevance in studies of infection and virulence is debatable. This is further underscored by the genomic diversity seen in strains of H37Rv, which has been grown in numerous laboratories throughout the world effectively in an unintentional in vitro evolution experiment, resulting in their separation by multiple SNPs and frameshift insertion and deletions (indels) (Ioerger et al., 2010).

1.3.2 Clinical strain phenotype

Whilst there is currently little evidence of common phenotypic differences at the lineage level, multiple phenotypes have been identified in nearly forty studies investigating the virulence and immunological characteristics of clinical strains (Coscolla & Gagneux, 2010). One consistent phenotype is the lower induction of proinflammatory cytokines by the Beijing sub-lineage of Lineage 2 (Figure 1.2) compared to H37Rv and other strains. This group of strains is so described as they are endemic in many parts of East Asia, and account for the majority of cases of TB in these regions (Qian et al., 1999); they have also been described as the W-Beijing family of strains (Glynn et al., 2002). The Beijing group has subsequently become the focus of numerous studies owing to its recent spread in human populations (Cowley et al., 2008), and association with multidrug resistance (Borrell & Gagneux, 2009). Whilst the characteristics that predispose this family of strains to such clinical outcomes have not been fully resolved, Reed et al. (2007) showed that Beijing strains accumulate large quantities of triglycerides in in vitro aerobic culture, and that this was linked to the constitutive over expression of genes that are members of the DosR-controlled regulon. DosR is induced during conditions that are likely to occur during latent infection, such as by nitric oxide and low oxygen tension and is thought to contribute to bacterial persistence (Kumar et al., 2007). One

16 1.4 Linking genotype to phenotype consequence of this constitutive expression is the observed accumulation of large quantities of triglycerides during in vitro aerobic culture conditions in contrast to non- Beijing strains. The authors hypothesise that the triglycerides provide an adaptive advantage to the Beijing strain family by acting as an energy source during infection (Reed et al., 2007), which would represent the first example of an in vitro phenotypic characteristic shared at the MTBC strain sub-lineage level (Nicol & Wilkinson, 2008).

From a clinical perspective, early studies of MTBC strain variation found that strains from South India were less virulent and had increased susceptibility to oxidative stress compared to strains from Great Britain (Mitchison et al., 1960; Mitchison et al., 1963). Although these strains were not genotyped at the time, it can be speculated using the current knowledge MTBC phylogeography that this represents a divide between Lineage 1 (Indo-Oceanic) and Lineage 4 strains (Coscolla & Gagneux, 2010). Another example of differences between MTBC strains detected at the clinical level is Lineage 2, which has been associated with extra pulmonary (Kong et al., 2007) and menigeal TB (Caws et al., 2008) compared to strains from other lineages. Several studies have also associated Lineage 2 with HIV coinfection (Caws et al., 2006), but the experimental phenotype is not clear and has been contested in other studies which found no significant associations (de Jong et al., 2009). In summary, the extent to which clinical MTBC phenotypes are shared by strains belonging to broader phylogenetic lineages is largely unknown, but this may reflect the previous paucity of research in this area (Nicol & Wilkinson, 2008). In the context of increasing evidence that the amount of sequence variation in MTBC has been underestimated, genetic diversity may have important phenotypic consequences, including an impact on areas such as drug and vaccine design (Gagneux & Small, 2007).

1.4 Linking genotype to phenotype

The first step towards understanding the influence of genetic diversity in the MTBC on TB infection is to understand the molecular mechanisms that link strain diversity to phenotype. This is a challenging area of research and there are few examples of such studies for the MTBC. The previously described study by Reed et al. linked the accumulation of triacylglycerides to the constitutive over-expression of the DosR regulon (Reed et al., 2007). This has recently been partially associated with a 350 kb genomic duplication that is present in some strains from the Lineage 2 (Domenech et al., 2010). A second example is a link between the hypervirulence of some Lineage 2 strains

17 1.4 Linking genotype to phenotype to the production of the immune modulatory phenolic glycolipid (PGL). It was found that the laboratory strain H37Rv and other members of Lineage 4 do not produce PGL due to a seven base pair frameshift deletion in the pks1/15 gene cluster; this encodes a polyketide synthase involved in the production of PGL (Constant et al., 2002). If pks1/15 is disrupted in the Lineage 2 laboratory strain HN878, then the hypoinflammatory and hypervirulent phenotype is lost (Reed et al., 2004). However, this phenotype is more complex than simply the presence of an intact pks1/15. Insertion of an intact pks1/15 into the lineage 4 H37Rv laboratory strain did not result in increased virulence (Sinsimer et al., 2008), thus demonstrating the importance of taking into account the lineage genetic background of the strain in question.

With the advent of advances in sequencing technology, the number of MTBC strains sequenced and associated number of SNPs identified is rapidly increasing (Stucki & Gagneux, 2012). Shown in Figure 1.4 is the number of MTBC genome sequences within the NCBI Short Read Archive (SRA), which is a repository for all next-generation genome sequencing data, and currently stands at 4,913 MTBC genome sequences. SNPs are the most common form of genetic variation in MTBC, followed by insertions and deletions (indels), and a total of 9,037 SNPs were discovered by sequencing twenty-one clinical strains of MTBC (Comas et al., 2010). Whilst this presents an opportunity to understand the impact of such SNPs, there are also considerable challenges due to the shear number of SNPs identified, which will only grow in size with the associated increase in comparative genome sequencing studies.

4913 5000 4675

s

e 4000 m A o n R e S 3000

g I

f B o

C r 1799 N e 2000

b n i m u

N 1000 355 0 1 0 2008 2009 2010 2011 2012 2013 Year

Figure 1.4. The number of MTBC genome sequences in the NCBI Short Read Archive (SRA). The database was queried on 21-02-2013 using the search term Mycobacterium tuberculosis complex. The year 2013 is not complete and only representative of nearly the first two months of the year.

18 1.4 Linking genotype to phenotype

1.4.1 In silico prediction of functional SNPs

Whilst identifying SNPs in bacterial genomics studies is becoming relatively simple through whole genome sequencing using one of the second-generation technologies (Loman et al., 2012), understanding the effects of sequence variations has become a major effort in mutation research (Thusberg & Vihinen, 2009). Experimental study of the molecular effects of all MTBC SNPs identified in recent studies, such as those found in the above twenty-one genome study, is unfeasible. The development of computational methods to screen for SNPs likely to have a functional effect from those that are neutral has therefore been a highly active field within bioinformatics, and a number of computational tools have been created for this purpose (Bao & Cui, 2005; Cingolani et al., 2012; Ng & Henikoff, 2006). From here on, the term functional SNP is used to refer to those SNPs that are expected to alter gene expression or function, and therefore associated with a phenotype. Use of such methods to predict functional SNPs can help prioritise additional research on those SNPs more likely to affect protein function.

Methods that predict whether a SNP has a functional effect use either sequence or structural information, or a combination of both to form the prediction. Such methods rely on the evidence that mutations which effect protein function tend to occur at evolutionary conserved positions, or are buried in the interior of the protein structure (Ng & Henikoff, 2006). Predictions based on sequence information typically follow a common procedure, as implemented by Ng & Henikoff in their SIFT prediction algorithm (Ng & Henikoff, 2003). Firstly an input sequence is used in a database search for homologous sequences. These are used to create a multiple sequence alignment, which identifies the evolutionary conserved positions, and these are inferred to be important for function. A scoring method based on the frequency of each amino acid at each position, and the severity of an amino acid change is then used for each position in the input sequence. The introduction of an amino acid that does not appear in the specific amino acid position can still be classified neutral and not functional as predictions also use the physiochemical properties of the amino acids already present in the alignment. For example, if a position in an alignment contains the hydrophobic amino acids isoleucine, leucine and valine, then this position can likely only contain hydrophobic amino acids, and changes to other hydrophobic amino acids, such as methionine, will likely not have a functional effect (Ng & Henikoff, 2003; Ng & Henikoff, 2006).

19 1.4 Linking genotype to phenotype

1.4.2 Gene expression diversity

After the genome sequence of the first M. tuberculosis strain was published in 1998 (Cole et al., 1998), and the extent of genetic diversity was beginning to be uncovered (Comas et al., 2010; Hershberg et al., 2008), the next logical step in understanding the consequences of such genetic diversity is to build upwards from the genomic information layer. Uncovering the complexity of phenotypic differences in the MTBC likely requires the integration of multiple layers of biological information (Comas & Gagneux, 2009), and moving from the DNA to RNA level to explore MTBC transcriptional diversity is discussed in the following section.

In the first systematic survey of variation in mRNA expression, Gao et al. compared the gene expression of ten clinical isolates of M. tuberculosis in additional to the reference strains H37Rv and H37Ra (Gao et al., 2005). All isolates were grown in vitro and under exponential growth conditions. The authors found that 527 (15%) of the genes tested were variable amongst the isolates, highlighting for the first time strain-to-strain variability in expression under identical growth conditions. Combined with gene function information, it was found that variable genes were statistically over-represented by genes involved in lipid metabolism; it was speculated that this could have implications in virulence, as lipid and lipid metabolism is thought to have an important role in host pathogen interactions (Barry, 2001; Forrellad et al., 2012; Reed et al., 2004). A further 16% of genes represented those consistently expressed, and as might be expected it was found that this class was over-represented by those found in the information pathways class; this class consists of genes associated with replication, transcription and translation (Lew et al., 2011), and are consequently highly expressed in actively growing bacteria. Approximately two-thirds of the remaining genes in the study were equally split between low or undetectable and unexpressed classes. Many of these genes included those that were classed as unknown hypotheticals, and so could represent incorrect annotation of coding regions, or alternatively discovery bias through the use of only one culture condition (Gao et al., 2005). Overall the study identified transcriptional variation amongst a set of clinical isolates, with implications in the choice of drug targets for vaccine development and diagnostic markers. The study predates the robust classification of the phylogenetic lineages of the MTBC (Gagneux et al., 2006a), and so limits the use of the results in a phylogenetic context.

20 1.4 Linking genotype to phenotype

More recently, a study of transcriptional variation amongst clinical isolates of the MTBC has been undertaken within a phylogenetic framework using microarray technology (Homolka et al., 2010). The authors included fifteen clinical strains from four MTBC lineages (Lineages 1, 2, 4 and 6), plus the reference strains H37Rv and CDC1551, which are part of Lineage 4. Under in vitro exponential growth conditions the authors identified 364 genes (9.1% of all annotated genes) differentially expressed between strains of different lineages in at least one pairwise comparison. Several genotypic signals were identified, such as the dysregulation of virS-mymA operon in Lineage 1, thought to be involved in maintenance of the cell wall structure (Singh et al., 2003), and over-expression of the dosR two component regulator in the Beijing strains, which controls the DosR regulon and described in section 1.3.2. Analyses were extended to the transcriptional response of intracellular bacilli before and after infection of resting and activated murine macrophages. Apart from identifying the core universal induction or repression of 280 genes (7.0%) in all strains regardless of state compared to in vitro expression, a proportion of genes (293 genes; 7.3%) displayed significant genotypic patterns in response to the intracellular conditions in the macrophage (Homolka et al., 2010). This study currently represents the most comprehensive survey of human- adapted MTBC transcriptional diversity in gene expression. The presence of genotypic signals implicates the effect of the underlying genotypic diversity, driven by large deletions, indels, and coding and noncoding SNPs, although this was not explored in the study.

In 2007, the global transcriptional differences between a strain of M. bovis and the reference strain H37Rv was investigated by microarray (Golby et al., 2007). This study provides a useful comparison from the perspective of a human-adapted strain and M. bovis, which whilst it can be sustained in humans, is regarded as primary pathogen of wild and domesticated animals (as discussed in section 1.1.2). Under nutrient limited conditions and in steady state growth, it was found that 92 genes (2.3%) had 3-fold differential expression. Genes showing higher expression were equally split between the two strains. Focusing again on the major gene functional categories, a large proportion of differentially expressed genes encoded proteins involved in the cell wall, lipid metabolism, gene regulators, the PE/PPE protein family, and toxin–antitoxin (TA) gene pairs.

The growing understanding that regulatory processes are often mediated by RNA molecules beyond the classical view of protein based regulation was combined with

21 1.4 Linking genotype to phenotype advances in sequencing technology to uncover the total transcriptome of M. tuberculosis by RNA-sequencing (RNA-seq) (Arnvig et al., 2011). The RNA-seq method is discussed in the following section (1.4.3). All RNA molecules from in vitro exponential and stationary phase cultures of M. tuberculosis strain H37Rv were sequenced, and it was found that more than a quarter of all sequence reads mapped to intergenic regions; this excluded the highly expressed ribosomal RNAs involved in protein synthesis. Accounting for the size of the intergenic regions based on the H37Rv genome size, this represented a 2-fold higher density of noncoding RNA expression compared to gene expression (mRNA transcription). The non-coding RNA ranged from 5’ and 3’ untranslated regions (UTRs), antisense transcripts, and intergenic small RNA (sRNA) molecules. Although based on the reference strain H37Rv, the work provides an important benchmark for future studies of transcriptional diversity in MTBC strains, demonstrating the significant quantity of RNA expression that had not been detectable in previous microarray based studies.

1.4.3 High-throughput DNA sequencing technology

Our awareness of greater levels of genetic diversity in the MTBC has been largely driven by technology changes in sequencing, and next-generation high-throughput DNA sequencing is likely to play an important role in improving our understanding of TB (Loman et al., 2012); whilst this technology is often described as next-generation sequencing, this term is likely to become less useful as the technology advances by further generations. As introduced earlier, in 2010 Comas et al. sequenced twenty-one representative clinical MTBC strains, and this was performed using Illumina sequencing by synthesis technology (Comas et al., 2010; Loman et al., 2012). This genome set has since become an ideal basis on which to perform later phylogenetic studies employing ever increasing numbers of MTBC strains (Bentley et al., 2012). This section briefly introduces the technology, focusing specifically on the methods used in this thesis, namely genome and RNA-sequencing using the Illumina sequencing platform.

Recent advances in DNA sequencing technologies have enabled the determination of nucleotide sequence at a greater data throughput, a shorter amount of time and at lower cost than was previously possible using capillary-based Sanger sequencing (Shendure & Ji, 2008). Several novel approaches have been developed including 454 (pyrosequencing) and Illumina sequencing, previously known as Solexa sequencing. The Illumina system was established at NIMR in 2010, initially by an Illumina Genome

22 1.4 Linking genotype to phenotype

Analyser IIx sequencer (GA), and later on by the Illumina HiSeq2000 (HS); the HS sequencer was the result of technical developments and has five times greater data output than older GA sequencer (Loman et al., 2012). The Illumina method involves sequencing millions of short reads, initially 36bp but more recently ~100bp, using a flowcell based system for capturing DNA. It is the flowcell in which the sequencing reactions take place, which is divided into eight lanes, and therefore up to eight different samples can be added. This limitation of sample number has been removed by recent multiplexing technology, which utilises sequence tags to track each sample and therefore increases the number of individual samples added to each flowcell lane (Meyer & Kircher, 2010).

Briefly, there are three broad stages in the generation of sequence data: library preparation, amplification and sequencing. Libraries are initially constructed by one of several methods that generate a mixture of DNA fragments with ligated adaptor sequences up to several hundred bp in length. These are amplified using PCR primers attached to a flowcell, resulting in the physical clustering of the DNA templates across the flowcell, creating a lawn of sequence fragments (Shendure & Ji, 2008). This is followed by sequencing, consisting of multiple cycles of single base extensions using fluorescently labeled reversible terminator nucleotides and imaging to detect which base has been incorporated, thereby determining the base in the sequence (Bentley et al., 2008). At the end of each cycle the labeled nucleotide is cleaved and another round of terminators is added; the number of cycles therefore determines the length of the reads generated.

The Illumina sequencing platform generates considerable quantities of data per run, with each flowcell producing up to 6 billion reads which translates into 600 Gigabase (Gb) of sequence data. Apart from creating demands on storage capacity, with image data from each flowcell requiring 32 terabytes of temporary storage, a robust informatics pipeline is required to handle the downstream analysis (Bentley, 2010). There are two main analytical approaches to using the sequence data, one involves aligning to a reference sequence, also known as a mapped assembly, and the other is reference free and therefore a de novo assembly. The short read data generated by the Illumina sequencers is most applicable to the former method, and is very useful in the discovery of SNPs and phylogenetics.

23 1.4 Linking genotype to phenotype

High-throughput sequencing has translated into numerous publications that provide new insight into the evolution and genomic diversity of bacteria (Comas et al., 2010; Holt et al., 2008; Qi et al., 2009). This technology is being applied to other disciplines, such as transcriptomics, where whole genome sequencing of RNA transcripts (RNA-seq) is creating a powerful new approach to characterisation of the bacterial transcriptome (Perkins et al., 2009). For over ten years, microarray technology has allowed the simultaneous monitoring of expression levels of all annotated genes in cell populations (Schena et al., 1998). Whilst microarrays have been instrumental in our understanding of transcription, generating a wealth of publications and data based on this technology, limitations in its applicability have begun to be reached (Mortazavi et al., 2008). Inherent issues such as the limited dynamic range for the detection of transcript levels, cross hybridisation and the need for normalisation provide some explanation for the explosion in use of second generation technologies in the analysis of transcriptomes (Marguerat & Bähler, 2010). As well as surveying the total transcriptional landscape, adaptation of the library making process can facilitate Transcriptional Start Site (TSS) mapping, whereby the precise position of transcription initiation can be determined in a genome-wide manner (Filiatrault et al., 2011; Sharma et al., 2010b). This can provide greater understanding of the transcriptional output, and in the human pathogen Helicobacter pylori revealed a complex structure of TSS within operons and opposite to annotated genes (Sharma et al., 2010b).

24 1.5 Thesis outline

1.5 Thesis Outline

In this thesis, the identification and effect of lineage-specific genetic variation within the phylogenetic lineages is investigated using computational methods and high-throughput sequencing technology. This is driven by the overarching hypothesis that fixation of mutations at evolutionary conserved positions in the lineages of M. tuberculosis, either due to a relaxed selective constraint or positive selection, has resulted in functional consequences that separate the MTBC lineages. Chapter 3 begins with the construction of a representative 28-genome phylogeny using Illumina sequencing data. Comparative analysis focuses on the detection of all lineage-specific single nucleotide polymorphisms (SNPs), providing the first glimpse of the total SNP diversity that separates the main phylogenetic lineages from each other. The lineage-specific coding SNPs are used to investigate the evolutionary pressures acting within the lineages using population genetics measures and gene function categories. Chapter 4 applies in silico tools to the lineage-specific SNPs to predict those likely to have a functional effect. Focus is made on the largest group of genetic variation, the nonsynonymous SNPs, and a significant overrepresentation of transcriptional regulators with predicted functional SNPs was detected. Chapter 5 moves from the DNA to RNA level using a transcriptomic approach. RNA-sequencing of multiple strains from two lineages was performed, and differential expression analysis used to define lineage-specific transcriptomes. Along with the differential expression of genes between the lineages, the experimental method used allowed novel expression of noncoding and antisense to be detected. In the context of previously identified lineage-specific SNPs, significant associations were found between the genomic and transcriptomic data, which were found to arise by three main mechanisms. These have the potential to alter the response of isolates to differing microenvironments and to modulate expression of ligands involved in innate immune recognition.

25 2 Materials and Methods

Chapter 2 Materials and Methods

The following chapter details all protocols used in this thesis. From basic laboratory methods used in the culture of Mycobacterium tuberculosis and the strains used. Genome and RNA sequencing are next outlined, alongside the bioinformatics analysis tools used to interpret this data. Details of MTBC strains and specific bioinformatics analyses are detailed in results Chapters 3 to 5.

2.1 General microbiological methods

2.1.1 Containment 3 laboratory

All culturing of M. tuberculosis strains was performed in a Biosafety Level 3 laboratory, and work undertaken within a Class II flow cabinet at a negative pressure of at least 160kPA.

2.1.2 General chemicals and reagents

Unless otherwise stated all laboratory chemicals were purchased from Sigma-Aldrich. Buffers were prepared as aqueous solutions using distilled water, and solutions were sterilised either by autoclaving or filtration (Millipore, 0.22μm) depending on the volume.

26 2 Materials and Methods

2.1.3 Bacterial culture and storage

Growth of M. tuberculosis strains used in this study was performed in liquid Middlebrook 7H9 growth media (Difco, Becton Dickinson). The 7H9 media was supplemented with 0.5% glycerol (Fisher Scientific), 10% Middlebrook ADC (Albumin, Dextrose, Catalase), and to help prevent clumping of the cells during growth, 0.05% Tween-80. This is standard rich nutrient medium to culture M. tuberculosis (Atlas & Snyder, 2006). Cultures were grown in one litre roller bottles (Nalgene) in a rolling incubator at 37oC. For long-term storage all isolates were stored at -20°C in 2ml cryo tubes (Sigma-Aldrich), and supplemented with 10% glycerol to increase viable cell number during storage.

2.1.4 Growth curves

Growth curves of the bacterial strains used in this study were performed to determine the previously unknown growth rates of the clinical isolates, which is critical for the extraction of RNA from the correct growth phase for subsequent experiments. This would also provide important phenotypic data on potential differences in in vitro growth rates between the lineages.

Inoculation of 50ml conical screw cap falcon tubes (Fisher Scientific) with 10mls 7H9 medium was performed two days prior to the start of the growth curve experiment. On starting the experiment a roller bottle with 100ml 7H9 was inoculated with the pre culture so that the starting OD was 0.01 (the lower limit of detection by the spectrophotometer). Samples of 1ml were taken every 24 hrs and the OD measured in 1ml cuvette.

2.1.4.1 Optical density (OD) measurements

The optical density (OD) method was used to measure the growth of mycobacterial cultures in the above protocol. This is a rapid method that employs a spectrophotometer to measure the difference in light transmission at a certain wavelength before and when passing through a path length of a culture sample in a cuvette. Here an Amersham Bioscience spectrophotometer was used for all OD measurements. All readings were taken at a wavelength of 600nm (OD600), and sterile 7H9 used as a reference. Saturation

27 2 Materials and Methods of absorbance occurs > 1 OD, therefore any readings above this were taken from a diluted sample and multiplied by the dilution factor afterwards (typically 1:10).

2.2 Molecular biology techniques

2.2.1 Genomic DNA extraction

Genomic DNA was extracted using the CTAB method described previously (van Soolingen et al., 1991). 20mls of culture with an OD of ~0.5 was transferred into a sterile 50ml conical tube and centrifuged at 3000xg for 10mins to precipitate the bacteria. The supernatant was decanted and the pellet resuspended in 1ml lysis buffer. The suspension was transferred into a 2ml screw cap tube, and placed into a water bath at 90oC for 1hr. Following this step the crude cells and lysate were transferred to a containment 2 laboratory. The cells were pelleted at 13000xg, the supernatant discarded, resuspended in 400µl lysis buffer and 100µl of 10mg/ml lysozyme, gently mixed, and incubated at 37oC for 2 hrs.

The cell lysis step consisted of the addition of 50µl 20% SDS and 25µl Proteinase K to the cell mix. The sample was incubated at 55oC for 40mins and 250µl of 4M NaCl added and gently mixed. 160µl of preheated CTAB was added and incubated for 10 minutes. To separate the DNA from protein contamination, 900µl chloroform-isoamyl alcohol (24:1) was added and the biphasic suspension vortexed, then centrifuged for 10mins at 13000xg at 4oC to separate the phases. The upper phase containing the DNA mix was transferred to a clean 2ml eppendorf. DNA was purified with 700µl cold isopropanol and mixed by gently inverting the tube. Following a 2hr or overnight precipitation, the sample was centrifuged at 13,000xg for 10mins at 4oC. The supernatant was decanted and the pellet air dried. 1xTE buffer was added to dissolve the DNA that was then stored at 4oC.

2.2.2 RNA Isolation and handling

Inoculation of 10mls 7H9 media in falcon tubes from previously frozen bacterial stock was performed per experiment to enable the rapid growth of pre-cultures before scaling up to larger growth volumes. Following approximately two days and before OD reached

28 2 Materials and Methods

0.8, this culture was used to inoculate a roller bottle containing up to 180mls 7H9 liquid media.

As determined by growth curve experiments (section 5.3.1), exponential phase cultures were harvested at an OD of between 0.4 and 0.8, whilst stationary phase cultures were harvested one week after the OD had reached 1.0. When ready, cultures were cooled rapidly by addition of ice directly into the culture, and centrifuged at 12,000xg for 15 mins at 4oC. RNA was isolated using the FastRNA Pro blue kit from QBiogene/MP Bio following the manufacturer’s instructions. The supernatant was subsequently decanted. Following this procedure, the standard FastRNA Pro blue kit instructions were followed. Briefly, 1ml of RNApro solution was added to the pellet and the cells resuspended by pipetting, and 1ml transferred to a blue-cap tube containing Lysing Matrix B. The cell mix in the tube was homogenised in a FastPrep Ribolyser (QBiogene/MP Bio) for 40secs at a setting of 6.0, and centrifuged at 12000xg for 5mins at 4oC. The upper phase was transferred to a fresh microcentrifuge tube, incubate for 5mins, 300µl chloroform added, vortexed for 10secs and further centrifuged at 12,000xg for 5mins. Following transfer of the upper phase to a fresh microcentrifuge tube, 500µl of cold ethanol was added and inverted for 5 times.

Following this step the RNA suspension was transferred to containment level 2 laboratory and precipitated for at least 2hrs or alternatively overnight. After precipitation, the sample was centrifuged at 12,000xg for 15mins at 4oC, the supernatant removed and pellet washed in 500μl of cold 75% ethanol (made with DEPC-H2O). The ethanol was aspirated and the pellet air-dried at room temperature for 5mins, then the

RNA resuspended in 100 μl of DEPC-H2O.

2.2.3 Quantification of DNA and RNA by Nanodrop

A Nanodrop spectrophotometer (version ND-1000) was used to detect the quantity of DNA and RNA following the above protocols. This requires 1μl of sample to be placed on to the Nanodrop pedestal. Then the Nanodrop measures the absorption of the sample at a range of wavelengths (230-350nm). This correlates with the concentration of DNA present, given in ng/μl. The Nanodrop also provides a measure of the quality of DNA or RNA extraction. Nucleic acids and proteins have absorbance maxima at 260 and 280nm, respectively. A ratio of ~1.8 is generally accepted as high quality for DNA, a ratio of ~2.0 is generally accepted as high quality for RNA. If DNA or RNA extractions were

29 2 Materials and Methods appreciably lower than these ratios a repeated round of purification was performed to remove potential protein or other contamination that may be present in the sample.

2.2.4 Determination of DNA and RNA integrity by micro fluidics

Both RNA and DNA concentration was first measured using Nanodrop, and then followed by quality control using the Agilent 2100 Bioanalyser. The Bioanalyser is a chip-based capillary electrophoresis machine for sizing, quantification and quality control of DNA, RNA, as well as proteins and cells. Depending on the sample type, the nucleic acid was measured using the Agilent DNA 1000 chip or Agilent RNA 6000 nano chip following the manufacture’s instructions.

2.2.5 Removal of DNA contamination from RNA samples

Rigorous DNase treatment of all RNA samples was performed using the TURBO DNase free kit (Applied Biosystems). This procedure can remove > 200µg DNA per ml. Up to 5µg total RNA was treated in volumes of 50µl according to the manufacture’s instruction. Briefly, 0.1 volume of 10X TURBO buffer and 1µl (2U) TURBO DNase was added to the 50µl total RNA aliquot and mixed well. This was incubated at 37oC for 20mins, followed by an additional 1µl (2U) TURBO DNase, and 20min incubation. To terminate the reaction 0.2 volumes DNase Inactivation Reagent was added and incubated for 5mins at room temperature. The sample was then centrifuged at 13,000xg for 2mins and the supernatant, containing the DNase free RNA, transferred to a fresh microcentrifuge tube and stored at -20oC.

2.2.6 Polymerase chain reaction (PCR)

PCR was used to amplify specific regions of DNA. For general PCR amplification of template DNA Supermix (Invitrogen) was used. Specific protocols including DNA-seq, RNA-seq and qRT-PCR, used the manufacturers recommended reagents and are described in the following sections. All PCR reactions were done in 0.2ml RNase- and DNase-free thin wall PCR tubes (Ambion) using an Applied Biosystems Veriti Thermal Cycler. As a negative control the same reaction was conducted in the absence of a DNA template.

30 2 Materials and Methods

2.3 Materials

2.3.1 Mycobacterium tuberculosis strains

At the start of this project, strain stocks were generated for the entire duration of the project. Stocks were taken from a strain collection at NIMR derived from a global collection isolated in San Francisco (Gagneux et al., 2006a). Handling of stocks was kept to a minimum to minimise the effect of laboratory adaptation; strains were cultured for one week at NIMR to obtain sufficient stocks for this thesis. Stocks were frozen at OD 0.4-0.8 to prepare stocks for subsequent exponential phase transcriptome sequencing experiments.

Specific description of the MTBC used in this thesis is described in the respective results chapters (Chapters 3 and 4).

2.4 DNA-seq

Following extraction and quality control of DNA described in the above method, the Epicentre Nextera DNA kit was used to generate Illumina sequencing ready DNA libraries. Briefly, the Nextera method employs in vitro transposition to simultaneously fragment and tag DNA in a single-tube reaction, thereby facilitating the rapid generation of DNA libraries; accounting for all quality control procedures, libraries can take less than two days. The manufacturer’s instructions were followed, and the High-Molecular- Weight Buffer (HMW) used, which generates fragments of 175-700bp and is recommended for paired-end sequencing. A limited PCR step was performed, consisting of a 72°C 3min extension step to denature the templates, followed by nine cycles of 95°C for 10secs, 62°C for 30secs and 72°C for 3mins. The amplified DNA fragments were subsequently purified using the Zymo column DNA Clean & Concentrator-5 kit.

Additional MTBC strains that were not part of this study were also generated using the above method at the same time, and therefore the Nextera barcoded adapters were used in the above PCR step. This can be used to add up to twelve unique barcodes to the Nextera library, enabling multiplexing of the libraries to reduce the sequencing cost.

31 2 Materials and Methods

2.5 RNA-seq

Following trialling of several methods to generate cDNA libraries ready for sequencing from the RNA extractions, two methods were chosen for the generation of transcriptomes in this thesis. The two methods are described below; one generates transcriptomes for differential expression analysis (2.5.1), whilst the other was used for transcriptional start site (TSS) mapping analysis (2.5.2).

2.5.1 Strand-specific RNA-seq libraries

The strand-specific protocol for transcriptome sequencing is largely based on the small RNA sample preparation protocol from Illumina (part # 1001375), but with exclusion of polyA-tail and size selection methods in order to capture all RNA species. Total RNA from the above DNase treated RNA extraction was randomly fragmented, specific 5’ and 3’ adapters attached to both ends of the RNA; the adapters are complementary to oligonucleotides immobilised on the glass surface of the Illumina flowcell. The protocol consists of six main steps: fragmentation, phosphatase treatment, PNK treatment, ligation of the adapters, reverse transcription and PCR amplification. These are followed by purification steps using Solid Phase Reversible Immobilisation (SPRI) beads.

Fragmentation: Initially between 3-5µg of DNase treated RNA was fragmented following the described Illumina protocol with the 10X fragmentation reagent. This was stopped with the stop solution and put ice, the volume increased to 100µl with RNase free water and precipitated by adding 3 volumes of 100% ethanol, 0.1 volumes of sodium acetate (3M) (Ambion Cat # AM9740) and 0.05 volumes of glycogen. This was precipitated for at least 30 minutes at -20°C. The pellet was washed with 500µl of 70% ethanol, air dry the pellet on ice and resuspended in 16µl of RNase free water in a 200µl PCR tube.

Phosphatase treatment: The sample was treated with 2µl Antartic phosphatase with 10X Phosphatase buffer (NEB Cat # M0289S) and incubated for 30mins at 37°C, 5mins at 65°C and held at 4°C. PNK treatment: To the previous PCR tube 2µl T4 Polynucleotide Kinase (PNK) (NEB Cat # M0201S), 17µl water, 5µl 10X PNK buffer, 5µl ATP (10mM) (Epicentre Cat # R109AT) and 1µl RNAse OUT (Invitrogen, part # 10777-019) was added and incubated for 60mins at 37°C and held at 4°C.

32 2 Materials and Methods

Phenol purification: In a new 1.5ml microcentrifuge tube the sample was transferred and volume increased to 200µl by addition of RNase free water (Ambion, Cat # AM9920). After 200µl acid phenol (Ambion Cat # AM9720) was added, vortexed, and after centrifuging for 15mins at room temperature the upper phase was transferred to a new microcentrifuge tube. 3 volumes of cold 100% ethanol, 0.1 volumes of sodium acetate and 0.05 volumes of glycogen was added and precipitated for 30mins or overnight. Following precipitation the sample was centrifuged for 25mins at 4°C, the pellet washed in 70% ethanol and air dried on ice. Once dry 5µl RNase free water was added to the pellet.

Ligation of the adapters: Adapters were from the Illumina small RNA kit preparation kit with the v1.5 sRNA 3’ Adaptor (Illumina cat # FC-102-1009). Following the manufacturers instructions the 3’ sRNA adaptor v1.5 and then SRA 5’ Adapter was ligated to 5µl RNA from previous step.

Reverse Transcribe and Amplify: 4µl of the 5’ and 3’ ligated RNA was mixed with 1µl diluted (1:5) SRA RT primer from the Illumina small RNA kit and heated at 70°C for 2mins. The standard SuperScript II Reverse transcriptase kit with 100mM DTT and 5X first strand buffer (Invitrogen, part # 18064-014) was used to reverse transcribe the ligated RNA following the manufacturer’s instructions.

PCR Amplification: Using the Phusion DNA Polymerase kit (NEB part # M0530S) following the manufacturer’s instructions, 10µl of the product from the reverse transcription reaction was amplified in a thermal cycler using the following conditions: 30secs at 98°C, 17 cycles of: 10secs at 98°C, 30secs at 60°C, 30secs at 72°C, followed by 10mins at 72°C and then holding at 4°C.

Purification of libraries: The SPRI bead purification system (Agencourt AMPure from Beckman Coulter Genomics) was used to remove residue reagents from the previous steps to leave a purified DNA sample. The standard manufacturer’s instructions were used for two rounds of SPRI bead purification. The final supernatant was transferred to a fresh labelled RNase free tube, along with another 4µl aliquot for assessing library concentration and purity (using a Bioanalyser), and stored at -20°C.

33 2 Materials and Methods

2.5.2 TSS 5’ enriched RNA-seq libraries

Terminator-5’-phosphate-dependent exonuclease (Epicentre Biotechnologies) was used to deplete processed RNAs in cDNA samples used in TSS mapping analysis. Total RNA was sent to Vertis Biotechnologie AG (Freising, Germany) and Illumina ready libraries were constructed using the same protocol as above, but with the addition of the Terminator-5’-phosphate-dependent exonuclease step to remove all RNA transcripts without a 5’ triphosphate cap. This step removes degraded mRNAs and rRNAs, thereby biasing the sequencing of only the 5’ end of mRNA transcripts and facilitating the mapping of transcriptional start sites (TSS).

2.6 Illumina sequencing DNA (genome) and cDNA (RNA-seq) libraries

The library sequencing stage was performed by the high-throughput sequencing (HTS) group at NIMR under the supervision of Abdul Sesay. Generated libraries were quality checked by Agilent DNA 1000 chip and quantified by Qubit (Invitrogen).

Briefly, sequencing libraries were denatured with sodium hydroxide and a dilution of 2nM of the library loaded onto a single lane of an Illumina Genome Analyser 2x (GA) or HiSeq2000 (HS) flowcell. Cluster formation, primer hybridisation and single or paired-end sequencing were performed using proprietary reagents according to manufacturer’s recommended protocol (Illumina).

2.7 Quantitative RT-PCR

To confirm differential expression identified by RNA-seq, qRT-PCR was carried out on a 7500 Fast Real-Time PCR System (Applied Biosystems) using Fast SYBR Green Master Mix (Applied Biosystems). To minimise across plate normalisation problems arising, each 96-well plate consisted of a closed experimental plate design, with all clinical strain samples included. RNA without RT (RT-) was analysed alongside cDNA (RT+). Standard curves were performed for each gene analysed, and the quantities of cDNA within the samples were calculated from cycle threshold values. Three biological replicates were tested, consisting of three qRT-PCR plates per gene tested. Data was averaged, adjusted for chromosomal DNA contamination (RT+ minus RT-) and normalised to corresponding 16S RNA values.

34 2 Materials and Methods

cDNA for quantitative RT-PCR was made with random primers and Superscript III according to manufacturer's instructions (Invitrogen). 2µg of DNase treated total RNA from each respective strain was used as the starting material. Three biological replicates per strain were used in this study.

2.7.1 Primer sequences

Primers were designed using the Primer 3 software (Rozen & Skaletsky, 2000), and ordered from Sigma at 100≤µM concentration in 100µl aliquots, and stored at -20oC. Primers used in the RNA-seq study in Chapter 5 are shown in Table 2.1.

Table 2.1. Primer sequences used in the qRT-PCR study. Seven toxin-antitoxin genes were measured by qRT-PCR, and the 16S rRNA sequence was used in normalisation. In the sequence column the suffix denotes the forward (F) and reverse (R) primers.

Gene qRT-PCR primer Sequence (5’ - 3’)

Rv2063 mazE7_F TCCACGACGATTAGGGTTTC Rv2063 mazE7_R ACATCGAGATTCCCCGTTC Rv2274A mazE8_F CGAACCAGAAACCCTTCCT Rv2274A mazE8_R GACGACTCTGCTCCCAACTC Rv2830c vapB22_F GATCGAGATCACCAAACACG Rv2830c vapB22_R GGTGGTGAAGAGTTCGTCGT Rv2758c vapB21_F GTATGCTCTCCGGGTGTGAC Rv2758c vapB21_R TGTCGTGGTACCCAGTTCCT Rv1398c vapB10_F GGACCTGCAGGCTATAAACG Rv1398c vapB10_R GCAAGGTGCTGTTCACGAC Rv1397c vapC10_F TGGACTTGGCGACTATCTGA Rv1397c vapC10_R GGAAATGCCACACGTTGAG Rv2527 vapC17_F CGATATCGGCGAACTTGAAT Rv2527 vapC17_R CAGTGACGTTTGTTGGCTGT 16S 16S_F AAGAAGCACCGGCCAACTAC 16S 16S_R TCGCTCCTCAGCGTCAGTTA

35 2 Materials and Methods

2.8 MTBC annotation datasets

2.8.1 Coding sequence annotations

All gene annotations were based on the reference H37Rv genome sequence (Cole et al., 1998) and using the most recent annotations from the Tuberculist database, release 24 (December 2011) (Lew et al., 2011). In total there are 4,015 protein coding gene sequences, 13 pseudogenes, 45 tRNAs and 3 rRNAs.

2.8.2 Functional Categories

The genes can be classified based on the function of the encoded proteins. Using the Tuberculist database annotations there are ten functional categories, listed below (Lew et al., 2011):

1. virulence detoxification and adaptation 2. lipid metabolism 3. information pathways 4. cell wall and cell processes 5. intermediary metabolism and respiration 6. unknown 7. regulatory proteins 8. conserved hypotheticals 9. insertion sequences and phages 10. PE/PPE

2.8.3 Essential M. tuberculosis genes

Definition of gene essentiality was based on experiments using transposon mutagenesis to generate single gene knockouts, followed by transposon site hybridization after growth on 7H11 agar or in mice (Sassetti et al., 2003; Sassetti & Rubin, 2003). On the basis of these studies a total of 760 genes fell into the category of essential genes and the remaining genes were classed as nonessential. This follows the same convention as Comas et al. (Comas et al., 2010).

36 2 Materials and Methods

2.9 Bioinformatics software

2.9.1 Artemis

The genome browsing and annotation tool, Artemis (Carver et al., 2008), from the Wellcome Trust Sanger Institute was used extensively throughout this work. Importantly, this tool enables new features to be overlaid onto published annotations, and the user plot function allows transcription data to be plotted against the genome.

2.9.2 Quality control of raw RNA-sequencing data

2.9.2.1 FastQC

Raw reads were first filtered to discard low quality reads, which improves the mapping through a decrease in time and higher number of mapped reads. Raw fastq files deposited from the Illumina machine were inspected using FastQC version version 0.9.3 (downloaded 20-6-11, Babraham Bioinformatics). FastQC provides a modular set of analyses in a GUI environment written in the JAVA language. The Phred quality of scores across the read length displayed in a box whisker plot, per base N content and over-represented Illumina primer sequences were used to determine if a run has passed QC.

2.9.2.2 SolexaQA

Fastq files passing the initial QC were filtered using SolexaQA version 1.7 (Cox et al., 2010) (downloaded April 2011). SolexaQA is a Perl-based software package for quality analysis of Illumina data. The DynamicTrim.pl script within this package was used to remove poor quality bases from reads. Specifically, bases with Phred scores < 13 (which corresponds to a p>0.05) were trimmed from the 5’ and 3’ ends of reads until all bases were above this parameter. The Perl scripts were run on a linux server.

Trimming of reads was performed with the command:

$ DynamicTrim.pl [in.fastq] –h 13

37 2 Materials and Methods

The resulting output trimmed fastq file was used with the LengthSort.pl script. This removes reads that were poor for a high percentage of the read length and are not sufficiently long enough for mapping. The default parameter was used, removing reads < 25 bases:

$ perl LengthSort.pl [in.fastq] > [out.fastq]

2.9.3 Transcriptome mapping software

An analysis pipeline was created to manage the high throughput sequencing datasets generated by this study. Each file can contain about 150 million reads consisting of 10 Gigabases of sequence data. A reference based assembly was used for this study, and mapping was performed against the reference genome H37Rv using BWA (version 0.5.9) (Li & Durbin, 2009). The raw sequence data file in the fastq format (Cock et al., 2009) was mapped to the reference genome in fasta format using the following commands:

Index the reference sequence using bwa index.

$ bwa-0.5.9 index [in.fasta]

The reads in the fastq file were mapped to the indexed fasta using the following commands:

$ bwa-0.5.9 aln -I [in.reference.fasta] [in.fastq] > [out.fastq.sai]

$ baw-0.5.9 samse [in.reference.fasta] [in.fastq.sai] [in.fastq] > [align.sam]

For later processing and storage the mapped file in sam format is converted to the binary format, BAM, using SAMtools (Li et al., 2009).

$ samtools view -bS [in.align.sam] > [out.align.bam]

The bam file is sorted to further reduce storage size and indexed for viewing the BAM file in Artemis.

38 2 Materials and Methods

$ samtools sort [in.align.bam] [out.align.sorted.bam] $ samtools index [in.align.bam]

Basic mapping statistics after this stage were viewed using the SAMtools idxstats command.

$ samtools idxstats [in.align.bam]

Artemis plots were produced using the unix command.

$ paste [genomeCoverageBed reverse strand.out] [genomeCoverageBed forward strand.out] > artemis.plot.out

2.9.4 Calculation of mapped read frequencies per feature region

Genome coverage of reads mapping to sense and antisense gene annotations and sRNAs were calculated using the BEDtools package (Quinlan & Hall, 2010). Specifically, the coverageBed and genomeCoverageBed utilities were used for extraction of gene regions and whole genome coverage plots respectively. BEDtools is based on four widely used file formats used in HTS data: BED, GFF, VCF and SAM/BAM. Gene and intergenic annotations based on H37Rv were parsed into the BED (Browser Extensible Data) format using standard linux command line tools. The Bed format consists of one line per feature, each line containing a minimum of three fields of tabbed delimited information: chr (chromosome name), chr start (start position), chr end (end position). Two of the optional fields were used in this study: name (feature e.g. gene name), strand (either forward or reverse strand). These optional fields enable the calculation of the reads number that map to either the coding (sense) or non-coding (antisense) strand of the gene in question.

As described above, the coverageBed script was used to identify the number of reads mapping to each annotated feature, such a gene. The following was used to identify reads mapping to each specific strand in the fastq file, in this case the forward strand.

$bamToBed -i [in.align.bam] | grep -w + | coverageBed -a stdin -b [annotations.bed] > [plus.strand.out]

39 2 Materials and Methods

The genomeCoverageBed provides a useful base-per-base output of read depth that can be imported into the Artemis, and was also used in deletion analysis. The following was used to identify all read depths on the forward strand:

$genomeCoverageBed -strand + -d -ibam i [in.align.bam] -g [genome_length.bed] > [plus.strand.out]

2.9.5 R

R is an open source statistical programming analysis environment (Team_RDC, 2008). The Bioconductor package programmed in R was used as it provides tools for the analysis and comprehension of high-throughput genomic data. Specific packages used are described in the Methods of Chapter 5 in relation to RNA-seq analysis.

2.9.6 Perl scripts

Adhoc Perl scripts were written to aid in the parsing of flat file formats for use in such as Artemis and R. In addition to these, the Perl script genomicDeletions.pl was written to identify genomic deletions in genome sequencing data (Appendix A).

2.9.7 Graph pad prism 5.0

For the plotting and analysis of data used the program Graph Pad Prism 5.0c for OSX was used. The software contains comprehensive statistical analysis and presentation tools.

40 3.1 Introduction

Chapter 3 Lineage-specific SNPs

3.1 Introduction

Genetic variation within the M. tuberculosis complex (MTBC) is higher than previously recognised. From studies of Large Sequence Polymorphisms (LSPs), to targeted multi locus sequence analysis (MLSA), and finally whole genome sequencing (WGS), each method has provided a greater resolution of the genetic variation that exists between clinical isolates (Comas et al., 2010; Gagneux & Small, 2007; Hershberg et al., 2008). The most comprehensive set of phylogenetically representative strains sequenced using new high throughput sequencing (HTS) technology was published recently (Comas et al., 2010). For the first time all branches within the MTBC phylogenetic tree could be resolved, encompassing the six major MTBC phylogenetic lineages. Genome sequences of the twenty-one clinical strains sequenced in the previous study are publicly available, making this an ideal reference phylogeny on which to base further analyses. The genomes were sequenced at high depth (40 to 90-fold coverage) using the Illumina sequencing platform, making it possible to capture the most complete picture yet of MTBC nucleotide diversity.

Single Nucleotide Polymorphisms (SNPs) are the most common form of genetic variation in the MTBC, and driven by advances in sequencing technology an extensive and ever growing catalogue of SNPs amongst clinical isolates of M. tuberculosis have been identified (Comas et al., 2010; Stucki & Gagneux, 2012). As described in Chapter 1, analysis of SNPs in 89 genes from 99 human MTBC isolates provided strong evidence that human MTBC originated in Africa and accompanied the Out-of-Africa migrations of modern humans approximately 70,000 years ago (Hershberg et al., 2008). The six human MTBC lineages exhibit a strong global population structure (Gagneux et

41 3.1 Introduction al., 2006a) and phenotypic diversity has been associated with the different MTBC lineages. This includes the ability to elicit an immune response in vivo (Portevin et al., 2011), and clinical associations with extra pulmonary tuberculosis (Kong et al., 2005; Kong et al., 2007). However, the effect that MTBC genomic diversity plays in TB disease remains an open question, but one that can now be explored using a rational data driven approach (Coscolla & Gagneux, 2010).

Using available MTBC genome datasets, it is now possible to identify all SNPs that contribute to the background genetic variation of the six lineages. Due to the clonal population structure of MTBC (Supply et al., 2003), the majority of this variation is expected to be exclusive to the lineage in question, and therefore private from all other lineage strains. This presents an opportunity to understand the nature of this lineage- specific variation, and is expected to provide insight into how the hypothesised reduced purifying selection in the MTBC has shaped the lineages (Hershberg et al., 2008).

3.1.1 Aims

The aim of the work presented in this chapter was to characterise whole genome variation within the MTBC at the lineage-specific level using M. tuberculosis and M. africanum clinical isolates. As the identification of the lineage-specific SNPs is reliant on a representative phylogeny, the initial aim was to generate a robust phylogeny comprising of strains sequenced using second-generation sequencing technology. Following generation of a robust phylogeny, specific aims of the analysis were to:

• identify lineage-specific SNPs from the main six lineages. These SNPs make up the basal branch of each lineage • gain insights into the evolution of the MTBC, focusing on the type and frequency of genetic changes within and across the phylogenetic lineages. • measure the selective pressures on different gene function categories across the lineages.

42 3.2 Materials and Methods

3.2 Materials and Methods

3.2.1 Genome collection used in study

In total twenty-eight phylogentically representative strains were used in this study. Twenty-seven were collected from previously published resources, either through deposited data in public databases or published studies (Comas et al., 2010). Accession numbers are as follows: SRP001137, SRA009341, SRA009367, SRA008875, SRA009637. An additional strain was sequenced as part of this study (Lineage 2 strain N0031). Data has been deposited in the EBI SRA under the accession number: ERX192819. Details of the strains, country of isolation, and metrics from the mapping performed for this study is shown in Table 3.1.

3.2.2 Genome sequencing.

Genomic DNA for N0031 was extracted using the CTAB method described [previously in Methods], and 2µg DNA used for sequencing on the Illumina HiSeq platform. Sequencing libraries were constructed using the Epicentre Nextera DNA kit according to manufacturer’s instructions. Paired-end 75 base read sequencing was performed in a single Illumina flowcell lane as part of a multiplexed run. In total 10.6 million reads were generated, corresponding to an average sequence depth of 180 reads.

3.2.3 Mapping genome sequences

MAQ (Li et al., 2008) was used to map the reads produced by the Illumina sequencer to the reference genome. The most recent common ancestor of MTBC was used as the reference sequence as described previously (Comas et al., 2010). This sequence is based on the H37Rv genome (NC_000962) but substituting H37Rv alleles with those of the

43 3.2 Materials and Methods reconstructed common ancestor of the strains. Standard MAQ parameters were used, removing SNPs with a Phred score <30, read depth of <5, and non-unique matches. A non-redundant list of variable positions called with high confidence in at least one strain was constructed and used to recover the base call in all other strains. SNPs and indels called within repetitive regions (genes annotated as PE/PPE/insertions/phages) were removed.

3.2.4 Phylogenetic analysis

Phylogenetic analysis was based on filtered SNPs detected when each strain was compared against the most common recent ancestor of the sequences, as explained in the above (section 3.3.3). Concatenated SNPs from 13,086 variable genomic positions were used to infer the phylogenetic relationships between strains using the neighbour-joining method. Both coding and noncoding were included. The resulting tree was generated with MEGA (Tamura et al., 2011), using 1000 bootstrap replications for clade support, and the observed number of substitutions as the measure of genetic distance. In cases where SNP calls were missing from individual strains, pairwise-deletion was performed and missing data in the specific comparison ignored. As an outgroup, the distantly related M. canetti (strain K116) was used to root the tree. For presentation purposes the branch length of the M. canetti outgroup was reduced by only including SNP positions shared by the MTBC and M. canetti. Trees in Newick tree format were imported into FigTree v1.3.1, a graphical viewer of phylogenetic trees and as a program for producing publication-ready figures. FigTree was downloaded from: http:// tree.bio.ed.ac.uk/software/figtree/.

3.2.5 Categorising SNPs

SNPs were categorised as nonsynonymous (an amino acid change) or synonymous (no change) using snpEff (Cingolani et al., 2012). Source code was downloaded from: https://snpeff.svn.sourceforge.net/svnroot/snpeff/SnpEffect/trunk, and run as a local installation. As an input snpEff takes two files: a database for the reference genome, and a SNP file in the Variant Call Format (VCF). It was necessary to generate a custom reference database based on the ancestral genome sequence of the MTBC. The database was built within snpEFF using the packages command line modules, and the ancestral sequence in fasta format was parsed into the Genome Transfer Format version 2.2 (GTF

44 3.2 Materials and Methods

2.2), and using the Tuberculist database gene annotations, version 22 (May 2011) to define regions encoding genes. Annotation of SNPs by functional category was based on the Tuberculist database. Genes are grouped into ten functional categories as described previously (section 2.8.2).

3.2.6 dN/dS calculation dN/dS was calculated by division of the two rate ratios dN and dS. dN is calculated by dividing the sum of nonsynonymous SNPs by the total number of potential nonsynonymous sites in coding sequences, and dS is the sum of synonymous SNPs divided by the total number of synonymous sites in coding sequences. Due to the low number of SNPs in the MTBC, instead of calculating the dN/dS per gene, gene concatenates were generated based on different classification. Firstly, genes defined as essential and nonessential on the basis of Transposon screens (Sassetti et al., 2003; Sassetti & Rubin, 2003), and secondly using the Tuberculist gene functional categories. For each concatenate, the Nei-Gojobori method was implemented in SNAP to define synonymous and nonsynonymous substitutions by pairwise comparison using the inferred ancestral genome (Korber, 2000).

45 3.2 Materials and Methods ., 2013) ., 2013) et al et al et ., (2010) ., (2010) ., (2010) ., (2010) ., (2010) ., (2010) ., (2010) ., (2010) ., (2010) ., (2010) ., (2010) ., (2010) ., (2010) ., (2010) ., (2010) ., (2010) ., (2010) ., (2010) ., (2010) ., (2010) ., (2010) et al et al et al et al et al et al et al et al et al et al et al et al et al et al et al et al et al et al et al et al et al et Study source of genome Study source Comas Comas Comas Comas Comas Comas (Comas Unpublished (Comas Unpublished (SRA009341) Institute Broad Comas TBC) study (SRA This Comas Comas Comas Comas Comas (SRA009637) Institute Broad Comas - Comas Comas Comas (SRA008875) Institute Broad (SRA009637) Institute Broad Comas Comas Comas Comas Comas Filtered SNPs Filtered 1,834 1,867 1,883 1,937 1,910 1,883 1,290 1934 1854 1,280 1229 1,279 1,276 1,293 1,305 1,271 1330 1,263 - 733 791 661 862 771 1,959 1,959 2,045 2,065 1,018 2 Percent genome genome Percent 99.75 99.36 98.85 99.29 98.95 99.22 99.04 98.89 97.18 99.02 99.23 99.10 98.94 98.73 99.04 99.14 99.25 99.25 - 98.78 99.08 98.85 98.22 99.52 99.02 98.92 98.61 99.00 96.32 coverage coverage reads Number of Number 7,621,946 7,130,412 5,068,053 7,112,888 7,097,284 6,017,391 3,421,436 3,696,378 3,573,058 7,394,236 21,138,728 6,395,114 4,022,290 7,616,603 6,159,284 7,228,038 3,850,822 6,845,266 - 7,466,814 7,891,933 5,480,451 4,333,184 11,458,643 7,491,737 7,578,690 7,027,143 7,350,873 6,544,254 Alternative strain name as used in previous MLSA and 1 ! Average Average 77.37 72.59 46.01 77.99 78.29 65.52 55.07 59.49 61.56 77.92 179.69 64.49 40.47 78.77 61.65 74.03 66.34 75.52 Reference 78.12 82.26 59.86 32.69 93.51 78.22 79.75 72.62 76.39 93.01 mapped depth Patient place of birth and strain isolation given. Depth of coverage and Country of San Francisco San Francisco San Francisco San Francisco San Francisco San Francisco San Francisco San Francisco San Francisco San Francisco San Francisco San Francisco San Francisco San Francisco San Francisco San Francisco San Francisco San Francisco - San Francisco San Francisco San Francisco - Africa South San Francisco San Francisco San Francisco San Francisco Djibouti isolation Patient place of place Patient Laos Philippines The Philippines The Zimbabwe Islands Comoro Tanzania Indonesia India Vietnam Japan China China South Korea China China Ethiopia India Tanzania USA Sierra-Leone Gambia The Uganda - Africa South Sierra-Leone Ghana Sierra-Leone Gambia The Djibouti birth Based on H37Rv reference genome. on H37Rv Based reference 2 Lineage 1 Lineage 1 Lineage 1 Lineage 1 Lineage 1 Lineage 1 Lineage 1 Lineage 1 Lineage 1 Lineage 2 Lineage 2 Lineage 2 Lineage 2 Lineage 2 Lineage 2 Lineage 3 Lineage 3 Lineage 3 Lineage 4 Lineage 4 Lineage 4 Lineage 4 Lineage 4 Lineage 4 Lineage 5 Lineage 5 Lineage 6 Lineage 6 Lineage M. canetti 1 relative to the reference H37Rv is shown. SNPs et al., 2008). al., et Alternative name Alternative N0032 N0121 MTB_T92 - - - EAS050 EAS053 MTB_T83 N0001 MTB_94_M4241A MTB_98_1833 N0110 MTB_T67 N0155 N0022 N0114 ------Hershberg Alternative name as used in Hershberg et al (2008) and Comas et al (2010). Preserves link from previously published and transition to systematic naming convention naming systematic to transition and published previously from (2010). Preserves link al et Comas (2008) and al et Hershberg as used in name Alternative Based on H37Rv reference genome on H37Rv Based reference Table X. Based on mapping to H37Rv mappingBased on X. to Table Strain name MTB_95_0545 MTB_T17 N0157 MTB_K21 MTB_K67 MTB_K93 N0070 N0072 N0153 MTB_00_1695 N0031 N0052 MTB_M4100A N0145 MTB_T85 MTB_91_0079 MTB_SG1 MTB_K49 H37Rv MTB_4783_04 MTB_GM_1503 MTB_K37 MTB_Erdman MTB_KZN_K605 MAF_11821_03 MAF_5444_04 MAF_4141_04 MAF_GM_0981 MTB_K116 1 2 Table Table 3.1. Twenty eight strains used in this study. number of filtered genome study is included to preserve the link to a new systematic naming convention used in the strain collection (Comas et al., 2010;

46 3.3 Results

3.3 Results

3.3.1 A globally representative 28-genome human-adapted MTBC phylogeny

To identify and extract all lineage-specific SNPs, a representative genome collection was built from previously published and newly sequenced M. tuberculosis strains (Table 3.1). This set of genomes formed the dataset to identify the lineage-specific SNPs analysed in this study; a subset of these strains will also be followed in Chapter 5 using a transcriptomic approach (RNA-sequencing). The majority of the strains used in this phylogeny were published by Comas et al., (2010), consisting of twenty-one genomes sequenced on the Illumina platform. As previously reported, these genomes have mean 72-fold sequence depth, with 98.9% coverage of the reference genome (Comas et al., 2010). A further six genomes sequences were downloaded from the European Nucleotide Achieve (ENA), and the last strain, N0031, was sequenced as part of this study. Strain N0031 was included in the previous MLSA study and therefore known to be a rare Lineage 2 strain that is ancestral to the Beijing sub group (Hershberg et al., 2008). For this reason the strain was selected for sequencing to capture the greatest possible within-lineage diversity. All strains were sequenced using the Illumina platform and with a minimum 32-fold average sequence depth, seen in Table 3.1.

Using the H37Rv genome as reference, a mapping assembly was built for the twenty- eight strains using MAQ (Heng, 2008). SNPs were filtered if they had low associated Phred quality scores, read numbers, or if they fell within annotated repeat regions such as PE/PPE regions (see 3.2.3). Such regions are families of genes encoding proteins carrying Proline-Glutamic acid (PE) or Proline-Proline-Glutamic acid (PPE) motifs found near the N-terminus (Cole et al., 1998), and are inherently difficult to map using short read technology such as Illumina. In total 39,764 SNPs were identified in the strains relative to the reference, and the frequency of filtered SNPs per strain is shown in Table 3.1. Many of these SNPs are present in more than one strain, leaving a high level

47 3.3 Results of redundancy in the SNP lists. A non-redundant list of SNPs was constructed, highlighting 13,088 nucleotide positions that were variable across the 4.4Mb genome. These positions will therefore harbour a SNP in one or more of the 28 strains, and were subsequently used to derive a genome wide phylogeny. A Neighbour-Joining phylogeny, constructed using MEGA5 (Tamura et al., 2011), is shown in Figure 3.1.

Strains group into six main phylogenetic lineages, with bootstrap values indicating strong statistical support (Figure 3.1). The phylogenetic structure and strain groupings are completely congruent to the most recent whole genome based phylogeny (Comas et al., 2010), and previous MLSA and gene deletion based phylogenies (Comas et al., 2010; Gagneux et al., 2006a; Hershberg et al., 2008). The same lineage colouring scheme used in previous studies is continued here, and this will be continued where applicable throughout the thesis (Comas et al., 2010; Hershberg et al., 2008). Naming of lineages from 1 to 6 follows the convention of Comas et al. (2010). Mycobacterium canetti (strain K116) was used to root the phylogenetic tree, as it is the closest known relative to the MTBC (Gutierrez et al., 2005). The number of SNPs has been artificially reduced for M. canetti in the phylogeny (Figure 3.1). This reduction was performed for aesthetic reasons due to the large number of singletons between M. canetti and any of the other MTBC strains used in this study. For example, between the reconstructed most recent common ancestor of the MTBC sequence used in this study (see section 3.2.3) and the M. canetti genome sequence used there are 12,319 SNPs, compared to the ~1,500 SNPs between any other MTBC strain and the ancestral sequence.

48 3.3 Results

MTB_erdman MTB_GM_1503 100 100 MTB_KZN_605 Lineage 4 100 MTB_H37Rv

100 MTB_4783_04 100 MTB_K37 MTB_N0145 100 100 MTB_T85 100 MTB_00_1695 Lineage 2 100 MTB_N0052 100 MTB_M4100A 77 MTB_N0031 81 100 MTB_91_0079

100 MTB_K49 Lineage 3 100 MTB_SG1 MTB_N0157 100 MTB_T17 MTB_95_0545 100 100 MTB_N0153

100 MTB_K21 Lineage 1 MTB_K67 98 100 MTB_K93 100 MTB_N0070 100 MTB_N0072 MAF_11821_03 Lineage 5 100 MAF_5444_04 100 MAF_4141_04 Lineage 6 100 MAF_GM_0981 MCAN_K116 200 SNPs

Figure 3.1. Neighbour-joining phylogeny based on 13,088 variable common nucleotide positions across 28 human-adapted MTBC genome sequences. Scale bar shows the number of SNPs. The six lineages are coloured as defined previously (Hershberg et al., 2008). The root has been truncated due to the large numbers of changes that separate M. canetti from the rest of the phylogeny. Node support after 1,000 bootstrap replications with all nodes > 75. M. canetti strain K116 was used as the phylogenetic outgroup.

49 3.3 Results

MCAN K116 0 0

MTB N0031 963 0

MTB N0072 1848 1032 0 349

MTB N0070 1847 1019 0 955 973

MTB N0157 1828 1004 0 880

MTB T85 1891 1928 1926 1018 0 902 838 839 956

MTB N0153 1857 1790 0 231 863 992

MTB N0145 1838 1869 1906 1902 0 885 336 940 953 975

MTB T17 1852 1872 1809 0 1146 MTB SG1 1897 1203 1863 1220 1908 1931 1940 1038 0 870 888 837 965 1170 MTB M4100A 1833 1801 1838 1872 1876 0 936 1193 1169 MTB KZN 605 1254 1781 1218 1764 1233 1801 1829 1842 0 908 809 923 562 570 991

MTB K93 1789 1838 1895 1867 1884 1825 SNP distances in the 28 genome Strain phylogeny. 0 317 934 836 954 590 596

MTB K67 1816 1866 1925 1892 1917 1848 1012 0 390 983 1183 1130 1144 MTB K49 1872 1850 1097 1831 1818 1856 1884 1894 1088 Pairwise 0 644 888 1133 1141 1169 1182 1117 MTB K37 1776 1750 1214 1744 1725 1761 1793 1801 0 924 892 967 877 990 931 940

MTB K21 1812 1900 1850 1902 1949 1922 1946 1882 1055 0 661 771

MTB H37Rv 1937 1263 1910 1883 1276 1330 1867 1293 1854 1305 1883 1920 1934 1229 1018 0 791 666 468 963 1189 MTB GM 1503 1888 1208 1856 1824 1210 1279 1807 1228 1800 1249 1827 1863 1868 0 829 862 750 799 892 ! 1154 1143 1183 1187 1132 MTB erdman 1802 1773 1733 1221 1731 1714 1743 1773 1787 0 864 599 621 852 988 1163 1144 1116 1186 MTB N0052 1217 1279 1902 1883 1855 1208 1839 1828 1860 1896 1891 0 900 868 838 907 503 932 879 886 976 1811 MTB 95 0545 1830 1706 1795 1834 1713 1823 1755 1892 1849 1859 1802 0 366 436 999 1111 1143 1164 1150 1196 1113 1148 1170 MTB 91 0079 1831 1224 1271 1913 1891 1862 1845 1832 1870 1899 1906 0 805 721 733 562 706 935 1190 1192 1176 MTB 4783 04 1207 1769 1206 1869 1836 1802 1266 1800 1226 1780 1242 1815 1848 1859 0 589 867 333 357 855 984 1134 1166 1152 1113 1191 MTB 00 1695 1215 1833 1216 1280 1909 1884 1855 1205 1833 1827 1859 1894 1894 0 1156 MAF GM 0981 2050 1989 2050 1998 2040 1926 2009 2065 2088 1932 2038 2052 2025 1981 2040 2103 2025 2052 1990 2079 2054 2064 2076 2023 0 1911 MAF 5444 04 1958 1935 1881 1931 1895 1938 1827 1909 1959 1978 1825 1922 1945 1915 1873 1924 1986 1915 1950 1873 1972 1943 1952 1963 1048 0 701 1911 1139 MAF 4141 04 1922 2026 1969 2023 1972 2024 1904 1991 2045 2070 2021 2031 1998 1956 2016 2079 2004 2036 1964 2058 2031 2044 2050 2006 0 466

MAF 11821 03 1946 1973 1946 1890 1941 1899 1945 1838 1915 1959 1986 1827 1928 1954 1930 1880 1934 1994 1921 1955 1887 1974 1948 1971 1973 1925 1062 MAF 11821 03 MAF 11821 MAF 4141 04 MAF 5444 04 MAF GM 0981 MTB 00 1695 MTB 4783 04 MTB 91 0079 MTB 95 0545 MTB N0052 MTB erdman MTB GM 1503 MTB H37Rv MTB K21 MTB K37 MTB K49 MTB K67 MTB K93 MTB KZN 605 MTB M4100A MTB SG1 T17 MTB T67 MTB MTB N0153 T85 MTB MTB N0157 MTB N0070 MTB N0072 MTB N0031 MCAN K116 Table 3.2. Table Estimates of evolutionary divergence between strains. 3.1. Figure as in same the are matrix the in names

50 3.3 Results

Across and within-lineage genetic diversity was next investigated using the phylogeny. A SNP distance matrix was constructed based on the number of base differences per pairwise strain comparison, shown in Table 3.2. Across the phylogeny the average number of SNPs per pairwise comparison is 1544, which translates to an average of one SNP per 2.857 kb sequence length, as based on the H37Rv reference sequence genome (4.411532 Mb). Contrasting to the phylogenetic outgroup M. canetti, there was on average one SNP per 0.358 kb sequence, which is nearly 8 times higher SNP density than the MTBC.

Within-lineage variation was next measured by taking the average of all pairwise comparisons for each lineage strain, shown in Figure 3.2. Average within-lineage diversity ranged from 397 SNPs (sd=36) between any Lineage 3 strain, to 811 SNPs (sd=193) between any Lineage 1 strain. Lineages 2 and 1 have the greatest within- lineage variation, with a standard deviation of 222 and 193 SNPs respectively. This is nearly twice that of Lineage 4 (sd=104) and over five times the variation seen in Lineage 3 (sd=36). Lineage 1 also has the greatest number of genome sequences in the phylogeny (9 strain genomes). This might indicate a discovery bias, where the increasing number of genome sequences is uncovering more within-lineage variation. Whilst this cannot be ruled out, there was not a significant correlation between the number of strains per lineage and average within-lineage variation (Pearson r = 0.73, p = 0.10). Furthermore, the M. africanum lineages, (Lineage 5 and 6) had the least representative strains per lineage sequenced at the time of this study owing to the restricted number of strains avaliable, but diversity is still greater than Lineage 3, with Lineage 6 diversity comparable to all but Lineage 1. Overall it would appear that Lineages 1 and 2 have the greatest within-lineage SNP diversity.

51 3.3 Results

1200

s P

N 800 S

f o

r e b

m 400 u N

0

Lineage 1 Lineage 5 Lineage 6 Lineage 2 Lineage 3 Lineage 4

Ancient Modern

Figure 3.2. Within-lineage SNP diversity. The number of SNPs per pairwise comparison of all strains per lineage. Lineages are ordered by Ancient and Modern groups. Error bars indicate mean and standard deviation (sd). There was not a significant correlation between the number of strains per lineage and average within-lineage variation (Pearson r = 0.73, p = 0.10).

52 3.3 Results

3.3.2 Identification of all lineage-specific SNPs

Using the underlying connections from the derived whole genome phylogeny, it was possible for the first time to identify and extract all SNPs that are common to all strains from each of the six lineages. Due to the clonal nature of the MTBC (Supply et al., 2003), SNPs within these branches are largely exclusive to the respective lineage. All alleles on the derived phylogeny were traced throughout the tree and the nodes for each lineage branch were used to isolate all SNPs that contribute to this branch (Figure 3.3A). For example, the 163 SNPs between node 5 and 7 define Lineage 4 strains (red lineage), and in all but a few rare cases are exclusive to the lineage. The SNPs were subsequently defined as lineage-specific, and form the main dataset for the following analysis; SNPs found in more than one lineage branch represent homoplasic nucleotide positions and are described later in section 3.3.4.

In total 2,794 lineage-specific SNPs were identified (Figure 3.3B), and these are distributed throughout the genome, shown in Figure 3.4 (the full list in shown in Appendix B). Lineage-specific SNPs frequencies range from 124 (Lineage 2) to 698 (Lineage 5). The highest number of lineage-specific SNPs is in the two M. africanum lineages (Lineages 5 and 6). In addition to the six lineages, SNPs from the relatively long phylogenetic branch that is basal to the three modern lineages (Lineages 2, 3 and 4) have also been included in this study (Figure 3.3B). This branch defines the three modern lineages and consists of 319 SNPs. From here on this branch is called the modern lineage branch.

53 3.3 Results

Figure 3.3. Isolating lineage-specific SNPs from the phylogeny. A. Ancestral states reconstructed at each node of the tree to extract SNPs belonging to the lineage branches – so called lineage-specific SNPs. For example163 SNPs between node 5 and 7 define Lineage 4 strains (red lineage), B. All SNPs identified from the lineage branches of the six lineages, including the Modern lineage branch (coloured in black), which defines the three modern lineage strains. Arrows show the number of lineage-specific coding and noncoding SNPs. Scale bar at bottom indicates number of SNPs.

54 3.3 Results

Figure 3.4. Distribution of the lineage-specific SNPs across the genome. Genes on forward and reverse strands shown in outer rings as blue and red respectively. Mapped lineage-specific SNPs depicted in six inner rings, with the SNP colouring based on lineage phylogeny colours. From the innermost ring: Lineage 4, 3, 2, 1, 6, and 5. Genome structure and size based on H37Rv.

55 3.3 Results

3.3.3 Distribution of SNPs

The most recent M. tuberculosis annotations at the time of this study were used to classify the lineage-specific SNPs as non-coding (intergenic SNPs) and coding (Tuberculist database release 24). The average percentage of SNPs falling into these two regions across all the lineages is shown in Figure 3.5. It can be seen that vast majority of SNPs (86.4%) fall within annotated coding regions. This is not unexpected, as the percentage of the M. tuberculosis genome annotated as coding is 91.3% (based on the H37Rv reference). However, adjusting for the differences in sequence length between coding and noncoding regions, the number of SNPs falling across coding and non- coding is not equal, with a nearly 2-fold higher SNP density in intergenic regions (1.0 SNPs per kb of intergenic sequence compared to 0.6 SNPs per kb coding sequence) (X2, p <0.0001). This may not be surprising as SNPs in coding regions are more likely to be removed through purifying selection; the selective pressures acting on the coding regions is investigated later in the chapter (section 3.3.7). Coding SNPs can be further divided into those that cause a change in the amino acid encoded by the codon (a nonsynonymous SNP), or cause no change in the amino acid (a synonymous SNP). On average 55% of all SNPs are nonsynonymous, shown in Figure 3.5. Table 3.3 shows the frequency of SNP types for each lineage. Although rare, nonsynonymous SNPs were also found to cause the introduction of a stop codon (1.3% of all SNPs), and these were found across all lineages (Table 3.3). Conversely, three nonsynonymous SNPs removed an existing stop codon, contributing to < 0.1% of all lineage-specific SNPs.

The direction of amino acid change was determined using the reconstructed ancestral sequence of the MTBC. This sequence is similar to the H37Rv genome structure and has the same nucleotide length, but with H37Rv alleles substituted by those inferred from a reconstruction of the ancestral states using the derived phylogeny (section 3.2.4). Inference of the ancestral alleles is possible because the chromosome is effectively a single linkage group and all descendants share characteristics of the single ancestral cell (Comas et al., 2010). Therefore, using the ancestral sequence is advantageous as it enables the evolutionary direction of nucleotide change to be determined, instead of basing the change from the reference strain H37Rv, which can be problematic as it is a Lineage 4 strain.

56 3.3 Results

!"#$ Intergenic Non-synonymous Stop gain %%#$ Stop loss &'#$ Synonymous

<1% !#$

Figure 3.5. The average number of lineage-specific SNPs broken down into non- coding and coding types. Coding SNPs are further subdivided into synonymous, nonsynonymous, and nonsynonymous SNPs that affect stop codons, either through an introduction of a stop codon in a coding sequence, (stop gain) or removal of existing stop codons (stop loss).

Table 3.3. Summary of lineage-specific SNPs. This total includes the nonsynonymous SNPs indicated in the table that affect stop codons, either through an introduction of a stop codon in a coding sequence, (stop gain) or removal of existing stop codons (stop loss).

SNP type 1 5 6 2 3 4 ineage ineage ineage ineage ineage ineage lineage L L L L L L Modern

Intergenic 59 90 86 16 53 18 57 Nonsynonymous 248 395 381 75 183 99 184 Stop gain 8 10 6 1 3 3 5 Stop loss 0 0 0 0 2 0 1 Synonymous 156 213 207 33 117 46 78 Total SNPs 463 698 674 124 353 163 319

57 3.3 Results

There were 1,556 genes (38.7% of all annotated genes) with one or more lineage SNP. Three quarters (75.1%) of the genes with a lineage SNP harboured a single SNP (Figure 3.6A). The distribution of SNPs per gene followed a Poisson distribution, suggesting that there is no clustering of SNPs at the gene specific level, ranging from 0 to a maximum of 8 SNPs per gene (Figure 3.6B). The single gene with the highest frequency of SNPs (Rv2424c, fas), encodes a probable fatty acid synthase and has multiple SNPs present in Lineages 4, 5 and 6. Typical of lipid associated genes in M. tuberculosis, fas is quite long at 9.21kb, compared to the average M. tuberculosis gene length at 1.0kb. This is likely the cause of the high number of SNPs, and plotting the nucleotide length of all genes with a lineage-specific SNP against SNP frequency found a positive correlation (Pearson r = 0.43, p<0.0001), which is shown in (Figure 3.6C).

A.! 2464 B.! 2500 10000 ) 0 1

2000 g 1000 s o l e (

n s e e

g 100

1500 n f e o

g r

f

e 1000 o

b 10 1000 r m e u b N m

500 377 u 1 N 109 51 11 5 2 1 0 0.1 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 Number of SNPs per gene Number of SNPs per gene

C.! 15000 ) s e d i t o e l 10000 c u n (

h t g n

e 5000 l

e n e G 0 0 2 4 6 8

Number of SNPs per gene

Figure 3.6 Distribution of lineage SNPs per gene. A. Frequency of SNPs per gene, with actual SNP numbers recorded at top of bars. B. Poisson model (shown in red) fitted to the data. The y-axis is plotted as a log10 scale to better show the SNP distribution. C. Correlation between the number of SNPs per gene and gene length.

58 3.3 Results

3.3.4 Monomorphic population structure and homoplasic SNPs

The MTBC displays a highly clonal population structure (Supply et al., 2003); (Hirsh et al., 2004). Consistent with this structure a negligible degree of homoplasy was observed in the lineages. Of the 2,794 lineage-specific SNPs identified, four homoplastic SNPs were found, corresponding to 0.14% of the lineage SNPs being homoplastic (Table 3.4). The SNPs have the same nucleotide change across two or more of lineages, and three of the four cause synonymous changes.

As shown in Table 3.4, the first homoplasy (SNP 1) at genomic position 1480945, introduces a synonymous C to G mutation into codon 519 in Rv1319, which encodes a possible adenylate cyclase (Cole et al., 1998). This mutation occurs in all Lineages 3 and 5 strains (Figure 3.7A), indicating convergent evolution of this nucleotide position between an ancient and modern lineage. Interestingly, the homoplasy 2 also occurs in Rv1319, at position 1480948, and is three nucleotides from the first homoplasy. Furthermore, this also occurs in the same lineages (Lineage 3 and 5), causing a synonymous C to T mutation in the preceding codon (codon 518). It was confirmed that this was not an artefact from poor sequencing over this region by inspection of the MAQ alignment files, and it was found that the surrounding 100bp region in strains from Lineage 3 and 5 were mapped with high confidence, shown by MAQ quality scores of 1.0 (Heng, 2008). If an insertion or deletion was present this could cause erroneous SNPs to be called in close proximity, but again this would cause a loss in the associated MAQ quality scores for the region, and this was not found to be the true. Together this would suggest that these two SNPs have been called with a high confidence, and the two homoplasies are likely true.

The third and fourth homoplasic SNPs occur in Rv2082, which encodes a conserved hypothetical protein. The two homoplasies are present in Lineages 1, 2 and 6, introducing synonymous (A94A) and nonsynonymous (T96A) SNPs (Figure 3.7B). Again these are in modern and ancient lineages, and located closely together, this time within four nucleotides of each other. The gene is a conserved hypothetical with no known function, but independent mutation of the same allele across three lineages might suggest biological relevance. Although not within the lineage branch, some strains from Lineage 4, including H37Rv, also have these two homoplasies as a sub-lineage homoplasy.

59 3.3 Results

Table 3.4. Homoplasic nucleotide positions within the lineage branches. Independent mutation of the same nucleotide position occurring across the phylogenetic tree. SNP position based on the reference strain H37Rv genome coordinates.

Gene Mutation Lineages

Gene product SNP allele Homoplasy SNP position Ancient allele Ancient adenylate 1 Rv1319c 1480945 C G T519T 3, 5 cyclase adenylate 2 Rv1319c 1480948 C T E518E 3, 5 cyclase hypothetical 3 Rv2082 2338990 C G A94A 1, 2, 6 protein hypothetical 4 Rv2082 2338994 A G T96A 1, 2, 6 protein

A.! B.!

Figure 3.7. Homoplasic lineage SNPs. A. Homoplasy 1 and 2 occur in Rv1319c in Lineages 3 and 5. B. Homoplasy 3 and 4 occur in Rv2082 in Lineages 1, 2 and 6.

60 3.3 Results

In addition to the four homoplasic positions at the nucleotide level shown in Table 3.4, there was one intergenic SNP at nucleotide position at 2566768 that was mutated to an Adenosine in Lineage 4, but a Cytosine in Lineage 1 (Table 3.5). Finally, at the amino acid level, the residue at position 733 within Rv0339c, harbours different nonsynonymous SNP in Lineages 3 and 5. Rv0339c encodes a transcriptional regulatory protein, and the two SNPs result in change to different amino acids in the lineages (Table 3.5).

Table 3.5. Variable genomic positions within the lineages. Two nucleotide positions harbour different SNPs across the lineages.

Mutation Lineage Gene product SNP Gene SNP allele SNP position Ancient allele Ancient

Rv2294- hypothetical 1 2566768 G A intergenic 4 Rv2295 protein

Rv2294- hypothetical 2566768 G C intergenic 1 Rv2295 protein

transcriptional 2 Rv0339c 406251 A G D733G 3 regulatory protein transcriptional Rv0339c 406251 A C D733A 5 regulatory protein

61 3.3 Results

3.3.5 Creation of pseudogenes

In total, thirty-nine SNPs were found to affect stop codons. SNPs can either cause the premature introduction of a new stop codon at any point in the annotated coding sequence (a nonsense SNP), or more rarely remove an existing stop codon. Thirty-six of the SNPs cause the former type of nonsense mutation. As shown previously (section 3.3.3), the majority of SNPs occur in isolation within genes, and nonsense SNPs follow this distribution, thus leading to the potential generation of thirty-five pseudogenes in the respective lineages (Table 3.6A). The remaining three nonsynonymous SNPs have the reverse effect, causing the loss or removal of an existing stop codon (Table 3.6B).

Whilst all lineages have accumulated nonsense SNPs, the three ancient lineages have the greatest frequency, with nearly two-thirds of nonsense SNPs (24 out of 39 nonsense SNPs). To test if this is due to the longer branch lengths of these lineages compared to the modern lineages, and so a reflection of the greater time that these lineages have had to accumulate nonsense mutations, the number of nonsense SNPs was compared to the total number of SNPs found in each respective lineage branch, shown in Table 3.7. Lineage 4 has the shortest branch length and one nonsense SNP, whilst Lineage 5 has the longest branch and the most nonsense SNPs. A significant correlation was found between branch length and the number of pseudogenes (Pearson r= 0.8477, p= 0.0160).

It can be seen from Table 6 that a large proportion of the nonsense SNPs are within genes annotated as encoding hypothetical proteins (21 out of 39 SNPs). Using the formal functional gene categories defined by Tuberculist, it was tested if the nonsense SNPs were distributed across all gene function categories. Whilst all categories were affected by one or more nonsense SNP, as expected the hypothetical category contained the largest proportion (15 SNPs, 38.7%). Due to the low number of nonsense SNPs, it was not possible to stratify into functional groups by each lineage, but the distribution of nonsense SNPs was not significantly different for any of the functional categories using the ancient and modern lineage groupings (Table 3.8) (Mann-Whitney U test, p= 0.24).

62 3.3 Results

Table 3.6 Nonsense SNPs. In total thirty-nine SNPs cause a change in the encoded stop codon. A. Introduction of a stop codon within the coding sequence. B. Removal of an existing stop codon. The stop codon is indicated by an asterisk (*) in column 3. Rows are ordered by gene.

A. Stop introduction Gene Mutation Lineage Gene product Rv0064 Q862* 5 hypothetical protein Rv0134 ephF W152* 1 epoxide hydrolase Rv0146 Y94* 3 hypothetical protein Rv0325 Q75* 4 hypothetical protein Rv0329c R141* 6 hypothetical protein Rv0368c S277* 5 hypothetical protein Rv0402c mmpL1 R376* 5 transmembrane transport protein Rv0457c W119* 1 peptidase Rv0490 senX3 R410* 6 two component sensor histidine kinase Rv0574c Q149* 5 hypothetical protein Rv0610c Q305* 1 hypothetical protein Rv0621 W355* Modern hypothetical protein Rv0836c W218* 4 hypothetical protein Rv0906 Q183* 1 hypothetical protein Rv1251c E875* 3 hypothetical protein Rv1504c E200* Modern hypothetical protein Rv1870c L212* Modern hypothetical protein Rv1912c fadB5 G63* 3 oxidoreductase Rv1965 yrbE3B W11* 5 integral membrane protein Rv2079 Q609* 2 hypothetical protein Rv2132 Y60* 5 hypothetical protein Rv2187 fadD15 Y81* 6 long-chain-fatty-acid-CoA ligase Rv2187 fadD15 W43* 1 long-chain-fatty-acid-CoA ligase Rv2299c htpG Q109* 6 heat shock protein 90 Rv2339 mmpL9 S917* 5 transmembrane transport protein Rv2690c R658* Modern hypothetical protein Rv2788 sirR Q131* 1 transcriptional repressor Rv2797c Q273* 5 hypothetical protein Rv2818c Q304* 6 hypothetical protein Rv2850c R515* 5 magnesium chelatase Rv2994 W68* 1 integral membrane protein Rv3079c E120* 1 hypothetical protein Rv3373 echA18 G214* Modern enoyl-CoA hydratase Rv3416 whiB3 E71* 5 transcriptional regulatory protein Rv3729 W369* 6 transferase Rv3898c Q111* 4 hypothetical protein

B. Stop removal Gene Lineage Gene product Rv0257 *23R Modern hypothetical protein Rv1641 infC *202S 3 translation initiation factor IF-3 Rv1921c lppF *424G 3 lipoprotein

63 3.3 Results

Table 3.7. Nonsense SNPs by lineage. Thirty-six lineage-specific nonsynonymous SNPs result in the introduction of a stop codon within the coding sequence (nonsense SNP). The number of nonsense SNPs is correlated to branch length.

Branch length Lineage Nonsense (SNPs) 1 8 463 5 10 698 6 6 674 2 1 124 3 3 353 4 3 163 Modern 5 319

Table 3.8 Nonsense SNPs grouped by functional category. Nonsense SNPs separated by functional category of the affected gene, and into modern (Lineages 2, 3 and 4) and ancient groups (Lineages 1, 5 and 6). Rows are ordered by descending total number of SNPs per functional category.

Lineage Functional category Total Modern Ancient conserved hypotheticals 15 7 8 cell wall and cell processes 7 3 4 lipid metabolism 5 3 2 intermediary metabolism and respiration 3 0 3 regulatory proteins 3 0 3 virulence, detoxification, adaptation 3 0 3 unknown 2 1 1 information pathways 1 1 0

64 3.3 Results

3.3.5.1 Nonsense and stop codon removal SNPs in essential genes

The thirty-eight genes harbouring the thirty-nine nonsense and stop codon removal SNPs were next grouped by gene essentiality. These groups are based on the genome- wide analyses of mutants that were unable to grow in vitro on Middlebrook 7H11 agar or in the spleens of intravenously infected mice (Sassetti et al., 2003; Sassetti & Rubin, 2003).

Strikingly, all but two of the genes harbouring a SNP involved in creation or removal of a stop codon were nonessential. There were 36 SNPs in nonessential genes compared to 2 in essential, out of a genome-wide number of 2,986 nonessential and 760 essential genes (X2 test; p = 0.0362). Given that nonsense SNPs within essential genes would highly likely cause a loss of function for the encoded protein that leads to cell death, this result is perhaps unsurprising. One of the two exceptions is in Lineage 6, an M. africanum lineage. Here an amino acid change at position 410 in senX3 (Rv0490), leads to the change of an Arginine residue for a stop codon. SenX3 encodes a predicted secreted two component sensor histidine kinase (Malen et al., 2007). Whilst this has the potential to severely affect the function of the encoded protein, the precise position of the SNP within the gene will determine the length of protein truncation, and so the likely severity. The amino acid length of SenX3 is 410, which places the new stop codon directly adjacent to the existing ancestral stop codon, and an ensuing loss of only one amino acid residue from the protein C-terminus; such a short truncation is likely to have little or no effect on gene function which may explain why the SNP is allowed to persist in the lineage.

A similar scenario exists in the second essential gene harbouring a stop codon affecting SNP. infC (Rv1641) encodes the translation initiation factor-IF3, one of the three initiation factors in bacteria (Malys & McCarthy, 2011). IF3 binds to the 30S ribosomal subunit, and shifts the equilibrium between 70S ribosomes and their 50S and 30S subunits by promoting dissociation of 30S from 50S, and thereby subsequent binding of mRNA (Liveris et al., 1993); it is therefore required for the initiation of protein biosynthesis in bacteria. Lineage 3 strains carry a nonsynonymous SNP that removes the existing stop codon at codon position 202, and introduces a Serine residue (Table 3.6B). This could lead to transcription of infC into the following intergenic region and potential fusion to the next encoded gene, rpmI. However, 27 nucleotides downstream from the removed stop codon is another in frame stop codon at position 1852903. Therefore infC

65 3.3 Results in Lineage 3 is 27 nucleotides longer, and the protein 9 amino acids longer, than in the rest of the MTBC. Again, this is unlikely to be harmful to the cell.

3.3.5.2 Length of protein truncation

The majority of SNPs that affect a stop codon cause the introduction of termination codon within the coding sequence (36 nonsense SNPs). Whilst this has the potential to severely affect the function of the encoded protein, it has been demonstrated previously (section 3.3.5.1) that the position of the SNP within the gene should also be taken into account. Comparing the full-length ancestral protein sequence to the truncated protein revealed that truncations were distributed throughout the protein length (Figure 3.8A). There was only one example of more than one nonsense SNP within a gene. Lineage 6 strains have two SNPs within fadD15, both of which would cause >85% loss of the protein length. The most extreme truncation, in yrbE3B (Rv1965), will lead to a protein 96.3% shorter in length than the ancestral protein. Although yrbE3B encodes a protein of unknown function, it is highly similar to other membrane proteins, and forms one of the mammalian cell entry operons in M. tuberculosis (Mce3) (Cole et al., 1998). Overall, 14 SNPs (38.9% of all nonsense SNPs) cause the deletion of >50% of the ancestral amino acid sequence; such a deletion might be expected to have severe effects on the function of the gene product.

It can been seen in Figure 3.8A that nine of the nonsense SNPs cause <1% of the protein being truncated. Apart from senX3, which has one amino acid truncated and described in the above section (3.3.5.1), the remaining eight genes affected by nonsense SNPs have 0% deletions. This is an artefact of basing the length of truncations on H37Rv strain annotations, a Lineage 4 strain. Therefore the analysis is identifying proteins with a premature stop codons introduced either from Lineage 4 or more basal Modern lineage branch, which have then been integrated into the H37Rv annotations. Interestingly, in four cases the nonsense SNPs have created two open reading frames that have been annotated as separate genes in H37Rv: these genes are Rv0325-Rv0326, Rv1504c-Rv1503c, Rv3373-Rv3374 and Rv3898c-Rv3897c. Whilst these are effectively new genes in the respective lineages, all but one are annotated as encoding hypothetical proteins. The single exception is echA18 (Rv3373) and echA18.1 (Rv3374) which encode probable Enoyl-CoA hydratases, but was previously a single open reading frame (Figure 3.9).

66 3.3 Results

The three SNPs that remove existing stop codons lead to proteins that are 104.3-563.6% greater in amino acid length compared to existing annotations (Figure 3.8B). This is based on the next in frame stop codon from the 3’ end of the annotated gene. infC has the smallest increase in length, and was described previously (section 3.3.5.1). The remaining two genes, lppF and Rv0257, increase by 110 (110.2% increase) and 104 (563.6%) amino acids.

A.! Rv3898c echA18 Rv2690c Rv1870c Rv1504c Rv0836c Rv0621 Rv0325 senX3 mmpL9 Rv2079 Rv0064 Rv2850c Rv2818c Rv0610c Rv2132 Rv1251c whiB3 Rv0368c Rv0329c sirR ephF Rv0906 Rv2797c Rv3729 Rv3079c mmpL1 Rv0574c Rv0146 fadB5 Rv0457c htpG Rv2994 fadD15 fadD15 yrbE3B 0 20 40 60 80 100 Percentage of protein truncated

B.! infC lppF Rv0257 0 20 40 60 80 100 120 300 600

Percentage increase in protein length

Figure 3.8. Change in protein length due to nonsense SNPs. A. Distribution of protein truncations due to thirty-six nonsense SNPs causing premature stop codon introductions. Truncations expressed as percentage change based on H37Rv annotations. Note fadD15 shown twice due to two SNPs that introduce stop codons. Black bars indicate the deletion; grey bars are remaining protein. B. Percentage increase in protein length from three SNPs that remove existing stop codons. Striped bars indicate new protein sequence.

67 3.3 Results

Figure 3.9. Gene creation by nonsense SNPs. echA18 (Rv3373) and echA18.1 (Rv3374) is a contiguous open reading frame in the ancient sequence, but introduction of a nonsense SNP in the modern branch led to the annotation of two genes in the reference H37Rv, and all other modern lineage strains.

68 3.3 Results

3.3.6 SNPs within genes associated with antibiotic resistance

Many drug resistance-conferring mutations have been identified in the MTBC and are held in the publicly available TBDReaMDB database (Sandgren et al., 2009). Identification of such mutations has been important in the development of molecular genotypic based assays for drug resistance (Boehme et al., 2011; Hillemann et al., 2007). However, as shown in this study, many SNPs in the MTBC are phylogenetic markers for the lineage, and so it is important to understand the underlying phylogeny to distinguish SNPs within drug resistant genes that are unlikely to be the cause of drug resistance but instead phylogenetic markers.

Using the above database, the lineage-specific SNPs were screened to identify SNPs within genes associated with drug resistance. In total, forty-six coding SNPs were identified, thirty-two were nonsynonymous and fourteen synonymous. Lineage-specific SNPs were found in genes associated with resistance to six of the nine antibiotics used in the treatment of tuberculosis, these were: Ethambutol (SNPs in 9 out of 13 associated genes), ethionamide (2 of 3), flurorquinolones (2 of 2), isoniazid (11 of 23), rifampicin (1 of 2) and streptomycin (1 of 3) (Figure 3.10). A further two intergenic SNPs were in potential regulatory regions (<100bp from the translational start site) of the genes ahpC and rpoB, which are associated with isoniazid and rifampicin resistance respectively (Ramaswamy & Musser, 1998; Sherman et al., 1996). Whilst more SNPs in drug resistance associated genes were found in the two M. africanum lineages (11 SNPs each), all lineages harboured at least one example (see Appendix C for details).

20 s

P 15 N S

f o

r 10

e b m u 5 N

0

Isoniazid Rifampicin EthambutolEthionamide Streptomycin Flurorquinolones Figure 3.10 Lineage-specific SNPs within genes associated with drug resistance. In total 46 SNPs were identified.

69 3.3 Results

Whilst one of the genome sequences used to construct the whole genome phylogeny is extensively drug resistant (XDR) (Lineage 4 strain KZN 605), the inherent nature of this study excludes SNPs only present in one strain (singleton SNPs), and so all of the lineage-specific SNPs are not directly involved in causing drug resistance. Interestingly, nine of the forty-six lineage-specific SNPs identified above were found within the TBDream database (19.6%), shown in Table 3.9. It is therefore likely that these lineage- specific SNPs have been incorrectly associated with drug resistance.

It can be seen at the top of Table 3.9 that a cysteine to tyrosine mutation (C110Y) within embR (Rv1267c) was found in a study by (Srivastava et al., 2009). This SNP is present within all strains from Lineage 1. In the former study, three genes implicated in ethambutol resistance (embB, embC and embR) were sequenced in 44 ethambutol resistant clinical strains isolated in India (Srivastava et al., 2009). The C110Y mutation was found within one of the study strains, which also had two mutations in embC (G288W and V303G). The C110Y mutation therefore identifies this strain as likely belonging to Lineage 1. Lineage 1 is not prevalent in the country from which the strains were isolated (Gagneux et al., 2006a; Gagneux & Small, 2007), which might account for there only being one instance of the SNP out of the 44 strains in the study. Interestingly, Lineage 1 strains also harbour two more lineage-specific SNPs within genes involved in ethambutol resistance, one in embA (Rv3794), a P913S mutation, and another within embC (Rv3793), a N394D mutation but these were not found in the study. However, embA was not sequenced and the primers used to sequence embC did not extend beyond the 5’ 308bp region of embC that has sequence homology to the resistance-determining region (ERDR), and so missed the Lineage 1 SNP that is in the middle of the gene (Sreevatsan et al., 1997b; Srivastava et al., 2009). It is therefore not possible for C110Y SNP to be involved directly in drug resistance to ethambutol.

The above study and others have identified the most common mutation reported in embC at codon 270 (I270T) (Srivastava et al., 2006; Srivastava et al., 2009). However, in this study the mutation was found to be a modern lineage SNP, and so is present within Lineages 2, 3 and 4. This would make the mutation highly prevalent in the study areas where the strains were isolated (Srivastava et al., 2006; Srivastava et al., 2009). The mutation is typically reported as the conversion of an existing Tyrosine residue, but most studies use the reference strain H37Rv as the ancient allele, and therefore the direction of change is reported incorrectly; this agrees with the findings of (Koser et al., 2011).

70 3.3 Results

Table 3.9. Putative mutations found in drug resistance studies incorrectly associated with drug resistance. All SNPs are lineage-specific, and therefore phylogenetic markers of the respective lineages.

Drug Lineage Gene Mutation Primary reference resistance 1 Rv1267c embR C110Y ethambutol Srivastava et al., 2009 3 Rv3264c manB D152N ethambutol Ramaswamy et al., 2000 Modern Rv3793 embC I270T ethambutol Sreevatsan et al., 1997b; Srivastava et al., 2009 1 Rv3793 embC N394D ethambutol Ramaswamy et al., 2000 3 Rv3793 embC R738Q ethambutol Ramaswamy et al., 2000 1 Rv3794 embA P913S ethambutol Ramaswamy et al., 2000 Modern Rv3795 embB A378E ethambutol Srivastava et al., 2006 4 Rv1908c katG L463R isoniazid Heym et al., 1995 3 Rv2242 M323T isoniazid Ramaswamy et al., 2003

71 3.3 Results

3.3.7 Conservation and removal of lineage-specific nonsynonymous SNPs

In the following section the extent to which nonsynonymous SNPs are removed from the lineages was analysed. The commonly used method to detect selection by measuring the proportion of nonsynonymous nucleotide changes (dN) to synonymous nucleotide changes (dS) was applied to the lineage-specific SNPs (see 3.2.6). A dN/dS >1 indicates positive selection, <1 indicates purifying selection and a ratio at or close to 1 is regarded as neutral, or a balance of the two former selective forces. The rate of nonsynonymous SNP accumulation was first compared across the six lineages. The relatively low number of SNPs within the MTBC made calculation of dN/dS for individual genes of questionable value and impossible for the 2,459 (61.2%) genes with no lineage-specific SNPs. As an alternative approach the dN/dS ratio was calculated using gene concatenates based firstly on all genes, then gene essentiality and functional categories.

The mean dN/dS for the lineages was 0.67 (ranging from 0.54-0.79), corresponding to nearly two thirds (64.8%) of SNPs causing a change to the encoded amino acid (Table 3.10). This finding is consistent with the average dN/dS based on all SNPs identified in 21 MTBC genome sequences (dN/dS=0.59), and the sequencing of 89 genes from 108 MTBC strains (dN/dS=0.57) (Comas et al., 2010; Hershberg et al., 2008). If the lineages are grouped into the ancient and modern categories, the mean dN/dS was 0.61 and 0.72 respectively; whilst a higher rate of nonsynonymous SNP accumulation was found in the modern lineages, the difference between two is not significant (Mann Whitney U test, p=0.2118). High dN/dS ratios are often considered to indicate a reduction in purifying selection (He et al., 2010; Hershberg et al., 2008; Holt et al., 2008), which would suggest here that all lineages are experiencing the same weak purifying selection. Alternatively, signals of weak purifying selection may be due to the close relatedness of the MTBC strains. Rocha et al. (2006) has shown that dN/dS is often higher when the organisms compared are closely related. Therefore dN/dS becomes dependent on time due to a lag in the time to remove deleterious nonsynonymous mutations by purifying selection, and so elevating dN/dS.

To test how the frequencies of nonsynonymous SNPs vary over different timescales, the ratio of nonsynonymous to synonymous SNPs was compared in different branches of the phylogenetic tree. No significant difference was found in the SNP ratio from the lineage branches compared to the external branches, which includes SNPs from the twenty-eight extant strains used in the phylogeny (Mann Whitney U test, p = 0.1033). The mean

72 3.3 Results lineage branch ratio was 1.9, whilst the external branches 1.7 (Appendix D), suggesting that nonsynonymous SNP accumulation in the MTBC is a consistent feature irrespective of time.

Table 3.10. The rate of nonsynonymous SNP accumulation across the lineages. The dN/dS ratio was used, which measures the accumulation of nonsynonymous SNPs against the background rate of synonymous SNPs.

Lineage Lineage Lineage Lineage Lineage Lineage Modern 5 6 1 4 2 3 Nonsynonymous 385 374 238 96 74 182 172 SNP Synonymous 213 206 156 46 33 117 78 SNP Nonsynonymous 2968425 2968425 2968425 2968425 2968425 2968425 2968425 positions (N) Synonymous 1052024 1052024 1052024 1052024 1052024 1052024 1052024 positions (S) dN rate 0.000130 0.000126 0.000081 0.000032 0.000025 0.000061 0.000058 dS rate 0.000202 0.000196 0.000148 0.000044 0.000031 0.000111 0.000074 dN/dS 0.64 0.64 0.54 0.74 0.79 0.55 0.78

3.3.7.1 Nonsynonymous SNPs within essential genes

The previous method was based on total sequence concatenates which is quite a blunt method for detecting selection, likely averaging both purifying and potential positive selection in the sequences. Further concatenates were generated based on biologically relevant categories. Firstly, genes were grouped by those shown to be essential for growth by transposon mutagenesis (Sassetti et al., 2003; Sassetti & Rubin, 2003). Based on the findings in other bacteria and evolutionary theory, it would be expected for less nonsynonymous SNPs to accumulate within genes that are essential to the cell (Jordan et al., 2002). There were 335 (14.0%) nonsynonymous SNPs and 212 (8.9%) synonymous SNPs within essential genes, leaving the remaining 1215 (50.8%) nonsynonymous and 630 (26.3%) synonymous SNPs within nonessential genes. Adjusting for differences in the nucleotide length of the two categories using the number of potential

73 3.3 Results nonsynonymous SNP positions, it was found that significantly less nonsynonymous SNPs were within essential genes (X2, p=0.0011). Whilst the average dN/dS for essential genes was lower than nonessential (0.56 and 0.68 respectively), indicating that essential genes are more conserved than nonessential.

3.3.7.2 Nonsynonymous SNPs within functional gene categories genes

Gene concatenates were next generated for all gene functional categories based on the Tuberculist database. Seven categories were tested: 1. information pathways, 2. intermediate metabolism and respiration, 3. lipid metabolism, 4. cell wall and cell wall processes, 5. conserved hypothetical, 6. virulence-detoxification and adaptation and 7. regulatory proteins (Lew et al., 2011). In Figure 3.11A, the dN/dS ratios across these categories are shown. A one-way ANOVA of the dN/dS for each lineage and functional category found an uneven distribution (Kruskal-Wallis test, p=0.0084). Following multiple testing correction it was seen that the dN/dS between the information pathways and regulatory protein categories was significantly different (Dunn's Multiple Comparison Test, p<0.05). It might be expected for the information pathways class to have the lowest number of nonsynonymous SNPs due to the critical function of these genes within cell, such as in DNA replication and repair. This was confirmed by comparison of the percentage of essential genes per functional category to the dN/dS ratio, which found a significant correlation (Spearman r = -0.8929, p = 0.0123) (Figure 3.11B).

Whilst there was evidence of gene function categories varying by the level of low purifying selection, only genes within the regulatory category showed strong signs of positive selection in multiple lineages (mean dN/dS = 1.16) (Table 3.11). Stratifying the regulatory functional category by lineage, the dN/dS was > 1 in Lineages 3, 4, 5 and 6. Focusing on this category, 84 regulatory proteins harboured 132 lineage-specific SNPs - 101 nonsynonymous and 31 synonymous. This corresponds to a nonsynonymous to synonymous ratio of 3.3, compared to the mean of ratio of 1.9 found across all functional categories. Potential positive selection (dN/dS >1) was also seen in the intermediary metabolism and respiration category for just Lineage 2 (dN/dS=1.49), and in lipid metabolism also for Lineage 2 (dN/dS=1.13) and the modern lineage branch (dN/dS =1.39) (Figure 3.11A).

74 3.3 Results

A. =:9 >>

<:;

A 2 @ <:9 ? 2

9:;

9:9 . . & . (!/'1 (+,'- ('3$1!. (+0 )*' )&0 )%0.*!%'(!$" (!$" )+-*$ 1!*!2 )'"2 #$%&' %0851'($%-)*%$(0!". !" '3$1!.& ( /$".0%402 /011),'11)'"2)/011)*%$/0..0)&0

(0%&02!'%- !" 4!%510"/06)20($7!#!/'(!$")'"2)'2'*'('(!$" B.

100 1 s

e 80 information1pathways n e y g r 1 l o a i g t e

n t e

a 60 s c 1 s l e a 1 f n o o i 1 t e c

g 40 n a u t

f 1 n n e i c r e 20 P regulatory1proteins

0 0.0 0.5 1.0 1.5

dN/dS

Figure 3.11. The rate of nonsynonymous SNP accumulation by functional category. A. Lineage dN/dS by functional category. Lineages coloured as previously and bars represent mean dN/dS. Information pathways dN/dS significantly lower than regulatory proteins (one-way ANOVA with Dunn’s post-hoc test, p <0.05). B. Correlation between essential genes per functional category (as percentages) and dN/dS. Spearman r = - 0.8929, p = 0.0123.

75 3.3 Results

Table 3.11. The rate of nonsynonymous SNP accumulation in each functional category. The nonsynonymous/synonymous ratio and dN/dS ratio is shown. N= all possible nonsynonymous positions, S = all possible synonymous positions.

ymous N S dN/dS

onymous Nonsynon SNPs Synonymous SNPs nonsynonymous /syn information pathways 96 73 1.3 202427 70831 0.46 lipid metabolism 168 99 1.7 294422 102505 0.59 intermediary metabolism and respiration 394 237 1.7 765073 268496 0.58 cell wall and cell processes 377 197 1.9 595063 214706 0.69 conserved hypotheticals 344 175 2.0 594619 209012 0.69 virulence, detoxification, adaptation 64 29 2.2 106294 38498 0.80 regulatory proteins 101 31 3.3 123975 44208 1.16

76 3.4 Discussion

3.4 Discussion

3.4.1 Strengths and limitations of this study

This study used recently published MTBC genomes sequenced by high-throughput sequencing technology to identify for the first time all SNPs that contribute to the background genetic variation within the six lineages of the MTBC. At the time of this study about thirty globally representative strains from all of the lineages had been sequenced and the genomes made publicly available. It is likely that a discovery bias exists within this small genome set, as illustrated in Figure 3.2 where it was seen that the lineages with the most genome sequences (Lineages 1, 2 and 4) had the greatest within- lineage diversity. Lineages 5 and 6 only had two genome sequences available to use in this study. However, this study was designed to capture variation within the internal basal branches of each lineage through exploitation of the clonal population structure of the MTBC, and this should circumvent any discovery bias. Theoretically, as backward mutations are rare in the MTBC, genome sequences from two strains belonging to the same lineage would capture all lineage-specific SNPs for the respective lineage, and additional genomes will only serve to reduce the branch length and so the number of lineage-specific SNPs. Finally, twenty-one of the genomes used to construct the genome phylogeny were selected from a wider collection of 875 strains characterised previously by the analysis of deletions across the genome (Comas et al., 2010; Gagneux et al., 2006a; Hershberg et al., 2008). Therefore, whilst it is expected that future studies will sequence ever-greater numbers of MTBC strains, the lineage-specific SNPs identified in this study are expected to be robust.

Removal of SNPs found within repetitive regions, such as in phages, and the PE and PPE gene families, will likely have resulted in the loss of potentially important variation within the MTBC lineages. Pe genes are characterised by the presence of a proline- glutamic acid (PE), whilst ppe genes contain a proline-proline-glutamic acid (PPE); both

77 3.4 Discussion families are highly variable in size and contain extensive repetitiveness of their C- terminal regions (Cole et al., 1998). Excluded regions total ~10% of the coding genome, and recently it has been shown that the large pe and ppe families harbour about 3-fold higher frequency of nonsynonymous SNPs compared to non-pe/ppe genes (McEvoy et al., 2012), which would suggest that a pool of lineage-specific variation might have been missed in this study. It was necessary to remove SNPs identified in these regions due to inherent difficulties encountered in sequencing through repetitive regions using the second generation short read technologies, such the Illumina sequenced strains used in this study. SNPs were detected in these regions in the lineage branches, but they would need to be confirmed by methods beyond the scope of this study. This is a common disadvantage of current short read sequencing technology (Loman et al., 2012), and developments in sequencing technology with longer read lengths will likely remove this current limitation (Branton et al., 2008).

3.4.2 General characteristics of lineage-specific diversity

Prior to identification of the lineage-specific SNPs, a 28-genome phylogeny was built using a non-redundant set of variable nucleotide positions derived from the genome sequences. The phylogeny was largely derived from the genome sequences published previously (Comas et al., 2010), and supplementing by other recently published genome sequences available in the EBI SRA. An additional strain (N0031), known to be a rare Lineage 2 strain based on a previous MLSA study, was sequenced for this project to widen diversity in this lineage (Hershberg et al., 2008). The topology of the resulting phylogeny was highly congruent with other MTBC phylogenies based on SNPs and other markers, such as deletions, further highlighting the clonal population structure of the MTBC (Comas et al., 2010; Gagneux et al., 2006a).

In total 2,794 SNPs lineage-specific SNPs were identified, with each lineage differing by an average of 400 SNPs. The ancient lineages (Lineages 1, 5 and 6) harboured the most lineage-specific SNPs, which is likely a reflection of the greater time that these lineages have had to accumulate mutations. On average, two-thirds of all coding SNPs were nonsynonymous and therefore cause a change in the encoded amino acid. This is a feature of the MTBC, and has been previously identified at the genome level (Fleischmann et al., 2002; Hershberg et al., 2008). Nonsynonymous SNPs are more

78 3.4 Discussion likely than synonymous SNPs to have a functional effect, which raises the possibility that this variation will have functional consequences in the respective MTBC lineages.

The ability to isolate the total background SNP variation that contributes to the diversity of all strains from a particular lineage (lineage-specific SNPs) was fundamental to this study. This was only possible due to the negligible level of recombination seen in the MTBC (Liu et al., 2006), and because back mutations are rarely observed (Casali et al., 2012). Therefore a SNP in the parental strain becomes a defining marker for the rest of the progeny. It has previously been reported that homoplasic nucleotide positions are rare in the MTBC, in which a SNP cannot be explained without convergence when mapped onto the tree, and typically found only in cases of drug resistance or compensatory mutations (Casali et al., 2012; Comas et al., 2011). Similar examples have been found in other bacterial studies, such as the sequencing of MRSA strains, where the authors found few homoplasic SNPs but when identified, corresponded to mutations conferring antibiotic resistance (Harris et al., 2010). In this study it was found that there were only four cases of homoplasic SNPs (0.14% of all lineage-specific SNPs), in which lineage-specific SNPs with the same nucleotide change were present in more than one lineage (Table 3.4). Independent fixation of SNPs across multiple lineages could represent signals of selective pressure acting on these positions, and this was strengthened by the distribution of the four SNPs, whereby they cluster within two genes and are within a few nucleotides of each other. Whilst these may have biological significance, the respective genes are not associated with drug resistance. Further work would be needed to confirm these SNPs and to understand if these SNPs have biological function.

The lineage-specific SNPs can also be exploited in SNP typing assays to genotype strains, either at the lineage or from any sub-lineage level (Bergval et al., 2012; Kahla et al., 2011; Stucki et al., 2012). SNP typing is suggested to be the new gold standard of phylogenetic classification of MTBC (Comas et al., 2009), and the majority of the SNPs identified in the lineage branches in this study, excluding the above homoplasies, would be applicable to such typing assays. At the epidemiological level, genotyping of strains has also been driven by the need for rapid tests to identify drug resistant strains. Resistance to first-line TB drugs rifampicin and isoniazid (Multidrug resistance or MDR-TB), and now also to some second-line drugs (extensively drug resistant tuberculosis or XDR-TB) has led to a growth in molecular genotypic drug susceptibility testing, such as the Genotype MTBDRplus (Hain Life science) and Xpert MTB/RIF

79 3.4 Discussion

(Cepheid) (Boehme et al., 2011; Hillemann et al., 2007; McNerney et al., 2012). Several SNPs were identified within drug resistant associated genes that are not associated with drug resistance, but act as evolutionary markers (Table 3.9). Previous studies have identified highly prevalent mutations within drug resistant strains, but these have been shown here to be lineage-specific markers. Other studies have also questioned some associations of SNPs with drug resistance. A significant association of a SNP within Rv2629 and rifampicin resistance was found based on a study of over 100 rifampicin resistant strains (Wang et al., 2007), but this was subsequently shown to be a phylogenetic marker of Lineage 2, specifically of the Beijing group of strains(Homolka et al., 2009)(Homolka et al., 2009)(Homolka et al., 2009)(Homolka et al., 2009)(Homolka et al., 2009). Similar approaches have been applied to inhA SNPs with isoniazid resistance, and embC SNPs with ethambutol resistance, that are instead phylogenetic markers and unlikely the cause of drug resistance (Projahn et al., 2011; Ramaswamy et al., 2000). From the perspectives of typing strains for evolutionary analysis, and linking genotype to phenotype to identify potential molecular causes of drug resistance, it is clear that an understanding of the underlying phylogenetic structure of the MTBC is critical.

Whilst several lineage-specific SNPs within genes associated with drug resistance are unlikely to be direct causes of resistance, some could play an indirect role in modulating the fitness cost of drug resistant mutations. It has been shown that strains from different lineages but with identical rifampicin resistance mutations show different levels of fitness cost (Gagneux et al., 2006b). In a wider context, the Beijing family of strains within Lineage 2 is often associated with drug resistance (Borrell & Gagneux, 2009; Parwati et al., 2010). It has been suggested therefore that strain genetic background plays a role in the spread of drug resistance strains (Muller et al., 2013), although the actual molecular mechanisms of this are currently unknown. Pre-existing mutations in genes associated with drug resistance, such as the lineage-specific SNPs found in this study, may increase the tolerance of the cell to future drug resistance mutations through higher baseline fitness, or epistatic interactions between the genetic background of the strain and drug resistance mutations (Muller et al., 2013).

3.4.3 Insights into the evolution of M. tuberculosis lineages

It has been hypothesised that, due to historical human migrations and serial transmission bottlenecks due to the low-infectious dose of tuberculosis, the MTBC have small

80 3.4 Discussion effective populations size (Hershberg et al., 2008). This phenomenon can lead to increased random genetic drift compared to natural selection, limiting the removal of potential functional mutations (Smith et al., 2006a). As discussed above, about two- thirds of all coding SNPs cause a change in the encoded amino acid, however nonsynonymous SNPs that cause the introduction or change of existing stop codons would highly likely cause a loss of function. Although rare (1.3% of all lineage-specific SNPs), thirty-five lineage-specific pseudogenes were identified due to the introduction of stop codons in the lineage branches. These genes may have been allowed to lose their function either due to the genome-wide loss of selective constraint in the MTBC, or potentially selection may have been relaxed during adaptation to a new niche in the respective lineages. The former hypothesis is more likely however, as no difference was found between the frequency of pseudogene creation or functional category of affected gene and the specific lineage. Furthermore, most genes were conserved hypotheticals and all but one nonessential to growth; the exception was senX3, but the nonsense SNP in Lineage 6 resulted in a modest loss of one amino acid, unlikely to affect function. The annotated H37Rv genome sequence contains thirteen pseudogenes (Lew et al., 2011), and it is likely that all of these pseudogenes are the result of random drift, which will eventually be removed by deletions leaving a tighter packed and eventually more reduced genome.

With such little variation in MTBC it is not currently possible to measure selection in each gene, although future whole genome studies employing low hundreds to thousands of MTBC genomes may enable this. An approach to analyse selection in DNA sequence data is to use dN/dS ratio, which provides a measure of the accumulation of nonsynonymous SNPs against the background of assumed silent synonymous SNPs. The dN/dS measure has been applied to many bacterial species to understand the evolutionary histories, including Salmonella typhi (Roumagnac et al., 2006), Clostridium difficile (He et al., 2010) and previously in the MTBC (Hershberg et al., 2008). However, the method was originally developed for the analysis of genetic sequences from divergent species (Kimura, 1977), and it has recently been suggested that it is inappropriate for the analysis for variation within a population (Kryazhimskiy & Plotkin, 2008). The problem with such comparisons is the potential short times scales involved, whereby slightly deleterious mutations that will have been removed by selection cannot be separated from substitutions that are fixed in the population (Rocha et al., 2006); this has been shown to lead to high dN/dS values for closely related bacteria, often approaching 1 (Rocha et al., 2006). If this is the case in the MTBC, it

81 3.4 Discussion might be expected for the external branches of the phylogeny, which includes SNPs from the extant strains, to harbour more nonsynonymous SNPs than the lineage-specific SNPs that were the focus of this study. Mutations would be expected to decrease over time as they are purged by purifying selection. In this study, the ratio of nonsynonymous to synonymous SNPs was not different between the external tips of the tree compared to the lineage branches (ranging from a ratio of 1.9 in the lineage branches, to 1.7 in the external). This is in agreement with other studies (Hershberg et al., 2008), and together shows that nonsynonymous SNPs are not more intensely purged than synonymous SNPs, which would suggest that the high dN/dS is not due to close relatedness of the strains.

Previous studies of MTBC variation found genome-wide dN/dS values of 0.57 (Hershberg et al., 2008) and 0.60 (Comas et al., 2010). These suggest strongly reduced purifying selection acting within MTBC. It has been suggested that the cause of this reduced selection is due to the small effective population size of the MTBC, which is a consequence of the clonality of the MTBC and serial population bottlenecks during transmission of TB (Hershberg et al., 2008; Smith et al., 2006a). The mean dN/dS for the lineage branches found in this study was 0.67, with no significant difference in the overall dN/dS per lineage. The lack of significant differences between the lineages suggests that the hypothesised reduction in purifying selection is a general feature across the lineages. Categorising all genes by essentiality, the effects of purifying selection could however still be detected in the MTBC, with significantly fewer nonsynonymous SNPs in essential genes. Furthermore, splitting genes by annotated function, the gene category with critical function to the cell had the lowest dN/dS. This information pathways category consists of genes involved in critical cellular functions, including genes involved in transcriptional and translational machinery. At the other end of the spectrum, the regulatory gene category had the greatest accumulation of nonsynonymous SNPs; four of the lineages (Lineages 3, 4, 5 and 6) had dN/dS ratios >1, indicating potential positive selection within this class, with three to five times more nonsynonymous to synonymous SNPs.

High frequencies of nonsynonymous SNPs in regulatory genes have been detected previously. In 2011, Schürch et al. sequenced several isolates from the Beijing family of the MTBC, a subgroup of Lineage 2, and found overrepresentation of nonsynonymous SNPs in the regulatory and associated signalling transduction pathways (Schürch et al., 2011). As previously discussed, in this study gene concatenates were used, which has

82 3.4 Discussion the disadvantage of averaging the selective forces acting on the sequences and thereby providing a summary of the pressure acting on the sequences; it is not possible to identify individual genes potentially under positive selection. Furthermore, analysing the frequency of SNPs clustering within genes, it was found that no genes harboured a rate that deviated from the expected Poisson distribution. This suggests that specific genes in the regulatory category are not highly variable, but that the whole category is accumulating the greatest ratio of nonsynonymous SNPs, which in turn may affect the regulatory networks of the respective lineages. Overall, this has shown that the loss of selective constraint is a common feature of all lineages, and functional genetic diversity is anticipated, specifically due to the high number of amino acid changing SNPs.

83 4.1 Introduction

Chapter 4 In silico prediction of functional Single Nucleotide Polymorphisms

4.1 Introduction

Current knowledge on the effect of genetic variation in the M. tuberculosis Complex (MTBC) is limited, but it has been suggested that much of the genetic variation in the MTBC will have functional consequences due to a reduction in purifying selection (Hershberg et al., 2008). This concept was further investigated by Hershberg et al. through comparison of the rates of nonsynonymous SNPs, and therefore amino acid changes, within conserved amino acid positions between the MTBC and M. canetti (Hershberg et al., 2008). Positions were classified as conserved based on the gene sequences of all other mycobacterial species. Reduced selection would be detected by a difference in the number of amino acid changes falling in conserved and variable sites between M. canetti and the MTBC. This was found to be the case, with nonsynonymous SNPs falling in conserved amino acid positions 27% of the time in M. canetti, but just over double the frequency (58%) was found in MTBC.

While underscoring the reduced selective constraint in MTBC, this also raises the possibility that much of the genetic variation could have a functional impact. Nonsynonymous SNPs have the potential to affect gene expression or the function of the encoded protein, which can have a range of phenotypic consequences to the cell. Most nonsynonymous SNPs are deleterious and eventually removed through the process of purifying selection (Balbi & Feil, 2007), but as demonstrated in this and other studies, the capacity to remove such SNPs is diminished in the MTBC due to low levels of purifying selection. This raises the question of how many and which nonsynonymous

84 4.1 Introduction

SNPs actually have a functional consequence. Based on an extrapolation of the aforementioned MLSA dataset, the actual number of functional SNPs was estimated. Specifically, the decreased number of nonsynonymous SNPs falling in conserved positions in M. canetti was used to estimate the number of nonsynonymous SNPs that would have been removed in the MTBC if purifying selection was similar to that of M. canetti, or any other Actinobacteria. It was suggested that about 40% of the amino acid changes in the MTBC would result in functional consequences, and if the small gene set was unbiased, genome-wide this translates to about 300 functional SNPs per average pairwise comparison of MTBC strains; strains that diverged at a closer time point would have would have few functional SNPs whilst the most divergent strain comparisons would have up to 500 functional SNPs (Hershberg et al., 2008).

Whilst the study represented the most complete analysis of genetic diversity at the time, the MLSA approach assays variation within a small sample of the genome. Whole genome sequencing datasets enable this hypothesis to be tested without risk of potential gene selection bias, and critically all of the predicted functional SNPs can be identified for the first time. Focus is made on the nonsynonymous SNPs identified in Chapter 3. This is the dominant SNP type identified in the MTBC, and is more amenable to in silico prediction methods due to the inherent property of causing amino acid change, which can be measured by the methods described below.

The main body of research into predicting the effects of nonsynonymous SNPs has been undertaken in eukaryotic systems, specifically in human based genetics studies (Ng & Henikoff, 2006). SNPs constitute about the 90% of human protein sequence variability (Collins et al., 1998), and the importance of nonsynonymous SNPs in humans is illustrated by the database containing disease-causing variants, the Human Gene Mutation Database (HGMD) (Stenson et al., 2012). In this database, nonsynonymous SNPs make up about half of the genetic variants that are known to cause disease (Stenson et al., 2012). In silico methods fall into two main groups, either based on sequence or structural information, and some hybrid methods now exist using a mix of the two approaches (Thusberg & Vihinen, 2009). The overarching basis of all amino acid substitution based predictions is the evidence that mutations which effect protein function tend to occur at evolutionary conserved positions, suggesting that predictions could be based on sequence homology (Miller & Kumar, 2001). It was also found that mutations had common structural features that distinguish them from neutral SNPs, suggesting that structural features could also be used in predictions (Sunyaev et al.,

85 4.1 Introduction

2000; Wang & Moult, 2001). In 2001, Wang & Moult used the human SNPdb database to model disease-causing mutations onto their respective wild-type protein structures and found that 83% of disease-causing mutations affect protein stability. These key studies spawned the development of algorithms to differentiate between functional and neutral SNPs. Some are based on sequence homology, such as SIFT (Ng & Henikoff, 2003) and PANTHER (Thomas et al., 2003), whilst others use structural features such as TopoSNP (Stitziel et al., 2004). As described, some combine many predictive features, and one example is the prediction method PolyPhen (Ramensky et al., 2002).

4.1.1 Aims

The work presented in this chapter is a comprehensive genome-wide prediction and characterisation of MTBC lineage-specific nonsynonymous SNPs. The specific aims were to:

• computationally predict functional nonsynonymous SNPs. • gain insight into the impact of functional SNPs across the lineages. • generate a focused SNP set that can be followed in experimental systems.

86 4.2 Materials and Methods

4.2 Materials and Methods

4.2.1 SIFT

Prediction of nonsynonymous SNPs likely to affect protein functional was performed using the Sorting Intolerant From Tolerant (SIFT) algorithm (Ng & Henikoff, 2003). SIFT version 4.0.2 (downloaded February 2010) was installed as a stand-alone version on a Linux server. A custom bash routine was written to analyse all SNPs in several batches.

The SIFT prediction is based on sequence conservation and the type of amino acid change. Briefly, SIFT looks for homologs in other bacteria of the gene of interest and 1) scores the conservation of the positions where mutations are found, and 2) weights this score by the nature of the amino acid change. These measures are incorporated into a normalised probability score, with scores ≤ 0.05 indicating a functional SNP prediction. The classification threshold was previously optimised for performance on a data set comprising of 55 LacI-related sequences, including paralogs (Ng & Henikoff, 2001). Furthermore, if sequence alignments over the SNP position were at a depth <3 then prediction was excluded.

A further conservation measure was also used to prevent the prediction of mutations on sequences too conserved, which would contaminate the multiple sequence alignment and bias SIFT to predicting more functional SNPs. The recommended <3.5 conservation score threshold was used, thereby filtering those genes and associated predictions above this threshold. As a bacterial database to generate the protein sequence alignment, all publicly available mycobacterial genome sequences outside of the M. tuberculosis complex (MTBC) were used. Therefore predictions were based on mycobacterial homologs, but not on species that are evolutionary too close to the query sequences, which could again contaminate the alignment with sequences likely to harbour the SNP

87 4.2 Materials and Methods allele to be tested. The MTBC database consisted of thirteen complete mycobacterial genomes, seen in Figure 4.1 and Table 4.1.

100 M. leprae (TN)

88 M. leprae (Br4923) M. ulcerans (AGY99) 100 100 M. marinum (M) M. avium subsp. paratuberculosis (K10) 100 M. avium (104) M. abscessus (ATCC 19977) M. smegmatis (MC2 155) 56 M. sp. JLS 100 100 M. sp. MCS 100 M. sp. KMS 91 M. gilvum (PYR-GCK) 100 M. vanbaalenii (PYR-1) Nocardia farcinica (IFM 10152)

50

Figure 4.1. SIFT database phylogeny. BLAST database constructed for SIFT. Neighbour-Joining phylogeny based on concatenated 16S RNA and rpoB nucleotide sequences from the thirteen available mycobacterial genomes. Node support after 1000 bootstrap repetitions shown on branches. Scale bar indicates number of SNPs. The tree is rooted using the outgroup Nocardia farcinica. The MTBC was not included to prevent contamination of the predictions by closely related sequences; if present the MTBC would diverge from M. leprae.

88 4.2 Materials and Methods

Table 4.1. SIFT database of non-MTBC species. Thirteen complete whole genome sequences were published at time of this study. Genomes downloaded from NCBI.

Genome Description M. leprae TN Causative agent of human leprosy. Leads to permanent damage to the skin, nerves, limbs and eyes if left untreated M. leprae Br4923 As above M. ulcerans AGY99 An emerging pathogen that causes Buruli ulcer M. marinum M Causes a tuberculosis-like disease in cold-blooded animals, and a peripheral granulomatous disease in humans M. avium subsp. Causes tuberculosis in birds and disseminated infections Paratuberculosis K10 in immunocompromised humans M. avium 104 See above M. abscessus ATCC 19977 Environmental bacterium that causes lung, wound, and skin infections M. smegmatis str. MC2 Generally non-pathogenic, capable of causing soft tissue 155 lesions M. sp. JLS A pyrene-degrading bacterium isolated from the soil M. sp. MCS As above M. sp. KMS As above M. gilvum PYR-GCK As above M. vanbaalenii PYR-1 Capable of degrading a variety of aromatic hydrocarbons

4.2.2 Indels

Short indels (ranging from 1 to about 20 nt) were identified in Lineage 1 and 2 genome strains using the indelpe module in MAQ (Li et al., 2008). All Lineage 1 and 2 genomes used in Chapter 3 were used in this analysis. The tab delimited output file includes: start position, indel type (inserted/deleted nucleotides). From this file it was possible to identify frameshift mutations as those not divisible by three, the codon length. Indels are inherently difficult to identify in short read data, and so only a targeted analysis of two lineages was performed.

4.2.3 Homology modelling

Prediction of protein structure was performed using Protein Homology/analogy Recognition Engine V 2.0 (Phyre2) (Kelley & Sternberg, 2009). Phyre2 is available at: http://www.sbg.bio.ic.ac.uk/phyre2. Detailed description of the Phyre2 server has been

89 4.2 Materials and Methods previously described (Bennett-Lovsey et al., 2008; Kelley & Sternberg, 2009; Mao et al., 2012). Briefly, nine ancestral (wild-type) regulatory protein-coding sequences were submitted to the phyre2 server. A non-redundant fold library is constructed based on known protein sequences mined from the Structural Classification of Proteins (SCOP) database and Protein Data Bank (PDB). The query protein sequence is scanned against a non-redundant sequence database, and a profile Hidden Markov model (HMM) generated. A PSI-Blast is used to collect close and remote sequence homologues, an alignment is constructed and secondary structure predicted. The profile HMM and the secondary structure are then used to scan the fold library. This alignment process returns a score on which all alignments are ranked, and an E-value is generated. Top twenty scoring matches are then used to generate full 3-D models of each sequence and reported to the user. For each regulatory protein, the highest confidence model (>99%) with the greatest coverage was used in the subsequent analysis. Whilst it was possible to generate a homology model for all regulators, for four proteins the structure did not cover the SNP region and so was not used in later analysis.

4.2.4 Change in protein stability

Prediction of SNPs that cause a destabilisation of the protein structure was made using CUPSAT (Parthiban et al., 2006). The CUPSAT server is available at: http://cupsat.tu- bs.de/. CUPSAT predicts the change in free energy of protein unfolding between wild- type and mutant proteins (ΔΔG) using structural environment specific atom potentials and torsion angle potentials. The prediction is based on existing PDB protein structures, or user supplied structures. The output consists of information about mutation site, its structural features (solvent accessibility, secondary structure and torsion angles), and comprehensive information about changes in protein stability for nineteen possible substitutions of a specific amino acid mutation (Parthiban et al., 2006). Protein stability is categorised as destabilising by a loss of protein stability (-ΔΔG) or stabilising if protein stability increases (+ΔΔG). Changes in stability of < 0.5 ΔΔG are not considered significant, and are classified as neutral mutations.

90 4.3 Results

4.3 Results

4.3.1 Predicting functional SNPs within control set

The Sorting Intolerant From Tolerant (SIFT) algorithm was first tested on a set of SNPs that are highly likely to affect protein function in the MTBC. Drug resistance in the MTBC is largely caused by SNPs (Ramaswamy & Musser, 1998; Riska et al., 2000), and many of these drug resistance-conferring mutations have been identified and are housed in the TBDream database (Sandgren et al., 2009) (database downloaded on 07- 06-10). In total a non-redundant set of 87 SNPs was extracted, consisting of SNPs from the following genes: ahpC, kasA and katG (SNPs associated with Isoniazid resistance), embB (ethambutol resistance), gyrA and gyrB (fluroquinolone resistance), pncA (pyrazinamide resistance) rpoB (rifampicin resistance).

In addition to the drug resistance conferring SNPs, a literature search of experimentally determined functional SNPs in the MTBC was conducted to supplement the test set. SNPs from two additional genes: pykA and mmaA3 were included from this search (Behr et al., 2000; Keating et al., 2005). One of the early signs of variation amongst the MTBC was the variation in carbon utilisation (Goldman, 1963; Winder & Brennan, 1966). A characteristic of M. bovis was the inability to grow on glycerol as a sole carbon source, unlike M. tuberculosis, and instead requiring the addition of pyruvate to the growth medium in vitro (Wayne, 1994). A mutation within pykA in M. bovis, encoding pyruvate kinase, was found to render this enzyme inactive and thereby disrupting the use of carbohydrates as an energy source (Keating et al., 2005). The nonsynonymous SNP (E220D) is also found in strains of M. africanum and M. microti (an infection in Voles), and these cultures are also supplemented with pyruvate (Keating et al., 2005; Wayne, 1994). The second nonsynonymous SNP (G98D) in mmaA3 is present within most strains of M. bovis BCG, such as BCG-Pasteur (Behr et al., 2000). A defining characteristic of mycobacteria is their capacity to synthesise mycolic acids, and it had

91 4.3 Results been known that some BCG strains could not synthesis methoxymycolates, one type of mycolic acid (Minnikin et al., 1983). The G98D mutation was subsequently found to be responsible for this difference (Behr et al., 2000; Yuan et al., 1998).

SIFT was applied to the test SNP set and the results filtered as described (section 4.2.1), removing regions covered by <3 homologs and alignments with too little sequence variation with which to form a reliable prediction. In total 63 SNP predictions were made for the control set, and 48 (78.7%) of the drug resistance associated SNPs were predicted to impact protein function, leaving the remaining 13 SNPs (21.3%) predicted to be tolerated. The two pykA and mmaA3 SNPs were also predicted functional, both receiving the lowest SIFT scores of 0.00. Together, nearly 80% of the SNP set was predicted functional, which may suggest a false negative error rate of 20%. Although it should be stressed that not all of the SNPs within the drug resistance set are experimentally confirmed to be involved in drug resistance, and instead causally associated. Additionally, promoter mutations could also be the cause of drug resistance, such as the inhA promoter mutations that cause isoniazid resistance (Musser et al., 1996); non-coding SNPs can inherently not be tested in this type of analysis.

4.3.2 Predicted functional nonsynonymous SNPs

All lineage-specific nonsynonymous SNPs indentified in Chapter 3 were entered into the dataset for this study (N=1550 SNPs). Predictions could be made for 1339 (86.4%) of the SNPs. Removal of predictions based on genes that were highly conserved reduced this set by 37.8% (506 SNPs), leaving 833 SNP predictions. SNPs within genes that harboured little sequence diversity were not included as such predictions would be biased, potentially causing increased functional mutation calls and thereby increased false positive error rate (Ng & Henikoff, 2003).

In total, 371 nonsynonymous SNPs were predicted to affect gene function (Table 4.2). The ancient lineages (Lineages 1, 5 and 6) were found to harbour nearly double the number of predicted functional SNPs than the modern lineages (246 vs 125 functional SNPs respectively). However, the three ancient lineages also have the longest branch lengths as shown in Chapter 3. To counter for any influence of gene branch length, the number of functional and tolerated SNPs was expressed as percentages (Figure 4.2). The percentage of SNPs predicted functional, for which predictions could be made, ranged

92 4.3 Results from 40.9-48.4% across the Lineages, with a mean of 44.5%. There was no significant difference between the frequency of predicted functional and tolerated SNPs across the lineages (Mann Whitney, p = 0.4817). Additionally, no difference was observed between the number of functional SNPs by the ancient and modern classification, with a mean of 44.7% and 44.1% predicted functional SNPs respectively.

As a further control, all genes with predicted functional SNPs were categorised as essential or nonessential on the basis of transposon mutagenesis screens (Sassetti et al., 2003; Sassetti & Rubin, 2003). Using these two categories 54 genes (14.6%) of the functional predictions were essential. This would suggest a 14.6% false positive error rate for SIFT predictions, which is also close to the previously described false positive error rate for the SIFT algorithm (~20%) (Ng & Henikoff, 2003).

93 4.3 Results

Table 4.2. Predicted tolerated and functional SNPs using SIFT. Based on SIFT score ≤ 0.05 are predicted functional, and genes with conservation scores not < 3.5 were filtered.

Total Lineage Tolerated Functional predictions L1 79 74 153 L2 25 18 43 L3 52 44 96 L4 33 23 56 L5 111 89 200 L6 118 83 201 Modern branch 44 40 84 462 371 833

100 )

% 80 (

s Tolerated SNP P

N 60 Functional SNP S

f o

r 40 e b m u 20 N

0

Lineage 1Lineage 5Lineage 6Lineage 2Lineage 3Lineage 4

Modern lineage

Figure 4.2. SIFT predictions. To account for differences in lineage branch lengths, the percentage of SNPs predicted as being functional and tolerated is shown. Horizontal dashed line indicates the average percentage of predicted functional SNPs (44.5%).

94 4.3 Results

4.3.3 Impact of nonsynonymous SNPs outside of the human adapted MTBC

To test if the high percentage of predicted functional SNPs is restricted to the MTBC or is a common phenomenon in mycobacteria, all SNPs were identified between the reconstructed ancestor of the MTBC sequences and M. canetti, the closely related outgroup of the MTBC. Out of a total 12,319 coding SNPs, 4,245 (34.5%) were nonsynonymous. Compared to the percentage of nonsynonymous SNPs found within the lineage branches of the MTBC (64.8%), M. canetti has nearly half the number of nonsynonymous SNPs. Screening these nonsynonymous SNPs for potential functional impact using SIFT, it was found that there were significantly more predicted functional SNPs in the MTBC. Out of total 2,416 possible predictions, 522 (21.6%) were predicted functional (chi-square, p<0.0001). This would suggest that in contrast to the MTBC, the majority of changes in M. canetti are functionally neutral.

4.3.4 Clustering of functional SNPs

There was little evidence of functional SNPs clustering within specific genes, which could be indicative of adaptive selection. The majority of genes did not harbour a predicted functional SNP (3701 genes, 92.1%), whilst those that did ranged from 0-5 SNPs per gene, as shown Figure 4.3A. The frequency of SNPs mainly followed the expected distribution seen by the Poisson model fitted to the data, however there were a few exceptions: Rv2079, fadD15 (Rv2187) and Rv0465c. The three genes that deviate from the expected number of SNPs had SNP numbers ranging from 4-5 per gene (Figure 4.3B).

All three genes are above the average gene length of 1003nt, ranging from 1425-2514nt, which could account for the increased number of predicted functional SNPs. However, out of the fifteen nonsynonymous SNPs found within the three genes, only one was not predicted to be functional, which would not be expected based on the genome-wide distribution of predicted functional and tolerated SNPs (chi-square, p=0.0002). Therefore, whilst these are relatively long genes, this does not account for the skewed number of predicted functional nonsynonymous SNPs.

Not much is known about Rv2079, which has four predicted functional SNPs. It is a conserved hypothetical gene of unknown function, and SNPs are found in four lineages

95 4.3 Results

(1, 2, 5 and 6); in Lineage 2 a nonsynonymous SNP causes the introduction of a stop codon. Combined with evidence that this gene is nonessential for growth based on transposon screens (Sassetti et al., 2003; Sassetti & Rubin, 2003), it is possible that functional mutations are accumulating as Rv2079 it is either incorrectly annotated as a gene, or in the case of Lineage 2 has become a pseudogene. The other outliers were fadD15 and Rv0465c, which contain five predicted functional SNPs each. The genes belong to different functional categories, lipid metabolism and regulation proteins, respectively. As before, fadD15 functional SNPs are across multiple lineages (1, 3, 4 and 6), and one SNP is also present in the modern lineage branch. Therefore all the Modern lineages have one or two functional SNPs in fadD15. Furthermore, in Lineages 1 and 5, the two SNPs are nonsense and result in the introduction of stop codons in the lineages. Function is again not known for fadD15, but it is encodes a fatty-acid-CoA synthetase and is likely involved in lipid degradation (Cole et al., 1998).

The other gene with five predicted functional SNPs, Rv0465c, is a probable transcriptional regulator (Cole et al., 1998). It shares high sequence identity with the RamB protein from Corynebacterium glutamicum, which is in the same phylum as M. tuberculosis. As well as binding to its own promoter to autoregulate expression, RamB controls isocitate lyase (icl1) which is part of the glyoxyate cycle (Micklinghoff et al., 2009). Although not annotated in the most current release of the Tuberculist database (Release 26, December 2012), it has been given the gene name ramB by Micklinghoff et al. (2009), and this has been adopted in the following sections. Characteristic of regulators, the mycobacterial ramB has a DNA binding domain, which is in the N- terminus of the 465 amino acid protein, including the helix-turn-helix domain (HTH), from amino acid residues 21 to 40, as based on the PROSITE database. One of the two predicted functional SNPs in Lineage 6 is located within the HTH domain (N36D), which might be expected to directly affect the capacity of the protein to bind DNA. All other functional SNPs, found in Lineages 1, 4 and 5, are located throughout the first half protein length, leaving only Lineage 2 and 3 with a likely functioning ramB.

Next, the distribution of the predicted functional SNPs across the genome was calculated, shown in Figure 4.4. Functional SNPs were located across the genome, and appear to follow the same distribution profile of the nonsynonymous SNP frequencies, as identified in Chapter 3. On average, there is one functional SNP per 10.9kb of coding sequence.

96 4.3 Results

A 4000 3701

l a n

o 3000 i t c n s u P f

N 2000 d S e t c i d

e 1000 r P 278 35 3 1 2 0 0 1 2 3 4 5 Number of SNPs in gene

B

l

a 1000 n

) o 0 i t 1 c g n o

u 100 L f

(

d s e t P c i N 10 d S e r P 1

0 1 2 3 4 5 Number of SNPs in gene

Figure 4.3. Distribution of predicted functional SNPs per gene. A. SNPs per gene range from 0-5, with actual number of genes shown at top of bar. Line indicates predicted values under a Poisson distribution fitted to the data. B. y-axis potted on a log10 scale to highlight deviation from the expected number at high SNP numbers per gene.

97 4.3 Results

Predicted functional SNPs 20 60

s Nonsynonymous SNPs N P o N n s S

y l 15 n a o

n 40 n o i y t m c

n 10 o u f u

s d

S e 20 t N c

i 5 P d s e r P 0 0 0 1 2 3 4 Genome position (Mb)

Figure 4.4. Frequency distribution of predicted functional SNPs across genome. SNPs were placed into bins of 0.1Mb. Right y-axis predicted functional SNPs, left y- axis nonsynonymous SNPs.

98 4.3 Results

4.3.5 Functional category analysis of functional SNPs

To determine if the predicted functional SNPs are within specific gene categories or instead evenly distributed, the genes with predicted functional SNPs were grouped by the Tuberculist functional categories (Lew et al., 2011). The percentage of functional SNPs within each of the eight functional categories was compared to the percentage representation of the respective category genome-wide, and is shown in Figure 4.5. In this way, the unequal distribution of genes within specific categories was normalised and functional SNP distribution expressed as a ratio. Ratios >1 represent functional categories over-represented with functional SNPs, whereas <1 indicates under- representation. Categories significantly over-represented with functional SNPs were lipid metabolism (2.4-fold) and regulatory proteins (1.6-fold) (chi-square, false discovery rate adjusted p < 0.05). Interestingly, information pathways were the most under-represented category, with 2.0-fold less predicted functional SNPs that would have been expected (chi-square, false discovery rate adjusted p=0.04) (Table 4.3). Genes within the conserved category were also significantly under-represented.

lipid-metabolism *

unknown

regulatory-proteins * cell-wall-and-cell-processes

intermediary-metabolism-and-respiration

virulence,-detoxification,-adaptation

conserved-hypotheticals * information-pathways * !3 !2 1 2 3 Functional-category-representation

Figure 4.5. Functional category representation. Values on the x-axis are ratios, representing the deviation from the expected number of predicted functional SNPs per category. Ratios > 1 indicate overrepresentation, <1 underrepresentation, and ~1 indicates that the number of predicted functional SNPs is on par with the expected number. Categories are based on Tuberculist annotations. * indicates p <0.05 by individual chi-square test followed by multiple test correction (False Discovery Rate method) (Benjamini & Hochberg, 1995) .

99 4.3 Results

Table 4.3. Functional category representation. The number of predicted functional SNPs within genes from each respective category. Representation of category expressed as ratios. Independent chi-square tests performed for all categories, followed by multiple test correction (False Discovery Rate method) (Benjamini & Hochberg, 1995).

chi-square Gene Functional Functional category Representation (adjusted number SNPs p-value) information pathways 242 12 -2.0 0.04 conserved hypotheticals 1031 63 -1.6 <0.01 virulence, detoxification, adaptation 238 17 -1.4 0.27 intermediary metabolism and respiration 936 88 -1.1 0.55 cell wall and cell processes 773 91 1.2 0.18 regulatory proteins 198 31 1.6 0.04 unknown 16 3 1.9 0.55 lipid metabolism 271 66 2.4 <0.01

An alternative method to account for the number of functional SNPs per category was also calculated. This was based on the number of functional SNPs per potential nonsynonymous SNP position in each functional category. Using this method, it was again found that the information pathways category had accumulated the least number of functional SNPs (12 functional SNPs out of 202,427 potential nonsynonymous positions, 0.006%). The lipid and regulatory categories had accumulated the most functional SNPs, with 0.02% and 0.03% of all potential nonsynonymous positions harbouring a functional SNP respectively. In summary, this method highlights the same gene categories over and under represented found previously.

Stratification of the predicted functional SNPs by lineage in the functional categories by one-way ANOVA found no significant difference (Kruskal-Wallis test, p=0.99). This would suggest that whilst there is a significant difference in representation of functional SNPs within the above four gene categories, it is not driven by specific lineages but instead a phenomena across the MTBC lineages.

100 4.3 Results

4.3.6 Functional impairment of Lineage 1 and 2 regulatory proteins

It has been shown that two functional categories, regulatory proteins and lipid metabolism, have accumulated a greater number of predicted functional SNPs than expected. The following section focuses on the over-represented regulatory category, and specifically on the predicted functional mutations within Lineages 1 and 2, which are the focus of the transcriptomic study in Chapter 5. This provides an opportunity to combine additional predictive information such as structural features, to the previous sequence based predictions, whilst also providing a reduced SNP set to initially guide the transcriptome analysis.

Eleven genes within the two lineages harbour lineage-specific SNPs predicted by SIFT analysis as likely to impair protein function, and a further gene harbours a nonsense mutation (Table 4.4). Targeted analysis of insertion and deletion (indel) mutations in the lineage branches identified a further two genes with mutations that cause frameshift mutations (Table 4.4). The frameshift mutation in Lineage 2 removes the existing stop codon, likely causing run through and fusion with the downstream gene Rv3829c. Similarly, the two base frameshift deletion within Rv1028c (kdpD) at chromosome position 1151486 leads to the introduction of stop codon at codon position 235 and a resulting 625 (72.8%) amino acid truncation of the ancestral protein. kdpD is a two component transcriptional sensor and controls the expression of the kdpABC operon, which in Escherichia coli is involved in potassium transport at low potassium concentrations (Walderhaug et al., 1992). A third indel was found within mce1R (Rv0165c), at chromosome position 194305. However, the same two-nucleotide insertion (consisting of two CC nucleotides) was found across all Lineage 1 and 2 strains, and so was removed from the analysis as this likely represents a two base deletion that is specific to the H37Rv sequence used in the reference based mapping.

101 4.3 Results

Table 4.4. Transcriptional regulators with predicted functional SNPs and indels. Eleven SNPs with prediction functional SNPs based on SIFT analysis. One SNP causes a nonsense mutation (stop gain). Two indels cause frameshift mutations. n/a: not possible to predict with SIFT.

SIFT Gene Regulator type SNP Mutation Lineage score Rv1846c BlaI penicillinase repressor T 2096430 G L57R 1 0.05 Rv3082c VirS AraC T 3447480 G L316R 1 0.01 Rv3167c TetR C 3536008 A P17Q 1 0.02 Rv0465c RamB HTH-XRE A 555945 G Q121R 1 0.02 Rv1032c TcrS 2-component sensor C 1157771 G S62C 1 0.01 Rv3736 AraC G 4187063 A G144R 1 0.01 Rv0844c NarL 2-component regulator G 940602 C G169R 2 0.00 Rv0377 LysR G 455325 C R302P 2 0.00 Rv0275 TetR T 331588 C L24S Modern 0.00 Rv0981 MprA 2-component regulator A 1097023 G S70G Modern 0.04 Rv2359 Zur Fur G 2641840 A R64H Modern 0.02 Rv2788 SirR Fe-dependent C 3097349 Q131X 1 n/a repressor Rv3830c TetR insertion: S208 2 n/a 4305063 T frameshift Rv1028c KdpD 2-component sensor deletion: H67 1 n/a 1151486 AC frameshift

4.3.6.1 Change in protein stability

The sequence-based predictions of functional impairment of transcriptional regulators were refined through incorporation of structural based information. The location of each SNP was placed in the context of protein domain information, such as identification of SNPs within the functionally important DNA binding helix-turn-helix (HTH) domain. Protein domain annotations were extracted from the Pfam database (Punta et al., 2012). These were then complemented with predictions on the protein stability (ΔG) of wild- type and mutant protein structures, enabling the change in protein stability (ΔΔG) to be

102 4.3 Results calculated. Compromised protein folding and decreased stability of the protein product are major pathogenic consequences of nonsynonymous SNPs, affecting the ability of the protein to function (Wang & Moult, 2001; Yue et al., 2005).

To calculate ΔΔG it is necessary to have protein structures for each of the regulators. Only two of the eleven regulators with predicted functional SNPs have had their protein structures resolved and are publicly available in the Protein Data Bank (PDB) (Burley, 2013); these are BlaI (PDB ID: 2G9W) and NarL (3EUL) (Sala et al., 2009; Schnell et al., 2008). For the remaining nine regulators, homology modeling was performed using the Phyre2 server (Kelley & Sternberg, 2009) as described in Methods (section 4.2.3). Following this it was still not possible to construct protein models for four of the regulators, either due to the low quality of the model or because the SNP position was not covered. The remaining seven regulators were entered into the analysis.

The CUPSAT server was used to predict ΔΔG (Parthiban et al., 2006). Protein stability is categorised as destabilising (-ΔΔG), neutral (0 ΔΔG) or stabilising (+ΔΔG). Changes in stability of < 0.5 ΔΔG are not considered significant (see section 4.2.4). Five of the regulator SNPs were predicted to cause a loss of protein stability, one protein structure increased in stability following the SNP, and one prediction of energy change was too small to classify as either stabilising or destabilising, and so is likely neutral (Table 4.5). Combined with the protein domain information, five of the destabilising SNPs were located within the HTH DNA binding domains, and likely affect the regulatory function of the protein: Rv0275, Rv0844c (narL), Rv1846c (BlaI), Rv3082c (virS) and Rv3167c. These were classified as having “high predictive scores” and form a reduced set of transcriptional regulators predicted to be functionally impaired (Table 4.5). For example, a SNP in Lineage 1 strains introduces an arginine residue into the conserved position of the virS HTH domain, which is predicted to destabilise the structure and cause a loss of function (Figure 4.6).

103 4.3 Results

Table 4.5. Regulatory proteins with predicted functional SNPs and indels in Lineages 1 and 2. Sequence based predictions of functional SNPs are combined with Pfam protein domain information and prediction of changes in protein stability (ΔΔG). n/a: unable to calculate ΔΔG as the mutation is an indel or nonsense SNP, unkn: unable to generate a protein structure using homology modelling.

Protein stability Gene Mutation Lineage Domain (ΔΔG; kcal/mol) high predictive score Rv0275 L24S Modern helix-turn-helix -3.18

Rv0844c NarL G169R 2 helix-turn-helix -4.66

Rv1028c KdpD H67 frameshift 1 2-component n/a sensor Rv1846c BlaI L57R 1 helix-turn-helix -8.72

Rv2788 SirR Q131X 1 Fe-dependent n/a repressor Rv3082c VirS L316R 1 helix-turn-helix -2.03

Rv3167c P17Q 1 helix-turn-helix -1.21

Rv3830c S208 frameshift 2 low complexity n/a fusion low predictive score

Rv0465c RamB Q121R 1 low complexity unkn

Rv0377 R302P 2 low complexity unkn

Rv0981 MprA S70G Modern cheY 2.83

Rv1032c TcrS S62C 1 low complexity unkn

Rv2359 Zur R64H Modern helix-turn-helix 0.47

Rv3736 G144R 1 arabinose- unkn binding

104 4.3 Results

L316R

281 310 320 QUERY LIERERRAQA ARYLAQPGLY LSQIAVLLGY SEQSALNRSC RRWFGMTPRQ YRAYGGVSGR * mmi:MMAR_3320 VVDDVRREVT ERYLRDSDMT LTHLARQLGY AEQSVLSRSC QRWFGASPAS LRAXXXXXXX X mmi:MMAR_5276 LIDEVRKETA DRYLRTTAMS LSHLARELGY AEQSVLTRSC KRWFGIGPAA YRAXXXXXXX X mul:MUL_4350 LIDEVRKETA DRYLRTTAMS LSHLARELGY AEQSVLTRSC KRWFGIGPAA YRAXXXXXXX X mab:MAB_3997c LVDQIRREAA ERLLSDTDLS LDHLSRQLGY AEQSVFTRSC KRWFGTTPSA YRSXXXXXXX X mgi:Mflv_5495 LVDQTRRDTA QRLLLDTALS LDQLACPLXX XXXXXXXXXX XXXXXXXXXX XXXXXXXXXX X mab:MAB_3623 LLDTIRLDLA DHLVTSDRHS LTEISEMLAF SSPSNFSRWF RGHRAMSPRT WRXXXXXXXX X mmc:Mmcs_3216 LRQSFLRERA ILRLLDRSLS VSEIAAELGY AELTNFTHAF KRWTGRSPRH FRXXXXXXXX X mkm:Mkms_3278 LRQSFLRERA ILRLLDRSLS VSEIAAELGY AELTNFTHAF KRWTGRSPRH FRXXXXXXXX X mjl:Mjls_3227 LRQSFLRERA ILRLLDRSLS VSEIAAELGY AELTNFTHAF KRWTGRSPRH FRXXXXXXXX X mgi:Mflv_4594 LRQSCLRESA MMLLITRSMS ASQIATELGY GDLANFSHAF KRWTGRSPSE YRXXXXXXXX X mab:MAB_0715c IRDAALRTEA IKSLEDGSES LNDLSVRLGF SELSAFTRAF RRWTGASPAQ YRXXXXXXXX X mab:MAB_2050 LRQSFLQERA ILRILDRSVS VSEIAAELGY ADLTNFTHAF KRWTGRSPRH FRXXXXXXXX X mmi:MMAR_3156 LRQAFLRERA MLQLLDRSLS VSEIATDLGY SDLANFSHAF KRWTGRSPSE FRXXXXXXXX X

Figure 4.6. Predicted loss of function of virS transcriptional regulator in Lineage 1. Homology model of wild-type VirS protein, covering amino acid residues 214 - 334. Arrow indicates Lineage 1 SNP at amino acid position 316 within the HTH domain. CUPSAT analysis of the ancestral and mutant protein predicts a destabilisation of the structure (ΔΔG = -2.03 kcal/mol). Sequence conservation of region used in SIFT prediction shown on right hand side, with the ancestral MTBC sequence shown at the top of the sequence alignment. Standard one-letter amino acids nomenclature used, and X indicating a gap in the alignment.

105 4.4 Discussion

4.4 Discussion

4.4.1 Strengths and limitations of the study

The overall aim of this study was to computationally measure the impact of SNPs in the MTBC, focusing specifically on SNPs that contribute to lineage-specific variation identified in Chapter 3. Over 1,500 nonsynonymous SNPs were identified in the lineage branches of MTBC, and the phenotypic effects of these are unknown. Unlike other bacterial species, the majority of SNPs in MTBC (over two-thirds) are nonsynonymous, and this SNP set was the focus of this computational study. Such SNPs are more tractable to computational prediction methods than synonymous and intergenic SNPs, as the impact of the amino acid substitutions can be measured using the properties of the amino acid, such as residue volume change, as well as the evolutionary conservation of the specific nucleotide position based on multiple sequence alignments. This is reflected in the development of computational prediction methods based mainly on nonsynonymous SNPs (Ng & Henikoff, 2006). However, clearly noncoding SNPs can also have an impact on gene function, such as the mutation of regulatory regions found in M. tuberculosis drug resistance (Müller et al., 2011; Riska et al., 2000). More recently it has also been suggested that synonymous SNPs are less silent than previously assumed (Plotkin & Kudla, 2011). Despite not having an effect on the resulting protein sequence, synonymous SNPs, and therefore synonymous codon changes, have shaped gene expression through the phenomenon of codon-usage bias (Plotkin & Kudla, 2011). Differential use of synonymous codons can effect RNA processing, protein translation and protein folding (Plotkin & Kudla, 2011); industrial applications have exploited this to increase gene expression over 1000-fold through introduction of synonymous SNP changes (Gustafsson et al., 2004). Furthermore, in human based studies, a synonymous mutation has also been shown to change the substrate specificity of the multidrug- resistance protein 1 (MDR1), although the precise mechanism is not yet understood (Kimchi-Sarfaty et al., 2007; Komar, 2007). Together this demonstrates the potential

106 4.4 Discussion functional importance of all SNP types, and it is likely that future study of M. tuberculosis genomic variation will attribute instances of functional variation not only to nonsynonymous SNPs but the latter two SNP types as well.

Whilst experimental methods exist to characterise the functional effect of SNPs, such as site-directed mutagenesis, studying the molecular effects of mutations in the MTBC is time-consuming, laborious and unfeasible at this scale, therefore computational methods can provide useful and reliable information about the effects of amino acid substitutions at an initial stage. There are two main methods to predict the functional effect of coding nonsynonymous SNPs. The first relies on mapping the SNP to the three-dimensional protein structure and the latter takes a sequence-based approach, assessing the nature of the position and introduced amino acid type. At the time of writing, there were protein structures for 259 (6.4%) of all annotated M. tuberculosis proteins in the Protein Data Bank (Burley, 2013). This number has not increased significantly in the interim period, and currently 314 genes have associated protein structures (December, 2012) (Burley, 2013). To ensure that this was a comprehensive study of the effects of lineage-specific nonsynonymous SNPs, it was decided to use the latter prediction method based on sequence homology, thus maximising the number of SNP predictions. The method chosen was the Sorting Intolerant From Tolerant (SIFT) algorithm (Ng & Henikoff, 2003). Although SIFT relies solely on amino acid sequence to make the prediction, it has been shown to perform similarly to methods based on different evolutionary and structural features, and critically can be applied to many more of the lineage-specific SNPs (Saunders & Baker, 2002; Sunyaev et al., 2001). It has been suggested that a combination of the two main prediction methods (sequence and structural based) will likely improve the accuracy of predictions (Bao & Cui, 2005; Thusberg & Vihinen, 2009), but the chosen method was viewed as an acceptable trade-off. More in depth structural work can be applied at a later targeted stage, as was used in this study on the genes within the regulatory protein category. However, even at this stage, four of the eleven (36.4%) regulatory proteins with nonsynonymous SNPs could not be entered into structural based predictions, owing to the lack of structural information; for the remaining proteins only two had been experimentally determined, requiring intensive homology modeling to increase the size of the structural dataset.

Moving from SNPs, short insertion and deletions (indels) also have potential functional consequences, particularly indels that are of a length not divisible by three and so lead to a change in the reading frame. However, inference of indels from next-generation

107 4.4 Discussion sequence data is challenging, and so far methods for identifying these lag behind methods for calling SNPs in terms of sensitivity and specificity (Albers et al., 2011). For this reason, it was decided to not include a genome-wide analysis of indels, but focus on a few potential indels in genes involved in regulatory function instead. Indels are also more rare than SNPs in the MTBC, and for these reasons the identification of SNPs has had the greatest attention in such studies so far. They are effectively the lower hanging fruit. It is likely that these issues will be resolved and indels will have more attention as newer algorithms to detect them are developed (Albers et al., 2011), and as potentially longer reads from third generation sequencing technologies are utilised.

4.4.2 Validation of the SIFT method

For the first time it was possible to identify all potential functional SNPs in the lineages of MTBC. As described in Chapter 3, these SNPs represent the background variation that contributes to the underlying lineage genetic diversity. Identification of SNPs more likely to contribute to functional diversity focuses later analyses on predicted phenotypically important SNPs, and on a broader scale tests the hypothesis that a high proportion of SNPs within the MTBC will be functional, likely due to reduced purifying selection acting within MTBC (Hershberg et al., 2008).

The SIFT algorithm was first run on a test SNP set that would be expected to be enriched for functional SNPs, and so act as positive control for the performance of the method. The set was based on SNPs associated with drug resistance from the current release of the TBDReam database (Sandgren et al., 2009). It was found that 79.4% of SNPs were predicted functional by SIFT, leaving 20.6% of SNPs associated with drug resistance predicted to be functionally neutral. This potential false negative error rate of ~20% is close to that previously described by the authors of SIFT (Ng & Henikoff, 2001; Ng & Henikoff, 2003). However, it is important to note that the majority of SNPs in the positive control set are putative mutations found in drug resistant clinical M. tuberculosis isolates, and so may be causally related and not involved in drug resistance (Sandgren et al., 2009). This will likely mean that the control set has some SNPs that are not functional and so is not a completely robust test of the SIFT algorithm. As an alternative test, it was found that significantly fewer predicted functional SNPs were found within the genes previously characterised as being essential for growth, and that functional SNPs that did fall within the group of essential genes (14.6%) is again close

108 4.4 Discussion to the expected false positive error rate of SIFT. Together this provides confidence in the later SNP predictions.

4.4.3 Half of lineage-specific SNPs are predicted to have functional consequences

Applying SIFT to all lineage-specific SNPs, it was possible to make predictions for >85% of the set, and strikingly it was found that just under half were predicted to have a functional effect. The mean percentage of functional SNPs for all lineages was 44.5% and no significant difference was found between the individual lineages, or by grouping lineages into ancient and modern categories. This prediction is very close to the estimate made by Hershberg et al. (2008). The authors of this former study estimated that ~40% of the SNPs within MTBC are functional by extrapolating from the SNPs found within the set of 89 genes sequenced in 99 human M. tuberculosis isolates (Hershberg et al., 2008). In contrast to the high proportion of functional SNPs in the MTBC, all SNPs between an M. canetti strain, the closely related outlier from the MTBC, and the reconstructed M. tuberculosis ancestor were identified and it was found that only 21.6% of the nonsynonymous SNPs were predicted to be functional, which is less than half of the proportion seen in the MTBC. This suggests that the hypothesised low frequency of purifying selection acting with MTBC is generating substantial diversity. Interestingly, a similar phenomenon has been observed in humans, where recent demographic expansions have led to the accumulation of low frequency genetic variants associated with strong functional effects (Keinan & Clark, 2012; Tennessen et al., 2012). Considering the tight link between the MTBC and its human host, it is interesting to speculate that these human expansions might have had a similar effect on the genetic diversity of the MTBC (Hershberg et al., 2008).

Although purifying selection is likely reduced in MTBC, it was still possible to detect signals of this force through increased removal of predicted functional SNPs within genes classed as essential for growth compared to nonessential genes and also by clustering of SNPs beyond the expected distribution. When grouped by functional category, genes encoding proteins involved in the information pathways category accumulated significantly less predicted functional SNPs than expected. Conversely, genes encoding proteins that perform regulatory functions and those involved in lipid metabolism were over-represented with functional SNPs. Interestingly, it was also found that the transcriptional regulator ramB had accumulated more functional SNPs than

109 4.4 Discussion expected, spanning four of the lineages. Following the regulatory protein category, focus was made on Lineage 1 and 2 SNPs; the two respective lineages form the transcriptomic study in Chapter 5, and so a focused analysis was performed through integration of additional mutational and structural information to identify likely impaired functional regulators for the proceeding study. It was found that several SNPs lie within the HTH DNA binding domain of the regulatory proteins, such as a Lineage 1 SNP in virS. VirS regulates its own transcription and is also a positive regulator of an adjacent divergently- expressed MymA locus, which has experimentally been shown to be involved in virulence in guinea pigs (Singh et al., 2003; Singh et al., 2005). Together with several frameshift mutations arising from short indels, it is hypothesised that specific lineages have functionally impaired regulators and this has the potential to give rise to phenotypic diversity. Such SNPs should be detectable at the transcriptional level, and part of the following chapter (Chapter 5) explores this hypothesis.

In summary, this study has identified a set of nonsynonymous SNPs likely to have functional consequences in MTBC. However, it is not possible using the SIFT predictions to predict how these mutations affect protein function. There are four possible evolutionary fates for SNPs: The mutant is beneficial; causes a severe fitness cost and so is lost from the population; is functionally neutral; or finally is neither beneficial or excessively harmful, but slightly deleterious (Balbi & Feil, 2007). Slightly deleterious SNPs are the largest class, and in Escherichia coli it has been estimated that for every beneficial mutation there are 105 slightly deleterious mutations (Kibota & Lynch, 1996). As seen in Figure 4.7, it can be anticipated that many of the predicted functional SNPs identified in this study will fall within this slightly deleterious category, whilst the proportion of SNPs that have a greater impact or are “more” functional is unknown, but likely determined by a combination of selective and stochastic forces, such as the level of purifying selection acting within the organism.

110 4.4 Discussion

of SNPs of “more” functional harmful, cell death SNPs (?%) Increasingnumber tolerated SNPs (~60%) functional SNPs (~40%) Increasing severity of SNP

Figure 4.7. Spectrum of functional SNPs. The consequence of nonsynonymous SNPs range from tolerated/neutral to functional and at the extreme results in cell death, and therefore are not observed in the bacterial population. In MTBC ~40% SNPs were predicted functional in this study, but severity is unknown.

111 5.1 Introduction

Chapter 5 Screening the effect of lineage- specific variation by sequence-based transcriptional profiling

5.1 Introduction

M. tuberculosis infection is defined by a typically protracted period of asymptomatic infection followed by progression to active disease in a minority of individuals. Throughout these stages of infection, M. tuberculosis is exposed to a range of microenvironments, including acidic pH, reactive oxygen species, and nutrient starvation (Barry et al., 2009). Genome sequencing of the M. tuberculosis reference strain H37Rv by Cole et al. revealed a complex network of transcriptional regulation, including thirteen sigma factors, eleven two-component regulators, eleven serine- threonine protein kinases and over one hundred predicted transcription factors (Cole et al., 1998). At the initiation of this study, the extent of transcriptional variation between clinical isolates from the six main lineages was unknown, and the effect of the underlying genetic diversity to such variation was an open question.

In 2007, a microarray based study comparing H37Rv and the animal adapted M. bovis growing under steady state conditions revealed that the human and bovine pathogens showed differential expression of ninety two genes, which encoded a range of functions, including cell wall and secreted proteins, transcriptional regulators, PE/PPE proteins, lipid metabolism and toxin–antitoxin pairs (Golby et al., 2007). It is now known that there are on average ~1500 SNPs separating any MTBC strain (section 3.3.1), which raises the likelihood that human-adapted MTBC strains will also display a similar

112 5.1 Introduction quantity of differential expression. Shortly after identification of the main six human adapted MTBC lineages, a microarray-based study in 2010 surveyed for the first time differences in gene expression amongst clinical isolates of the MTBC (Homolka et al., 2010). The study was based on a total fifteen MTBC clinical isolates from Lineage 1, the Beijing group of Lineage 2, two sub-lineages from Lineage 4 and Lineage 6. The study found specific transcriptional patterns in vitro and in intracellular growth based on the ancient and modern lineage groupings, demonstrating that strains from defined phylogenetic groups display similar gene expression, which suggests the importance of understanding the underlying genetic background. The strains used in the study were not genome sequenced which limited the scope of the study, and it was not possible to relate to specific genetic variation.

The previous chapters would not have been possible without the availability of whole genome sequences, and such data now is crucial to experiments linking genotype to phenotype. Previous transcriptomic studies have relied on microarray based methods, but recent advances in DNA sequencing technologies has enabled the determination of RNA expression through sequencing of cDNA prepared by reverse transcription of total cellular RNA (RNA-seq), which provides dynamic ranges several orders of magnitude greater than other technologies, whilst at the greatest possible resolution. The first sequence based transcriptome of M. tuberculosis strain H37Rv was published in 2011 by Arnvig et al., and whilst this was not a clinical isolate, this demonstrated the power of RNA-seq to capture the complete transcriptional landscape of M. tuberculosis (Arnvig et al., 2011).

5.1.1 Aims

The aims of this chapter were to survey the transcriptome profiles of M. tuberculosis clinical isolates from Lineages 1 and 2, and to understand the effects of lineage-specific variation identified in the previous Chapters. Specific aims were to:

• characterise M. tuberculosis transcriptomes using a sequence based approach • capture lineage-specific transcription profiles in the transcriptome sets • explore the functional impact of lineage-specific SNPs identified in Chapter 3 and 4

113 5.2 Methods

5.2 Methods

5.2.1 Clinical isolates in study

5.2.1.1 Strains sequenced using RNA-seq

Strains are from a collection of M. tuberculosis isolates from foreign-born tuberculosis patients in San Francisco, who contracted the infection in their country of origin (Gagneux et al., 2006a). All strains are drug susceptible and have been typed in studies (Table 5.1) (Gagneux et al., 2006a; Hershberg et al., 2008). Three strains were selected from Lineages 1 and 2 respectively, to represent the genetic diversity in the lineages. Figure 5.1 shows the previously described MTBC phylogeny based on MLSA analysis, and the strains used in the RNA-seq study are highlighted (Hershberg et al., 2008). From Lineage 1, two strains are from the large Rim of Indian subgroup (strains N0072 and N0153) and a representative of the Philippines subgroup (strain N0157). Two Beijing strains from Lineage 2 were selected (strain N0145 and N0052) and a less common non- Beijing strain (N0031). Figure 5.1 uses the original naming schema, but from this point on the later adopted ‘N’ number strain naming will be referred to. To preserve the two naming conventions both have been used in Table 5.1. All strains have been genome sequenced in previous studies or as part of this thesis.

5.2.1.2 Additional growth curve experiment strains

The determination of growth rates for the RNA-seq study strains was supplemented by the clinical isolates shown in Table 5.2. In total six strains from Lineage 1 and 2 were included to explore potential lineage-specific differences in exponential phase growth rate. The reference laboratory strain H37Rv was also included.

114 5.2 Methods

Table 5.1. Lineage 1 and 2 strains used in the RNA-seq study. All strains were previously genome sequenced except strain N0031, which was sequenced for this thesis in Chapter 3. This study refers to the strain names used in the Gagneux group, but original strain names used by Hershberg et al. (2008) are shown for reference. In addition to lineage, the region of difference (RD), which has been historically used to type the strains, is indicated. Geographic distribution and prevalence of lineage based on previous classifications (Coscolla & Gagneux, 2010).

Lineage MLSA strain RD Strain Lineage geographic Patient origin name lineage distribution Rim of Indian N0153 T83 1 RD239 Vietnam Ocean Rim of Indian N0072 EAS053 1 RD239 India Ocean N0157 T92 1 RD239 The Philippines The Philippines N0145 T67 2 RD105 Beijing China N0052 98_1833 2 RD105 Beijing China N0031 94_M4241A 2 RD105 Non-Beijing China

Table 5.2. Additional strains used in growth curve experiment. Three additional strains from Lineage 1 and Lineage 2 were included in the growth curve experiments in combination with the previously described six RNA-seq study strains. All are clinical strains and isolated as part of the San Francisco strain collection (Gagneux et al., 2006a). Genome column indicates genome sequencing status of strain.

RD Strain Strain ID Lineage Patient origin Genome lineage

N0043 96_4329 1 RD239 Burma Y

N0075 EAS080 1 RD239 Vietnam N N0121 T17 1 RD239 The Philippines Y

N0041 96_2104 2 RD105 Vietnam N

N0053 98_1863 2 RD105 China Y

N0140 T47 2 RD105 Macau N

115 5.2 Methods

5.2.1.3 Additional qRT-PCR strains

The confirmation of select lineage-specific expression of genes by qRT-PCR used all previous RNA-seq strains and the addition of four Lineage 1 and 2 strains. These are shown below in Table 5.3.

Table 5.3 Additional strains used in qRT-PCR confirmation. Two strains from Lineage 1 and Lineage 2 were included in the RNA-seq confirmation. All are clinical strains and isolated as part of the San Francisco strain collection (Gagneux et al., 2006a). Genome column indicates genome sequencing status of strain. One strain is currently not genome sequenced but this was not required for the aims of the qRT-PCR study.

RD Strain Strain ID Lineage Patient origin Genome lineage

N0043 96_4329 1 RD239 Burma Y

N0121 T17 1 RD239 The Philippines Y

N0041 96_2104 2 RD105 Vietnam N

N0053 98_1863 2 RD105 China Y

116 5.2 Methods

Figure 5.1. Strains sequenced in RNA-seq study. Circles indicate the six Lineage 1 and 2 strains used in the RNA-seq study. Phylogenetic tree of MTBC adapted from (Hershberg et al., 2008). Image reproduced under the Creative Commons Attribution License (CCAL).

117 5.2 Methods

5.2.2 Cluster analysis

Hierarchical cluster analysis of the transcriptomes was performed using the hclust function in R by the complete linkage method. Spearman distances were calculated from the dissimilarity matrix of pairwise correlations of total gene expression (N=4,015 genes), expressed as Reads Per Kilobase per Million mapped reads (RPKM). Clade support using 1000 bootstrap replications was performed using the R function pvclust. Comparison of the total gene expression per strain to SNP distance was performed with normalised read counts that were transformed using the variance stabilising transformation (VST), and implemented in the DESeq package (Anders & Huber, 2010). VST is a monotonous function, and is calculated for each sample such that variance in the count data becomes independent of the mean.

5.2.3 Differential expression analysis

Statistical testing for the main differential expression analysis was performed using DESeq (Anders & Huber, 2010). DESeq is a method based on the negative binomial distribution and implemented in the R statistical environment. Raw reads were normalised first using DESeq to adjust for differences in library sizes. Reads from technical replicates were combined and treated as one sample. Gene deletions at either strain or lineage level were first removed from the analysis (N=223 genes); deletions were identified based on genome coverage using the respective strains genome, with a threshold of <90% gene coverage to define a deletion. Normalised expression of features (annotated genes, antisense or sRNAs) that overlapped with strains from different lineages due to strain specific expression were filtered and removed, with 1,606 features entered into the analysis. For the purpose of testing for lineage-specific differential expression in DESeq, strains from the same lineage were treated as biological replicates, and the mean expression from the two lineages compared. Significant differential expression was defined as p<0.05 (p-value adjusted for multiple testing using Benjamini-Hochberg method).

118 5.2 Methods

5.2.4 Transcriptional Start Site (TSS) calling

Custom Perl scripts were written for TSS calling. Briefly, the increment in reads from one genome position to the next consecutive base was calculated for all genomic positions, with an increment significantly above the average background coverage defined as candidate TSS. TSS peak height was considered as representative of the level of expression of the TSS. To build a genome-wide TSS map for M. tuberculosis, automated annotation of the putative TSS detected according to genomic distribution similar to previous TSS analysis using RNA-seq data (Sharma et al., 2010b).

119 5.3 Results

5.3 Results

5.3.1 Growth rate in vitro

It was critical to isolate the transcriptomes of all study strains from the same physiological state, ensuring that differential transcription is not simply a reflection of the stage of growth. RNA was harvested at two growth phases in this study, mid- exponential and stationary; and these were defined as an Optical Density (OD600) of 0.4 to 0.6 and one week after an OD of 1.0, respectively. A difficulty of working with clinical strains compared to well-used reference strains is that the growth rates are largely unknown, which are required to standardise the RNA extraction process.

Three representative strains from Lineage 1 and 2 were selected for the RNA-seq study (section 5.2.1.1, and the growth of the six strains was monitored over a 14-day period. In a defined 7H9 media (section 2.1.3) culture density (OD600) was measured daily from the initial inoculation (day 0). From frozen stocks, strains were grown in 10mls 7H9 for two days prior to transfer into roller bottles used for the growth curves and all RNA extractions. At day 0, a calculated volume was transferred from the pre-culture to start all growth curves at OD 0.01. This experiment was also used to identify any lineage level differences between growth rates in vitro, and three additional strains from both Lineage 1 and 2 were included to increase the sample size and so the statistical power of the test. Additional clinical isolates are described in section 5.2.1.2. The H37Rv laboratory strain was also included as a reference.

120 5.3 Results

A

a) 10 N0121 N0043 N0072 ) 0

0 1 6 N0153 D

O N0157 (

y

t N0075 i s

n 0.1 N0145 e D

N0053 l a

c N0041 i t p 0.01 N0031 O N0052 N0140 0.001 H37Rv 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Days from inoculum

B

b) 10 ) 0 0 6

D 1 Lineage 1 O (

y Lineage 2 t i

s H37Rv n e D

l a c

i 0.1 t p O

0.01 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Days from inoculum

Figure 5.2. In vitro growth curves. A. Growth of twelve strains from Lineage 1 and 2, plus H37Rv. B. Strains pooled by lineage. Error bars are the standard error of the mean (SEM). All strains were grown in three independent experiments, and under the same conditions. Strains are coloured using previously defined lineage colouring.

121 5.3 Results

Growth rates of the clinical strains did vary with a trend for Lineage 2 strains to continue into late-exponential phase for longer that Lineage 1 strains (Figure 5.2A). This was reflected in higher OD600 readings for Lineage 2, with the Lineage 2 strain N0145 reaching an OD600 of ~10, the highest of all strains. The reference strain H37Rv is in the middle of the growth rates. By day 9 to 10 all strains had entered stationary phase. Figure 5.2B plots all strains from the same lineage as replicates, confirming the observation that Lineage 2 strains do continue in late-exponential phase for comparatively longer. However, mid-exponential growth is similar for all strains irrespective of lineage. As pre-cultures were used, all strains were in exponential growth at day zero, the start of the growth curve. Between days three and four, strains leave mid-logarithmic and enter late-logarithmic growth. For these experiments, mid- logarithmic growth was defined as OD ≤ 0.6.

Strain specific doubling times are shown in Table 5.4. Exponential doubling times range from 13.8 ± 0.2 hrs (strain N0043) to 24.2 ± 0.6 hrs (strain N0075). This shows that the doubling times of the clinical strains can range by up to 10 hrs, which is important when synchronising RNA extraction experiments. Whilst there is some variability in the specific growth rates of the strains, this was not significant at the lineage level. The mean lineage exponential doubling time for Lineage 1 was 18.2 ± 1.8 hrs and for Lineage 2 was 16.4 ± 0.5 hrs (two tailed students t-test, p=0.35).

!

122 5.3 Results

Table 5.4. In vitro growth rates. Doubling times in hours are shown for exponential phase growth with the SEM. All strains were grown in at least three independent experiments, under conditions detailed in 2.1.4. Lineage mean doubling time also shown. The laboratory strain H37Rv was used as a reference. Asterisks (*) identify strains used in RNA-seq study.

Doubling time Error Lineage mean Strain Lineage (hrs) (SEM) doubling time (hrs)

N0043 1 13.8 0.2 N0072 * 1 16.1 0.8 N0075 1 24.2 0.6 18.2 N0121 1 16.0 0.4 N0153 * 1 23.2 1.9 N0157 * 1 15.9 0.4 N0031 * 2 16.2 0.1 N0041 2 18.1 2.0 N0052 * 2 16.2 2.1 16.4 N0053 2 16.6 2.1 N0140 2 16.9 1.9 N0145 * 2 14.4 0.6 H37Rv 4 18.0 1.7 18.0

123 5.3 Results

5.3.2 RNA isolation and Illumina ready libraries

Following extraction of RNA, sample concentration was measured by Nanodrop and RNA quality by Agilent 2100 BioAnalyser. The BioAnalyser is a nanofluidics device that performs size fractionation and quantification of DNA, RNA, or protein samples. Only high quality samples, based on the electropherogram and expressed as an RNA Integrity Number (RIN) ≥8, were used in later analysis (Figure 5.3A). Samples were rigorously DNase treated to remove potential DNA contamination from the RNA extraction, and entered into the Illumina library construction stage. Two main cDNA library types were constructed, the first was based on a modified Illumina stand-specific

protocol for sequencing all RNA species (section 2.5.1). In total nine high quality cDNA libraries were constructed, including three technical replicates. Only libraries with concentrations >10µg/ml and with the expected size distribution of adapter cDNA fragments were sequenced (Figure 5.3B). The second main16S library39.55 method43.61 depleted83.3 19.0 23S 44.91 49.02 117.6 26.9 processed RNAs in the samples, and was used in Transcriptional Start Site (TSS) mapping. RNA was sent to Vertis Biotechnologie AG and four libraries constructed (section 2.5.2).

A. B.

16S 38.92 42.28 60.9 24.7 16S 40.41 42.34 32.5 26.3 23S 45.45 48.27 72.3 29.3 23S 45.85 49.59 7.0 5.7 Figure 5.3. Quality control of RNA-seq samples by Bioanalyser. Migration time of sample shown in seconds on x-axis, and fluorescence units (FU) on y-axis. A. Integrity of total RNA following RNA isolation and DNase treatment for strain N0052. The two largest peaks are rRNA 16S and 23S, and the area under the peaks used as a metric of RNA quality (RIN). B. Quality of strain N0052 Illumina strand-specific RNA-seq library. Lower and upper size markers are 15 bp and 1500 bp. Distribution of cDNA library fragments expected to be 180-200 bp, corresponding to 60-80 seconds.

124 5.3 Results

5.3.3 Transcriptome sequencing

All libraries were sequenced at the NIMR by the High Throughout Sequencing staff, managed by Abdul Sesay. Single-end read sequencing was performed on Illumina Genome Analyser (GA) and HiSeq (HS) sequencers, using a single flow cell lane per library. The mean number of raw reads generated per run was 93.2 million (ranging from 30.1-186.8 million). Full details of the transcriptome data are shown in Tables 5.5 to 5.6.

5.3.3.1 RNA-seq data quality control

It was first necessary discard low quality reads from the transcriptomes to increase the quality of the subsequent reference based mapping. Poor quality read bases were trimmed using the SolexaQA package (Cox et al. 2010), trimming bases with confidences p > 0.05, and removing reads < 25 bases in length. For the nine RNA-seq transcriptomes used in the following differential expression analysis (Table 5.5), a mean of 14.4 million (ranging from 1.6 to 52.5 million) reads were discarded by this step. Therefore, on average 15% of the raw reads were removed due to poor quality. It can be seen in Figure 5.4A that the mean RNA-seq base quality for strain N0145 decreases throughout the read length, and after 85 cycles (read lengths of 85bp) mean phred scores <10, correlating to 90% base call accuracy. Post trimming removed the poor quality 3’ tails of these reads, as well as reads that were low quality throughout (Figure 5.4B).

A. B.

N0145_HS N0145_HS

40 40 ) ) e e r r o o c c s

30 s 30

d d e e r r h h p p ( (

y

20 y

t 20 t i i l l a a u u q q

d d a 10 a 10 e e R R

0 0 5 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 100 Position in read (bp) Position in read (bp)

Figure 5.4. Distribution of quality scores throughout RNA-seq read length for strain N0145. The x-axis is the position with in the read length (bp), y-axis is read quality based on Phred scores. A. Raw reads pre-trimming. B. Post-trimming read quality.

125 5.3 Results

Table 5.5. Details of exponential phase transcriptomes used in differential expression analysis. The nine transcriptomes were constructed using the same Illumina RNA-seq method (section 2.5.1). Sample ID was used to track the sample through the sequencing pipeline. Two Illumina machines were used to sequence the technical replicates; HiSeq2000 (HS) and the Genome Analyser GAIIx (GA). Fold coverage was calculated as the amount of sequence data mapped, excluding rRNAs (16S, 23S and 5S subunits), divided by the genome size of H37Rv, which was used for the reference- based assembly of the transcriptomes (4411532 bp). A. Six strains sequenced on the HS platform. B. Technical replicates for strains N0145, N0031 and N0153.

A.

Sample 1 2 3 4 5 6

Sample ID GR1 GR2 GR16 GR5 GR6 GR15 Machine HS HS HS HS HS HS Strain N0145 N0031 N0052 N0072 N0157 N0153 Lineage L2 L2 L2 L1 L1 L1 Growth phase EXP EXP EXP EXP EXP EXP Read Length SE 101bp SE 101bp SE 60bp SE 101bp SE 101bp SE 60bp Mapped reads 47.476144 41.065818 152.323131 44.946409 39.208057 148.026386 (Million) Unmapped 30.552533 21.408397 20.090865 13.226052 9.695711 20.899070 reads (Million) rRNA reads 44.530529 38.050982 142.691893 41.303614 36.843853 138.358711 Non- rRNA 218.2 219.5 327.5 268.8 169.3 473.7 (Mb) Fold coverage 49.5 49.8 74.2 60.9 38.4 107.4

B.

Sample 7 (t) 8 (t) 9 (t) Sample ID GR1 GR2 GR13 Machine GA GA HS Strain N0145 N0031 N0153 Lineage L2 L2 L1 Growth phase EXP EXP EXP Read Length SE 75bp SE 75bp SE 101bp Mapped reads 22.670235 21.866688 37.070182 (Million) Unmapped reads 8.067401 6.538578 26.419337 (Million) rRNA reads 21.800412 20.914537 34.574916 Non- rRNA 58.0 63.5 69.9 (Mb) Fold coverage 13.2 14.4 15.8

126 5.3 Results

Table 5.6. Transcriptomes used in TSS mapping. Exponential and stationary phase transcriptomes of strains N0153 and N0145 were generated; these were chosen as representative Lineage 1 and 2 strains. cDNA libraries ready for sequencing were constructed by Vertis Biotechnologie AG and sequenced at NIMR.

Sample 13 14 15 16

ID GR1 GR2 145_s4 153_s4 Machine HS HS GA GA Strain N0145 N0153 N0145 N0153 Lineage L2 L1 L2 L1 Growth phase EXP EXP STAT STAT Read Length SE 50bp SE 76bp SE 76bp SE 76bp Mapped reads 7.966042 24.649725 18.745600 21.574243 (Million) Unmapped 0.726967 1.839590 0.834070 1.863064 reads (Million)

rRNA reads 2.200098 3.144389 11.548229 10.708733

Non- rRNA 288.3 1290.3 431.8 651.9 (Mb) Fold coverage 65.4 292.0 97.9 147.8

127 5.3 Results

5.3.4 Mapping reads to the H37Rv genome

A reference based mapping assembly was performed using the reference genome H37Rv and the BWA aligner (Figure 5.5) (section 2.9.3) (Li & Durbin, 2009). The dominance of the rRNA (16S, 23S and 5S) reads is seen in the RNA-seq plots at genomic position ~1.5Mb. The average mapped read number was 61.0 million (ranging 21.9-152.3) for nine exponential phase transcriptomes and reads were visualised using Artemis software (Figure 5.5). Importantly, no sequence data mapped to region of difference 3 (RD3) in Lineage 1 strain N0153, shown in Figure 5.6. RD3 contains the mobile prophage ϕRv1 (Hendrix et al., 1999), and is variably deleted in clinical strains of M. tuberculosis (Parsons et al., 2002), and including strain N0153.

Figure 5.5. Circular plot of mapped RNA-seq data. Moving from outer to inner circles are the annotated CDS on the forward (blue) and reverse (red) strands. The inner circles are the mapped reads for strain N0153 and N0145. Reads map to forward (blue) and reverse strands (red). Read coverage per base position sampled by 5000 bp windows and are log2 scaled.

128 5.3 Results N0153 Lineage 1 N0072

30 !" N0157 0 Rv1571 Rv1573 Rv1575 Rv1572c Rv1574 Rv1576c

Figure 5.6. Representation of transcriptome plot based on Artemis. y-axis shows sequence depth, and all plots scaled to a depth of 30 bases (scale bar on bottom plot). Reads on forward strand in blue, and reverse strand in red. Part of RD3 is shown for Lineage 1 strains N0157, N0072 and N0153, which is variably deleted in the MTBC.

5.3.5 Identifying strain specific gene deletions

It is known from previous studies that the number of genes within clinical strains is variable due to gene deletions at the lineage and strain specific level (Tsolaki et al., 2004). One region that includes sixteen gene deletions relative to the reference strain H37Rv was shown previously in Figure 5.6. A microarray based study has identified 224 gene deletions (5.6% of all annotated genes) in a survey of one hundred clinical isolates (Tsolaki et al., 2004). This has important ramifications for the following differential expression analysis, and these deletions need to be removed to prevent the identification of changes in expression due to deletions. A Perl script was written to identify gene deletions and based on the genome coverage depths for the respective strains (Appendix A), allowing the removal of only genes deleted in the six strains. This also presented an opportunity to investigate the nature of the gene deletions within the strains. In total genome-wide scanning in the genomes identified 223 genes that were deleted in one or more strain. This was based on a cut-off threshold of <90% base coverage per annotated gene (Figure 5.7).

129 5.3 Results

!""*#&'() !"#$%&'() !"#%+&'#) !"#%*&'#) !""%(&'() !""+(&&'#)

,-./0123,4&5,32-678&&&&&& Percentage of gene deleted :6,/0-79-C&;-23D6?6912637C 1;15212637 0 10 100

9378-,:-;&<4532<-269108

672-,=-;61,4&=-21>3068=& 17;&,-856,12637

67?3,=12637&512<@148

9-00&@100&17;&9-00& @100&5,39-88-8

!Rv1

678-,2637&8-A8& 17;&5<1.-8

!Rv2

0656;&=-21>3068= /7B73@7

Figure 5.7. Distribution of gene deletions in the six RNA-seq study strains. In total 223 genes were classed as deletions. Strains are hierarchical clustered based on deletions, and genes (rows) clustered and grouped based on existing gene functional categories. Deletions found within genes annotated as PE/PPE functional category were excluded from the analysis.

130 5.3 Results

Hierarchical clustering based on the identified gene deletions clustered the six strains by lineage and sub-lineage, following the known genome phylogeny. This is expected due to the clonality of the MTBC. Grouping the gene deletions first by functional category and then genomic position it was seen that large blocks of deletions within the insertion sequences and phages section are largely made up of deletions of the prophages. There are two prophages in the H37Rv genome, designated ϕRv1 (Rv1572c-Rv1588c) and ϕRv2 (Rv2646-Rv2659c) (Cole et al., 1998; Hendrix et al., 1999). The first prophage ϕRv1 is not present in all Lineage 2 strains and N0153 from Lineage 1. This is an example of convergent evolution, whereby ϕRv is not present across more than one lineage, although it is not known if the phage was deleted in these strains or was never inserted originally. Strain N0153 and N0072 also do not have the second prophage (ϕRv2). Whilst deletions were distributed throughout all gene categories, there was a disproportionate representation of several categories based on the genome-wide number of genes within the category (Figure 5.8). Using a χ² test followed by multiple testing correction a significant overrepresentation of gene deletions within the insertion sequence and phages category was identified (p=0.0009). Expressed as a ratio of the genes within the category versus the genome-wide number, insertion sequences and phages were 11-fold over-represented. Under-represented groups were intermediary metabolism and respiration (p=0.004) and genes involved in information pathways (p=0.04), which were 2-fold and 5-fold under-represented respectively.

As expected, significantly more genes classed as nonessential for growth based on previous classifications were present in the deleted gene set (Sassetti et al., 2003; Sassetti & Rubin, 2003) (χ², <0.0001). However, ten deleted genes (6.4% of all deletions) were defined as being essential. Four of the genes are annotated as conserved hypotheticals, and the remaining are genes involved in cell wall and cellular processes (Rv0383c, Rv1974), lipid metabolism (fadD30, desA3), intermediary metabolism and respiration (Rv1524), and information pathways (infB). Four of these deletions have been identified in the previously described microarray based study of one hundred clinical MTBC isolates (Tsolaki et al., 2004). The infB deletion has not been previously identified, and was surprising as the encoded InfB protein is an essential initiation factor of the protein synthesis machinery (Boelens & Gualerzi, 2002). However, the deletion was just within the threshold of defining a gene deletion used in this study, with 11.1% of the gene deleted in strain N0153, and <6% in strain N0145.

131 5.3 Results

insertion seqs and phages 62 *** conserved hypotheticals 28 regulatory proteins 8 unknown 2 Deleted intermediary metabolism and respiration 18 ** Non-deleted cell wall and cell processes 24 information pathways 2 * lipid metabolism 6 virulence detoxification and adaptation 7 0 400 800 1200 Number of genes

Figure 5.8. Distribution of gene deletions grouped by gene function category. For each category the number of deleted genes (black) and non-deleted genes (white) is shown. Actual deleted gene numbers are shown on top of bars. Gene categories with a statistically significant departure from the expected number of deletions are identified by asterisk (*). Categories were tested using a χ² test followed by multiple testing correction (False discovery rate method). * p <0.05, ** p<0.01, *** p<0.001. All functional categories except the PE/PPE category were included in the analysis.

132 5.3 Results

5.3.6 Clustering of strains at the total sample level

The transcriptomic data was first clustered at the total sample level rather than at the level of individual genes. This provides a meaningful analysis of overall expression patterns in the samples, allowing the stratification of strains based only on expression. This provided the first broad analysis of how closely related strains belonging to the same lineage were in terms of transcription, and how the genetic diversity between Lineage 1 and 2 is reflected in functional expression. Clustering was based first on gene expression, or the messenger RNA (mRNA), and then antisense expression, which is transcription that is complementary to the mRNA, and so from the noncoding strand of DNA.

5.3.6.1 Clustering of strains by gene expression

Gene expression from all annotated genes excluded those identified as deleted in section 5.3.5. To enable the comparison of different expression levels, data was normalised as reads per kilobase per million reads (RPKM). The RPKM measure was designed to reflect the molar concentration of a transcript in the starting sample by normalising for RNA length and the total number of reads the transcriptome data set (Mortazavi et al., 2008). Pairwise spearman correlations of RPKM normalised gene expression were calculated for all samples, converted into dissimilarities, and the distances between the samples clustered using hierarchical clustering. The resulting dendogram is shown in Figure 5.9. Branches show bootstrap confidence following 1000 bootstrap replicates, indicating high statistical support. Strains from the same lineage were more closely related than those from a different lineage. The three transcriptomes sequenced as technical replicates, strains N0153 (HS 2), N0145 (GA) and N0031 (GA), were highly related to their respective replicate transcriptome. Technical replicates were from the same source of total RNA for each respective strain, but were from separate cDNA library construction. In the case of Lineage 2 strains N0145 and N0031, cDNA was sequenced as part of an earlier run on an Illumina Genome Analyser (indicated by the GA suffix), as opposed to later Illumina HiSeq runs (indicated by the HS suffix) for all other strains.

In addition to the sequenced clinical strains, three transcriptomes of the reference strain H37Rv were included in clustering analysis. The transcriptomes were previously

133 5.3 Results published by Arnvig et al. (2011). Strains were grown in the same growth media as this study and RNA extracted from exponential phase growth. The method to generate cDNA libraries was not the same as this study, but cDNA was sequenced using the Illumina platform. The three transcriptomes were from three biological replicates, which have similar gene expression, shown in Figure 5.9. Interestingly, whilst the three H37Rv strains tightly cluster together, they are clearly distinct from all Lineage 1 and 2 strains, suggesting the laboratory strain is an outlier relative to the clinical strains with respect to its transcriptional profile.

Comparing back to the underlying genotype, transcriptome diversity parallels genome diversity for the clinical strains (Figure 5.10), whilst the H37Rv transcriptomes do not fit within the expected topology. Based on the genome phylogeny, H37Rv would be expected to cluster alongside Lineage 2 strains and form part of the modern lineages; instead it is clearly outside of Lineage 2 as well as Lineage 1. Whilst the clinical strains did cluster by lineage, the parallel to genotype broke down at the sub-lineage level, for example, Lineage 2 Beijing strain N0145 clustered with N0031, despite being genetically closer to N0052, the other Beijing strain in the study.

134 5.3 Results

!6789: /5/ /50 /5- /5" /5(

!"#$%&'("(#)*+

0// !"#$%&'("(,)-+

,, !"#$%&'("(,)"+

.//"0&'!2+ 0//

.//"0&'34+ 0//

./0(1&'!2+ 0// ,; ./0(1&'34+

.//1-&'!2+ 0//

./01"&'!2&-+ 0//

./01"&'!2&0+ 0//

.//#-&'!2+

0// ./01#&'!2+

Figure 5.9. Unsupervised hierarchical clustering of total gene expression. Reads normalised as RPKMs for all annotated genes not previously identified as gene deletions. Strain replicates also shown, strain N0153 (HS 2), N0145 (GA) and N0031 (GA). Node support after 1000 bootstrap replications on branch. Exponential phase H37Rv transcriptomes shown as reference. Top scale bar indicates spearman correlation. Branches coloured using previous classification.

135 5.3 Results

!"# !"$ !"% !"& '"!

Figure 5.10. Relationship of genotypic to transcriptomic diversity. Left hand side image shows the 28-genome MTBC phylogeny constructed in Chapter 3. Right hand side image shows the unsupervised hierarchical clustering of gene expression. The transcriptome diversity parallels genome diversity for the clinical strains.

136 5.3 Results

5.3.6.2 Relationship of SNP distance to gene expression

Gene expression was next compared to SNP distance to explore the effect of genetic diversity at the total sample level. Normalised reads were variance-stabilising transformed (VST) so that they were approximately homoscedastic and suitable as input to the distance calculation (Anders & Huber, 2010). All genes therefore have roughly equal influence on the distance that is independent of expression strength, thus preventing a few strongly expressed genes having a greater influence. All strains used in this transcriptome study have been previously genome sequenced, and the total number of SNPs that separate these strains was calculated previously (Chapter 3). Figure 5.11 shows the correlation of gene expression to SNP distance. Lineage 2 strain N0145 was used as the reference. A clear positive correlation between the gene expression and SNP distance was observed (Spearman r=0.93, p=0.02). The correlation did not include the reference strain N0145 in the calculation. Using all other strains as the reference identified the same significant correlation in the other five comparisons (Spearman r=0.93 to 0.99, p<0.05).

N0145&VSD&expression&distance&vs&snp&distance

80 n

o i s s

) N0072 e N0157

r 60 T p S x V e ( & & N0153 e e c d 40 i n w

a N0052

& t e s i N0031 m d 20 o n e

G 0 0 500 1000 1500 2000 2500 SNP&distance

Figure 5.11. Correlation of SNP distance to gene expression. The x-axis is the number of SNPs of all strains relative to N0145. The y-axis is the Euclidean distance between each transcriptome relative to strain N0145; distances were calculated from the variance-stabilising transformation of the count data. The line shows the Linear Regression slope (Slope = 0.024 ± 0.005). Spearman r = 0.93. Strains coloured by lineage.

137 5.3 Results

5.3.7 Clustering of strains by antisense expression

A similar sample level analysis was also performed for the antisense transcriptomes. The hierarchical clustering of total antisense per transcriptome sample is shown in Figure 5.12. As seen previously for gene expression, strains from the same lineage clustered closer together based on transcriptional expression than those from the other lineage. However, within-lineage comparisons again did not follow the finer sub-lineage structure of the genome phylogeny. Lineage 1 strain N0153 and N0072 did not cluster together, despite being genetically closer than the third Lineage 1 strain N0157 based on SNP distance.

Interestingly, the two technical replicates for Lineage 2, N0145 (GA) and N0031 (GA), clustered based on gene expression (Figure 5.9), but not by antisense expression (Figure 5.12). The replicates can be linked by both being sequenced on the Illumina GA machine (GA) which might be affecting the level of antisense detected in these transcriptomes. The two replicates had the lowest number of mapped reads (ranging 21.9-22.6 million reads) compared to the mean 78.8 million reads for the six non- technical replicates. This may suggest that less abundant and rare antisense transcripts were not detected in these technical replicates due to the low level of sequencing depth, and this similarity was identified in this analysis.

138 5.3 Results

(12345 "0"$ "0#" "0#$ "0*" "0*$

!"#-$&'./+ #"" !""%#&'./+

$6 !"#-$&'()+

#""

!""%#&'()+ 77

!""$*&'()+

!"",*&'()+ #""

!"#$,&'()+

78

!"#$%&'()&*+ 77

!"#$%&'()&#+

Figure 5.12. Unsupervised hierarchical clustering of total antisense expression. Reads normalised as RPKMs for all annotated genes not previously identified as gene deletions. Strain replicates also shown, strain N0153 (HS 2), N0145 (GA) and N0031 (GA). Node support after 1000 bootstrap replications on branch. Top scale bar indicates spearman correlation. Branches coloured using previous classification.

139 5.3 Results

5.3.8 Testing for differential expression in RNA-seq data

Measurement of transcription by sequencing is a recent development and currently there is no clear consensus on a standard method to test for differential expression from generated RNA-seq data (Dillies et al., 2012). Normalisation is necessary to ensure that expression levels are comparable across samples (different cDNA libraries) and also across annotated features to enable valid inferences about the differential expression of features within or across samples (Robinson & Oshlack, 2010). Importantly, normalisation must ensure that read counts arising from a transcript are proportional to the length of the transcript and the total depth of the sample. In section 5.3.6 the RPKM method was used to normalise the data at the total sample level, but Robinson and Oshlack argue that RPKM may not be appropriate for normalisation between libraries of different biological conditions (Robinson & Oshlack, 2010). Central to their argument is that the total number of reads in a sequencing experiment is limited, and this sequencing real estate is competed for by highly expressed genes, leaving less available for the remaining genes. Thus if one sample contains highly expressed genes that are not expressed in other samples, this sampling artifact can skew the differential analysis, giving rise to higher false positive rates and less power to detect true differences. Scaling the libraries with RPKM will not solve the problem due to the assumption that the unknown total RNA is the same for all libraries.

There are now a number of methods which make a better assumption that the RNA output of a core set of genes is similar between samples, and this is used to create a scaling factor for the samples; several R bioconductor packages implement this, such as DESeq (Anders & Huber, 2010), baySeq (Hardcastle & Kelly, 2010) and edgeR (Robinson et al., 2010b). All three methods were tested on the transcriptome set. Raw count data for each annotated feature in the technical transcriptome replicates were combined, generating six RNA-seq samples to use in the analysis. To test for differential expression at the lineage level, strains from the same lineage were treated as biological replicates tested in the three above methods. A statistical cut-off of p <0.05 (False Discovery rate corrected) was used to identify statistically significant expression. Figure 5.13 shows the number of genes identified as differentially expressed. There was considerable overlap in the genes identified as displaying lineage-specific gene expression, although the number of statistically significant genes ranged from 76 genes using baySeq, to 336 genes identified with edgeR; edgeR identified all of the same genes as DESeq (112 genes). For this study all differential expression analysis was

140 5.3 Results performed using the latter method, DESeq, which was chosen as the best compromise of sensitivity and the size of number of differentially expressed genes identified.

Figure 5.13. Venn diagram comparing edgeR, DESeq and baySeq differential expression methods. The number of genes identified as differentially expressed using three methods for identifying significantly different gene expression between Lineage 1 and 2 strains. Significance defined as p <0.05 following multiple testing correction.

5.3.9 Lineage-specific gene expression

A total of 112 genes were identified as having a lineage-specific pattern of differential gene expression (based on a statistical cut-off of p<0.05); 88 (78.6%) were higher in Lineage 1, and 24 (21.4%) more highly expressed in Lineage 2 strains (Figure 5.14). The complete list is shown in Appendix E. Differentially expressed genes were present in all Tuberculist functional categories. Twenty-six of the genes were identified as differentially expressed in previous microarray comparisons of ancient versus modern lineages or M. tuberculosis H37Rv versus M. bovis (a subgroup within Lineage 6) (Golby et al., 2007; Homolka et al., 2010). The greatest significant fold change in gene expression was galK (Rv0620), which is involved in galactose metabolism, and was 39- fold higher in Lineage 1 strains. Antisense expression of Rv0842, a conserved integral membrane protein of unknown function, was 197-fold higher in Lineage 2 strains. Differential expression of galK was not detected in the previous microarray analysis, whilst it was not possible to measure antisense expression in the experiment design (Homolka et al., 2010).

141 5.3 Results

!"#$% !""&# !""%' !""(' !"#%( !"#%&

conserved hypotheticals

intermediary metabolism and respiration

regulatory proteins

lipid metabolism

virulence, detoxification and adaptation

cell wall and cell processes

unknown

PE/PPE information pathways

row min row max

Figure 5.14. Heatmap of 112 differentially expressed genes. Expression is based on normalised reads using the DESeq scaling factor method. Colouring is relative, based on the minimum and maximum expression for each gene (row), moving from lower expression (blue) to higher expression (red); scale at bottom of heat map. Strains are hierarchically clustered using Spearman’s rank correlation. Genes (rows) are grouped by Tuberculist functional categories.

142 5.3 Results

5.3.9.1 Transcriptional regulators

Eight transcriptional regulators were identified in Chapter 4 to harbour lineage-specific SNPs or indel mutations predicted to impair the function of the encoded regulatory protein. Three of the regulatory proteins with predicted functional SNPs, Rv0275c, virS (Rv3082c) and Rv3167c were identified in the set of lineage-specific differentially expressed genes. These are shown in Table 5.7.

VirS has previously been shown to act as an inhibitor of its own transcription and as a positive regulator of the adjacent divergently-expressed MymA locus (Rv3083-3085, Rv3086-3089) (Singh et al., 2003). Consistent with its predicted functional impairment by substitution of arginine for leucine within the helix-turn-helix (HTH) DNA-binding domain, virS expression was 17-fold higher in Lineage 1 than Lineage 2, but with no effect on expression of MymA. Targets of transcriptional regulators Rv0275c and Rv3167c are unknown, but the proximity of transcriptional start sites (TSS), identified by additional 5’ enriched RNA-seq transcriptomes (section 2.5.2), suggested that binding of the regulators to upstream sequences would repress transcription of the adjacent divergent genes Rv0276 and Rv3168 (Figure 5.15). Expression of Rv0276 followed Rv0275c in being 10-fold higher in Lineage 2, although it fell outside of the statistical cut-off (p=0.08), whilst Rv3178 expression was 5-fold higher in Lineage 1 (p=0.12).

Table 5.7. Differential expression associated with lineage-specific amino acid mutations. SNPs: Three out of the eight transcriptional regulators with predicted functional lineage-specific SNPs were differentially expressed at the lineage level. Fold change relative to Lineage 1. Modern lineage represents SNPs in Lineage 2, 3, and 4. Frameshift indels: a single base insertion in the regulator Rv3830c was predicted to impair function. Expression of the adjacent genes, Rv3829c and Rv3831, was higher in the predicted intact regulator.

Fold SNP Gene Function Mutation change lineage Predicted functional SNPs Rv0275c transcriptional regulator 0.4 L24S Modern Rv3082c virS transcriptional regulator 12.0 L316R 1 Rv3167c transcriptional regulator 3.7 P17Q 1 Frameshift indels Rv3829c phytoene dehydrogenase 0.05 inactive Rv3830c 2 Rv3831 hypothetical 0.1 inactive Rv3830c 2

143 5.3 Results

A B 1200 500 N0145 N0145

0 0

1200 6000

!" N0153 N0153

!" 0 0 Rv0275c Rv0276 Rv3167c Rv3168

Figure 5.15. TSS mapping for differential expression of divergently regulated genes. Reads mapping to forward strand in blue, and reads corresponding to reverse strand in red. Scale bar indicating maximum read depth at right of trace. For two of the strains (Lineage 1 N0153 and Lineage 2 N0145), additional RNA-seq was performed after a 5’ phosphate-dependent exonuclease digestion step to facilitate mapping of transcriptional start sites (TSS). A. Differential expression of Rv0276 due to predicted impaired Rv0275c regulator. Overlapping TSS suggest that Rv0275c acts as a repressor of Rv0276. B. Differential expression of Rv3168 due to the predicted impaired Rv3167c regulator. Again TSS show some overlap.

In addition to mutations introduced by SNPs, a frameshift insertion mutation in Lineage 2 was predicted to inactivate Rv3830c due to the resulting fusion with Rv3829c (Table 5.7). Although no significant change was observed in expression of Rv3830c itself, a 14- fold and 21-fold increase in expression of Rv3829c and Rv3831 in Lineage 2 suggested that the functional protein may act as a repressor of the two flanking genes, and that this regulation is lost in the case of mutant allele.

It was not possible to identify a lineage-specific transcriptional signature for the remaining four regulators, NarL, BlaI, SirR and KdpD, which were also predicted to be functionally impaired (Chapter 4). This may be due to incorrect predictions, or alternatively culture conditions other than routine exponential growth may be required to uncover defects in associated regulatory responses.

144 5.3 Results

5.3.9.2 Expression of the DosR regulon in strains from the Beijing family

It has previously been reported that genes belonging to the DosR regulon are expressed during exponential growth in strains belonging to the Beijing family (Homolka et al., 2010; Reed et al., 2007), but this elevation was not found at the lineage level in this analysis. Only Rv1733c met statistical criteria for up-regulation in Lineage 2 (p=0.023), but an enhanced DosR response was clearly seen in strains N0145 and N0052 (Figure 5.16). The outlying strain, N0031, belongs to a basal branch of Lineage 2 that diversified prior to expansion of the major Beijing branches represented by N0145 and N0052.

A 350kb genomic duplication that includes the DosR operon has been identified in the Beijing strains and has been suggested to contribute to constitutive expression of the DosR regulon (Domenech et al., 2010; Weiner et al., 2012). This duplication is present in N0145 and N0031, but absent from N0052, and therefore cannot account for the observed differential pattern of DosR expression in our study (Figure 5.17).

145 5.3 Results

0.1 1 10 Figure 5.16. Heat map of dosR regulon. Normalised expression represented as fold change relative to the mean. Note * Rv0571c, Rv0572c deleted in strain N0157. Hclust of strains based on dosR regulon separates Beijing strains N0145 and N0052. N=48 plus small RNA MTS1338. Black indicates no expression. Scale at bottom.

146 5.3 Results

dosR N0145 N0052 Lineage 2 N0031 N0153 Lineage 1 N0072

255 N0157 0

0 1 2 3 4 Genome position (Mb)

Figure 5.17. Duplication of dosR region. Genome read depth across the genome for the six in RNA-seq strains. Genome depth maximum height cut-off at 255 bases. Arrows show the position of the dosR operon within the duplicated region in Lineage 2 strains N0145 (a Beijing strain) and N0031 (non-Beijing strain). The previously published genomic duplication that includes the dosR operon extends from 3.5 to 3.8Mb (Domenech et al., 2010), although in strain N0031 the duplication is shorter in length.

Amino acid changing mutations that might alter the function of DosR or related regulatory components were not found. However, a Beijing-specific synonymous SNP (C 3500149 T) was identified within Rv3134c, which encodes a Universal Stress protein (USP) domain, and is itself a member of the DosR regulon (Gerasimova et al., 2011). Rv3134c is immediately upstream of dosR and the SNP generates a TAnnnT -10 consensus motif that is characteristic of actinomycetes promoters (where n represents any base) (Figure 5.18A). The classical prokaryotic promoter structure that is recognized by σ70 sigma factors has been defined based on studies of numerous E. coli promoters (Hawley & McClure, 1983), and similar sequences have been identified in other bacteria (Newton-Foot & Gey van Pittius, 2012). The promoter sequence determines the level of expression of a gene and is recognised as the DNA sequence between 10 and 35 bases upstream of the TSS. The TSS is usually a purine base (A or G base), and the -10 sequence, also known as the Pribnow box, is a highly conserved hexamer centered about

147 5.3 Results

-10 bp upstream of the TSP. As described above, in actinomycetes this sequence consists of highly conserved T and A residues at the first and second positions respectively, and in all cases a T residue in the final position; some variability is found in the three central positions of the motif (Newton-Foot & Gey van Pittius, 2012). The - 10 motif has been found to be associated with ~73% of all TSS mapped in M. tuberculosis H37Rv (T. Cortes, unpublished; (Newton-Foot & Gey van Pittius, 2012; Zheng et al., 2011).

In Figure 5.18A it can be seen that the Beijing-specific SNP is located seven nucleotides upstream of a novel TSS, and the TSS is expressed in both exponential and stationary phase samples of Beijing strain N0145 (Figure 5.18B). The new TSS is distinct from the standard Rv3134c intergenic TSS associated with growth-phase induction of the DosR regulon and from secondary promoters identified within the Rv3134c gene of M. tuberculosis H37Rv (Bagchi et al., 2005). The resulting transcript is clearly seen in the total RNA profiles and runs through dosR in the two Beijing strains (Figure 5.18A).

A second Beijing-specific SNP (C 3509626 A) similarly generates a TAnnnT consensus motif and associated TSS for the two-component sensor protein encoded by Rv3143. Increased expression was evident in total transcriptome profiles from the two Beijing strains, but in this case downstream targets of the regulator are unknown.

148 5.3 Results

Figure 2 A.

A B N0145 !" Lineage 2 N0052

!" 1500 N0031 0 dosS dosR Rv3134c

C 3500149 T

!" 3'-GccgagtTcacatgtacgcggttggccgacgggacaagctgc-5' 3500142 -10 box

B. C N0145 N0145 !" !" 3000 16000 Stationary Exponential N0145 N0145 !" !" N0153 N0153 3000 16000 0 0

Rv3134c Stationary Rv3134c Exponential N0153 N0153 0 0 0.1 1 10 Rv3134c Rv3134c

Figure 5.18. DosR regulon and SNP-associated TSS. A. Mapped RNA-seq reads in Lineage 2 strains over the DosR region. Reads mapping to forward strand in blue, and reads corresponding to reverse strand in red; plots are shown at an identical scale with scale bar indicating maximum read depth included in the bottom panel. The C/T SNP in Beijing strains is indicated with asterisk (*), and the new TSS 7 nucleotides from created -10 box highlighted. Numbering is based on the M. tuberculosis H37Rv genome. B. RNA- seq TSS mapping for Beijing strain N0145 (Lineage 2) and N0153 (Lineage 1) grown in exponential and stationary phase conditions; TSS shown with arrows. The Beijing- specific TSS within Rv3134c is expressed in exponential and stationary phase.

149 5.3 Results

5.3.9.3 SNP-associated TSS

The influence of SNP-associated TSS in generating transcriptional diversity was next explored in the rest of the gene set with lineage-specific transcriptional profiles. Alignment of lineage-specific SNPs with a total transcriptome map of M. tuberculosis (strain H37Rv, T. Cortes unpublished) identified ninety-four instances (1.2% of 7601 TSS) in which a SNP fell within the 30-nucleotide region upstream of a TSS. The frequency was markedly higher amongst the 168 differentially expressed genes and antisense identified in this study, with 23 of the respective TSS harbouring one or more SNPs in this upstream region (χ2, p<0.0001). In ten cases, lineage-specific SNPs generated a new TAnnnT consensus motif linked to a new TSS (Table 5.8).

Table 5.8. Ten differentially expressed genes associated with change in promoter sequence. Fold change relative to Lineage 1, with >1 higher Lineage 1 expression, <1 higher Lineage 2 expression. Modern lineage includes Lineages 2, 3 and 4. The final mutation column shows the nucleotide change and genomic position as based on H37Rv coordinates, and in brackets the -10 motif created (SNP in bold upper). Sequences read in the 5’ to 3’direction.

Differentially expressed Fold Lineage Function Mutation gene change with SNP mycolic acid C 560664 T Rv0469 umaA 2.2 1 modification (tacaaT) mgtA C 649345 T Rv0557 mannosyltransferase 3.2 1 (tatgcT) C 817696 T Rv0724A - methyltransferase 3.6 1 (tattcT) T 2017560 A Rv1781c malQ glucanotransferase 2.2 1 (tAcggt) C 2309356 T Rv2051c ppm1 ppm synthase 2.2 1 (taccaT) & T467I dienelactone C 3074830 T Rv2765 - 7.4 1 hydrolase (tactaT) G 3778011 A 2 tRNA (taccAg) Rv3366 spoU 0.2 methyltransferase G 3778012 T 2 (taccaT) Rv3679 & anion transport C 4119246 T - 0.1/0.2 Modern Rv3680 ATPase (tatgaT) C 4276306 T Rv3812 PE_PGRS62 hypothetical 2.5 1 (Taatgt)

150 5.3 Results

For three of the differentially-expressed genes (MalQ, Rv3680, PE_PGRS62) the new TSS was located within 542 nucleotides of the predicted translational start, either within an intergenic region or the upstream gene. In Figure 5.19A a SNP within Lineage 1 strains creates a -10 sequence and a resulting novel Lineage 1 TSS is seen. In Lineage 1 malQ is 2-fold higher expressed. The remaining six new TSS (umaA, MgtA, Rv0724A, Ppm1, Rv2765, SpoU) were located within the differentially-expressed gene itself and, if translated, would give rise to truncated protein products. Two SNPs in spoU remove Guanine nucleotides to generate the TSS motif. Rv2051c encodes a bifunctional protein Ppm1. Shown in Figure 5.19B, the novel internal transcript is initiated in the middle of the gene and includes the C-terminal polyprenyl phosphomannose synthase domain. The SNP also introduces a T467I mutation at the amino acid level, which was predicted by previous SIFT analysis in Chapter 4 to impair the function of the N-terminal apolipoprotein N-acyltransferase domain. A second internal TSS was present in all strains (Lineage 1 and 2) at position 2309159, suggesting that the option of dissociating the two activities is not unique to Lineage 1.

SNPs that alter residues outside of the -10 motif may also influence promoter activity. A G 4092921 T was associated with a 100-fold increase in reads mapping to a TSS upstream of PE_PGRS60, for example. This mutation changes an existing -10 TAnnnT motif to an “extended -10” TGnTAnnnT consensus (Newton-Foot & Gey van Pittius, 2012). Interestingly, this change is similar to that generated by a SNP that drives increased promoter activity and inhA expression in isoniazid-resistant strains of M. tuberculosis (Ramaswamy & Musser, 1998; Ramaswamy et al., 2003).

151 5.3 Results

A.Figure 3 A B N0145 N0145 N0145 N0145 N0052 N0052 Lineage 2 Lineage 2 N0153 !" N0153 !" N0031 malQ eccB5eccB5 N0031 ppm1 eccB5 !" N0072 !" N0072

!" Lineage 1 Lineage 1 N0153 !" N0153

500 250 !" N0157 N0157 !" 0 0 eccB5 Rv2050 malQ ppm1 Figure 3 C D A B. B N0145

N0145 N0145 !" N0145 N0145 N0145 !" N0052 N0145 Lineage 2 N0145 N0052

Lineage 2 !" N0153 !" N0153 N0052 N0052 Lineage 2 Lineage 2 umaA pcaAeccB5 N0031 N0153 N0153

!" N0031 deaD eccB5 !" !" N0031 malQ eccB5eccB5 N0031 ppm1 eccB5 N0072

!" N0072 !" N0072 !" N0072 Lineage 1

N0153 !" Lineage 1 N0153 2000 !" 300 Lineage 1 Lineage 1 N0153 !" N0153

500 N0157 250 !" 0 N0157 !" 0 N0157 N0157 fadB2 umaA !" 0 0 deaD pcaA eccB5 Rv2050 malQ ppm1

C D Figure 5.19. SNP-associated TSS leading to differential gene expression. A. Lineage

N0145 1 SNP (T 2017560 A) is associated with a new TSS and 2.2-fold increased expression of N0145

N0145 !" N0145 malQ in the respective strains. B. Internal coding!" TSS: A Lineage 1 SNP (C 2309356 T) N0052 Lineage 2 N0052

Lineage 2 !" N0153 !" within ppm1 is associated with a new TSS and 2.2-fold up regulation of ppm1 N0153 umaA pcaAeccB5 N0031 transcription. The nonsynonymous SNP is predicted to impair lipoprotein N-

N0031 deaD eccB5 !" acyltransferase activity. A second internal TSS present in all strains is also indicated in N0072

!" the TSS mapping inset.N0072 Lineage 1

N0153 !" Lineage 1 N0153 2000 300 152 N0157

!" 0 N0157 0 fadB2 umaA deaD pcaA 5.3 Results

5.3.9.4 Differential antisense expression

A parallel analysis of antisense transcription identified similar conservation by lineage, with a differential expression pattern for 56 genes; 23 were higher in Lineage 1, and 33 in Lineage 2 (Appendix E). Antisense RNAs are transcripts encoded on the strand that is complementary to protein-coding genes. The transcripts are generated either from internal TSS, or from overlapping 3’ untranslated regions (UTRs) in convergent gene pairs, which has been identified previously in the transcriptome of H37Rv (Arnvig et al., 2011). Three of the differentially expressed 3’ UTR antisense transcripts (pcaA, Rv1898 and ribD) were associated with SNPs that create a new TAnnnT-linked forward TSS in the adjacent divergent gene (Table 5.9). In the case of pcaA, shown in Figure 5.20A, a 2- fold increase in umaA gene expression and 4-fold increase in pcaA antisense expression was detected. For a further six antisense transcripts (Rv0552, Rv0842, Rv0874c, deaD, Rv2672 and FadE20), introduction of a TAnnnT motif on the reverse strand was associated with new TSS arising within the gene itself. In the case of deaD, a Lineage 2 C to T SNP (a modern branch SNP) creates a new motif and TSS on both the forward and reverse strands of DNA, causing a 41-fold increase in Lineage 2 (Figure 5.20B).

Table 5.9. Nine differentially expressed antisense associated with introduction of SNP-associated TSS. Mutation column shows SNP lineage (1, 2 or Modern). Where appropriate, predicted functional amino acid changes are shown and the sequence of new -10 and extended motifs with the SNP allele indicated in uppercase. Nucleotide positions are based on H37Rv genome. Sequences read in the 5’ to 3’direction.

Fold Lineage Gene Function Mutation change with SNP Rv0470c pcaA mycolic acid 4.1 C 560664 T modification 1 (tacaaT) Rv0552 hydrolase 8.6 C 643483 T 1 (tacacT) Rv0842 membrane C 938246 T protein 0.01 Modern (taggcT) Rv0874c hypothetical 0.2 C 972980 T 2 (taggcT) Rv1253 deaD RNA helicase 0.02 C 1400396 T Modern (tatcaT) Rv1898 hypothetical 0.1 C 2145878 T Modern (tacccT) Rv2671/72 ribD riboflavin 82.2/8.2 C 2987918 T biosynthesis 1 (tacacT) Rv2724c fadE20 acyl-CoA 0.1 C 3036826 T dehydrogenase 2 (tagcaT)

153 Figure 3 A B N0145 N0145 N0145 N0145 N0052 N0052 Lineage 2 Lineage 2 N0153 !" N0153 !" N0031 malQ eccB5eccB5 N0031 ppm1 eccB5 !" N0072 !" N0072

!" Lineage 1 Lineage 1 N0153 !" N0153

500 250 !" N0157 N0157 !" 0 0 eccB5 5.3 Results Rv2050 Figure 3 malQ ppm1

C D A A. B

N0145 N0145

N0145 !" N0145 N0145 N0145 N0145 N0145 !" N0052 Lineage 2 N0052

Lineage 2 !" N0153

N0052 !" N0052 N0153 Lineage 2 Lineage 2 N0153 !" umaA N0153 pcaAeccB5

N0031

!" N0031 deaD eccB5 !" N0031 malQ eccB5eccB5 N0031 ppm1 eccB5 N0072

!" !" N0072 N0072 !" N0072 Lineage 1

N0153 !" Lineage 1 N0153 !" 2000 Lineage 1 Lineage 1 N0153 !" N0153 300 500 250 N0157

!" !" 0 N0157 0 N0157 N0157 !" 0 fadB2 umaA 0 deaD eccB5 Rv2050 pcaA malQ ppm1

C B. D

N0145 N0145

N0145 !"

N0145 !" N0052 Lineage 2 N0052

Lineage 2 !" N0153 !" N0153 umaA pcaAeccB5 N0031

N0031 deaD eccB5 !"

N0072 !" N0072

Lineage 1

N0153 !" Lineage 1 N0153 2000 300

N0157

!" 0 N0157 0 fadB2 umaA deaD pcaA

Figure 5.20. SNP-associated TSS leading to differential antisense expression. A. A SNP-associated TSS in 3’ region of umaA in Lineage 1 strains is associated with higher umaA gene expression (2.2-fold) and pcaA antisense expression (4.1-fold). B. A SNP within deaD in all Lineage 2 strains is associated with a new TSS and 41.2-fold increase in antisense transcription. The SNP also creates a -10 consensus on the forward strand; this is associated with a new TSS but has no significant impact on the level of sense transcription.

154 5.3 Results

Although not a lineage example, a highly expressed antisense transcript in Lineage 1 strain N0157 in ino1, an essential gene involved in virulence (Movahedzadeh et al., 2004), also has a new TAnnnT motif created by a C 50557 T SNP. Interestingly, this is a homoplasic SNP, which are rare in M. tuberculosis (Comas et al., 2009; Schürch et al., 2011). The SNP is present in a sub-branch of Lineage 4, including strain H37Rv, which also expresses the antisense transcript (Arnvig & Young, 2012).

5.3.10 Enrichment of toxin-antitoxins

It was not possible to identify direct SNP-associations for the remainder of the genes showing lineage-specific patterns of differential expression. It is likely that their differential expression reflects downstream consequences of primary mutations. Analysis of the panel of differentially-expressed genes according to functional category identified a 2-fold over-representation of proteins involved in virulence, detoxification and adaptation. This was found to be driven by ten toxin-antitoxins (TA) genes, and a separate classification of all TA as an independent category revealed 2.9-fold over- representation in the differentially expressed set compared to the genome representation (χ2, p=0.03) (Figure 5.21). The full table is shown in Appendix F. Six of the TA were chosen and the pattern of differential gene expression seen by RNA-seq was confirmed by quantitative RT-PCR (Figure 5.22). Additional strains were included in this analysis to widen the lineage set (section 5.2.1.3). vapB10 fell outside of the RNA-seq statistically significant cutoff (p=0.06) but by qRT-PCR this was also shown to be differentially expressed.

155 5.3 Results Figure 4

A B !""

! ,(?%&!+&,%,(?%&2 RNAseq ;&@&(0& !

=%);63&<3>-43,(?%'%<+,%(&>-+4+.,+,%(& / qRT-PCR ,

$ !" %

<366-0+66-+&4-<366-.)(<32232 . ' ' -

6%.%4-*3,+5(6%2* , +

<(&23)=34-/1.(,/3,%<+62 * ) ( '

)3:;6+,()1-.)(,3%&2 & %

$ !

789778 # %&,3)*34%+)1-*3,+5(6%2*-+&4-)32.%)+,%(& %&'()*+,%(&-.+,/0+12

!" # " $ ":! A;&<,%(&+6-<+,3:()1-)3.)323&,+,%(&-B)+,%(C 0*123 0*124 5*6788 5*678! 5*67!" 5*69!" 5*69!3

Figure 5.21. Over-representation of differentially expressed toxin-antitoxins. Ratio of significant differential gene expression grouped by functional category, compared to the genome-wide representation of the category. Values on the x-axis represents the difference as fold-change, positive fold-change indicates over-representation of a particular function category, negative values under-representation, whereas a fold- change of one indicates no difference. As a separate toxin-antitoxin category, there were Figure 4 2.9-fold more toxin-antitoxins than expected (χ2, p=0.03). A B !""

! ,(?%&!+&,%,(?%&2 RNAseq ;&@&(0& !

=%);63&<3>-43,(?%'%<+,%(&>-+4+.,+,%(& / qRT-PCR ,

$ !" %

<366-0+66-+&4-<366-.)(<32232 . ' ' -

6%.%4-*3,+5(6%2* , +

<(&23)=34-/1.(,/3,%<+62 * ) ( '

)3:;6+,()1-.)(,3%&2 & %

$ !

789778 # %&,3)*34%+)1-*3,+5(6%2*-+&4-)32.%)+,%(& %&'()*+,%(&-.+,/0+12

!" # " $ ":! A;&<,%(&+6-<+,3:()1-)3.)323&,+,%(&-B)+,%(C 0*123 0*124 5*6788 5*678! 5*67!" 5*69!" 5*69!3

Figure 5.22. Validation of select RNA-seq differentially expressed toxin- antitoxins (solid bars) by qRT-PCR (striped bars). Fold change relative to Lineage 1 expression

on y-axis (log10 scale), and bars coloured by lineage with higher expression. Error bars for qRT-PCR indicate the standard deviation of three biological replicates.

156 5.3 Results

Table 5.10. Ten differentially expressed toxin-antitoxins (TA). Mutation column shows SNP lineage (1, 2 or Modern). Where appropriate, predicted functional amino acid changes are shown and the sequence of new -10 and extended motifs with the SNP allele indicated in uppercase. Nucleotide positions are based on H37Rv genome.

Fold Lineage Gene Function Mutation change with SNP Rv1103c mazE3 antitoxin 2.2 Rv1397c vapC10 toxin 0.1 Modern G103D Rv2063 mazE7 antitoxin 15.7 Rv2063A mazF7 toxin 4.9 1 R101P Rv2274A mazE8 antitoxin 3.2 Rv2526 vapB17 antitoxin 0.2 Rv2527 vapC17 toxin 0.1 Rv2596 vapC40 toxin 2.2 Rv2758c vapB21 antitoxin 2.5 Rv2830c vapB22 antitoxin 2.5 Modern G 3137237 A

Transcription of TA modules is generally repressed by binding of the cognate toxin- antitoxin complex to the promoter region, and activated when the antitoxin is degraded in response to signals associated with environmental stress (Buts et al., 2005). Differential expression could result from mutations that affect stability or repressor activity of the toxin-antitoxin complex, mutations that alter promoter sequences, or mutations that alter proteolytic activity in the cell. Two differentially-expressed toxins have nonsynonymous lineage-specific SNPs; VapC10 (Lineage 2, G103D) and MazF7 (Lineage 1, R101P) (Table 5.10), but the SIFT algorithm was unable to predict functional consequences for these mutations. All TA pairs with detectable transcripts were expressed from a single major TSS. In two cases the TSS was located within the annotated coding sequence, and suggesting that the translational start sites are annotated incorrectly. In the majority of cases (31 out of 51 expressed TA pairs; 60.8%), the TA pairs were encoded by leaderless mRNAs. A single TSS-associated SNP was identified; with position -1 of the VapB22 (Rv2830c) TSS switched from G to A in Lineage 2 strains with a decrease in expression.

157 5.3 Results

Due to the lack of direct SNP associations it could be concluded that differential expression of TA genes reflects general differences in regulatory networks between the two lineages. A series of genes that are preferentially expressed in Lineage 1 strains have previously been implicated in the H37Rv response to acid stress and cell wall damage, including ahpC and ahpD, fabD and lpqS (Fisher et al., 2002). Up-regulation of these genes may be associated with the stress-related sigma factor sigB (Rv2710), which has 2-fold higher expression Lineage 1, but falls outside the statistical cut-off (p=0.06).

158 5.4 Discussion

5.4 Discussion

5.4.1 Strengths and limitations of the study

This aim of this study was to identify the lineage-specific expression profiles of Lineage 1 and 2 and to relate this back to the underlying genotype of the respective lineages. For the first time the total RNA expression of clinical MTBC strains was uncovered using a sequence-based approach. The RNA-seq data generated has intrinsic advantages of previous transcriptional analysis methods that rely on hybridisation of targeted oligonucleotides to specific loci (qRT-PCR), hybridisation of cDNA of multiple probes (Microarray) or labeled probes binding to RNA (Northern blotting) (Croucher & Thomson, 2010). Firstly, RNA-seq is not biased as there is no reliance on prior knowledge of the sequence using probes, therefore all transcripts are studied, including all gene (mRNA), antisense and non-coding transcription. Secondly, as the method is sequence based, the resolution is more precise than hybridisation, effectively sampling all positions within the transcripts, and non-specific hybridisation is not an issue (Kane et al., 2000). Finally, the dynamic range of RNA-seq is effectively unlimited, and defined by the amount of sequence coverage that can be generated in the experiment, whereas the detection of fluorescence or radioactivity can become saturated using microarrays. Ultimately, the transcriptome data generated in this study is more discriminatory at high and low expression levels, and provides an unbiased view of transcription in the MTBC strains.

Whilst one of the advantages of the RNA-seq method is the sampling of all RNA species, this can also become a draw back through dominance of the transcriptome data by highly expressed transcripts, such as ribosomal RNA. In this study, about 90% of the total sequence data was attributed to rRNA, effectively saturating the dataset by out- competing all other mRNA transcripts for sequence data. Exclusion of such transcripts is more difficult than with microarray experiments, where rRNA probes can simply be

159 5.4 Discussion omitted from the chip design. Several methods exist to remove abundant transcripts, including the use of terminator exonucleases that specifically degrade transcripts with a 5’-monophosphate group (Sharma et al., 2010a), or hybridization of magnetic beads linked to oligonucleotides complementary to rRNAs (Camarena et al., 2010; Yoder- Himes et al., 2009). Although such methods are attractive, the significant increased cost of using these, potential for sample degradation and introduced bias (Croucher et al., 2009; Yi et al., 2011) and the availability of the high sequence output from the Ilumina HiSeq2000 sequencer at NIMR rendered these options unnecessary for the differential expression analysis performed in this study. However, the former terminator exonuclease method was used in transcriptional start site (TSS) mapping analysis, which effectively biased the sequence coverage to the 5’ end of transcripts thus facilitating the accurate mapping of TSS. As larger studies wish to sequence more strains, it may become necessary to use a depletion step to enable multiplexing of cDNA from multiple strains into a single Illumina flowcell lane, such as the recently release Epicentre ScriptSeq v2 preparation kit (Cat. No. RSBC10948) in 2012, which allows up to twelve indexed cDNA libraries to be pooled together into one lane, therefore decreasing the cost of sequencing and providing a rapid increase in potential experiment size.

The experimental design of this study was to identify the lineage-specific expression profiles of two MTBC lineages. The RNA-seq data was therefore mapped to a common reference genome sequence using M. tuberculosis genome annotations based on H37Rv; in this case the sequence was the reconstructed ancestor of the MTBC determined from the phylogeny in Chapter 3, and the annotations was based on Tuberculist annotations (Lew et al., 2011). This is advantageous as the number of genes is common to the dataset, allowing comparison of expression levels across all strains. However, a disadvantage of using a reference-based mapping process is the ignorance to the expression of any novel transcripts present in the samples. About a quarter (23.3%) of the filtered high quality transcriptome data did not map to the reference genome, which could suggest that some highly expressed transcripts are not detected in this analysis. The mapping algorithm (BWA) (Li & Durbin, 2009) and parameters used could accommodate gaps of up to three mismatches, and therefore larger indels may account for some of these sequences not mapped. But future non-reference based de novo mapping of the sequences has the potential to indentify novel transcripts not present in the reference strain H37Rv, although such analyses are computational very expensive and would be more effective using paired-end reads instead of the single-end reads generated in this study (Schulz et al., 2012).

160 5.4 Discussion

5.4.2 Lineage-specific expression

Clustering analysis of the transcriptome samples identified significant correlation in transcription between strains of the same lineage based on both sense and antisense expression, suggesting that the underlying lineage-specific variation is functional and results in differential transcription. This was strengthened by the positive correlation between the number of diverging SNPs and gene expression distance (Figure 5.11). At the gene level, differential analysis identified a total of 112 genes with significant lineage-specific patterns of expression. A quarter of the genes (26 out of 112 genes) were identified as differentially expressed in previous microarray comparisons of ancient versus modern lineages or M. tuberculosis H37Rv versus Mycobacterium bovis (within Lineage 6) suggesting that the RNA-seq method is concordant to other gene expression methods (Golby et al., 2007; Homolka et al., 2010). Furthermore, qRT-PCR analysis of a select number of differentially expressed genes identified the same direction of fold change identified in the RNA-seq data, and despite the addition of strains not used in the RNA-seq study; this strengthens the case that there is not a selection bias in the strains used, and that the lineage-specific patterns of expression are a general phenomenon of the respective lineage.

A parallel analysis of antisense transcription identified similar conservation by lineage with a differential expression pattern for 56 genes. Pervasive expression of antisense transcripts has been recognised as a common feature of bacterial transcriptomes (Lasa et al., 2011; Raghavan et al., 2012). Comparison of upstream sequences in Escherichia coli and Salmonella typhimurium suggest that selective pressure for conservation of antisense promoters is lower than in the case of sense promoters (Raghavan et al., 2012). Parallel sequencing of the above genomes by Raghavan et al. identified only eight common highly expressed antisense in both species out the approximately one hundred antisense from orthologous gene pairs found between the two species. This could have been due to a species-specific function of the antisense, but no evidence of conservation was found within strains of E. coli either (Raghavan et al., 2012). In contrast, this study found a broadly similar pattern of sense and antisense diversity in the MTBC lineage comparison, which could reflect the reduced purifying selection and increased genetic drift within MTBC (Hershberg et al., 2008). Currently, the biological significance of antisense transcripts is unknown, and in the thousands of proposed antisense in E. coli only a few have been functionally characterised (Fozo et al., 2008; Kawano et al., 2007). The conservation of antisense in a lineage-specific pattern in the MTBC is

161 5.4 Discussion interesting and suggests a functional role. It is possible that double-stranded RNA molecules differ from single-stranded mRNAs in their efficiency of translation and susceptibility to degradation which could add another layer of regulation (Thomason & Storz, 2010), which should not be ignored in future studies of MTBC diversity.

5.4.3 Linking genotype to phenotypic consequences at the transcriptional level

Bioinformatic analyses in Chapters 3 and 4 suggested a high percentage of nonsynonymous SNPs identified across the MTBC were likely to impair protein function. In this study, three mechanisms by which transcriptome diversity is generated were identified and these are discussed in the following sections.

5.4.3.1 Transcriptional regulators

Focusing on Lineages 1 and 2, functional impairment of eight transcriptional regulators was predicted in Chapter 4. Transcriptional profiling provided confirmatory evidence in four of these cases, virS, Rv0275c, Rv3167c, and Rv3830c. Increased transcription was observed for three regulatory proteins with mutations affecting the helix-turn-helix motif, consistent with a loss of autorepression. Elevated expression of VirS in Lineage 1 recapitulates results of a previous microarray comparison of modern and ancient lineages (Homolka et al., 2010), with the absence of activation of the associated MymA regulon providing further indication that the mutant VirS lacks functional activity. Differential expression of virS has also been observed in the comparison of M. tuberculosis and M. bovis transcriptomes, with 10-fold higher virS expression in M. bovis (Golby et al., 2007); interestingly another virS lineage-specific SNP was found at amino acid residue 322, six amino acids away from the above Lineage 1 SNP, and this defines all animal-adapted MTBC strains, leading to a change in amino acid also predicted to be functional by SIFT (R322C). Experimental deletion of VirS in M. tuberculosis H37Rv resulted in pleiotropic cell wall defects and reduced growth in the spleen of guinea pigs (Singh et al., 2005), raising the possibility that this mutation may reduce the virulence of Lineage 1 strains. Transcription of Rv0275c and Rv3167c was similarly upregulated in strains carrying the mutant allele. Neither of these proteins have been characterised, but RNA-seq profiles were consistent with the functional proteins acting as autorepressors and inhibitors of adjacent genes. Predicted inactivation of Rv3830c by a frameshift mutation causing fusion to an adjacent protein did not result in

162 5.4 Discussion a significant change in expression, but flanking genes (phytoene dehydrogenase Rv3829c, and Rv3131 with unknown function) were markedly upregulated in Lineage 2.

Whilst for the remaining four transcriptional regulators no detectable transcriptional phenotype was found in this study, analysis of the response to specific stimuli other than in exponential phase culture may uncover functional defects. For example, the BlaI regulator is activated in the presence of beta-lactams (Sala et al., 2009), and therefore the predicted impaired BlaI in Lineage 1 may only be identified in these conditions. Similarly, low potassium may uncover functional defects of KdpD in Lineage 1 strains; kdpE is a sensor protein of the Kdp postassium transport system (Steyn et al., 2003; Walderhaug et al., 1992).

5.4.3.2 SNP-associated TSS

In addition to amino acid changes in regulatory proteins, genes with lineage-specific patterns of differential expression were characterised by a high frequency of SNPs associated with transcriptional start sites (TSS). A striking observation was that SNPs generating a -10 consensus motif (TAnnnT) were frequently associated with the emergence of a new TSS. SNP-created TAnnnT motifs could account for 19 of the 168 (11%) lineage-specific differentially expressed genes and antisense, and also for exponential phase expression of the DosR regulon in the Beijing family. SNPs falling outside of the -10 motif may also affect promoter activity. Creation of an “extended” -10 consensus (TGnTAnnnT) resulted in enhanced expression, and changes at the -1 position were associated with higher TSS activity.

In addition to their effect on expression of downstream genes, as in the case of Rv3134c/DosR for example, TSS arising within coding regions may also play a role in generating functionally active truncated proteins. Ppm1 (Rv2051c) is a bifunctional enzyme, fusing an N-terminal apolipoprotein N-acyltransferase with a polyprenyl phosphomannose synthase that are encoded by separate genes in other mycobacteria (Gurcha et al., 2002). Combination of the two activities in a single polypeptide is likely to assist in coordination of the final steps in post-translation of glycosylated lipoproteins: the N-acyltransferase completes the tri-acyl lipid tail, and polyprenyl mannose provides the sugar donor glycosylation. An internal TSS provides the option of separating the two activities, freeing the polyprenyl phosphomannose synthase to participate in other glycosylation pathways. The presence of a conserved internal TSS suggests that this

163 5.4 Discussion option is retained by all members of the MTBC, with additional flexibility in Lineage 1 provided by a SNP that is associated with a new TSS and predicted impairment of N- acyltransferase activity. It has been proposed that changes in the mannosylation of cell surface components have an important impact on recognition of mycobacteria by receptors on innate immune cells (Torrelles & Schlesinger, 2010), and redistribution of mannose between lipoglycans and lipoproteins represents an attractive hypothesis to account for the differential inflammatory response to Lineage 1 and Lineage 2 strains (Portevin et al., 2011). Enhanced Lineage 1 transcription of mgtA (Rv0557, previously also referred to as “PimB”) could also contribute to differences in macrophage phenotype (Torrelles et al., 2009).

New TSS associated with SNP-generated TAnnnT motifs were also observed at a similar frequency in antisense orientation. The biological significance of antisense transcripts is unknown; it is possible that double-stranded RNA molecules differ from single-stranded mRNAs in their efficiency of translation and susceptibility to degradation. Identification of a Lineage 1 SNP associated with a new TSS in UmaA that generates antisense to the adjacent pcaA raises the intriguing possibility of a mechanism for co-ordinated regulation of the two genes. Both proteins are involved in modification of mycolic acids and lineage-specific differential expression could again contribute to variation in innate immune reactivity (Rao et al., 2006; Barkan et al., 2012).

More generally, this study has uncovered a potentially important mechanism of generating transcriptional diversity through SNP-associated TSS. Mutation drives evolution and adaptation on which selection acts, but mutation is not a completely stochastic process, and several biases exist (Hershberg & Petrov, 2010). It has been shown that mutation is AT-biased in clonal organisms including M. tuberculosis, and is dominated by nucleotide transitions from C or G to T or A (Hershberg & Petrov, 2010). This was also found to be the case in the Lineage-specific SNPs identified in Chapter 3 (Figure 5.23), with a mean of 64.5% of all SNPs resulting in a G to A or C to T transition. Together this suggests the potential for many other SNP-associated TSS within the MTBC, and should be focused on initially in subsequent transcriptome studies, along with predicted functional mutations at the amino acid level described earlier.

164 5.4 Discussion

A.! B.! 25 40 s P s N

20 P f f

S 30 N

o o

s S e e

u s g g

o 15 u a a t t m o n n 20 y m e e n y c c

o 10 r r n n e e o y P P n s 10 y n

5 s o n 0 0

G/C C/G G/T C/A A/T T/A A/C T/G G/A C/T A/G T/C G/C C/G G/T C/A A/T T/A A/C T/G G/A C/T A/G T/C

C.! 25

s 20 f P o

N e S g 15 a c t i n n e e c g 10 r r e e t P n i 5

0

G/C C/G G/T C/A A/T T/A A/C T/G G/A C/T A/G T/C

Figure 5.23. Rates of the types of nucleotide mutations across A. nonsynonymous, B. synonymous and C. intergenic regions. Lineage-specific SNPs result in G/C to A/T transitions 56.7% for all nonsynonymous SNPs, 76.2% for all synonymous SNPs and 60.7% for all intergenic SNPs.

5.4.3.3 Landscaping of toxin antitoxins

For the remaining differentially expressed genes no direct genotypic link was identified, and it is presumed that they reflect secondary adaptive responses. The most striking feature was the over-representation of toxin-antitoxin (TA) gene pairs, contributing to ten percent of the total set of differentially expressed genes. Differential expression of TAs is also a feature of previous microarray studies comparing M. bovis with M. tuberculosis (Golby et al., 2007), and “ancient” with “modern” strains (Homolka et al., 2010). TA systems were originally identified by their role in plasmid maintenance, but they are now recognised as a common feature of bacterial genomes (Pandey & Gerdes, 2005). With 62 TA pairs in the current Tuberculist database (Lew et al., 2011), M. tuberculosis has more TAs than any other intracellular bacterium (Makarova et al., 2009; Pandey & Gerdes, 2005). The toxin component is typically an endonuclease, with activity directed towards ribosome-associated mRNAs, rRNAs and tmRNA, resulting in

165 5.4 Discussion blockage of translation. An attractive hypothesis is that the role of TAs in M. tuberculosis is to drive the bacteria into reversible growth arrest in unfavourable environments, by responding to changes in antitoxin stability and proteolytic activities. Based on this model, the differential expression of TA genes is interpreted as a read-out of lineage differences in environmental sensing. Comparison of the overall TA transcription response suggests that the core lineage pattern is overlaid by strain-specific responses, and it can be envisaged that variability in the combined proteolytic and transcriptional regulatory network could readily generate heterogeneity within clonal populations.

166 6 Final discussion

Chapter 6 Final discussion

In this thesis, M. tuberculosis, the principal etiologic agent of tuberculosis in humans, was investigated at the population level using genomic and transcriptomic approaches made possible through use of new DNA sequencing technologies. Prior to this study, it had been hypothesised that a high percentage of genetic diversity in the MTBC will be functional due to a low frequency of purifying selection (Hershberg et al., 2008). The MTBC is known to exist as six major lineages, and the overarching aim of this study was to explore the nature of the genetic diversity at the lineage level and identify the extent to which this has translated into functional diversity at the transcriptional level. The results of these studies, their impact, and avenues for future work are discussed in the following section.

The potential to further our understanding of MTBC diversity was underscored by the defining study in 2010 by Comas et al. which provided the first representative genome- wide phylogeny of global genetic diversity at the single nucleotide resolution (Comas et al., 2010). Twenty-one isolates were selected from a global collection of strains, creating a robust genomic framework on which to base future analyses. In Chapter 3, the clonal population structure of the MTBC was exploited to reveal for the first time all lineage- specific SNPs, which were captured using an expanded 28-genome phylogeny. This was only possible due to the absence of horizontal gene transfer and recombination in the MTBC, resulting in the situation whereby the MTBC evolves by decent. This property was underscored by the extremely low level of homoplasic SNPs, with only 0.14% of all lineage-specific SNPs present in more than one lineage. SNPs are the most abundant form of genetic diversity in the MTBC and as such this variation is anticipated to significantly contribute to the genetic background of the lineages. Accounting for potential discovery bias in the set of genomes used, the 2,794 SNP set is robust and will not be expected to change significantly in future studies. From a mechanistic point of

167 6 Final discussion view, the SNPs identified in this study are directly applicable to SNP based typing assays, as has been demonstrated recently (Stucki & Gagneux, 2012). Previously, deletions identified in the M. tuberculosis genome have proved useful targets for typing (Kong et al., 2006), but efforts are moving towards SNP typing as genome sequencing costs decrease (Comas et al., 2009). In addition to typing newly isolated strains, knowledge of the underlying background genetic variation is important for excluding phylogentically informative SNPs from those associated with drug resistance, demonstrated by the presence of lineage-specific SNPs identified in this study that were also present in the database housing the largest collection of mutations causally linked to drug resistance (Sandgren et al., 2009).

From an evolutionary perspective, the hypothesised reduced selective constraint in the MTBC might be assumed to create a situation whereby nonsynonymous SNPs are accumulating in genes with no discrimination to biological function (Hershberg et al., 2008). The degree of purifying selection was first tested in the lineage-specific set using the dN/dS measure and it was found that similar low levels of purifying selection was present in all lineages. As a validation of these results, the dN/dS ratios were congruent to those found in different MTBC SNP datasets that focused on either a restricted number of genes (Hershberg et al., 2008) or more generally on all identified SNPs (Comas et al., 2010). Due to the genome-wide nature of this study, it was possible to focus down into gene functional categories to ask if there is no difference in purifying selection, and so if all categories are experiencing the same random genetic drift. It was found that this was not the case, with a gradient in the removal of nonsynonymous SNPs; the information pathways category harboured the least number of SNPs, whilst a significant accumulation of amino acid changing SNPs in the regulatory category was observed. Interesting it has been previously found that genes involved in essential functions have a greater level of purifying selection (Comas et al., 2010), which was also observed in this study, and it was found that this is likely the influencing factor in the observed result; the information category has the highest proportion of essential genes whilst the regulatory has the lowest. It has been previously reported that genome sequencing of strains from the Beijing group of Lineage 2 found an overrepresentation of nonsynonymous SNPs in regulatory coding genes (Schürch et al., 2011). Together this suggests that firstly, whilst low purifying selection is acting across all lineages and gene categories, removal of potential deleterious SNPs is still detectable, and secondly, the enrichment of nonsynonymous SNPs in genes with a regulatory function could result in alterations in the response to environmental signals between the lineages.

168 6 Final discussion

Whilst the lineage-specific SNP set is an important pool of genetic diversity, the identification of nearly three thousand SNPs is difficult to manage from a phenotypic point of view. As part of a need to generate a focused SNP set for later phenotypic analysis, and to further understand the genome-wide effect of the observed high nonsynonymous SNP frequency in Chapter 3, a predictive computational approach was undertaken in Chapter 4. Based on evolutionary information it was found that nearly half of all nonsynonymous SNPs introduce an amino acid change at positions conserved in all other mycobacteria, and therefore are likely to have a functional effect. This confirms a previous expectation for a high number of functional SNPs based on a restricted MLSA dataset (Hershberg et al., 2008), and strengthens the observation that this is a phenomenon specific to the MTBC and not mycobacteria-wide; the same method applied to the MTBC outlier, M. canetti, found half the level of predicted functional SNPs. Together this suggests a significant potential for functional diversity in the MTBC due to nonsynonymous SNPs. The MTBC is thought to have originated in Africa, and the association with humans over a long time frame has likely resulted in interactions between human genetic diversity and MTBC variation (Gagneux, 2012). Interestingly, a similar phenomenon to that found in this study has also been observed in humans, where recent demographic expansions have distorted basic principles of population genetics and lead to the accumulation of low frequency genetic variants associated with strong functional effects (Keinan & Clark, 2012; Tennessen et al., 2012).

In light of the current slew of genome sequencing studies and corresponding explosion in growth of databases including dbSNP and DGV (Iafrate et al., 2004; Sherry et al., 1999), on the human genetics side, and Tuberculist, TBDB and PATRIC on the MTBC side (Lew et al., 2011; Gillespie et al., 2011; Reddy et al., 2009), it can be envisaged that the field is rapidly on course to cataloguing the majority of genetic variation. The activity of this field is demonstrated by a simple pubmed search for “whole genome sequencing” and “SNPs”, which identified 501 research article and review hits over the course of this thesis (2009 to 2013). It is therefore reasonable to state that we now have a good understanding of what the genetic differences are in the MTBC at the lineage level. As a side note, the MTBC field is struggling to keep up with the growth in genome sequencing projects in terms of database curation and access, and is in need of a new online resource to house and integrate recently identified genetic variation (Stucki & Gagneux, 2012). Keeping with the human genetics theme, it is estimated that 90% of

169 6 Final discussion sequence variants in humans are SNPs (Collins et al., 1998), with each person thought to be heterozygous for 24,000-40,000 nonsynonymous SNPs (Cargill et al., 1999), whilst this study found an average pairwise difference of ~1000 nonsynonymous SNPs between any one MTBC strain. However, there is a much less complete picture of what these variants do. In response to this, the computational approaches used in this study in Chapter 4 have been largely developed to facilitate human genetics research with a need to filter potential deleterious SNPs from those that are neutral. Ultimately it is anticipated that genomics will translate into real world clinical settings, informing diagnostics and treatment in personalised medicine (Evans & Relling, 1999; Laing et al., 2011). It was interesting to see in Chapter 5 that some of the functional genetic variation was due to nucleotide level changes that were not nonsynonymous and therefore not the focus of the computational tools. Nonsynonymous SNPs are classically thought of as having a higher potential to affect function that synonymous SNPs, which are usually regarded as neutral. Here the synonymous mutations are shown to give rise to novel TSS; the SNP predicted to be involved in constitutive expression of DosR in Beijing strains is synonymous. This stresses the importance of appreciating diversity outside of the classical focus on nonsynonymous SNPs which are the focus of most computational resources to predict SNPs (Mooney, 2005).

One of the most exciting aspects of this thesis has been to combine multidisciplinary methods to strengthen and further understand MTBC diversity. Bioinformatic analyses in Chapters 3 and 4 suggested that a high percentage of nonsynonymous SNPs identified across the MTBC likely impair protein function. Chapter 5 explored the potential effects of genetic variation within the total transcriptomes of clinical MTBC isolates. Prior to this study, there were no examples of an integrated MTBC genome and transcriptome analysis, and whilst one recent microarray based study used a rational approach to selecting strains from different lineages, the underlying genotype was unknown (Homolka et al., 2010). The aims of the chapter were firstly to survey the transcriptome profiles of M. tuberculosis clinical isolates from Lineages 1 and 2 using a sequence based approach, and secondly to understand the effects of the identified lineage-specific variation. The importance of this study was therefore to establish direct links between genetic differences observed amongst clinical isolates of the MTBC and phenotypic consequences at the level of transcription. Transitioning from large-scale whole genome sequencing of strains to transcriptome analysis using high throughput sequencing is one of the next frontiers in the understanding of MTBC diversity, and it can be anticipated that as throughput continues to increase, thanks to improvements in sequencing

170 6 Final discussion technology, and costs decrease from economies of scale, transcriptome sequencing will be feasible for many more clinical strains.

The work undertaken in this thesis is positioned at the interface of genomic and transcriptional systems. Genomic diversity that is specific to each of the MTBC lineages was identified and the effects of this variation screened in the next biological level - transcription. Analysis was guided by predictions of potential functional mutations using an in silico approach. One example of a predicted functional SNP within a regulatory protein with a detectable phenotype at the transcriptional level was virS. The hypothesised functional defect with virS in Lineage 1 strains correlates with another virS SNP found in M. bovis strains and evidence of a similiar transcriptional phenotype observed by a previous microarray study (Golby et al., 2007). Verification of the predicted defective virS in Lineage 1 strains and M.bovis is under in investigation at NIMR, with purified recombinant virS protein currently undergoing DNase footprinting to ascertain the virS binding site in addition to in vitro transcription assays. A second example under further investigation is the cause of constitutive DosR expression in the Beijing sub-family of Lineage 2. It was hypothesised in this thesis that this is due to a synonymous SNP within all Beijing strains that was seen in the RNA-seq data to introduce a new transcriptional start site (TSS). Following verification that the SNP is the cause of the new TSS and associated increased DosR transcription, it would then be necessary to follow this into the level of translation through measurement of protein abundance. In a wider context, the relevance of increased DosR to virulence in mouse models and ultimately epidemiology in humans is not clear (Bartek et al., 2009; Boon & Dick, 2012). The understanding of tuberculosis at all biological levels is currently an active field of research, with large collaborations utilising a systems biology approach in the United States (TB Systems Biology - Stanford University and the Broad Institute) and Europe (SysteMTb). While these projects are largely based on the reference MTBC strain H37Rv, it is anticipated that use of clinical stains, such as those used in this study, will provide important biological insight into the impact of MTBC genetic variation. At NIMR the approach used in this thesis is also being applied at the proteomic and metabolomic levels.

In conclusion, this thesis has for the first time captured the genetic diversity that separates the MTBC lineages, and demonstrates that such diversity generates transcriptional diversity between the two MTBC lineages focused on in this study, and it is highly likely that similar mechanisms occur in the other lineages. This underpins the

171 6 Final discussion importance of the holistic scientific approach that was undertaken in thesis and is in contrast to the gene centric focus of reductionism. This studies strength comes from the power to analyse all SNPs across the genome, uncovering examples of functional SNPs in a data-driven approach, and the potential pool of additional functional SNPs predicted across all functional categories. To understand MTBC diversity, genomic data should not be interpreted in isolation, but instead integrated with other biological systems, as suggested for DosR above. An example of the importance of not treating mutations in isolation is demonstrated by the phenomenon of epistasis, whereby the phenotypic effect of one mutation differs depending on the presence of another mutation (Lehner, 2011). A role for epistatis in M. tuberculosis has been recently reported for the evolution of drug resistant strains (Borrell et al., 2013), but epistatis has been implicated in many other biological processes, ranging from pathway organization, mutational load, and genomic complexity (Breen et al., 2012). Therefore, the lineage-specific SNPs identified in this thesis provide a framework on which further studies of the effects of MTBC genomic diversity can be based; firstly as an approach to interrogating genome datasets, secondly in demonstrating a mechanistic way of generating diversity, and finally as a resource to the TB community.

Finally, whilst genetic diversity has been uncovered in this thesis, it remains to be shown whether this has biological consequences during infection. Both lineages are highly successful pathogens with proven ability to maintain transmission cycles over tens of thousands of years, and it is likely that phenotypic diversity will reflect adaptation to different circumstances rather than loss or gain of ability to cause disease. The differences detected here suggest that strains from the two lineages may present alternative ligand repertoires to host cells, and respond differently to environmental changes generated by the host immune response. This in turn may confer varying degrees of fitness in different epidemiological settings. Understanding the message layer between a cell and its genome, through studies such as those undertaken in this thesis will help connect genotype and phenotype, and are needed along with integration of other biological systems to provide a full understanding of the nature and phenotypic consequence of MTBC diversity in relation to human TB disease. Finally, it is important to note that this thesis focused on the common underlying genetic differences between the MTBC lineages, reflecting events occurring 40,000 to 60,000 years ago (Hershberg et al., 2008). It has been hypothesised that the MTBC and humans have been co-evolving and are thus shaped by this longstanding association (Gagneux, 2012), it is therefore interesting to speculate that focusing on different evolutionary timescales, such

172 6 Final discussion as the last two hundred years, might reveal selective pressures in the MTBC associated with the great expansion in human population numbers over this period of time. As well as providing an opportunity to discern the ongoing evolution of the MTBC population, such timescales could highlight the response to pressures associated with HIV and drug- resistance and ultimately help design better tools and effective control strategies for one of the world’s oldest humans diseases.

173 REFERENCES

References

Achtman, M. (2008). Evolution, population structure, and phylogeography of genetically monomorphic bacterial pathogens. Annu Rev Microbiol 62 53-70.

Albers, C. A., Lunter, G., MacArthur, D. G., McVean, G., Ouwehand, W. H. & Durbin, R. (2011). Dindel: accurate indel calls from short-read data. Genome Res 21, 961-973.

Alexander, K. A., Laver, P. N., Michel, A. L., Williams, M., van Helden, P. D., Warren, R. M. & Gey van Pittius, N. C. (2010). Novel Mycobacterium tuberculosis complex pathogen, M. mungi. Emerg Infect Dis 16, 1296-1299.

Anders, S. & Huber, W. (2010). Differential expression analysis for sequence count data. Genome Biol 11, R106.

Arnvig, K. & Young, D. (2012). Non-coding RNA and its potential role in Mycobacterium tuberculosis pathogenesis. RNA Biol 9.

Arnvig, K. B., Comas, I., Thomson, N. R., Houghton, J., Boshoff, H. I., Croucher, N. J., Rose, G., Perkins, T. T., Parkhill, J., Dougan, G. & Young, D. B. (2011). Sequence-based analysis uncovers an abundance of non-coding RNA in the total transcriptome of Mycobacterium tuberculosis. PLoS Pathog 7, e1002342.

Atlas, R. M. & Snyder, J. W. (2006). Handbook of media for clinical microbiology: CRC.

174 REFERENCES

Bagchi, G., Chauhan, S., Sharma, D. & Tyagi, J. S. (2005). Transcription and autoregulation of the Rv3134c-devR-devS operon of Mycobacterium tuberculosis. Microbiology 151, 4045-4053.

Baker, L., Brown, T., Maiden, M. C. & Drobniewski, F. (2004). Silent nucleotide polymorphisms and a phylogeny for Mycobacterium tuberculosis. Emerging Infect Dis 10, 1568-1577.

Balbi, K. J. & Feil, E. J. (2007). The rise and fall of deleterious mutation. Res Microbiol 158, 779-786.

Bao, L. & Cui, Y. (2005). Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary information. Bioinformatics 21, 2185-2190.

Barkan, D., Hedhli, D., Yan, H. G., Huygen, K. & Glickman, M. S. (2012). Mycobacterium tuberculosis lacking all mycolic acid cyclopropanation is viable but highly attenuated and hyperinflammatory in mice. Infect Immun 80, 1958-1968.

Barry, C. E., Boshoff, H. I., Dartois, V., Dick, T., Ehrt, S., Flynn, J., Schnappinger, D., Wilkinson, R. J. & Young, D. B. (2009). The spectrum of latent tuberculosis: rethinking the biology and intervention strategies. Nature reviews Microbiology 7, 845- 855.

Barry, C. E., 3rd (2001). Interpreting cell wall 'virulence factors' of Mycobacterium tuberculosis. Trends Microbiol 9, 237-241.

Bartek, I. L., Rutherford, R., Gruppo, V., Morton, R. A., Morris, R. P., Klein, M. R., Visconti, K. C., Ryan, G. J., Schoolnik, G. K., Lenaerts, A. & Voskuil, M. I. (2009). The DosR regulon of M. tuberculosis and antibacterial tolerance. Tuberculosis (Edinb) 89, 310-316.

Behr, M. A., Schroeder, B. G., Brinkman, J. N., Slayden, R. A. & Barry, C. E. (2000). A point mutation in the mma3 gene is responsible for impaired methoxymycolic acid production in Mycobacterium bovis BCG strains obtained after 1927. J Bacteriol 182, 3394-3399.

175 REFERENCES

Bellamy, R., Beyers, N., McAdam, K. P., Ruwende, C., Gie, R., Samaai, P., Bester, D., Meyer, M., Corrah, T., Collin, M., Camidge, D. R., Wilkinson, D., Hoal-Van Helden, E., Whittle, H. C., Amos, W., van Helden, P. & Hill, A. V. (2000). Genetic susceptibility to tuberculosis in Africans: a genome-wide scan. Proc Natl Acad Sci U S A 97, 8005-8009.

Benjamini, Y. & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Series B Stat Methodol 57 289- 300.

Bennett-Lovsey, R. M., Herbert, A. D., Sternberg, M. J. & Kelley, L. A. (2008). Exploring the extremes of sequence/structure space with ensemble fold recognition in the program Phyre. Proteins 70, 611-625.

Bentley, D. R., Balasubramanian, S., Swerdlow, H. P., Smith, G. P., Milton, J., Brown, C. G., Hall, K. P., Evers, D. J., Barnes, C. L., Bignell, H. R., Boutell, J. M., Bryant, J., Carter, R. J., Keira Cheetham, R., Cox, A. J., Ellis, D. J., Flatbush, M. R., Gormley, N. A., Humphray, S. J., Irving, L. J., Karbelashvili, M. S., Kirk, S. M., Li, H., Liu, X., Maisinger, K. S., Murray, L. J., Obradovic, B., Ost, T., Parkinson, M. L., Pratt, M. R., Rasolonjatovo, I. M., Reed, M. T., Rigatti, R., Rodighiero, C., Ross, M. T., Sabot, A., Sankar, S. V., Scally, A., Schroth, G. P., Smith, M. E., Smith, V. P., Spiridou, A., Torrance, P. E., Tzonev, S. S., Vermaas, E. H., Walter, K., Wu, X., Zhang, L., Alam, M. D., Anastasi, C., Aniebo, I. C., Bailey, D. M., Bancarz, I. R., Banerjee, S., Barbour, S. G., Baybayan, P. A., Benoit, V. A., Benson, K. F., Bevis, C., Black, P. J., Boodhun, A., Brennan, J. S., Bridgham, J. A., Brown, R. C., Brown, A. A., Buermann, D. H., Bundu, A. A., Burrows, J. C., Carter, N. P., Castillo, N., Chiara, E. C. M., Chang, S., Neil Cooley, R., Crake, N. R., Dada, O. O., Diakoumakos, K. D., Dominguez-Fernandez, B., Earnshaw, D. J., Egbujor, U. C., Elmore, D. W., Etchin, S. S., Ewan, M. R., Fedurco, M., Fraser, L. J., Fuentes Fajardo, K. V., Scott Furey, W., George, D., Gietzen, K. J., Goddard, C. P., Golda, G. S., Granieri, P. A., Green, D. E., Gustafson, D. L., Hansen, N. F., Harnish, K., Haudenschild, C. D., Heyer, N. I., Hims, M. M., Ho, J. T., Horgan, A. M., Hoschler, K., Hurwitz, S., Ivanov, D. V., Johnson, M. Q., James, T., Huw Jones, T. A., Kang, G. D., Kerelska, T. H., Kersey, A. D., Khrebtukova, I., Kindwall, A. P., Kingsbury, Z., Kokko-Gonzales, P. I., Kumar, A., Laurent, M. A.,

176 REFERENCES

Lawley, C. T., Lee, S. E., Lee, X., Liao, A. K., Loch, J. A., Lok, M., Luo, S., Mammen, R. M., Martin, J. W., McCauley, P. G., McNitt, P., Mehta, P., Moon, K. W., Mullens, J. W., Newington, T., Ning, Z., Ling Ng, B., Novo, S. M., O'Neill, M. J., Osborne, M. A., Osnowski, A., Ostadan, O., Paraschos, L. L., Pickering, L., Pike, A. C., Chris Pinkard, D., Pliskin, D. P., Podhasky, J., Quijano, V. J., Raczy, C., Rae, V. H., Rawlings, S. R., Chiva Rodriguez, A., Roe, P. M., Rogers, J., Rogert Bacigalupo, M. C., Romanov, N., Romieu, A., Roth, R. K., Rourke, N. J., Ruediger, S. T., Rusman, E., Sanches-Kuiper, R. M., Schenker, M. R., Seoane, J. M., Shaw, R. J., Shiver, M. K., Short, S. W., Sizto, N. L., Sluis, J. P., Smith, M. A., Ernest Sohna Sohna, J., Spence, E. J., Stevens, K., Sutton, N., Szajkowski, L., Tregidgo, C. L., Turcatti, G., Vandevondele, S., Verhovsky, Y., Virk, S. M., Wakelin, S., Walcott, G. C., Wang, J., Worsley, G. J., Yan, J., Yau, L., Zuerlein, M., Mullikin, J. C., Hurles, M. E., McCooke, N. J., West, J. S., Oaks, F. L., Lundberg, P. L., Klenerman, D., Durbin, R. & Smith, A. J. (2008). Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53-59.

Bentley, S. (2010). Taming the next-gen beast. Nature reviews Microbiology 8, 161.

Bentley, S. D., Comas, I., Bryant, J. M., Walker, D., Smith, N. H., Harris, S. R., Thurston, S., Gagneux, S., Wood, J., Antonio, M., Quail, M. A., Gehre, F., Adegbola, R. A., Parkhill, J. & de Jong, B. C. (2012). The genome of Mycobacterium africanum West African 2 reveals a lineage-specific locus and genome erosion common to the M. tuberculosis complex. PLoS neglected tropical diseases 6, e1552.

Bergval, I., Sengstake, S., Brankova, N., Levterova, V., Abadia, E., Tadumaze, N., Bablishvili, N., Akhalaia, M., Tuin, K., Schuitema, A., Panaiotov, S., Bachiyska, E., Kantardjiev, T., de Zwaan, R., Schurch, A., van Soolingen, D., van 't Hoog, A., Cobelens, F., Aspindzelashvili, R., Sola, C., Klatser, P. & Anthony, R. (2012). Combined species identification, genotyping, and drug resistance detection of Mycobacterium tuberculosis cultures by MLPA on a bead-based array. PLoS One 7, e43240.

Boehme, C. C., Nicol, M. P., Nabeta, P., Michael, J. S., Gotuzzo, E., Tahirli, R., Gler, M. T., Blakemore, R., Worodria, W., Gray, C., Huang, L., Caceres, T., Mehdiyev, R., Raymond, L., Whitelaw, A., Sagadevan, K., Alexander, H., Albert, H., Cobelens, F., Cox, H., Alland, D. & Perkins, M. D. (2011). Feasibility, diagnostic

177 REFERENCES accuracy, and effectiveness of decentralised use of the Xpert MTB/RIF test for diagnosis of tuberculosis and multidrug resistance: a multicentre implementation study. Lancet 377, 1495-1505.

Boelens, R. & Gualerzi, C. O. (2002). Structure and function of bacterial initiation factors. Current Protein and Peptide Science 3, 107-119.

Boon, C. & Dick, T. (2012). How Mycobacterium tuberculosis goes to sleep: the dormancy survival regulator DosR a decade later. Future Microbiol 7, 513-518.

Borrell, S. & Gagneux, S. (2009). Infectiousness, reproductive fitness and evolution of drug-resistant Mycobacterium tuberculosis. Int J Tuberc Lung Dis 13, 1456-1466.

Borrell, S., Teo, Y., Giardina, F., Streicher, E. M., Klopper, M., Feldmann, J., Muller, B., Victor, T. C. & Gagneux, S. (2013). Epistasis between antibiotic resistance mutations drives the evolution of extensively drug-resistant tuberculosis. EMPH, 65-74.

Branton, D., Deamer, D. W., Marziali, A., Bayley, H., Benner, S. A., Butler, T., Di Ventra, M., Garaj, S., Hibbs, A., Huang, X., Jovanovich, S. B., Krstic, P. S., Lindsay, S., Ling, X. S., Mastrangelo, C. H., Meller, A., Oliver, J. S., Pershin, Y. V., Ramsey, J. M., Riehn, R., Soni, G. V., Tabard-Cossa, V., Wanunu, M., Wiggin, M. & Schloss, J. A. (2008). The potential and challenges of nanopore sequencing. Nat Biotechnol 26, 1146-1153.

Breen, M. S., Kemena, C., Vlasov, P. K., Notredame, C. & Kondrashov, F. A. (2012). Epistasis as the primary factor in molecular evolution. Nature 490, 535-538.

Brosch, R., Gordon, S. V., Marmiesse, M., Brodin, P., Buchrieser, C., Eiglmeier, K., Garnier, T., Gutierrez, C., Hewinson, G., Kremer, K., Parsons, L. M., Pym, A. S., Samper, S., van Soolingen, D. & Cole, S. T. (2002). A new evolutionary scenario for the Mycobacterium tuberculosis complex. Proc Natl Acad Sci U S A 99, 3684-3689.

Brudey, K., Driscoll, J. R., Rigouts, L., Prodinger, W. M., Gori, A., Al-Hajoj, S. A., Allix, C., Aristimuno, L., Arora, J., Baumanis, V., Binder, L., Cafrune, P., Cataldi, A., Cheong, S., Diel, R., Ellermeier, C., Evans, J. T., Fauville-Dufaux, M., Ferdinand, S., Garcia de Viedma, D., Garzelli, C., Gazzola, L., Gomes, H. M.,

178 REFERENCES

Guttierez, M. C., Hawkey, P. M., van Helden, P. D., Kadival, G. V., Kreiswirth, B. N., Kremer, K., Kubin, M., Kulkarni, S. P., Liens, B., Lillebaek, T., Ho, M. L., Martin, C., Mokrousov, I., Narvskaia, O., Ngeow, Y. F., Naumann, L., Niemann, S., Parwati, I., Rahim, Z., Rasolofo-Razanamparany, V., Rasolonavalona, T., Rossetti, M. L., Rusch-Gerdes, S., Sajduda, A., Samper, S., Shemyakin, I. G., Singh, U. B., Somoskovi, A., Skuce, R. A., van Soolingen, D., Streicher, E. M., Suffys, P. N., Tortoli, E., Tracevska, T., Vincent, V., Victor, T. C., Warren, R. M., Yap, S. F., Zaman, K., Portaels, F., Rastogi, N. & Sola, C. (2006). Mycobacterium tuberculosis complex genetic diversity: mining the fourth international spoligotyping database (SpolDB4) for classification, population genetics and epidemiology. BMC Microbiol 6, 23.

Burley, S. K. (2013). PDB40: The Protein Data Bank celebrates its 40th birthday. Biopolymers 99, 165-169.

Buts, L., Lah, J., Dao-Thi, M. H., Wyns, L. & Loris, R. (2005). Toxin-antitoxin modules as bacterial metabolic stress managers. Trends Biochem Sci 30, 672-679.

Camarena, L., Bruno, V., Euskirchen, G., Poggio, S. & Snyder, M. (2010). Molecular mechanisms of ethanol-induced pathogenesis revealed by RNA-sequencing. PLoS Pathog 6, e1000834.

Cargill, M., Altshuler, D., Ireland, J., Sklar, P., Ardlie, K., Patil, N., Shaw, N., Lane, C. R., Lim, E. P., Kalyanaraman, N., Nemesh, J., Ziaugra, L., Friedland, L., Rolfe, A., Warrington, J., Lipshutz, R., Daley, G. Q. & Lander, E. S. (1999). Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat Genet 22, 231-238.

Carver, T., Berriman, M., Tivey, A., Patel, C., Böhme, U., Barrell, B. G., Parkhill, J. & Rajandream, M. A. (2008). Artemis and ACT: viewing, annotating and comparing sequences stored in a relational database. Bioinformatics (Oxford, England) 24, 2672-2676.

Casali, N., Nikolayevskyy, V., Balabanova, Y., Ignatyeva, O., Kontsevaya, I., Harris, S. R., Bentley, S. D., Parkhill, J., Nejentsev, S., Hoffner, S. E., Horstmann,

179 REFERENCES

R. D., Brown, T. & Drobniewski, F. (2012). Microevolution of extensively drug- resistant tuberculosis in Russia. Genome Res 22, 735-745.

Caws, M., Thwaites, G., Stepniewska, K., Nguyen, T. N., Nguyen, T. H., Nguyen, T. P., Mai, N. T., Phan, M. D., Tran, H. L., Tran, T. H., van Soolingen, D., Kremer, K., Nguyen, V. V., Nguyen, T. C. & Farrar, J. (2006). Beijing genotype of Mycobacterium tuberculosis is significantly associated with human immunodeficiency virus infection and multidrug resistance in cases of tuberculous meningitis. J Clin Microbiol 44, 3934-3939.

Caws, M., Thwaites, G., Dunstan, S., Hawn, T. R., Lan, N. T., Thuong, N. T., Stepniewska, K., Huyen, M. N., Bang, N. D., Loc, T. H., Gagneux, S., van Soolingen, D., Kremer, K., van der Sande, M., Small, P., Anh, P. T., Chinh, N. T., Quy, H. T., Duyen, N. T., Tho, D. Q., Hieu, N. T., Torok, E., Hien, T. T., Dung, N. H., Nhu, N. T., Duy, P. M., van Vinh Chau, N. & Farrar, J. (2008). The influence of host and bacterial genotype on the development of disseminated disease with Mycobacterium tuberculosis. PLoS Pathog 4, e1000034.

Chesne-Seck, M. L., Barilone, N., Boudou, F., Gonzalo Asensio, J., Kolattukudy, P. E., Martin, C., Cole, S. T., Gicquel, B., Gopaul, D. N. & Jackson, M. (2008). A point mutation in the two-component regulator PhoP-PhoR accounts for the absence of polyketide-derived acyltrehaloses but not that of phthiocerol dimycocerosates in Mycobacterium tuberculosis H37Ra. J Bacteriol 190, 1329-1334.

Cingolani, P., Platts, A., Wang le, L., Coon, M., Nguyen, T., Wang, L., Land, S. J., Lu, X. & Ruden, D. M. (2012). A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin) 6, 80-92.

Coar, T. (1982). The aphorisms of Hippocrates with a Translation into Latin, and English. Birmingham: Gryphon Editions.

Cock, P. J., Fields, C. J., Goto, N., Heuer, M. L. & Rice, P. M. (2009). The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res 38, 1767-1771.

180 REFERENCES

Cohan, F. M. (2002). What are bacterial species? Annu Rev Microbiol 56 457-487.

Cole, S. T., Brosch, R., Parkhill, J., Garnier, T., Churcher, C., Harris, D., Gordon, S. V., Eiglmeier, K., Gas, S., Barry, C. E., Tekaia, F., Badcock, K., Basham, D., Brown, D., Chillingworth, T., Connor, R., Davies, R., Devlin, K., Feltwell, T., Gentles, S., Hamlin, N., Holroyd, S., Hornsby, T., Jagels, K., Krogh, A., McLean, J., Moule, S., Murphy, L., Oliver, K., Osborne, J., Quail, M. A., Rajandream, M. A., Rogers, J., Rutter, S., Seeger, K., Skelton, J., Squares, R., Squares, S., Sulston, J. E., Taylor, K., Whitehead, S. & Barrell, B. G. (1998). Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature 393, 537-544.

Collins, F. S., Brooks, L. D. & Chakravarti, A. (1998). A DNA polymorphism discovery resource for research on human genetic variation. Genome Res 8, 1229-1231.

Comas, I. & Gagneux, S. (2009). The past and future of tuberculosis research. PLoS Pathog 5, e1000600.

Comas, I., Homolka, S., Niemann, S. & Gagneux, S. (2009). Genotyping of genetically monomorphic bacteria: DNA sequencing in Mycobacterium tuberculosis highlights the limitations of current methodologies. PLoS ONE 4, e7815.

Comas, I., Chakravartti, J., Small, P., Galagan, J., Niemann, S., Kremer, K., Ernst, J. & Gagneux, S. (2010). Human T cell epitopes of Mycobacterium tuberculosis are evolutionarily hyperconserved. Nat Genet 42, 498-503.

Comas, I., Borrell, S., Roetzer, A., Rose, G., Malla, B., Kato-Maeda, M., Galagan, J., Niemann, S. & Gagneux, S. (2011). Whole-genome sequencing of rifampicin- resistant Mycobacterium tuberculosis strains identifies compensatory mutations in RNA polymerase genes. Nat Genet 44, 106-110.

Constant, P., Perez, E., Malaga, W., Laneelle, M. A., Saurel, O., Daffe, M. & Guilhot, C. (2002). Role of the pks15/1 gene in the biosynthesis of phenolglycolipids in the Mycobacterium tuberculosis complex. Evidence that all strains synthesize glycosylated p-hydroxybenzoic methyl esters and that strains devoid of phenolglycolipids harbor a frameshift mutation in the pks15/1 gene. J Biol Chem 277, 38148-38158.

181 REFERENCES

Coscolla, M. & Gagneux, S. (2010). Does M. tuberculosis genomic diversity explain disease diversity? Drug Discov Today Dis Mech 7, e43-e59.

Cowley, D., Govender, D., February, B., Wolfe, M., Steyn, L., Evans, J., Wilkinson, R. J. & Nicol, M. P. (2008). Recent and rapid emergence of W-Beijing strains of Mycobacterium tuberculosis in Cape Town, South Africa. Clin Infect Dis 47, 1252- 1259.

Cox, M. P., Peterson, D. A. & Biggs, P. J. (2010). SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data. BMC Bioinformatics 11, 485.

Croucher, N. J., Fookes, M. C., Perkins, T. T., Turner, D. J., Marguerat, S. B., Keane, T., Quail, M. A., He, M., Assefa, S., Bahler, J., Kingsley, R. A., Parkhill, J., Bentley, S. D., Dougan, G. & Thomson, N. R. (2009). A simple method for directional transcriptome sequencing using Illumina technology. Nucleic Acids Res 37, e148.

Croucher, N. J. & Thomson, N. R. (2010). Studying bacterial transcriptomes using RNA-seq. Curr Opin Microbiol 13, 619-624.

Daniel, T. M. (1997). Captain of death: the story of tuberculosis. Rochester, NY: University of Rochester Press. de Jong, B. C., Hill, P. C., Aiken, A., Awine, T., Antonio, M., Adetifa, I. M., Jackson-Sillah, D. J., Fox, A., Deriemer, K., Gagneux, S., Borgdorff, M. W., McAdam, K. P., Corrah, T., Small, P. M. & Adegbola, R. A. (2008). Progression to active tuberculosis, but not transmission, varies by Mycobacterium tuberculosis lineage in The Gambia. J Infect Dis 198, 1037-1043. de Jong, B. C., Antonio, M., Awine, T., Ogungbemi, K., de Jong, Y. P., Gagneux, S., DeRiemer, K., Zozio, T., Rastogi, N., Borgdorff, M., Hill, P. C. & Adegbola, R. A. (2009). Use of spoligotyping and large sequence polymorphisms to study the population structure of the Mycobacterium tuberculosis complex in a cohort study of consecutive smear-positive tuberculosis cases in The Gambia. J Clin Microbiol 47, 994-1001.

182 REFERENCES de Jong, B. C., Antonio, M. & Gagneux, S. (2010). Mycobacterium africanum--review of an important cause of human tuberculosis in West Africa. PLoS Negl Trop Dis 4, e744. de la Rua-Domenech, R. (2006). Human Mycobacterium bovis infection in the United Kingdom: Incidence, risks, control measures and review of the zoonotic aspects of bovine tuberculosis. Tuberculosis (Edinb) 86, 77-109.

Dillies, M. A., Rau, A., Aubert, J., Hennequet-Antier, C., Jeanmougin, M., Servant, N., Keime, C., Marot, G., Castel, D., Estelle, J., Guernec, G., Jagla, B., Jouneau, L., Laloe, D., Le Gall, C., Schaeffer, B., Le Crom, S., Guedj, M. & Jaffrezic, F. (2012). A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief Bioinform.

Domenech, P. & Reed, M. B. (2009). Rapid and spontaneous loss of phthiocerol dimycocerosate (PDIM) from Mycobacterium tuberculosis grown in vitro: implications for virulence studies. Microbiology 155, 3532-3543.

Domenech, P., Kolly, G. S., Leon-Solis, L., Fallow, A. & Reed, M. B. (2010). Massive gene duplication event among clinical isolates of the Mycobacterium tuberculosis W/Beijing family. J Bacteriol 192, 4562-4570.

Donoghue, H. D., Spigelman, M., Greenblatt, C. L., Lev-Maor, G., Bar-Gal, G. K., Matheson, C., Vernon, K., Nerlich, A. G. & Zink, A. R. (2004). Tuberculosis: from prehistory to Robert Koch, as revealed by ancient DNA. Lancet Infect Dis 4, 584-592.

Ellis, R. C. & Zabrowarny, L. A. (1993). Safer staining method for acid fast bacilli. J Clin Pathol 46, 559-560.

Evans, J. T., Smith, E. G., Banerjee, A., Smith, R. M., Dale, J., Innes, J. A., Hunt, D., Tweddell, A., Wood, A., Anderson, C., Hewinson, R. G., Smith, N. H., Hawkey, P. M. & Sonnenberg, P. (2007). Cluster of human tuberculosis caused by Mycobacterium bovis: evidence for person-to-person transmission in the UK. Lancet 369, 1270-1276.

183 REFERENCES

Evans, W. E. & Relling, M. V. (1999). Pharmacogenomics: translating functional genomics into rational therapeutics. Science 286, 487-491.

Filiatrault, M. J., Stodghill, P. V., Myers, C. R., Bronstein, P. A., Butcher, B. G., Lam, H., Grills, G., Schweitzer, P., Wang, W., Schneider, D. J. & Cartinhour, S. W. (2011). Genome-wide identification of transcriptional start sites in the plant pathogen Pseudomonas syringae pv. tomato str. DC3000. PLoS One 6, e29335.

Filliol, I., Driscoll, J. R., van Soolingen, D., Kreiswirth, B. N., Kremer, K., Valetudie, G., Dang, D. A., Barlow, R., Banerjee, D., Bifani, P. J., Brudey, K., Cataldi, A., Cooksey, R. C., Cousins, D. V., Dale, J. W., Dellagostin, O. A., Drobniewski, F., Engelmann, G., Ferdinand, S., Gascoyne-Binzi, D., Gordon, M., Gutierrez, M. C., Haas, W. H., Heersma, H., Kassa-Kelembho, E., Ho, M. L., Makristathis, A., Mammina, C., Martin, G., Mostrom, P., Mokrousov, I., Narbonne, V., Narvskaya, O., Nastasi, A., Niobe-Eyangoh, S. N., Pape, J. W., Rasolofo-Razanamparany, V., Ridell, M., Rossetti, M. L., Stauffer, F., Suffys, P. N., Takiff, H., Texier-Maugein, J., Vincent, V., de Waard, J. H., Sola, C. & Rastogi, N. (2003). Snapshot of moving and expanding clones of Mycobacterium tuberculosis and their global distribution assessed by spoligotyping in an international study. J Clin Microbiol 41, 1963-1970.

Firdessa, R., Berg, S., Hailu, E., Schelling, E., Gumi, B., Erenso, G., Gadisa, E., Kiros, T., Habtamu, M., Hussein, J., Zinsstag, J., Robertson, B. D., Ameni, G., Lohan, A., Loftus, B., Comas, I., Gagneux, S., Tschopp, R., Yamuah, L., Hewinson, G., Gordon, S. V., Young, D. B. & Aseffa, A. (2013). Mycobacterial lineages causing pulmonary and extrapulmonary tuberculosis, Ethiopia. Emerg Infect Dis 19 460-463.

Fisher, M. A., Plikaytis, B. B. & Shinnick, T. M. (2002). Microarray analysis of the Mycobacterium tuberculosis transcriptional response to the acidic conditions found in phagosomes. J Bacteriol 184, 4025-4032.

Fleischmann, R. D., Alland, D., Eisen, J. A., Carpenter, L., White, O., Peterson, J., DeBoy, R., Dodson, R., Gwinn, M., Haft, D., Hickey, E., Kolonay, J. F., Nelson, W. C., Umayam, L. A., Ermolaeva, M., Salzberg, S. L., Delcher, A., Utterback, T., Weidman, J., Khouri, H., Gill, J., Mikula, A., Bishai, W., Jacobs Jr, W. R., Jr.,

184 REFERENCES

Venter, J. C. & Fraser, C. M. (2002). Whole-genome comparison of Mycobacterium tuberculosis clinical and laboratory strains. J Bacteriol 184, 5479-5490.

Forrellad, M. A., Klepp, L. I., Gioffre, A., Sabio, Y. G. J., Morbidoni, H. R., Santangelo, M. D., Cataldi, A. A. & Bigi, F. (2012). Virulence factors of the Mycobacterium tuberculosis complex. Virulence 4.

Fozo, E. M., Hemm, M. R. & Storz, G. (2008). Small toxic proteins and the antisense RNAs that repress them. Microbiol Mol Biol Rev 72, 579-589, Table of Contents.

Gagneux, S., DeRiemer, K., Van, T., Kato-Maeda, M., de Jong, B. C., Narayanan, S., Nicol, M., Niemann, S., Kremer, K., Gutierrez, M. C., Hilty, M., Hopewell, P. C. & Small, P. M. (2006a). Variable host-pathogen compatibility in Mycobacterium tuberculosis. Proc Natl Acad Sci U S A 103, 2869-2873.

Gagneux, S., Long, C. D., Small, P. M., Van, T., Schoolnik, G. K. & Bohannan, B. J. M. (2006b). The competitive cost of antibiotic resistance in Mycobacterium tuberculosis. Science 312, 1944-1946.

Gagneux, S. & Small, P. M. (2007). Global phylogeography of Mycobacterium tuberculosis and implications for tuberculosis product development. Lancet Infect Dis 7, 328-337.

Gagneux, S. (2012). Host-pathogen coevolution in human tuberculosis. Philos Trans R Soc Lond B Biol Sci 367, 850-859.

Gao, Q., Kripke, K. E., Saldanha, A. J., Yan, W., Holmes, S. & Small, P. M. (2005). Gene expression diversity among Mycobacterium tuberculosis clinical isolates. Microbiology (Reading, England) 151, 5-14.

Garnier, T., Eiglmeier, K., Camus, J. C., Medina, N., Mansoor, H., Pryor, M., Duthoy, S., Grondin, S., Lacroix, C., Monsempe, C., Simon, S., Harris, B., Atkin, R., Doggett, J., Mayes, R., Keating, L., Wheeler, P. R., Parkhill, J., Barrell, B. G., Cole, S. T., Gordon, S. V. & Hewinson, R. G. (2003). The complete genome sequence of Mycobacterium bovis. Proc Natl Acad Sci U S A 100, 7877-7882.

185 REFERENCES

Gerasimova, A., Kazakov, A. E., Arkin, A. P., Dubchak, I. & Gelfand, M. S. (2011). Comparative genomics of the dormancy regulons in mycobacteria. J Bacteriol 193, 3446-3452.

Gillespie, J. J., Wattam, A. R., Cammer, S. A., Gabbard, J. L., Shukla, M. P., Dalay, O., Driscoll, T., Hix, D., Mane, S. P., Mao, C., Nordberg, E. K., Scott, M., Schulman, J. R., Snyder, E. E., Sullivan, D. E., Wang, C., Warren, A., Williams, K. P., Xue, T., Yoo, H. S., Zhang, C., Zhang, Y., Will, R., Kenyon, R. W. & Sobral, B. W. (2011). PATRIC: the comprehensive bacterial bioinformatics resource with a focus on human pathogenic species. Infect Immun 79, 4286-4298.

Glynn, J. R., Whiteley, J., Bifani, P. J., Kremer, K. & van Soolingen, D. (2002). Worldwide occurrence of Beijing/W strains of Mycobacterium tuberculosis: a systematic review. Emerg Infect Dis 8, 843-849.

Golby, P., Hatch, K. A., Bacon, J., Cooney, R., Riley, P., Allnutt, J., Hinds, J., Nunez, J., Marsh, P. D., Hewinson, R. G. & Gordon, S. V. (2007). Comparative transcriptomics reveals key gene expression differences between the human and bovine pathogens of the Mycobacterium tuberculosis complex. Microbiology 153, 3323-3336.

Goldman, D. S. (1963). Enzyme Systems in the Mycobacteria. Xv. Initial Steps in the Metabolism of Glycerol. J Bacteriol 86, 30-37.

Grange, J. M. (2001). Mycobacterium bovis infection in human beings. Tuberculosis (Edinb) 81, 71-77.

Grissa, I., Vergnaud, G. & Pourcel, C. (2008). CRISPRcompar: a website to compare clustered regularly interspaced short palindromic repeats. Nucleic Acids Res 36, W145- 148.

Gurcha, S. S., Baulard, A. R., Kremer, L., Locht, C., Moody, D. B., Muhlecker, W., Costello, C. E., Crick, D. C., Brennan, P. J. & Besra, G. S. (2002). Ppm1, a novel polyprenol monophosphomannose synthase from Mycobacterium tuberculosis. Biochem J 365, 441-450.

186 REFERENCES

Gustafsson, C., Govindarajan, S. & Minshull, J. (2004). Codon bias and heterologous protein expression. Trends Biotechnol 22, 346-353.

Gutierrez, M. C., Brisse, S., Brosch, R., Fabre, M., Omaïs, B., Marmiesse, M., Supply, P. & Vincent, V. (2005). Ancient origin and gene mosaicism of the progenitor of Mycobacterium tuberculosis. PLoS Path 1, e5.

Hardcastle, T. J. & Kelly, K. A. (2010). baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics 11, 422.

Harris, S. R., Feil, E. J., Holden, M. T., Quail, M. A., Nickerson, E. K., Chantratita, N., Gardete, S., Tavares, A., Day, N., Lindsay, J. A., Edgeworth, J. D., de Lencastre, H., Parkhill, J., Peacock, S. J. & Bentley, S. D. (2010). Evolution of MRSA during hospital transmission and intercontinental spread. Science (New York, NY) 327, 469-474.

Hawley, D. K. & McClure, W. R. (1983). Compilation and analysis of Escherichia coli promoter DNA sequences. Nucleic Acids Res 11, 2237-2255.

He, M., Sebaihia, M., Lawley, T. D., Stabler, R. A., Dawson, L. F., Martin, M. J., Holt, K. E., Seth-Smith, H. M., Quail, M. A., Rance, R., Brooks, K., Churcher, C., Harris, D., Bentley, S. D., Burrows, C., Clark, L., Corton, C., Murray, V., Rose, G., Thurston, S., van Tonder, A., Walker, D., Wren, B. W., Dougan, G. & Parkhill, J. (2010). Evolutionary dynamics of Clostridium difficile over short and long time scales. Proceedings of the National Academy of Sciences of the United States of America.

Hendrix, R. W., Smith, M. C., Burns, R. N., Ford, M. E. & Hatfull, G. F. (1999). Evolutionary relationships among diverse bacteriophages and prophages: all the world's a phage. Proceedings of the National Academy of Sciences of the United States of America 96, 2192-2197.

Heng, L. (2008).MAQ: Mapping and Assembly with Qualities.

Hershberg, R., Lipatov, M., Small, P. M., Sheffer, H., Niemann, S., Homolka, S., Roach, J. C., Kremer, K., Petrov, D. A., Feldman, M. W. & Gagneux, S. (2008).

187 REFERENCES

High functional diversity in Mycobacterium tuberculosis driven by genetic drift and human demography. PLoS Biol 6, e311.

Hershberg, R. & Petrov, D. A. (2010). Evidence that mutation is universally biased towards AT in bacteria. PLoS Genet 6.

Heym, B., Alzari, P. M., Honore, N. & Cole, S. T. (1995). Missense mutations in the catalase-peroxidase gene, katG, are associated with isoniazid resistance in Mycobacterium tuberculosis. Mol Microbiol 15, 235-245.

Hillemann, D., Rusch-Gerdes, S. & Richter, E. (2007). Evaluation of the GenoType MTBDRplus assay for rifampin and isoniazid susceptibility testing of Mycobacterium tuberculosis strains and clinical specimens. J Clin Microbiol 45, 2635-2640.

Hirsh, A. E., Tsolaki, A. G., DeRiemer, K., Feldman, M. W. & Small, P. M. (2004). Stable association between strains of Mycobacterium tuberculosis and their human host populations. Proc Natl Acad Sci U S A 101, 4871-4876.

Ho, D. D., Neumann, A. U., Perelson, A. S., Chen, W., Leonard, J. M. & Markowitz, M. (1995). Rapid turnover of plasma virions and CD4 lymphocytes in HIV-1 infection. Nature 373, 123-126.

Holt, K. E., Parkhill, J., Mazzoni, C. J., Roumagnac, P., Weill, F. X., Goodhead, I., Rance, R., Baker, S., Maskell, D. J., Wain, J., Dolecek, C., Achtman, M. & Dougan, G. (2008). High-throughput sequencing provides insights into genome variation and evolution in Salmonella Typhi. Nat Genet 40, 987-993.

Homolka, S., Köser, C., Archer, J., Rüsch-Gerdes, S. & Niemann, S. (2009). Single- nucleotide polymorphisms in Rv2629 are specific for Mycobacterium tuberculosis genotypes Beijing and Ghana but not associated with rifampin resistance. J Clin Microbiol 47, 223-226.

Homolka, S., Niemann, S., Russell, D. G. & Rohde, K. H. (2010). Functional genetic diversity among Mycobacterium tuberculosis complex clinical isolates: delineation of conserved core and lineage-specific transcriptomes during intracellular survival. PLoS Path 6, e1000988.

188 REFERENCES

Iafrate, A. J., Feuk, L., Rivera, M. N., Listewnik, M. L., Donahoe, P. K., Qi, Y., Scherer, S. W. & Lee, C. (2004). Detection of large-scale variation in the human genome. Nat Genet 36, 949-951.

Ioerger, T. R., Feng, Y., Ganesula, K., Chen, X., Dobos, K. M., Fortune, S., Jacobs, W. R., Mizrahi, V., Parish, T., Rubin, E., Sassetti, C. & Sacchettini, J. C. (2010). Variation among genome sequences of H37Rv strains of Mycobacterium tuberculosis from multiple laboratories. J Bacteriol 192, 3645-3653.

Jones, T. F., Craig, A. S., Valway, S. E., Woodley, C. L. & Schaffner, W. (1999). Transmission of tuberculosis in a jail. Ann Intern Med 131, 557-563.

Jordan, I. K., Rogozin, I. B., Wolf, Y. I. & Koonin, E. V. (2002). Essential genes are more evolutionarily conserved than are nonessential genes in bacteria. Genome Res 12, 962-968.

Kahla, I. B., Henry, M., Boukadida, J. & Drancourt, M. (2011). Pyrosequencing assay for rapid identification of Mycobacterium tuberculosis complex species. BMC Res Notes 4, 423.

Kane, M. D., Jatkoe, T. A., Stumpf, C. R., Lu, J., Thomas, J. D. & Madore, S. J. (2000). Assessment of the sensitivity and specificity of oligonucleotide (50mer) microarrays. Nucleic Acids Res 28, 4552-4557.

Kaplan, G., Post, F. A., Moreira, A. L., Wainwright, H., Kreiswirth, B. N., Tanverdi, M., Mathema, B., Ramaswamy, S. V., Walther, G., Steyn, L. M., Barry, C. E., 3rd & Bekker, L. G. (2003). Mycobacterium tuberculosis growth at the cavity surface: a microenvironment with failed immunity. Infect Immun 71, 7099-7108.

Kawano, M., Aravind, L. & Storz, G. (2007). An antisense RNA controls synthesis of an SOS-induced toxin evolved from an antitoxin. Mol Microbiol 64, 738-754.

Keating, L. A., Wheeler, P. R., Mansoor, H., Inwald, J. K., Dale, J., Hewinson, R. G. & Gordon, S. V. (2005). The pyruvate requirement of some members of the

189 REFERENCES

Mycobacterium tuberculosis complex is due to an inactive pyruvate kinase: implications for in vivo growth. Mol Microbiol 56, 163-174.

Keinan, A. & Clark, A. G. (2012). Recent explosive human population growth has resulted in an excess of rare genetic variants. Science 336, 740-743.

Kelley, L. A. & Sternberg, M. J. (2009). Protein structure prediction on the Web: a case study using the Phyre server. Nat Protoc 4, 363-371.

Kibota, T. T. & Lynch, M. (1996). Estimate of the genomic mutation rate deleterious to overall fitness in E. coli. Nature 381, 694-696.

Kimchi-Sarfaty, C., Oh, J. M., Kim, I. W., Sauna, Z. E., Calcagno, A. M., Ambudkar, S. V. & Gottesman, M. M. (2007). A "silent" polymorphism in the MDR1 gene changes substrate specificity. Science 315, 525-528.

Kimura, M. (1977). Preponderance of synonymous changes as evidence for the neutral theory of molecular evolution. Nature 267, 275-276.

Komar, A. A. (2007). Silent SNPs: impact on gene function and phenotype. Pharmacogenomics 8, 1075-1080.

Kong, Y., Cave, M. D., Yang, D., Zhang, L., Marrs, C. F., Foxman, B., Bates, J. H., Wilson, F., Mukasa, L. N. & Yang, Z. H. (2005). Distribution of insertion- and deletion-associated genetic polymorphisms among four Mycobacterium tuberculosis phospholipase C genes and associations with extrathoracic tuberculosis: a population- based study. J Clin Microbiol 43, 6048-6053.

Kong, Y., Cave, M. D., Zhang, L., Foxman, B., Marrs, C. F., Bates, J. H. & Yang, Z. H. (2006). Population-based study of deletions in five different genomic regions of Mycobacterium tuberculosis and possible clinical relevance of the deletions. J Clin Microbiol 44, 3940-3946.

Kong, Y., Cave, M. D., Zhang, L., Foxman, B., Marrs, C. F., Bates, J. H. & Yang, Z. H. (2007). Association between Mycobacterium tuberculosis Beijing/W lineage strain infection and extrathoracic tuberculosis: Insights from epidemiologic and clinical

190 REFERENCES characterization of the three principal genetic groups of M. tuberculosis clinical isolates. J Clin Microbiol 45, 409-414.

Korber, B. (2000). HIV Signature and Sequence Variation Analysis. In Computational Analysis of HIV Molecular Sequences, pp. 55-72. Edited by A. G. Rodrigo & G. H. Learn: Kluwer Academic Publishers, Dordrecht, Netherlands.

Koser, C. U., Summers, D. K. & Archer, J. A. (2011). Thr270Ile in embC (Rv3793) is not a marker for ethambutol resistance in the Mycobacterium tuberculosis complex. Antimicrob Agents Chemother 55, 1825.

Kryazhimskiy, S. & Plotkin, J. B. (2008). The population genetics of dN/dS. PLoS Genet 4, e1000304.

Kumar, A., Toledo, J. C., Patel, R. P., Lancaster, J. R., Jr. & Steyn, A. J. (2007). Mycobacterium tuberculosis DosS is a redox sensor and DosT is a hypoxia sensor. Proc Natl Acad Sci U S A 104, 11568-11573.

Laing, R. E., Hess, P., Shen, Y., Wang, J. & Hu, S. X. (2011). The role and impact of SNPs in pharmacogenomics and personalized medicine. Curr Drug Metab 12, 460-486.

Lasa, I., Toledo-Arana, A., Dobin, A., Villanueva, M., de los Mozos, I. R., Vergara- Irigaray, M., Segura, V., Fagegaltier, D., Penades, J. R., Valle, J., Solano, C. & Gingeras, T. R. (2011). Genome-wide antisense transcription drives mRNA processing in bacteria. Proc Natl Acad Sci U S A 108, 20172-20177.

Lehner, B. (2011). Molecular mechanisms of epistasis within and between genes. Trends Genet 27, 323-331.

Lew, J. M., Kapopoulou, A., Jones, L. M. & Cole, S. T. (2011). TubercuList--10 years after. Tuberculosis (Edinb) 91, 1-7.

Li, H., Ruan, J. & Durbin, R. (2008). Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 18, 1851-1858.

191 REFERENCES

Li, H. & Durbin, R. (2009). Fast and accurate short read alignment with Burrows- Wheeler transform. Bioinformatics 25, 1754-1760.

Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G. & Durbin, R. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078-2079.

Lindstedt, B. A. (2005). Multiple-locus variable number tandem repeats analysis for genetic fingerprinting of pathogenic bacteria. Electrophoresis 26, 2567-2582.

Liu, X., Gutacker, M. M., Musser, J. M. & Fu, Y. X. (2006). Evidence for recombination in Mycobacterium tuberculosis. J Bacteriol 188, 8169-8177.

Liveris, D., Schwartz, J. J., Geertman, R. & Schwartz, I. (1993). Molecular cloning and sequencing of infC, the gene encoding translation initiation factor IF3, from four enterobacterial species. FEMS Microbiol Lett 112, 211-216.

Loman, N., Constantinidou, C., Chan, J. Z., Halachev, M., Sergeant, M., Penn, C., Robinson, E. & Pallen, M. (2012). High-throughput bacterial genome sequencing: an embarrassment of choice, a world of opportunity. Nature reviews Microbiology.

Madigan, M. T., Martinko, J. M. & Parker, J. (2003). Brock Biology of Microorganisms10th ed. Pearson Education.

Makarova, K. S., Wolf, Y. I. & Koonin, E. V. (2009). Comprehensive comparative- genomic analysis of type 2 toxin-antitoxin systems and related mobile stress response systems in prokaryotes. Biol Direct 4, 19.

Malen, H., Berven, F. S., Fladmark, K. E. & Wiker, H. G. (2007). Comprehensive analysis of exported proteins from Mycobacterium tuberculosis H37Rv. Proteomics 7, 1702-1718.

Malys, N. & McCarthy, J. E. (2011). Translation initiation: variations in the mechanism can be anticipated. Cell Mol Life Sci 68, 991-1003.

192 REFERENCES

Manca, C., Tsenova, L., Barry, C. E., 3rd, Bergtold, A., Freeman, S., Haslett, P. A., Musser, J. M., Freedman, V. H. & Kaplan, G. (1999). Mycobacterium tuberculosis CDC1551 induces a more vigorous host response in vivo and in vitro, but is not more virulent than other clinical isolates. J Immunol 162, 6740-6746.

Manca, C., Tsenova, L., Bergtold, A., Freeman, S., Tovey, M., Musser, J. M., Barry, C. E., 3rd, Freedman, V. H. & Kaplan, G. (2001). Virulence of a Mycobacterium tuberculosis clinical isolate in mice is determined by failure to induce Th1 type immunity and is associated with induction of IFN-alpha /beta. Proc Natl Acad Sci U S A 98, 5752-5757.

Manca, C., Tsenova, L., Freeman, S., Barczak, A. K., Tovey, M., Murray, P. J., Barry, C. & Kaplan, G. (2005). Hypervirulent M. tuberculosis W/Beijing strains upregulate type I IFNs and increase expression of negative regulators of the Jak-Stat pathway. J Interferon Cytokine Res 25, 694-701.

Mao, C., Shukla, M., Larrouy-Maumus, G., Dix, F. L., Kelley, L. A., Sternberg, M. J., Sobral, B. W. & de Carvalho, L. P. (2012). Functional assignment of Mycobacterium tuberculosis proteome revealed by genome-scale fold-recognition. Tuberculosis (Edinb).

Marguerat, S. & Bähler, J. (2010). RNA-seq: from technology to biology. Cell Mol Life Sci.

McEvoy, C. R., Cloete, R., Muller, B., Schurch, A. C., van Helden, P. D., Gagneux, S., Warren, R. M. & Gey van Pittius, N. C. (2012). Comparative analysis of Mycobacterium tuberculosis pe and ppe genes reveals high sequence variation and an apparent absence of selective constraints. PLoS One 7, e30593.

McNerney, R., Maeurer, M., Abubakar, I., Marais, B., McHugh, T. D., Ford, N., Weyer, K., Lawn, S., Grobusch, M. P., Memish, Z., Squire, S. B., Pantaleo, G., Chakaya, J., Casenghi, M., Migliori, G. B., Mwaba, P., Zijenah, L., Hoelscher, M., Cox, H., Swaminathan, S., Kim, P. S., Schito, M., Harari, A., Bates, M., Schwank, S., O'Grady, J., Pletschette, M., Ditui, L., Atun, R. & Zumla, A. (2012). Tuberculosis diagnostics and biomarkers: needs, challenges, recent advances, and opportunities. J Infect Dis 205 Suppl 2, S147-158.

193 REFERENCES

Meyer, M. & Kircher, M. (2010). Illumina sequencing library preparation for highly multiplexed target capture and sequencing. Cold Spring Harb Protoc 2010, pdb prot5448.

Micklinghoff, J. C., Breitinger, K. J., Schmidt, M., Geffers, R., Eikmanns, B. J. & Bange, F. C. (2009). Role of the transcriptional regulator RamB (Rv0465c) in the control of the glyoxylate cycle in Mycobacterium tuberculosis. J Bacteriol 191, 7260- 7269.

Miller, M. P. & Kumar, S. (2001). Understanding human disease mutations through the use of interspecific genetic variation. Hum Mol Genet 10, 2319-2328.

Minnikin, D. E., Minnikin, S. M., Dobson, G., Goodfellow, M., Portaels, F., van den Breen, L. & Sesardic, D. (1983). Mycolic acid patterns of four vaccine strains of Mycobacterium bovis BCG. J Gen Microbiol 129, 889-891.

Mitchison, D. A., Wallace, J. G., Bhatia, A. L., Selkon, J. B., Subbaiah, T. V. & Lancaster, M. C. (1960). A comparison of the virulence in guinea-pigs of South Indian and British tubercle bacilli. Tubercle 41, 1-22.

Mitchison, D. A., Selkon, J. B. & Lloyd, J. (1963). Virulence in the Guinea-Pig, Susceptibility to Hydrogen Peroxide, and Catalase Activity of Isoniazid-Sensitive Tubercle Bacilli from South Indian and British Patients. J Pathol Bacteriol 86, 377-386.

Mooney, S. (2005). Bioinformatics approaches and resources for single nucleotide polymorphism functional analysis. Brief Bioinformatics 6, 44-56.

Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. & Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5, 621- 628.

Movahedzadeh, F., Smith, D. A., Norman, R. A., Dinadayala, P., Murray-Rust, J., Russell, D. G., Kendall, S. L., Rison, S. C., McAlister, M. S., Bancroft, G. J., McDonald, N. Q., Daffe, M., Av-Gay, Y. & Stoker, N. G. (2004). The Mycobacterium

194 REFERENCES tuberculosis ino1 gene is essential for growth and virulence. Mol Microbiol 51, 1003- 1014.

Muller, B., Borrell, S., Rose, G. & Gagneux, S. (2013). The heterogeneous evolution of multidrug-resistant Mycobacterium tuberculosis. Trends Genet.

Müller, B., Streicher, E. M., Hoek, K. G., Tait, M., Trollip, A., Bosman, M. E., Coetzee, G. J., Chabula-Nxiweni, E. M., Hoosain, E., Gey van Pittius, N. C., Victor, T. C., van Helden, P. D. & Warren, R. M. (2011). inhA promoter mutations: a gateway to extensively drug-resistant tuberculosis in South Africa? The international journal of tuberculosis and lung disease : the official journal of the International Union against Tuberculosis and Lung Disease 15, 344-351.

Musser, J. M., Kapur, V., Williams, D. L., Kreiswirth, B. N., van Soolingen, D. & van Embden, J. D. (1996). Characterization of the catalase-peroxidase gene (katG) and inhA locus in isoniazid-resistant and -susceptible strains of Mycobacterium tuberculosis by automated DNA sequencing: restricted array of mutations associated with drug resistance. J Infect Dis 173, 196-202.

Musser, J. M., Amin, A. & Ramaswamy, S. (2000). Negligible genetic diversity of mycobacterium tuberculosis host immune system protein targets: evidence of limited selective pressure. Genetics 155, 7-16.

Nerlich, A. G., Haas, C. J., Zink, A., Szeimies, U. & Hagedorn, H. G. (1997). Molecular evidence for tuberculosis in an ancient Egyptian mummy. Lancet 350, 1404.

Newton-Foot, M. & Gey van Pittius, N. C. (2012). The complex architecture of mycobacterial promoters. Tuberculosis (Edinb).

Ng, P. C. & Henikoff, S. (2001). Predicting deleterious amino acid substitutions. Genome Res 11, 863-874.

Ng, P. C. & Henikoff, S. (2003). SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res 31, 3812-3814.

195 REFERENCES

Ng, P. C. & Henikoff, S. (2006). Predicting the effects of amino acid substitutions on protein function. Annual review of genomics and human genetics 7, 61-80.

Nicol, M. P. & Wilkinson, R. J. (2008). The clinical consequences of strain diversity in Mycobacterium tuberculosis. Trans R Soc Trop Med Hyg 102, 955-965.

Pandey, D. P. & Gerdes, K. (2005). Toxin-antitoxin loci are highly abundant in free- living but lost from host-associated prokaryotes. Nucleic Acids Res 33, 966-976.

Parish, T. & Stoker, N. G. (2001). Mycobacterium tuberculosis protocols. Totowa, NJ: Humana Press.

Parsons, L. M., Brosch, R., Cole, S. T., Somoskovi, A., Loder, A., Bretzel, G., Van Soolingen, D., Hale, Y. M. & Salfinger, M. (2002). Rapid and simple approach for identification of Mycobacterium tuberculosis complex isolates by PCR-based genomic deletion analysis. J Clin Microbiol 40, 2339-2345.

Parsons, S., Smith, S. G., Martins, Q., Horsnell, W. G., Gous, T. A., Streicher, E. M., Warren, R. M., van Helden, P. D. & Gey van Pittius, N. C. (2008). Pulmonary infection due to the dassie bacillus (Mycobacterium tuberculosis complex sp.) in a free- living dassie (rock hyrax-Procavia capensis) from South Africa. Tuberculosis (Edinb) 88, 80-83.

Parthiban, V., Gromiha, M. M. & Schomburg, D. (2006). CUPSAT: prediction of protein stability upon point mutations. Nucleic Acids Res 34, W239-242.

Parwati, I., van Crevel, R. & van Soolingen, D. (2010). Possible underlying mechanisms for successful emergence of the Mycobacterium tuberculosis Beijing genotype strains. Lancet Infect Dis 10, 103-111.

Perkins, T. T., Kingsley, R. A., Fookes, M. C., Gardner, P. P., James, K. D., Yu, L., Assefa, S. A., He, M., Croucher, N. J., Pickard, D. J., Maskell, D. J., Parkhill, J., Choudhary, J., Thomson, N. R. & Dougan, G. (2009). A strand-specific RNA-Seq analysis of the transcriptome of the typhoid bacillus Salmonella typhi. PLoS Genet 5, e1000569.

196 REFERENCES

Plotkin, J. B. & Kudla, G. (2011). Synonymous but not the same: the causes and consequences of codon bias. Nat Rev Genet 12, 32-42.

Portevin, D., Gagneux, S., Comas, I. & Young, D. (2011). Human macrophage responses to clinical isolates from the Mycobacterium tuberculosis complex discriminate between ancient and modern lineages. PLoS Pathog 7, e1001307.

Projahn, M., Koser, C., Homolka, S., Summers, D., Archer, J. & Niemann, S. (2011). Polymorphisms in Isoniazid and Prothionamide Resistance Genes of the Mycobacterium tuberculosis Complex. Antimicrobial agents and chemotherapy 55, 4408-4411.

Punta, M., Coggill, P. C., Eberhardt, R. Y., Mistry, J., Tate, J., Boursnell, C., Pang, N., Forslund, K., Ceric, G., Clements, J., Heger, A., Holm, L., Sonnhammer, E. L., Eddy, S. R., Bateman, A. & Finn, R. D. (2012). The Pfam protein families database. Nucleic Acids Res 40, D290-301.

Qi, W., Kaser, M., Roltgen, K., Yeboah-Manu, D. & Pluschke, G. (2009). Genomic diversity and evolution of revealed by next-generation sequencing. PLoS Pathog 5, e1000580.

Qian, L., Van Embden, J. D., Van Der Zanden, A. G., Weltevreden, E. F., Duanmu, H. & Douglas, J. T. (1999). Retrospective analysis of the Beijing family of Mycobacterium tuberculosis in preserved lung tissues. J Clin Microbiol 37, 471-474.

Quinlan, A. R. & Hall, I. M. (2010). BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841-842.

Raghavan, R., Sloan, D. B. & Ochman, H. (2012). Antisense transcription is pervasive but rarely conserved in enteric bacteria. MBio 3.

Ramaswamy, S. & Musser, J. M. (1998). Molecular genetic basis of antimicrobial agent resistance in Mycobacterium tuberculosis: 1998 update. Tuber Lung Dis 79, 3-29.

Ramaswamy, S. V., Amin, A. G., Goksel, S., Stager, C. E., Dou, S. J., El Sahly, H., Moghazeh, S. L., Kreiswirth, B. N. & Musser, J. M. (2000). Molecular genetic

197 REFERENCES analysis of nucleotide polymorphisms associated with ethambutol resistance in human isolates of Mycobacterium tuberculosis. Antimicrob Agents Chemother 44, 326-336.

Ramaswamy, S. V., Reich, R., Dou, S. J., Jasperse, L., Pan, X., Wanger, A., Quitugua, T. & Graviss, E. A. (2003). Single nucleotide polymorphisms in genes associated with isoniazid resistance in Mycobacterium tuberculosis. Antimicrob Agents Chemother 47, 1241-1250.

Ramensky, V., Bork, P. & Sunyaev, S. (2002). Human non-synonymous SNPs: server and survey. Nucleic Acids Res 30, 3894.

Rao, V., Gao, F., Chen, B., Jacobs, W. R., Jr. & Glickman, M. S. (2006). Trans- cyclopropanation of mycolic acids on trehalose dimycolate suppresses Mycobacterium tuberculosis -induced inflammation and virulence. J Clin Invest 116, 1660-1667.

Reddy, T. B., Riley, R., Wymore, F., Montgomery, P., DeCaprio, D., Engels, R., Gellesch, M., Hubble, J., Jen, D., Jin, H., Koehrsen, M., Larson, L., Mao, M., Nitzberg, M., Sisk, P., Stolte, C., Weiner, B., White, J., Zachariah, Z. K., Sherlock, G., Galagan, J. E., Ball, C. A. & Schoolnik, G. K. (2009). TB database: an integrated platform for tuberculosis research. Nucleic Acids Res 37, D499-508.

Reed, M. B., Domenech, P., Manca, C., Su, H., Barczak, A. K., Kreiswirth, B. N., Kaplan, G. & Barry, C. E., 3rd (2004). A glycolipid of hypervirulent tuberculosis strains that inhibits the innate immune response. Nature 431, 84-87.

Reed, M. B., Gagneux, S., Deriemer, K., Small, P. M. & Barry, C. E., 3rd (2007). The W-Beijing lineage of Mycobacterium tuberculosis overproduces triglycerides and has the DosR dormancy regulon constitutively upregulated. J Bacteriol 189, 2583-2589.

Reed, M. B., Pichler, V. K., McIntosh, F., Mattia, A., Fallow, A., Masala, S., Domenech, P., Zwerling, A., Thibert, L., Menzies, D., Schwartzman, K. & Behr, M. A. (2009). Major Mycobacterium tuberculosis lineages associate with patient country of origin. J Clin Microbiol 47, 1119-1128.

Riska, P. F., Jacobs, W. R., Jr. & Alland, D. (2000). Molecular determinants of drug resistance in tuberculosis. Int J Tuberc Lung Dis 4, S4-10.

198 REFERENCES

Robinson, D. A., Falush, D. & Feil, E. J. (2010a). Bacterial population genetics in infectious disease. Hoboken, N.J.: Wiley-Blackwell.

Robinson, M. D., McCarthy, D. J. & Smyth, G. K. (2010b). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139-140.

Robinson, M. D. & Oshlack, A. (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol 11, R25.

Rocha, E. P., Smith, J. M., Hurst, L. D., Holden, M. T., Cooper, J. E., Smith, N. H. & Feil, E. J. (2006). Comparisons of dN/dS are time dependent for closely related bacterial genomes. J Theor Biol 239, 226-235.

Roumagnac, P., Weill, F. X., Dolecek, C., Baker, S., Brisse, S., Chinh, N. T., Le, T. A., Acosta, C. J., Farrar, J., Dougan, G. & Achtman, M. (2006). Evolutionary history of Salmonella typhi. Science 314, 1301-1304.

Rozen, S. & Skaletsky, H. (2000). Primer3 on the WWW for general users and for biologist programmers. Methods Mol Biol 132, 365-386.

Russell, D. G., Barry, C. E. & Flynn, J. L. (2010). Tuberculosis: what we don't know can, and does, hurt us. Science (New York, NY) 328, 852-856.

Sala, C., Haouz, A., Saul, F., Miras, I., Rosenkrands, I., Alzari, P. & Cole, S. T. (2009). Genome-wide regulon and crystal structure of BlaI (Rv1846c) from Mycobacterium tuberculosis. Mol Microbiol 71, 1102-1116.

Salo, W. L., Aufderheide, A. C., Buikstra, J. & Holcomb, T. A. (1994). Identification of Mycobacterium tuberculosis DNA in a pre-Columbian Peruvian mummy. Proc Natl Acad Sci U S A 91, 2091-2094.

Sandgren, A., Strong, M., Muthukrishnan, P., Weiner, B. K., Church, G. M. & Murray, M. B. (2009). Tuberculosis drug resistance mutation database. PLoS Med 6, e2.

199 REFERENCES

Sassetti, C. M., Boyd, D. H. & Rubin, E. J. (2003). Genes required for mycobacterial growth defined by high density mutagenesis. Mol Microbiol 48, 77-84.

Sassetti, C. M. & Rubin, E. J. (2003). Genetic requirements for mycobacterial survival during infection. Proc Natl Acad Sci U S A 100, 12989-12994.

Saunders, C. T. & Baker, D. (2002). Evaluation of structural and evolutionary contributions to deleterious mutation prediction. J Mol Biol 322, 891-901.

Schena, M., Heller, R. A., Theriault, T. P., Konrad, K., Lachenmeier, E. & Davis, R. W. (1998). Microarrays: biotechnology's discovery platform for functional genomics. Trends Biotechnol 16, 301-306.

Schnell, R., Agren, D. & Schneider, G. (2008). 1.9 A structure of the signal receiver domain of the putative response regulator NarL from Mycobacterium tuberculosis. Acta Crystallogr Sect F Struct Biol Cryst Commun 64, 1096-1100.

Schulz, M. H., Zerbino, D. R., Vingron, M. & Birney, E. (2012). Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics 28, 1086-1092.

Schürch, A. C., Kremer, K., Warren, R. M., Hung, N. V., Zhao, Y., Wan, K., Boeree, M. J., Siezen, R. J., Smith, N. H. & van Soolingen, D. (2011). Mutations in the regulatory network underlie the recent clonal expansion of a dominant subclone of the Mycobacterium tuberculosis Beijing genotype. Infect, Genet Evol 11, 587-597.

Sharma, C. M., Hoffmann, S., Darfeuille, F., Reignier, J., Findeiss, S., Sittka, A., Chabas, S., Reiche, K., Hackermüller, J., Reinhardt, R., Stadler, P. F. & Vogel, J. (2010a). The primary transcriptome of the major human pathogen Helicobacter pylori. Nature 464, 250-255.

Sharma, C. M., Hoffmann, S., Darfeuille, F., Reignier, J., Findeiss, S., Sittka, A., Chabas, S., Reiche, K., Hackermuller, J., Reinhardt, R., Stadler, P. F. & Vogel, J. (2010b). The primary transcriptome of the major human pathogen Helicobacter pylori. Nature 464, 250-255.

200 REFERENCES

Shendure, J. & Ji, H. (2008). Next-generation DNA sequencing. Nat Biotechnol 26, 1135-1145.

Sherman, D. R., Mdluli, K., Hickey, M. J., Arain, T. M., Morris, S. L., Barry, C. E., 3rd & Stover, C. K. (1996). Compensatory ahpC gene expression in isoniazid-resistant Mycobacterium tuberculosis. Science 272, 1641-1643.

Sherry, S. T., Ward, M. & Sirotkin, K. (1999). dbSNP-database for single nucleotide polymorphisms and other classes of minor genetic variation. Genome Res 9, 677-679.

Singh, A., Jain, S., Gupta, S., Das, T. & Tyagi, A. K. (2003). mymA operon of Mycobacterium tuberculosis: its regulation and importance in the cell envelope. FEMS Microbiol Lett 227, 53-63.

Singh, A., Gupta, R., Vishwakarma, R. A., N, P. C., Ramanathan, V. D. & Tyagi, A. K. (2005). Requirement of the mymA operon for appropriate cell wall ultrastructure and persistence of Mycobacterium tuberculosis in the spleens of guinea pigs. J Bacteriol 187, 4173-4186.

Sinsimer, D., Huet, G., Manca, C., Tsenova, L., Koo, M. S., Kurepina, N., Kana, B., Mathema, B., Marras, S. A., Kreiswirth, B. N., Guilhot, C. & Kaplan, G. (2008). The phenolic glycolipid of Mycobacterium tuberculosis differentially modulates the early host cytokine response but does not in itself confer hypervirulence. Infect Immun 76, 3027-3036.

Smith, N. H., Gordon, S. V., de la Rua-Domenech, R., Clifton-Hadley, R. S. & Hewinson, R. G. (2006a). Bottlenecks and broomsticks: the molecular evolution of Mycobacterium bovis. Nat Rev Microbiol 4, 670-681.

Smith, N. H., Kremer, K., Inwald, J., Dale, J., Driscoll, J. R., Gordon, S. V., van Soolingen, D., Hewinson, R. G. & Smith, J. M. (2006b). Ecotypes of the Mycobacterium tuberculosis complex. J Theor Biol 239, 220-225.

Sreevatsan, S., Pan, X., Stockbauer, K. E., Connell, N. D., Kreiswirth, B. N., Whittam, T. S. & Musser, J. M. (1997a). Restricted structural gene polymorphism in

201 REFERENCES the Mycobacterium tuberculosis complex indicates evolutionarily recent global dissemination. Proc Natl Acad Sci U S A 94, 9869-9874.

Sreevatsan, S., Stockbauer, K. E., Pan, X., Kreiswirth, B. N., Moghazeh, S. L., Jacobs, W. R., Jr., Telenti, A. & Musser, J. M. (1997b). Ethambutol resistance in Mycobacterium tuberculosis: critical role of embB mutations. Antimicrob Agents Chemother 41, 1677-1681.

Srivastava, S., Garg, A., Ayyagari, A., Nyati, K. K., Dhole, T. N. & Dwivedi, S. K. (2006). Nucleotide polymorphism associated with ethambutol resistance in clinical isolates of Mycobacterium tuberculosis. Curr Microbiol 53, 401-405.

Srivastava, S., Ayyagari, A., Dhole, T. N., Nyati, K. K. & Dwivedi, S. K. (2009). emb nucleotide polymorphisms and the role of embB306 mutations in Mycobacterium tuberculosis resistance to ethambutol. Int J Med Microbiol 299, 269-280.

Stahl, D. A. & Urbance, J. W. (1990). The division between fast- and slow-growing species corresponds to natural relationships among the mycobacteria. J Bacteriol 172, 116-124.

Steenken, W. (1935). Lysis of tubercle bacilli in vitro. Proc Soc Exptl Biol Med 33 253–255.

Stenson, P. D., Ball, E. V., Mort, M., Phillips, A. D., Shaw, K. & Cooper, D. N. (2012). The Human Gene Mutation Database (HGMD) and its exploitation in the fields of personalized genomics and molecular evolution. Curr Protoc Bioinformatics Chapter 1, Unit1 13.

Steyn, A. J. C., Joseph, J. & Bloom, B. R. (2003). Interaction of the sensor module of Mycobacterium tuberculosis H37Rv KdpD with members of the Lpr family. Mol Microbiol 47, 1075-1089.

Stitziel, N. O., Binkowski, T. A., Tseng, Y. Y., Kasif, S. & Liang, J. (2004). topoSNP: a topographic database of non-synonymous single nucleotide polymorphisms with and without known disease association. Nucleic Acids Res 32, D520-522.

202 REFERENCES

Stucki, D. & Gagneux, S. (2012). Single nucleotide polymorphisms in Mycobacterium tuberculosis and the need for a curated database. Tuberculosis (Edinb).

Stucki, D., Malla, B., Hostettler, S., Huna, T., Feldmann, J., Yeboah-Manu, D., Borrell, S., Fenner, L., Comas, I., Coscolla, M. & Gagneux, S. (2012). Two new rapid SNP-typing methods for classifying Mycobacterium tuberculosis complex into the main phylogenetic lineages. PLoS One 7, e41253.

Sunyaev, S., Ramensky, V. & Bork, P. (2000). Towards a structural basis of human non-synonymous single nucleotide polymorphisms. Trends Genet 16, 198-200.

Sunyaev, S., Ramensky, V., Koch, I., Lathe, W., Kondrashov, A. S. & Bork, P. (2001). Prediction of deleterious human alleles. Hum Mol Genet 10, 591-597.

Supply, P., Lesjean, S., Savine, E., Kremer, K., van Soolingen, D. & Locht, C. (2001). Automated high-throughput genotyping for study of global epidemiology of Mycobacterium tuberculosis based on mycobacterial interspersed repetitive units. J Clin Microbiol 39, 3563-3571.

Supply, P., Warren, R. M., Banuls, A. L., Lesjean, S., Van Der Spuy, G. D., Lewis, L. A., Tibayrenc, M., Van Helden, P. D. & Locht, C. (2003). Linkage disequilibrium between minisatellite loci supports clonal evolution of Mycobacterium tuberculosis in a high tuberculosis incidence area. Mol Microbiol 47, 529-538.

Supply, P., Marceau, M., Mangenot, S., Roche, D., Rouanet, C., Khanna, V., Majlessi, L., Criscuolo, A., Tap, J., Pawlik, A., Fiette, L., Orgeur, M., Fabre, M., Parmentier, C., Frigui, W., Simeone, R., Boritsch, E. C., Debrie, A. S., Willery, E., Walker, D., Quail, M. A., Ma, L., Bouchier, C., Salvignol, G., Sayes, F., Cascioferro, A., Seemann, T., Barbe, V., Locht, C., Gutierrez, M. C., Leclerc, C., Bentley, S. D., Stinear, T. P., Brisse, S., Medigue, C., Parkhill, J., Cruveiller, S. & Brosch, R. (2013). Genomic analysis of smooth tubercle bacilli provides insights into ancestry and pathoadaptation of Mycobacterium tuberculosis. Nat Genet 45, 172-179.

Tamura, K., Peterson, D., Peterson, N., Stecher, G., Nei, M. & Kumar, S. (2011). MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Mol Biol Evol 28, 2731-2739.

203 REFERENCES

Team_RDC (2008). R Development Core Team. Vienna: R Foundation for Statistical Computing.

Tennessen, J. A., Bigham, A. W., O'Connor, T. D., Fu, W., Kenny, E. E., Gravel, S., McGee, S., Do, R., Liu, X., Jun, G., Kang, H. M., Jordan, D., Leal, S. M., Gabriel, S., Rieder, M. J., Abecasis, G., Altshuler, D., Nickerson, D. A., Boerwinkle, E., Sunyaev, S., Bustamante, C. D., Bamshad, M. J. & Akey, J. M. (2012). Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 337, 64-69.

Thomas, P. D., Campbell, M. J., Kejariwal, A., Mi, H., Karlak, B., Daverman, R., Diemer, K., Muruganujan, A. & Narechania, A. (2003). PANTHER: a library of protein families and subfamilies indexed by function. Genome Res 13, 2129-2141.

Thomason, M. & Storz, G. (2010). Bacterial antisense RNAs: how many are there, and what are they doing? Annu Rev Genet 44, 167-188.

Thusberg, J. & Vihinen, M. (2009). Pathogenic or not? And if so, then how? Studying the effects of missense mutations using bioinformatics methods. Hum Mutat 30, 703- 714.

Torrelles, J. B., DesJardin, L. E., MacNeil, J., Kaufman, T. M., Kutzbach, B., Knaup, R., McCarthy, T. R., Gurcha, S. S., Besra, G. S., Clegg, S. & Schlesinger, L. S. (2009). Inactivation of Mycobacterium tuberculosis mannosyltransferase pimB reduces the cell wall lipoarabinomannan and lipomannan content and increases the rate of bacterial-induced human macrophage cell death. Glycobiology 19, 743-755.

Torrelles, J. B. & Schlesinger, L. S. (2010). Diversity in Mycobacterium tuberculosis mannosylated cell wall determinants impacts adaptation to the host. Tuberculosis (Edinb) 90, 84-93.

Tsolaki, A. G., Hirsh, A. E., DeRiemer, K., Enciso, J. A., Wong, M. Z., Hannan, M., Goguet de la Salmoniere, Y. O., Aman, K., Kato-Maeda, M. & Small, P. M. (2004). Functional and evolutionary genomics of Mycobacterium tuberculosis: insights from

204 REFERENCES genomic deletions in 100 strains. Proceedings of the National Academy of Sciences of the United States of America 101, 4865-4870.

Valway, S. E., Sanchez, M. P., Shinnick, T. F., Orme, I., Agerton, T., Hoy, D., Jones, J. S., Westmoreland, H. & Onorato, I. M. (1998). An outbreak involving extensive transmission of a virulent strain of Mycobacterium tuberculosis. N Engl J Med 338, 633-639. van Embden, J. D., Cave, M. D., Crawford, J. T., Dale, J. W., Eisenach, K. D., Gicquel, B., Hermans, P., Martin, C., McAdam, R., Shinnick, T. M. & et al. (1993). Strain identification of Mycobacterium tuberculosis by DNA fingerprinting: recommendations for a standardized methodology. J Clin Microbiol 31, 406-409. van Soolingen, D., Hermans, P. W., de Haas, P. E., Soll, D. R. & van Embden, J. D. (1991). Occurrence and stability of insertion sequences in Mycobacterium tuberculosis complex strains: evaluation of an insertion sequence-dependent DNA polymorphism as a tool in the epidemiology of tuberculosis. J Clin Microbiol 29, 2578-2586. van Soolingen, D., Hoogenboezem, T., de Haas, P. E., Hermans, P. W., Koedam, M. A., Teppema, K. S., Brennan, P. J., Besra, G. S., Portaels, F., Top, J., Schouls, L. M. & van Embden, J. D. (1997). A novel pathogenic taxon of the Mycobacterium tuberculosis complex, Canetti: characterization of an exceptional isolate from Africa. Int J Syst Bacteriol 47, 1236-1245.

Van Soolingen, D. (2001). Molecular epidemiology of tuberculosis and other mycobacterial infections: main methodologies and achievements. J Intern Med 249, 1- 26.

Walderhaug, M. O., Polarek, J. W., Voelkner, P., Daniel, J. M., Hesse, J. E., Altendorf, K. & Epstein, W. (1992). KdpD and KdpE, proteins that control expression of the kdpABC operon, are members of the two-component sensor-effector class of regulators. J Bacteriol 174, 2152-2159.

Wang, Q., Yue, J., Zhang, L., Xu, Y., Chen, J., Zhang, M., Zhu, B. & Wang, H. (2007). A newly identified 191A/C mutation in the Rv2629 gene that was significantly

205 REFERENCES associated with rifampin resistance in Mycobacterium tuberculosis. J Proteome Res 6, 4564-4571.

Wang, Z. & Moult, J. (2001). SNPs, protein structure, and disease. Hum Mutat 17, 263-270.

Wayne, L. G. (1994). Tuberculosis: Pathogenesis, Protection, and Control. Washington, D.C: American Society for Microbiology Press.

Weiner, B., Gomez, J., Victor, T. C., Warren, R. M., Sloutsky, A., Plikaytis, B. B., Posey, J. E., van Helden, P. D., Gey van Pittius, N. C., Koehrsen, M., Sisk, P., Stolte, C., White, J., Gagneux, S., Birren, B., Hung, D., Murray, M. & Galagan, J. (2012). Independent large scale duplications in multiple M. tuberculosis lineages overlapping the same genomic region. PLoS One 7, e26038.

Weniger, T., Krawczyk, J., Supply, P., Niemann, S. & Harmsen, D. (2010). MIRU- VNTRplus: a web tool for polyphasic genotyping of Mycobacterium tuberculosis complex bacteria. Nucleic Acids Res 38, W326-331.

WHO (2012).WHO 2012 Global tuberculosis control—surveillance, planning, financing. Geneva.

Winder, F. G. & Brennan, P. J. (1966). Initial steps in the metabolism of glycerol by Mycobacterium tuberculosis. J Bacteriol 92, 1846-1847.

Yi, H., Cho, Y. J., Won, S., Lee, J. E., Jin Yu, H., Kim, S., Schroth, G. P., Luo, S. & Chun, J. (2011). Duplex-specific nuclease efficiently removes rRNA for prokaryotic RNA-seq. Nucleic Acids Res.

Yoder-Himes, D. R., Chain, P. S., Zhu, Y., Wurtzel, O., Rubin, E. M., Tiedje, J. M. & Sorek, R. (2009). Mapping the Burkholderia cenocepacia niche response via high- throughput sequencing. Proc Natl Acad Sci U S A 106, 3976-3981.

Yuan, Y., Zhu, Y., Crane, D. D. & Barry, C. E., 3rd (1998). The effect of oxygenated mycolic acid composition on cell wall function and macrophage growth in Mycobacterium tuberculosis. Mol Microbiol 29, 1449-1458.

206 REFERENCES

Yue, P., Li, Z. & Moult, J. (2005). Loss of protein structure stability as a major causative factor in monogenic disease. J Mol Biol 353, 459-473.

Zheng, X., Hu, G., She, Z. & Zhu, H. (2011). Leaderless genes in bacteria: clue to the evolution of translation initiation mechanisms in prokaryotes. BMC Genomics 12, 361.

207 APPENDIX( (

Appendix: A-G

208 Appendix A: genomeDeletions.pl

Appendix A

Perl script (genomeDeletions.pl) to identify large deletions within genome sequencing data. Script takes as input the Artemis genome coverage file format. genomeDeletions.pl

#!/usr/bin/perl -w

################################################################# # Find deletions using Artemis per base coverage file # # # # usage: perl findGeneDeletions [artemis coverage file] # #[annotation file] [% threshold] # # # # Percentage cutoff set by command argument 3 # # # # Graham Rose 05.2011 # # # #################################################################

################# Arguments from commandline #################### if ($#ARGV != 2 ) { print "\nusage: perl findGeneDeletions.pl [artemis coverage file] [H37Rv annotation file] [% deletion threshold eg: 50]\n\n"; exit; } open FILEIN_ONE, $ARGV[0] or die "Can't open STDOUT: $!\n"; @genomeCoverage = ; close(FILEIN_ONE); open FILEIN_TWO, $ARGV[1] or die "Can't open STDOUT: $!\n"; @annotations = ; close(FILEIN_TWO);

$threshold = $ARGV[2];

209 Appendix A: genomeDeletions.pl

########################## Main logic ########################### foreach $line_in_annotations(@annotations) { chomp($line_in_annotations); $line_in_annotations =~ /(\w+)\s+(\w+)\s+(\w+)\s+(\w+)/; $geneStart = $1-1; #catch +1 error $geneEnd = $2; #$direction = $3; $geneName = $4;

$geneLength = $geneEnd-$geneStart; $geneLength2 = $geneLength; $zeros = 0; $numberNonZero = 0;

while($geneStart != $geneEnd) { #push(@array,$genomeCoverage[$geneStart]);

if($genomeCoverage[$geneStart] == 0) { $zeros++; } $geneStart++; } #$length = @array; #print "$geneName length = $geneLength2\n"; #print "$geneName Number of zeros = $zeros\n"; $numberNonZero = ($geneLength2-$zeros); $percentZero = (($zeros/$geneLength2)*100); $rounded = sprintf "%.2f", $percentZero;

############################## Output ########################### if($rounded >= $threshold) { print "Deletion: $geneName (% deleted: $rounded)\n"; } } print "\ncomplete\n\n";

210 Appendix B. Lineage-specific SNPs

Appendix B

Lineage-specific SNPs

All lineage-specific MTBC SNPs. SNPs are ordered by genomic position. Alleles are relative to the coding strand. If SNPs are in intergenic regions, alleles are based on the forward strand. Ancestral allele based on the reconstructed most recent common ancestor of the MTBC. Mutation column shows codon position and amino acid change if the SNP is nonsynonymous.

211 Appendix B. Lineage-specific SNPs

Genomic Ancestral Derived Lineage Mutation type Gene Mutation position allele allele modern 2532 C T synonymous Rv0002 L161L 5 3192 A G nonsynonymous Rv0002 N381D 3 3446 C T nonsynonymous Rv0003 A56V 5 3452 T C nonsynonymous Rv0003 L58S 1 6112 G C nonsynonymous Rv0005 M330I 1 8452 C T nonsynonymous Rv0006 A384V 6 8493 C T nonsynonymous Rv0006 L398F modern 9143 C T synonymous Rv0006 I614I 5 9566 C T synonymous Rv0006 Y755Y 2 11820 C G intergenic - - 3 12204 C T synonymous Rv0008c L36L 1 13298 C G nonsynonymous Rv0010c I87M modern 13460 C T synonymous Rv0010c D33D 6 13482 C T nonsynonymous Rv0010c A26V 5 13579 C T intergenic - - modern 14401 G A nonsynonymous Rv0012 E105K 2 14861 G T nonsynonymous Rv0012 G258V 4 15117 G C nonsynonymous Rv0013 M68I 5 16720 G C nonsynonymous Rv0014c V251L modern 21819 G T nonsynonymous Rv0018c A455S 1 22961 A C nonsynonymous Rv0018c D74A modern 23174 T G nonsynonymous Rv0018c L3R 6 24780 A G nonsynonymous Rv0020c D222G 5 25386 G A nonsynonymous Rv0020c G20D 6 26053 G C nonsynonymous Rv0021c A277P 1 26347 G C nonsynonymous Rv0021c D179H 5 26783 G A synonymous Rv0021c A33A 3 26957 C G intergenic - - modern 27469 G A intergenic - - 3 27487 G A intergenic - - 5 27947 C T nonsynonymous Rv0023 A118V 1 27996 T C synonymous Rv0023 Y134Y 5 31807 C T synonymous Rv0028 R98R 5 32776 G T synonymous Rv0029 T240T 6 33137 C G nonsynonymous Rv0029 L361V 2 36008 G C nonsynonymous Rv0032 D572H 6 36304 T G synonymous Rv0032 A670A modern 36538 C T synonymous Rv0032 S748S 2 37305 C G nonsynonymous Rv0035 S16W 2 39158 G C synonymous Rv0036c R224R modern 39758 C T synonymous Rv0036c H24H 5 39786 G C nonsynonymous Rv0036c S15T 5 40177 G A synonymous Rv0037c L342L 6 41241 G A intergenic - - 4 42281 T G nonsynonymous Rv0039c F24C modern 43945 G A synonymous Rv0041 V128V 5 46297 T C synonymous Rv0041 Y912Y 5 47877 A C nonsynonymous Rv0043c D75A 5 50059 A T nonsynonymous Rv0046c I356F

212 Appendix B. Lineage-specific SNPs

6 51113 C T synonymous Rv0046c H4H 5 51892 A G nonsynonymous Rv0048c Y269C 3 53422 G A intergenic - - 3 54842 G T nonsynonymous Rv0050 A394S 3 56001 G A synonymous Rv0051 Q102Q 5 58875 C T nonsynonymous Rv0054 T97I 5 60059 C T nonsynonymous Rv0057 T55M 5 60300 G A synonymous Rv0057 V135V 6 62367 G A nonsynonymous Rv0058 V658I 2 63146 G T intergenic - - 1 64028 C T synonymous Rv0060 P40P 3 65083 G A synonymous Rv0061 G31G 1 65159 G A nonsynonymous Rv0061 A57T 1 65663 C G nonsynonymous Rv0062 H38D 3 66632 C T nonsynonymous Rv0062 P361S 1 66892 C G intergenic - - 3 67012 C T synonymous Rv0063 T30T 1 68174 T C nonsynonymous Rv0063 S418P 3 69984 C A synonymous Rv0064 A455A 4 70267 T G nonsynonymous Rv0064 F550V 5 71203 C G nonsynonymous Rv0064 Q862E 5 71203 C T stopgain Rv0064 Q862X 3 72549 G A nonsynonymous Rv0066c G655S 5 73148 C T nonsynonymous Rv0066c A455V 6 74161 G C nonsynonymous Rv0066c K117N 1 74737 A C nonsynonymous Rv0067c E154D 5 75313 A C nonsynonymous Rv0068 T5P 6 76147 C G nonsynonymous Rv0068 Q283E 5 77327 T G nonsynonymous Rv0069c I99S 6 78103 A G nonsynonymous Rv0070c E265G 1 79479 T C intergenic - - 6 84238 T G synonymous Rv0075 G81G 5 86587 C A synonymous Rv0078 I20I 2 87468 G A nonsynonymous Rv0078A E112K 5 87499 C T synonymous Rv0078A F101F 5 87973 C T intergenic - - 6 89113 T G nonsynonymous Rv0080 V31G modern 89200 T G nonsynonymous Rv0080 V60G 5 89474 G A synonymous Rv0080 T151T 6 89535 C T intergenic - - 1 89871 C T synonymous Rv0081 D99D 6 91001 C T nonsynonymous Rv0083 P201L 6 92016 T G nonsynonymous Rv0083 D539E 5 95867 G T nonsynonymous Rv0087 A152S modern 97696 C T intergenic - - 1 98966 G C nonsynonymous Rv0090 A163P 6 100589 A G nonsynonymous Rv0092 T3A modern 103600 C T nonsynonymous Rv0093c R22C 6 104712 T C intergenic - - 6 111651 G A nonsynonymous Rv0101 D551N 5 113059 C A nonsynonymous Rv0101 P1020H 1 115499 T G synonymous Rv0101 R1833R 6 116901 T G nonsynonymous Rv0101 C2301G

213 Appendix B. Lineage-specific SNPs

3 117389 C T synonymous Rv0101 T2463T 6 121248 C T nonsynonymous Rv0103c P309L 3 123198 T C synonymous Rv0104 P294P 4 123520 C T nonsynonymous Rv0104 H402Y 3 123745 G A nonsynonymous Rv0104 G477R 6 123842 C T intergenic - - modern 126803 C T nonsynonymous Rv0107c P1247S 3 129576 T G synonymous Rv0107c A322A 5 131232 G T intergenic - - 6 134014 C T nonsynonymous Rv0111 A22V 5 134555 C A synonymous Rv0111 G202G 6 135398 G A nonsynonymous Rv0111 M483I 5 137085 A G nonsynonymous Rv0112 D266G 6 137185 G A synonymous Rv0112 Q299Q 1 137233 C T synonymous Rv0112 A315A modern 139756 C T intergenic - - 1 139954 A C intergenic - - 6 140644 G A nonsynonymous Rv0116c G127R 5 140875 G A nonsynonymous Rv0116c V50M 6 141261 C T nonsynonymous Rv0117 A21V 5 141516 C T nonsynonymous Rv0117 S106F 4 143207 G A nonsynonymous Rv0118c G224S 6 144345 C A synonymous Rv0119 R99R 1 144564 C T synonymous Rv0119 I172I 6 144570 C G nonsynonymous Rv0119 F174L 1 146788 C T synonymous Rv0120c V328V 1 146872 G A synonymous Rv0120c Q300Q 5 146893 G A synonymous Rv0120c L293L 2 147262 C A nonsynonymous Rv0120c D170E 6 147650 A G nonsynonymous Rv0120c E41G 5 148187 C T synonymous Rv0121c T52T 1 154191 A G intergenic - - 5 155478 C A nonsynonymous Rv0127 A416E 3 157129 G A nonsynonymous Rv0129c G158S 1 160976 C G nonsynonymous Rv0133 N36K 1 162226 G A stopgain Rv0134 W152X 6 162622 G A synonymous Rv0134 V284V 6 162948 C T nonsynonymous Rv0135c T101I 5 163148 C T synonymous Rv0135c R34R 5 164936 G A nonsynonymous Rv0137c D109N 1 167986 C T synonymous Rv0142 D92D 6 168529 G A synonymous Rv0142 P273P 1 168787 G A nonsynonymous Rv0143c V466I 6 170083 T C nonsynonymous Rv0143c F34L 3 170671 G A nonsynonymous Rv0144 A130T 3 172492 C G stopgain Rv0146 Y94X 3 181090 C T intergenic - - 6 183575 C T intergenic - - 5 188856 C T intergenic - - 2 190816 A C synonymous Rv0161 S70S 6 191470 G A synonymous Rv0161 L288L 6 195315 T C nonsynonymous Rv0166 F108S modern 195360 C T nonsynonymous Rv0166 A123V

214 Appendix B. Lineage-specific SNPs

1 196874 C T nonsynonymous Rv0167 T5I 6 198313 C A nonsynonymous Rv0168 D218E 3 198401 G T nonsynonymous Rv0168 G248C 4 199470 G T nonsynonymous Rv0169 A313S 5 199734 C G nonsynonymous Rv0169 P401A 6 201567 G C nonsynonymous Rv0171 E212D 5 202229 C G nonsynonymous Rv0171 A433G 6 203639 G A synonymous Rv0172 L388L 5 204315 A G nonsynonymous Rv0173 K84R 4 206481 G C synonymous Rv0174 P417P 4 206484 T G synonymous Rv0174 G418G 3 207079 G C nonsynonymous Rv0175 R89P 5 208299 C T nonsynonymous Rv0176 P283L modern 208318 C T synonymous Rv0176 I289I modern 208320 G A nonsynonymous Rv0176 S290N modern 208321 C G nonsynonymous Rv0176 S290R modern 208321 C T synonymous Rv0176 S290S 6 208403 C T nonsynonymous Rv0176 P318S 6 210442 G A nonsynonymous Rv0179c R124H 3 210624 G A synonymous Rv0179c L63L 1 211993 G T nonsynonymous Rv0180c K86N 1 215238 C T synonymous Rv0184 D90D 4 217201 C T synonymous Rv0186 N311N 3 218599 T C intergenic - - modern 223752 G C nonsynonymous Rv0192A G49A 3 224338 C T nonsynonymous Rv0192 P259S 5 224414 A G intergenic - - 6 225416 T C nonsynonymous Rv0193c W386R modern 225668 A G nonsynonymous Rv0193c S302G 1 226676 G A intergenic - - 5 227468 C T synonymous Rv0194 V197V 5 229448 G C nonsynonymous Rv0194 L857F 2 230170 C T nonsynonymous Rv0194 P1098L 5 230197 C T nonsynonymous Rv0194 T1107I modern 233358 C A synonymous Rv0197 V376V modern 233364 C G nonsynonymous Rv0197 S378R 5 233377 C T nonsynonymous Rv0197 H383Y 4 234493 C G nonsynonymous Rv0197 L755V 5 240032 G T nonsynonymous Rv0202c A421S modern 243598 G A nonsynonymous Rv0205 R72H 1 244550 T C synonymous Rv0206c R923R 6 245921 C G nonsynonymous Rv0206c D466E 6 246169 T A nonsynonymous Rv0206c F384I 5 248946 C T intergenic - - 4 249522 C T nonsynonymous Rv0209 A162V 5 251176 C A nonsynonymous Rv0210 L353M 4 251575 A G nonsynonymous Rv0210 T486A modern 251669 C T intergenic - - 5 253046 A C nonsynonymous Rv0211 K422T 6 254508 G C nonsynonymous Rv0212c G45R 1 254903 G T nonsynonymous Rv0213c G350C 3 255373 A G nonsynonymous Rv0213c D193G 5 256001 G T intergenic - -

215 Appendix B. Lineage-specific SNPs

3 257071 C T synonymous Rv0214 Y336Y 5 258470 T C synonymous Rv0215c G129G 5 258561 T G nonsynonymous Rv0215c L99R modern 260282 A C nonsynonymous Rv0217c T184P 6 260610 G A synonymous Rv0217c L74L 1 263149 A C nonsynonymous Rv0220 E113A 2 264129 G A nonsynonymous Rv0221 M21I 1 264298 C A nonsynonymous Rv0221 Q78K 1 264984 C G synonymous Rv0221 T306T 3 264992 T C nonsynonymous Rv0221 L309P 3 266405 G A nonsynonymous Rv0223c G454S 6 272306 G T synonymous Rv0227c A178A 1 272678 G A synonymous Rv0227c A54A 6 273558 C G synonymous Rv0228 S168S 3 274463 G T nonsynonymous Rv0229c R175L 5 274584 T C nonsynonymous Rv0229c S135P 6 275367 C A nonsynonymous Rv0230c D199E 3 276539 G C nonsynonymous Rv0231 G161A 5 277865 G C intergenic - - 5 281405 C A nonsynonymous Rv0235c P404T 6 282537 C A synonymous Rv0235c I26I 2 282892 G A synonymous Rv0236c T1320T 5 285653 G A nonsynonymous Rv0236c R400Q 3 289253 C T synonymous Rv0239 D50D 1 290374 A G nonsynonymous Rv0241c H94R 5 291830 A C nonsynonymous Rv0242c D67A 5 295645 C T intergenic - - 3 301341 G T synonymous Rv0249c P105P 6 301687 C T intergenic - - 5 306201 A G nonsynonymous Rv0254c H50R 5 309681 C G intergenic - - modern 309765 T C stoplost Rv0257 X23R 6 310129 T C intergenic - - 6 310132 G A intergenic - - 6 315069 G T nonsynonymous Rv0263c R233L modern 320180 C T synonymous Rv0266c I325I 6 323306 A G intergenic - - 3 324812 C T synonymous Rv0270 D82D 4 325505 C T synonymous Rv0270 V313V 1 326002 A G nonsynonymous Rv0270 D479G 6 327312 A G nonsynonymous Rv0271c K384E 3 328569 G C intergenic - - 5 331309 T C nonsynonymous Rv0275c L117P modern 331588 T C nonsynonymous Rv0275c L24S modern 333212 C G intergenic - - modern 333292 G A intergenic - - 6 333394 G C synonymous Rv0277A S8S 3 339230 G C intergenic - - 4 342146 C A nonsynonymous Rv0282 A6E 3 342873 C T synonymous Rv0282 V248V modern 343281 C G synonymous Rv0282 A384A 5 344258 G A synonymous Rv0283 V79V 1 344288 C G synonymous Rv0283 S89S

216 Appendix B. Lineage-specific SNPs

6 344957 A G nonsynonymous Rv0283 I312M 5 345317 G C synonymous Rv0283 V432V 6 352058 G T nonsynonymous Rv0288 A71S 5 352646 G A synonymous Rv0289 P166P 3 353197 C T nonsynonymous Rv0290 R39C 2 353309 G A nonsynonymous Rv0290 S76N 2 353365 G A nonsynonymous Rv0290 A95T 5 357538 A G nonsynonymous Rv0293c H176R 1 357582 C T synonymous Rv0293c H161H 3 358473 G C nonsynonymous Rv0294 W101C 6 360585 G A nonsynonymous Rv0296c E191K 3 363563 G A nonsynonymous Rv0299 A30T 5 376237 G A synonymous Rv0306 L108L 6 378032 G A synonymous Rv0309 A34A 1 378357 T G nonsynonymous Rv0309 S143A 6 378404 G A synonymous Rv0309 P158P 1 378939 C G synonymous Rv0310c R70R 5 379687 C T synonymous Rv0311 V172V modern 381030 G A nonsynonymous Rv0312 G159S 6 382243 C T nonsynonymous Rv0312 P563L 6 382489 C T intergenic - - 4 392261 C T stopgain Rv0325 Q75X 1 393941 G A nonsynonymous Rv0327c M35I 6 394900 C T stopgain Rv0329c R141X 1 396750 G A nonsynonymous Rv0331 V184I 3 402836 C T synonymous Rv0337c Y109Y 6 402881 C T synonymous Rv0337c D94D 6 405274 C A nonsynonymous Rv0338c L190I 6 405854 C G intergenic - - 1 406274 A G synonymous Rv0339c A725A 5 408006 C A nonsynonymous Rv0339c P148Q 5 408935 C T nonsynonymous Rv0340 A101V 5 409079 G A nonsynonymous Rv0340 G149E modern 412280 G T nonsynonymous Rv0342 Q481H 5 414876 C G synonymous Rv0344c G22G 1 420405 C T nonsynonymous Rv0350 R191C 3 422678 G A nonsynonymous Rv0352 R76H 1 424250 A G intergenic - - modern 438271 A G nonsynonymous Rv0359 T252A 3 438470 G A synonymous Rv0360c K90K 1 439711 T C intergenic - - 3 440365 G A synonymous Rv0362 S165S 3 440878 C T synonymous Rv0362 T336T 5 441062 G T nonsynonymous Rv0362 A398S 6 441891 A G nonsynonymous Rv0363c I137V 5 442468 G A nonsynonymous Rv0364 G25D 5 443471 G A nonsynonymous Rv0365c A243T 6 443897 C T synonymous Rv0365c L101L 5 445696 C A stopgain Rv0368c S277X 6 445737 G A synonymous Rv0368c P263P 4 445780 A G nonsynonymous Rv0368c H249R modern 447442 A G nonsynonymous Rv0370c E201G 1 447525 C G nonsynonymous Rv0370c I173M

217 Appendix B. Lineage-specific SNPs

3 447642 C G synonymous Rv0370c L134L 1 452288 C T intergenic - - 6 452730 A G nonsynonymous Rv0375c D142G 5 454418 T G intergenic - - 3 455024 G A nonsynonymous Rv0377 V202I modern 455325 G C nonsynonymous Rv0377 R302P modern 455329 C T synonymous Rv0377 G303G 1 456511 C T synonymous Rv0380c T103T 5 456731 G A nonsynonymous Rv0380c R30H 6 457372 C A nonsynonymous Rv0381c P151Q 5 458116 G A nonsynonymous Rv0382c D89N 5 464480 G A nonsynonymous Rv0386 R357Q 5 464958 C T synonymous Rv0386 G516G 5 466175 A C nonsynonymous Rv0386 E922A modern 468357 G A nonsynonymous Rv0389 G8E 1 469042 C T synonymous Rv0389 N236N 1 471666 T C nonsynonymous Rv0392c M325T 6 476582 G T intergenic - - 2 477234 A C nonsynonymous Rv0398c E29D 6 477634 A G nonsynonymous Rv0399c E308G 5 477988 A C nonsynonymous Rv0399c H190P 5 479350 A G nonsynonymous Rv0400c D135G 5 480239 C G intergenic - - 5 482106 C T stopgain Rv0402c R376X 3 484504 C G synonymous Rv0404 A176A 2 484596 C T nonsynonymous Rv0404 P207L 1 485230 C T synonymous Rv0404 H418H 3 485561 A C nonsynonymous Rv0404 I529L 1 485785 T G nonsynonymous Rv0405 L19V 5 487463 A G nonsynonymous Rv0405 D578G 5 489024 C T synonymous Rv0405 S1098S 6 489514 G A nonsynonymous Rv0405 G1262R 6 490398 C T nonsynonymous Rv0406c T103M 6 491668 A G nonsynonymous Rv0407 K296E 4 491742 C T synonymous Rv0407 F320F 4 492150 C G nonsynonymous Rv0408 A122G 1 492655 G A synonymous Rv0408 A290A 6 494915 C A nonsynonymous Rv0409 D355E 6 495108 C A nonsynonymous Rv0410c T736K 5 495322 G A nonsynonymous Rv0410c A665T 1 495473 G A synonymous Rv0410c S614S 2 497491 C T synonymous Rv0411c D270D 4 498531 C T synonymous Rv0412c A363A 5 500223 G T nonsynonymous Rv0413 V171L 5 501517 G A nonsynonymous Rv0415 E124K 4 505974 A G nonsynonymous Rv0419 T297A 6 507989 G A synonymous Rv0422c L189L 5 512659 G A nonsynonymous Rv0425c S888N 6 514098 G A synonymous Rv0425c A408A modern 514657 C A nonsynonymous Rv0425c A222E 4 517358 G A nonsynonymous Rv0428c G149D 3 517389 G T nonsynonymous Rv0428c V139F modern 517411 C T synonymous Rv0428c R131R

218 Appendix B. Lineage-specific SNPs

6 518166 C A synonymous Rv0429c R77R modern 519185 G T nonsynonymous Rv0431 G38V 6 519331 A G nonsynonymous Rv0431 T87A 5 519872 C T synonymous Rv0432 F91F modern 522081 G A nonsynonymous Rv0434 A190T 3 523654 G A nonsynonymous Rv0435c A294T 6 524802 C T nonsynonymous Rv0436c P197S 5 525205 C T synonymous Rv0436c R62R 5 525540 A C synonymous Rv0437c A181A 6 526255 C T nonsynonymous Rv0438c P369L 6 526406 C T nonsynonymous Rv0438c P319S modern 527316 C G nonsynonymous Rv0438c I15M modern 528354 A C intergenic - - 6 529147 C G synonymous Rv0440 T180T 6 532927 G C intergenic - - 5 534205 G A synonymous Rv0445c R64R modern 534427 A T intergenic - - 5 536070 G A synonymous Rv0447c R146R 6 538762 G A synonymous Rv0450c L910L 5 539019 C T nonsynonymous Rv0450c H825Y 2 542014 C G intergenic - - 4 546357 C T synonymous Rv0456c T149T 6 547013 G A intergenic - - 3 548326 A G nonsynonymous Rv0457c T428A 1 549251 G A stopgain Rv0457c W119X 5 554297 C T synonymous Rv0463 D94D 5 554493 G T synonymous Rv0464c A131A 5 555621 A C nonsynonymous Rv0465c K229T 1 555945 A G nonsynonymous Rv0465c Q121R 4 555991 C T nonsynonymous Rv0465c R106C 1 556035 C A nonsynonymous Rv0465c P91Q 6 556089 T C nonsynonymous Rv0465c V73A 6 556201 A G nonsynonymous Rv0465c N36D 5 558750 C T synonymous Rv0467 T408T 1 560664 C T synonymous Rv0469 N259N 1 560666 A G nonsynonymous Rv0469 K260R 6 560857 G C synonymous Rv0470c L285L 3 562064 G C nonsynonymous Rv0470A W77C 3 562066 T C nonsynonymous Rv0470A W77R 5 562322 A C nonsynonymous Rv0471c E131A 5 563965 C T synonymous Rv0473 G134G 1 564723 C T nonsynonymous Rv0473 T387M modern 565404 T G synonymous Rv0474 R128R 1 568693 C A nonsynonymous Rv0479c A92E 5 569841 C T intergenic - - 5 573190 C T synonymous Rv0484c G204G modern 573384 C A nonsynonymous Rv0484c L140I 3 579284 T C intergenic - - 1 580336 C A synonymous Rv0490 R330R 6 580576 C T stopgain Rv0490 R410X 4 584171 G A nonsynonymous Rv0493c G174S modern 584511 A C synonymous Rv0493c T60T 5 588160 G T nonsynonymous Rv0497 A262S

219 Appendix B. Lineage-specific SNPs

5 588733 C G nonsynonymous Rv0498 P137A 5 589808 A C nonsynonymous Rv0499 D209A 6 590622 C G synonymous Rv0500 G180G modern 590763 C G nonsynonymous Rv0500 D227E 5 591470 G A intergenic - - 1 591965 G A synonymous Rv0501 A104A 6 592494 G A nonsynonymous Rv0501 G281R 1 595501 C T nonsynonymous Rv0505c A362V 6 598244 C T nonsynonymous Rv0507 T349I 6 598723 G T nonsynonymous Rv0507 D509Y 6 599363 C G nonsynonymous Rv0507 S722C 4 599868 G A synonymous Rv0507 R890R 5 601315 G A nonsynonymous Rv0509 R292H 5 609003 C G nonsynonymous Rv0517 I86M 5 610787 G A synonymous Rv0518 E200E 6 611373 A G synonymous Rv0519c Q234Q 3 611977 G A nonsynonymous Rv0519c G33D 5 613957 G T nonsynonymous Rv0522 R307L 1 615938 G A synonymous Rv0524 E368E 1 621390 G A nonsynonymous Rv0530 V162I 5 622361 G A synonymous Rv0531 R11R 1 627350 C G nonsynonymous Rv0536 P35A 2 627485 G A nonsynonymous Rv0536 V80I 2 628864 A G nonsynonymous Rv0537c T290A 6 628906 C T synonymous Rv0537c L276L 3 629714 C T synonymous Rv0537c D6D 6 630018 A G intergenic - - 6 631296 G A synonymous Rv0538 P419P 3 635139 C G synonymous Rv0542c G122G 5 635775 G A nonsynonymous Rv0543c R34H 6 640129 C T intergenic - - 2 640954 A G intergenic - - 1 643483 G A nonsynonymous Rv0552 G199S 6 644439 C T synonymous Rv0552 A517A 5 644472 T C synonymous Rv0552 G528G 5 646048 C G synonymous Rv0554 P194P 1 646531 A T synonymous Rv0555 T78T 5 648756 C T nonsynonymous Rv0557 T74M 4 648856 C T synonymous Rv0557 G107G 1 649345 C T synonymous Rv0557 A270A 5 649446 G A nonsynonymous Rv0557 G304D 3 652950 T C synonymous Rv0562 R60R 6 654603 A G nonsynonymous Rv0563 E242G 6 655382 A G nonsynonymous Rv0564c T190A 1 655707 C T synonymous Rv0564c V81V 4 655986 G T intergenic - - 6 656432 G A nonsynonymous Rv0565c G347S 2 657142 G A nonsynonymous Rv0565c R110H modern 657578 C T synonymous Rv0566c D154D 6 658923 C G nonsynonymous Rv0567 F201L 6 659019 C T synonymous Rv0567 D233D 4 659341 C T intergenic - - 5 660153 T G nonsynonymous Rv0568 L235R

220 Appendix B. Lineage-specific SNPs

4 662911 C T synonymous Rv0570 A539A 5 666713 T G nonsynonymous Rv0573c L177R 5 667950 C T stopgain Rv0574c Q149X 6 669225 A G nonsynonymous Rv0575c Y174C 6 669231 A C nonsynonymous Rv0575c E172A 5 669406 G A nonsynonymous Rv0575c E114K 4 670545 A G nonsynonymous Rv0576 H233R 1 678440 G A synonymous Rv0583c T212T 5 678934 G T nonsynonymous Rv0583c A48S modern 684290 G A intergenic - - 3 684376 T C intergenic - - 1 685955 G A nonsynonymous Rv0588 A10T 3 686123 C A nonsynonymous Rv0588 L66M 5 686146 C T synonymous Rv0588 G73G 6 687074 C T nonsynonymous Rv0589 P85L 6 688260 G A nonsynonymous Rv0590 V77I modern 690248 A C nonsynonymous Rv0591 N397T 4 690450 C A synonymous Rv0591 A464A 2 696917 G T intergenic - - 5 697196 C T synonymous Rv0598c D124D modern 700776 C T nonsynonymous Rv0604 P180S 1 704997 C T stopgain Rv0610c Q305X modern 705602 G A nonsynonymous Rv0610c S103N 6 705988 G C nonsynonymous Rv0611c E119D 2 707334 A G nonsynonymous Rv0613c T728A 3 708056 A G nonsynonymous Rv0613c H487R 5 708263 T C nonsynonymous Rv0613c L418S 6 709150 C T synonymous Rv0613c D122D 4 713310 C T nonsynonymous Rv0620 R199C 6 713802 C T nonsynonymous Rv0620 R363C modern 715266 G A stopgain Rv0621 W355X 6 717062 C T intergenic - - 5 717558 C G synonymous Rv0625c G112G modern 717588 C T synonymous Rv0625c V102V 1 720863 C A synonymous Rv0629c A290A 1 722852 C T nonsynonymous Rv0630c P721L 1 726498 G T synonymous Rv0631c L603L 5 728532 A C intergenic - - 6 729114 C T synonymous Rv0632c G55G 3 729685 G A nonsynonymous Rv0633c R161Q 1 730087 C T nonsynonymous Rv0633c A27V 5 731750 G C nonsynonymous Rv0634B L13F 1 734116 T C nonsynonymous Rv0638 M127T 5 735135 G A nonsynonymous Rv0640 M38I 6 735252 C T synonymous Rv0640 A77A 6 736919 C A nonsynonymous Rv0642c F95L 6 738820 G T nonsynonymous Rv0644c R114L 5 738899 G A nonsynonymous Rv0644c V88I 5 740038 C G synonymous Rv0645c A50A 6 742633 G A intergenic - - 5 745835 C G synonymous Rv0648 G1039G 1 749968 T C intergenic - - 1 752046 C T nonsynonymous Rv0655 A177V

221 Appendix B. Lineage-specific SNPs

3 753174 G T nonsynonymous Rv0656c W65L 1 753668 C T intergenic - - 1 754387 C T nonsynonymous Rv0658c T8I 6 754754 C T synonymous Rv0659c R80R 2 757139 C A nonsynonymous Rv0663 R335S 4 757182 G A nonsynonymous Rv0663 G349D 3 759746 C T intergenic - - 6 760969 C T nonsynonymous Rv0667 S388L 6 761723 A C nonsynonymous Rv0667 E639D 3 762434 T G synonymous Rv0667 G876G 4 763031 C T synonymous Rv0667 A1075A 1 763884 C T nonsynonymous Rv0668 A172V 1 763886 C A synonymous Rv0668 R173R 3 767339 A G intergenic - - 1 767609 A G intergenic - - 5 769406 G A synonymous Rv0669c L64L 6 772596 T C synonymous Rv0672 R371R 5 773021 G A nonsynonymous Rv0672 R513Q 4 776100 T C nonsynonymous Rv0676c I794T 6 781075 G C nonsynonymous Rv0681 A119P 1 786137 A C intergenic - - 3 788615 G C nonsynonymous Rv0688 G226R 1 797597 C T nonsynonymous Rv0697 T222M 2 798355 G C nonsynonymous Rv0697 A475P 3 798779 T C intergenic - - 3 798934 C A synonymous Rv0698 R34R 1 800357 C A intergenic - - 5 801959 C T synonymous Rv0702 R166R 6 807012 G T nonsynonymous Rv0711 W226C 3 807405 C T synonymous Rv0711 S357S 1 810287 C G synonymous Rv0713 T114T 1 811492 C G synonymous Rv0714 V40V 2 811753 C T synonymous Rv0715 H4H 1 812502 C T synonymous Rv0716 V148V modern 815236 C T nonsynonymous Rv0723 T16I modern 815851 G A synonymous Rv0724 R63R 6 816732 G A nonsynonymous Rv0724 S357N 6 816862 G T synonymous Rv0724 V400V 5 817489 T C synonymous Rv0724 T609T 1 817696 C T nonsynonymous Rv0725c P250L 1 819213 A G nonsynonymous Rv0726c E143G 3 820734 T C nonsynonymous Rv0728c V248A 4 820752 A G nonsynonymous Rv0728c H242R modern 821907 C T nonsynonymous Rv0729 P134L 1 829719 G A intergenic - - modern 832246 G A synonymous Rv0740 Q157Q 3 834857 G A nonsynonymous Rv0744c M30I 6 841095 T G nonsynonymous Rv0748 V50G 5 841139 C T synonymous Rv0748 L65L 1 841494 C G synonymous Rv0749 L89L 1 841495 A G nonsynonymous Rv0749 M90V 6 841629 G T synonymous Rv0749 S134S 6 843751 G T nonsynonymous Rv0752c A222S

222 Appendix B. Lineage-specific SNPs

4 847995 C T intergenic - - 5 850047 C T intergenic - - 1 850985 T C nonsynonymous Rv0756c I161T 6 851104 G A synonymous Rv0756c A121A 6 851562 G A intergenic - - 3 857643 G C nonsynonymous Rv0764c G132A 3 858464 T C nonsynonymous Rv0765c V134A 5 861279 A C nonsynonymous Rv0768 E123A 6 862664 C T synonymous Rv0769 L85L 5 863975 G T synonymous Rv0770 R240R 1 865761 C T synonymous Rv0772 H392H 6 866448 G A synonymous Rv0773c T314T 6 867745 A G nonsynonymous Rv0774c T203A 5 869036 C T nonsynonymous Rv0776c T243I modern 871271 C A nonsynonymous Rv0777 P422T 5 872863 A C nonsynonymous Rv0779c T144P 5 883072 C T nonsynonymous Rv0788 R105C 1 885689 C A nonsynonymous Rv0791c T51K 6 886178 C A nonsynonymous Rv0792c L157I 5 892659 C T synonymous Rv0799c T205T modern 894888 C T synonymous Rv0801 S86S 1 895082 T C nonsynonymous Rv0802c F183L 1 895120 G A nonsynonymous Rv0802c R170H 3 896979 T C nonsynonymous Rv0803 V387A 6 900065 C A synonymous Rv0806c T422T 6 901327 C G nonsynonymous Rv0806c P2A 5 901358 G A intergenic - - 5 904367 G C nonsynonymous Rv0809 D215H 5 905344 G T nonsynonymous Rv0811c G333C 3 906742 T C nonsynonymous Rv0812 V107A 5 907906 G A nonsynonymous Rv0813c R38H 5 908033 G C intergenic - - 1 910015 G T nonsynonymous Rv0816c A7S 5 910282 C A synonymous Rv0817c S187S modern 911261 C T nonsynonymous Rv0818 P97L modern 913274 G C synonymous Rv0820 S183S modern 916046 C A nonsynonymous Rv0822c P89H 6 916350 G A intergenic - - 6 916714 C A synonymous Rv0823c L311L 6 917259 G T nonsynonymous Rv0823c G130C 5 919007 T G nonsynonymous Rv0825c C183G 6 919382 T C nonsynonymous Rv0825c F58L 6 919384 A G nonsynonymous Rv0825c Y57C 3 919551 G A synonymous Rv0825c V1V 6 920333 C T nonsynonymous Rv0826 P234S 5 921429 C T nonsynonymous Rv0828c P62L 4 931123 C T synonymous Rv0835 Y57Y 6 932252 C A intergenic - - 4 932280 G A stopgain Rv0836c W218X modern 933699 A C synonymous Rv0837c G111G 4 934230 G C intergenic - - 4 934611 T G intergenic - - 6 937614 C A nonsynonymous Rv0841 H8N

223 Appendix B. Lineage-specific SNPs modern 938246 G A synonymous Rv0842 L45L 2 940602 G C nonsynonymous Rv0844c G169R 6 941054 C T nonsynonymous Rv0844c P18L 4 941845 A C nonsynonymous Rv0845 E219A 5 941849 A C nonsynonymous Rv0845 E220D 3 942616 C A intergenic - - 6 944725 C T nonsynonymous Rv0847 A128V 3 945238 G T nonsynonymous Rv0848 A101S modern 948294 G A nonsynonymous Rv0851c G59S 3 950116 G C nonsynonymous Rv0853c A335P modern 951142 A C intergenic - - 1 952597 C T synonymous Rv0855 I322I 5 954131 G A nonsynonymous Rv0858c V264M 5 955631 C T synonymous Rv0859 G185G 3 957306 C T synonymous Rv0860 A338A 6 959369 G A synonymous Rv0861c S261S 4 960367 C T nonsynonymous Rv0862c S749L 6 962133 C A nonsynonymous Rv0862c D160E 5 964400 C A nonsynonymous Rv0867c T379K 3 964969 A G synonymous Rv0867c A189A 5 965648 G T intergenic - - 6 972484 C T intergenic - - 2 972980 G A nonsynonymous Rv0874c G243S 6 975915 G C nonsynonymous Rv0876c R8P 1 976043 G T intergenic - - 6 982363 C A synonymous Rv0884c G64G 1 987601 C T synonymous Rv0888 G123G 6 988043 T C nonsynonymous Rv0888 Y271H 5 991740 A G nonsynonymous Rv0890c Q286R 3 991939 C T nonsynonymous Rv0890c R220C 6 994678 C G nonsynonymous Rv0892 L276V 1 996219 C T nonsynonymous Rv0893c A26V 3 996263 C T synonymous Rv0893c T11T 6 996284 G A synonymous Rv0893c E4E 6 1000732 A G nonsynonymous Rv0896 T421A 3 1002172 C T nonsynonymous Rv0897c R82W 1 1002342 A G nonsynonymous Rv0897c Y25C 5 1004177 A C nonsynonymous Rv0901 E74A 3 1007198 T C nonsynonymous Rv0904c L328P 6 1007708 G A nonsynonymous Rv0904c G158E 3 1008460 C T nonsynonymous Rv0905 S85F 1 1009490 C T stopgain Rv0906 Q183X 3 1009500 C T nonsynonymous Rv0906 P186L 6 1009957 G A synonymous Rv0906 L338L 3 1012815 C G nonsynonymous Rv0908 A362G 6 1013635 C T synonymous Rv0908 V635V modern 1014815 G T nonsynonymous Rv0909 Q45H 2 1022003 A C intergenic - - 1 1022613 G A nonsynonymous Rv0917 R176Q 3 1023911 C G intergenic - - 4 1024346 G A nonsynonymous Rv0918 G46S 5 1025135 T C nonsynonymous Rv0919 L151P 3 1029586 C T nonsynonymous Rv0923c P331L

224 Appendix B. Lineage-specific SNPs

1 1029997 T C nonsynonymous Rv0923c V194A 1 1032524 A G nonsynonymous Rv0925c D37G 6 1034238 T C synonymous Rv0927c L132L 3 1034381 C T nonsynonymous Rv0927c A84V 5 1038813 C T nonsynonymous Rv0931c H368Y modern 1040706 G T nonsynonymous Rv0932c A115S 1 1043136 C T nonsynonymous Rv0934 T341I 6 1043169 T C nonsynonymous Rv0934 V352A 1 1048102 G A nonsynonymous Rv0938 R656H 6 1049460 A T nonsynonymous Rv0939 D350V 5 1050523 G T intergenic - - 5 1053653 G A synonymous Rv0943c R28R 6 1054136 C T synonymous Rv0944 T124T 4 1054784 G C nonsynonymous Rv0945 G180R 5 1058309 A G nonsynonymous Rv0949 D17G 1 1061386 A G nonsynonymous Rv0950c E90G modern 1063765 A G nonsynonymous Rv0952 K209R 1 1063922 C T synonymous Rv0952 G261G 1 1066038 A G synonymous Rv0954 *304* 6 1069146 C T synonymous Rv0957 G314G 3 1071349 G C nonsynonymous Rv0959 G32A 1 1072342 A G nonsynonymous Rv0959 D363G 1 1075169 A G intergenic - - 5 1077102 G A synonymous Rv0965c A32A 1 1077754 G A nonsynonymous Rv0966c A28T 4 1080192 A G nonsynonymous Rv0969 N484D 1 1083755 T C nonsynonymous Rv0973c S666P 1 1086648 G T nonsynonymous Rv0974c R233L 6 1095053 G T nonsynonymous Rv0979A K56N modern 1097023 A G nonsynonymous Rv0981 S70G 5 1097633 C T synonymous Rv0982 A42A 4 1098523 A T nonsynonymous Rv0982 H339L 2 1102468 C A synonymous Rv0986 G222G 5 1104499 A G nonsynonymous Rv0987 D653G 4 1104690 G T nonsynonymous Rv0987 V717F modern 1105284 A G nonsynonymous Rv0988 I57V 5 1105557 G T nonsynonymous Rv0988 A148S 3 1106099 C T synonymous Rv0988 I328I 5 1107024 A G nonsynonymous Rv0989c D120G 6 1107897 C T nonsynonymous Rv0990c A68V 4 1107940 G T nonsynonymous Rv0990c A54S modern 1109163 C G nonsynonymous Rv0992c I3M 3 1110721 T C synonymous Rv0994 R151R modern 1110956 G T nonsynonymous Rv0994 G230C 1 1111518 T C nonsynonymous Rv0994 V417A 3 1111852 G T nonsynonymous Rv0995 D81Y modern 1113290 C G nonsynonymous Rv0996 Q303E 5 1114129 T G intergenic - - 6 1117308 C T synonymous Rv1001 L42L 1 1117405 C T nonsynonymous Rv1001 T74I 6 1118270 A G synonymous Rv1001 V362V 6 1119597 G C nonsynonymous Rv1002c V115L 1 1119739 G A synonymous Rv1002c L67L

225 Appendix B. Lineage-specific SNPs

6 1122175 C G intergenic - - modern 1123597 C T nonsynonymous Rv1005c S1L modern 1131300 A G nonsynonymous Rv1012 N58S 3 1139089 G C synonymous Rv1020 T41T 1 1139222 G A nonsynonymous Rv1020 A86T 5 1139497 C T synonymous Rv1020 T177T 5 1141069 G T synonymous Rv1020 A701A modern 1143832 C A nonsynonymous Rv1022 P33T 3 1144409 T G nonsynonymous Rv1022 I225S 4 1144585 G A nonsynonymous Rv1023 G8R 5 1145442 C G synonymous Rv1023 G293G 4 1148259 G A intergenic - - 6 1149547 G A nonsynonymous Rv1028c A714T 6 1150490 G C synonymous Rv1028c S399S 6 1150803 G A nonsynonymous Rv1028c G295D 1 1151490 C G nonsynonymous Rv1028c T66R 3 1152805 T A nonsynonymous Rv1029 L265Q 3 1152863 A G synonymous Rv1029 Q284Q 3 1153388 C T synonymous Rv1029 N459N 6 1153920 C T nonsynonymous Rv1030 T66I 6 1154634 T C nonsynonymous Rv1030 V304A 1 1155700 C T synonymous Rv1030 I659I 5 1155819 T C nonsynonymous Rv1030 V699A 5 1156224 C T synonymous Rv1031 G124G 6 1156704 A G nonsynonymous Rv1032c T418A 1 1157771 C G nonsynonymous Rv1032c S62C 3 1164571 A G intergenic - - modern 1165521 T A intergenic - - 2 1168776 A C nonsynonymous Rv1046c R151S 2 1175343 C T intergenic - - 6 1177815 A G nonsynonymous Rv1056 D63G 1 1184826 G T nonsynonymous Rv1061 R271L 5 1186287 C A synonymous Rv1063c A179A 5 1190588 G A intergenic - - 5 1192641 C A nonsynonymous Rv1069c H545N 1 1192830 C T nonsynonymous Rv1069c R482W 5 1196194 C T intergenic - - 6 1197169 G C intergenic - - 1 1199019 A G nonsynonymous Rv1074c T119A 2 1201581 A C nonsynonymous Rv1076 Q272P 5 1203264 C T intergenic - - modern 1203824 C T nonsynonymous Rv1078 T171I 4 1211369 C A nonsynonymous Rv1086 R259S 5 1213925 C T intergenic - - 5 1215581 A C nonsynonymous Rv1089A H22P 6 1220180 C T nonsynonymous Rv1092c R3W modern 1220570 T G intergenic - - 6 1224174 C T nonsynonymous Rv1095 A393V 6 1225198 C T nonsynonymous Rv1096 P272S 6 1226021 C A nonsynonymous Rv1097c Q42K 1 1228116 T C nonsynonymous Rv1099c I156T 4 1230778 T C nonsynonymous Rv1102c I65T 5 1232089 C T intergenic - -

226 Appendix B. Lineage-specific SNPs

6 1233275 C T synonymous Rv1106c L228L modern 1235446 G A synonymous Rv1108c S5S 3 1236433 A C nonsynonymous Rv1110 E83D 1 1237403 C T nonsynonymous Rv1111c P264S 6 1238483 G A nonsynonymous Rv1112 V77I 5 1239649 C T nonsynonymous Rv1114 R14C 5 1240578 C T nonsynonymous Rv1115 P131L 3 1240744 C A nonsynonymous Rv1115 N186K 3 1241572 A G intergenic - - 5 1242007 G A synonymous Rv1118c R275R 5 1242416 A G nonsynonymous Rv1118c Q139R 5 1243724 T C synonymous Rv1121 G6G 6 1245781 C T nonsynonymous Rv1122 T218I 6 1247306 C G synonymous Rv1124 V60V 3 1247391 A C nonsynonymous Rv1124 S89R 4 1248382 G A nonsynonymous Rv1125 G101S 4 1248936 C G synonymous Rv1125 P285P 5 1250131 T C nonsynonymous Rv1127c V425A 4 1250340 C T synonymous Rv1127c A355A modern 1250357 C A nonsynonymous Rv1127c P350T 5 1251071 T G nonsynonymous Rv1127c S112A 6 1253028 C T intergenic - - 4 1254562 G A nonsynonymous Rv1130 G3D 5 1255685 C T synonymous Rv1130 F377F 6 1256012 G T synonymous Rv1130 R486R 5 1256176 A G synonymous Rv1131 K15K 5 1256895 A C nonsynonymous Rv1131 E255A 5 1257823 G A nonsynonymous Rv1132 A167T 1 1262230 C T intergenic - - 3 1265828 C T synonymous Rv1138c L221L 6 1265913 T C synonymous Rv1138c T192T 3 1271187 C G nonsynonymous Rv1144 T11S 6 1275025 T A synonymous Rv1147 G42G 6 1275084 C T nonsynonymous Rv1147 P62L 5 1275333 A G nonsynonymous Rv1147 H145R 6 1279184 G A nonsynonymous Rv1151c R145Q 6 1281685 C G nonsynonymous Rv1155 A86G 4 1281771 C T nonsynonymous Rv1155 P115S 3 1281984 G A intergenic - - 5 1283821 T C synonymous Rv1157c N117N 6 1283851 G A synonymous Rv1157c S107S 5 1284479 C G synonymous Rv1158c T128T 5 1284931 C T intergenic - - 6 1286582 A G intergenic - - modern 1287112 C T intergenic - - 1 1287160 A C intergenic - - 3 1287372 G C synonymous Rv1161 L15L 5 1288251 C T synonymous Rv1161 C308C 5 1288630 A C nonsynonymous Rv1161 S435R 5 1305657 G A intergenic - - 6 1306281 T G nonsynonymous Rv1175c I649S 3 1308317 A G nonsynonymous Rv1176c H159R modern 1310316 T C nonsynonymous Rv1178 V318A

227 Appendix B. Lineage-specific SNPs

5 1313128 C A synonymous Rv1179c R58R 5 1313131 C A nonsynonymous Rv1179c R57S 6 1313726 T C nonsynonymous Rv1180 V1A 6 1314261 G A synonymous Rv1180 S179S 1 1314617 C T nonsynonymous Rv1180 A298V 5 1316651 A G nonsynonymous Rv1181 Q473R 2 1317655 C T synonymous Rv1181 L808L 6 1318990 C T nonsynonymous Rv1181 L1253F 1 1320508 G A synonymous Rv1182 V158V 1 1320614 C G nonsynonymous Rv1182 L194V 3 1325650 G A intergenic - - 2 1329234 C T synonymous Rv1186c D24D 5 1331789 C T nonsynonymous Rv1188 R257C 3 1340784 T C synonymous Rv1197 G42G modern 1341040 A C nonsynonymous Rv1198 D12A 1 1344857 G A nonsynonymous Rv1201c G105S 3 1345016 G A nonsynonymous Rv1201c A52T 6 1347173 G A synonymous Rv1204c L484L 6 1347264 C T nonsynonymous Rv1204c T454M 1 1348520 G T synonymous Rv1204c L35L 1 1348521 T C nonsynonymous Rv1204c L35P 4 1351172 G A intergenic - - 5 1352566 C T synonymous Rv1208 H141H 6 1355937 C T synonymous Rv1213 F34F 1 1356648 C T synonymous Rv1213 D271D 5 1358934 A C nonsynonymous Rv1215c S171R modern 1358940 T G nonsynonymous Rv1215c S169A 5 1359908 T G nonsynonymous Rv1216c V80G 5 1360604 G A nonsynonymous Rv1217c V400I 6 1366736 T C nonsynonymous Rv1223 F288L 6 1367208 G A nonsynonymous Rv1223 S445N 4 1367484 G T nonsynonymous Rv1224 G8W 6 1368133 G T nonsynonymous Rv1225c A197S modern 1368947 C T nonsynonymous Rv1226c A450V 1 1369389 G A nonsynonymous Rv1226c G303S 1 1369735 C T synonymous Rv1226c Y187Y 3 1371470 G A nonsynonymous Rv1228 R184H 6 1372002 G A synonymous Rv1229c L316L 5 1372975 C G nonsynonymous Rv1230c A408G 6 1373576 T C nonsynonymous Rv1230c W208R 1 1374578 G A nonsynonymous Rv1231c R96H 1 1374639 A T nonsynonymous Rv1231c N76Y 6 1375349 G T nonsynonymous Rv1232c G274W 5 1377185 C G synonymous Rv1234 L70L 3 1377568 A G synonymous Rv1235 V15V 5 1383185 A C intergenic - - 5 1383970 A C nonsynonymous Rv1240 D253A 6 1384188 C A nonsynonymous Rv1240 L326I 6 1384255 G T intergenic - - 1 1387211 G A nonsynonymous Rv1244 A119T 6 1387580 C A nonsynonymous Rv1244 Q242K 6 1388517 G A nonsynonymous Rv1245c V38I 1 1389866 G A nonsynonymous Rv1248c G1063S

228 Appendix B. Lineage-specific SNPs

6 1390089 G A synonymous Rv1248c A988A 4 1390763 A G nonsynonymous Rv1248c M764V 6 1391728 G A nonsynonymous Rv1248c R442H modern 1395010 G A nonsynonymous Rv1250 G278R 3 1396618 G T stopgain Rv1251c E875X modern 1397201 C T synonymous Rv1251c N680N 1 1397215 G C nonsynonymous Rv1251c G676R 5 1397633 G A synonymous Rv1251c K536K modern 1400396 G A nonsynonymous Rv1253 V143M 3 1401033 C T nonsynonymous Rv1253 S355L 5 1403266 G C nonsynonymous Rv1255c A41P 6 1404738 C T synonymous Rv1257c L449L 6 1406685 T C nonsynonymous Rv1258c V219A 5 1407273 A T nonsynonymous Rv1258c D23V modern 1410062 G C nonsynonymous Rv1262c R104P 5 1413242 G T intergenic - - 6 1414870 C T nonsynonymous Rv1266c T324I 6 1416633 C G nonsynonymous Rv1267c L239V 1 1417019 G A nonsynonymous Rv1267c C110Y modern 1417554 C G intergenic - - 1 1417793 C T synonymous Rv1268c T188T 5 1419373 A G nonsynonymous Rv1270c T126A 5 1422079 G C nonsynonymous Rv1272c R76P 3 1422666 G A nonsynonymous Rv1273c G462E 3 1422667 G A nonsynonymous Rv1273c G462R modern 1424699 C T nonsynonymous Rv1274 P168L 1 1426928 C T synonymous Rv1277 F255F 5 1436284 T C synonymous Rv1283c G278G 6 1438981 C T synonymous Rv1286 L25L 3 1440090 C T nonsynonymous Rv1286 T395I 3 1441545 G A synonymous Rv1288 A66A 6 1442734 C G intergenic - - 1 1443354 C T synonymous Rv1289 A196A 1 1445977 C T intergenic - - 3 1450316 C T synonymous Rv1294 A314A 6 1452717 C T nonsynonymous Rv1296 P241S 6 1453680 C T synonymous Rv1297 T159T 6 1454811 C T synonymous Rv1297 N536N modern 1458144 A C nonsynonymous Rv1301 Q196P 6 1461251 G T synonymous Rv1305 A69A 5 1463143 C T nonsynonymous Rv1307 S434L 5 1465542 A C nonsynonymous Rv1309 Y220S 6 1467735 G A synonymous Rv1312 G16G 3 1478357 C A synonymous Rv1317c P254P 4 1479085 G A nonsynonymous Rv1317c V12I 1 1480972 A G synonymous Rv1319c E510E 5 1481038 G A synonymous Rv1319c L488L 3 1481563 G A synonymous Rv1319c E313E 5 1482978 G A nonsynonymous Rv1320c A414T 6 1486647 G C nonsynonymous Rv1323 Q262H 5 1487674 G A nonsynonymous Rv1324 A172T 6 1490140 G A nonsynonymous Rv1326c A725T 4 1490905 C T nonsynonymous Rv1326c P470S

229 Appendix B. Lineage-specific SNPs modern 1490911 T C nonsynonymous Rv1326c S468P modern 1492194 G A nonsynonymous Rv1326c G40D 3 1495836 G A nonsynonymous Rv1328 V425M modern 1496964 T G nonsynonymous Rv1328 S801A 5 1501448 G T nonsynonymous Rv1332 V175L modern 1505194 C T synonymous Rv1338 G40G 5 1505806 C T synonymous Rv1338 P244P modern 1505973 A G synonymous Rv1339 R19R 6 1507308 G C nonsynonymous Rv1340 R185P 5 1507920 G A synonymous Rv1341 V116V 6 1508682 A C nonsynonymous Rv1343c E81A 5 1509093 C G synonymous Rv1344 L42L 3 1513189 C T nonsynonymous Rv1348 A48V 1 1514010 G T nonsynonymous Rv1348 V322F 6 1515003 G A nonsynonymous Rv1348 A653T 5 1518271 G A nonsynonymous Rv1351 R14K 5 1518280 C T nonsynonymous Rv1351 S17F 5 1518681 C G intergenic - - 1 1521526 A G synonymous Rv1354c E117E 6 1521892 A G nonsynonymous Rv1355c H714R 1 1522862 G A nonsynonymous Rv1355c V391I modern 1523175 G T synonymous Rv1355c A286A modern 1523791 C T nonsynonymous Rv1355c P81L 1 1525160 G A intergenic - - 6 1529346 C T nonsynonymous Rv1358 A912V 1 1534548 C T synonymous Rv1362c I21I 3 1534551 G A synonymous Rv1362c E20E 1 1535643 C T intergenic - - 6 1536183 C T nonsynonymous Rv1364c R488C 6 1537926 G A nonsynonymous Rv1365c E82K 4 1540141 G A nonsynonymous Rv1367c V169I 3 1540484 G C synonymous Rv1367c L54L 4 1544255 T C synonymous Rv1371 R299R 5 1545472 C T synonymous Rv1372 N216N 5 1545720 A G nonsynonymous Rv1372 D299G 4 1546703 T C nonsynonymous Rv1373 L231P modern 1548087 A G nonsynonymous Rv1375 R86G 6 1548149 G A synonymous Rv1375 P106P 6 1549854 G C nonsynonymous Rv1376 R236P 5 1550945 A C synonymous Rv1377c G91G 5 1555432 C T synonymous Rv1381 T415T 5 1556030 C T synonymous Rv1383 G20G 2 1556787 T C nonsynonymous Rv1383 S273P modern 1559562 A C nonsynonymous Rv1384 E821A 6 1560088 G T synonymous Rv1384 L996L 1 1560912 G A synonymous Rv1385 E156E 1 1563686 G T intergenic - - 1 1568178 G A synonymous Rv1393c P470P 5 1568891 C G nonsynonymous Rv1393c L233V modern 1574206 G A nonsynonymous Rv1397c G103D 1 1575793 T C nonsynonymous Rv1399c V6A 2 1577241 G A synonymous Rv1401 P104P 5 1578212 A C synonymous Rv1402 A200A

230 Appendix B. Lineage-specific SNPs

1 1580181 C T nonsynonymous Rv1403c P81L 1 1581377 G A nonsynonymous Rv1405c G198D 6 1581727 C T synonymous Rv1405c P81P modern 1584379 C A nonsynonymous Rv1407 L427I 5 1585032 C T synonymous Rv1408 V178V modern 1585283 C A nonsynonymous Rv1409 N30K modern 1585404 A G nonsynonymous Rv1409 T71A 6 1585900 C T nonsynonymous Rv1409 P236L 1 1589739 C T synonymous Rv1413 G118G 1 1590555 C T synonymous Rv1415 T53T 5 1591039 G A nonsynonymous Rv1415 A215T 5 1593536 T C nonsynonymous Rv1419 L11P 6 1593652 G A nonsynonymous Rv1419 A50T 5 1593762 C A synonymous Rv1419 P86P 5 1593762 C T synonymous Rv1419 P86P 6 1598404 C A nonsynonymous Rv1423 L167I 3 1599557 G C nonsynonymous Rv1424c R33T modern 1600685 T C nonsynonymous Rv1425 L343P 1 1601528 G A nonsynonymous Rv1426c C265Y 5 1601722 G A synonymous Rv1426c P200P 5 1603637 A G nonsynonymous Rv1427c N98D 1 1604290 T A nonsynonymous Rv1428c V157D 5 1605016 C T synonymous Rv1429 L47L 5 1605569 T C nonsynonymous Rv1429 V231A 4 1608276 C A nonsynonymous Rv1431 T65N 5 1611024 C T synonymous Rv1432 T392T modern 1611283 C T intergenic - - 3 1614143 G A synonymous Rv1436 L279L 1 1616831 A G intergenic - - 1 1617833 C T intergenic - - 6 1620178 A C nonsynonymous Rv1442 M130L 6 1620179 T C nonsynonymous Rv1442 M130T 1 1625259 T G nonsynonymous Rv1446c I36S 6 1629058 C T nonsynonymous Rv1449c S381L modern 1639418 G A synonymous Rv1453 L346L modern 1639643 C T synonymous Rv1453 R421R 5 1639699 T G nonsynonymous Rv1454c V321G modern 1640442 A G synonymous Rv1454c G73G 2 1643864 A G nonsynonymous Rv1458c T133A 1 1644250 C G nonsynonymous Rv1458c A4G 6 1646982 C T nonsynonymous Rv1460 A266V 1 1647830 G A nonsynonymous Rv1461 G281D 5 1648089 G T synonymous Rv1461 L367L 5 1648224 C T synonymous Rv1461 H412H 5 1649265 G A synonymous Rv1461 L759L 3 1650406 C A nonsynonymous Rv1462 T294N 2 1651308 A G nonsynonymous Rv1463 E198G 5 1658030 C T synonymous Rv1469 I356I 1 1658535 G T nonsynonymous Rv1469 V525F 6 1659318 C T synonymous Rv1470 G113G 1 1659902 T C nonsynonymous Rv1472 V47A modern 1659994 T C nonsynonymous Rv1472 S78P 6 1662719 C G synonymous Rv1474c A162A

231 Appendix B. Lineage-specific SNPs

6 1666671 G A synonymous Rv1476 V156V 6 1666796 C A intergenic - - 5 1669296 G A nonsynonymous Rv1479 G5D 5 1672707 C T nonsynonymous Rv1482c T198M 5 1673338 G A intergenic - - 6 1674434 T C nonsynonymous Rv1484 V78A 6 1676742 C A nonsynonymous Rv1486c H48N 3 1676880 T C nonsynonymous Rv1486c W2R 5 1678639 C G nonsynonymous Rv1489 P30A 2 1678706 A C nonsynonymous Rv1489 K52T 6 1679008 A G nonsynonymous Rv1489A K23E 3 1681147 G A intergenic - - 5 1686414 C T synonymous Rv1494 D48D 6 1686538 C T nonsynonymous Rv1494 R90W 5 1686737 C T synonymous Rv1495 P56P 4 1688300 C T synonymous Rv1497 F120F 6 1690345 T C nonsynonymous Rv1498A M1T 6 1690970 G T nonsynonymous Rv1500 D41Y 1 1691520 T G nonsynonymous Rv1500 L224R 6 1692061 C G nonsynonymous Rv1501 R58G 6 1694199 G A nonsynonymous Rv1503c E116K modern 1694547 G T stopgain Rv1504c E200X modern 1695674 T C synonymous Rv1505c Y91Y 2 1695796 G A nonsynonymous Rv1505c A51T 6 1696132 G A synonymous Rv1506c A104A 5 1696941 C A nonsynonymous Rv1507c A161E 5 1696942 G A nonsynonymous Rv1507c A161T 6 1699448 G T nonsynonymous Rv1508c W149C 4 1699849 T C nonsynonymous Rv1508c S16P 6 1701212 A C intergenic - - 6 1702423 G A nonsynonymous Rv1510 A377T 6 1702803 G A intergenic - - 5 1703786 A C nonsynonymous Rv1511 Q238P modern 1706746 T G nonsynonymous Rv1515c F261V 3 1708792 T C intergenic - - 5 1710085 G C nonsynonymous Rv1518 V148L 3 1710767 T A nonsynonymous Rv1519 L12H 5 1711221 C T nonsynonymous Rv1520 T65I 5 1711619 G A nonsynonymous Rv1520 A198T 6 1711637 G T nonsynonymous Rv1520 A204S 2 1711670 C T nonsynonymous Rv1520 R215C 6 1712522 G T nonsynonymous Rv1521 S74I 3 1713923 A G nonsynonymous Rv1521 H541R 6 1714678 A G nonsynonymous Rv1522c T979A 4 1716472 C T nonsynonymous Rv1522c P381S 5 1717141 G C nonsynonymous Rv1522c A158P 1 1718444 T C synonymous Rv1523 R264R 1 1721213 T G nonsynonymous Rv1526c I283S 1 1721490 G C nonsynonymous Rv1526c A191P 5 1721987 G A nonsynonymous Rv1526c R25Q 2 1722228 T G nonsynonymous Rv1527c L2061R 6 1725768 C T nonsynonymous Rv1527c T881I modern 1726816 A C nonsynonymous Rv1527c T532P

232 Appendix B. Lineage-specific SNPs

6 1728324 G T nonsynonymous Rv1527c G29V 1 1728615 A C intergenic - - 5 1731312 G A intergenic - - 1 1731563 A T nonsynonymous Rv1530 H64L 5 1733563 A G intergenic - - 5 1733618 C T synonymous Rv1533 T3T 1 1735903 A G intergenic - - 5 1735926 T G intergenic - - 1 1739390 A G nonsynonymous Rv1536 T958A 1 1745446 T A nonsynonymous Rv1543 L128H 5 1746404 T C nonsynonymous Rv1544 L104S 3 1748439 C T nonsynonymous Rv1547 T249I 1 1754299 A C nonsynonymous Rv1550 E195A 5 1754786 G A synonymous Rv1550 G357G 1 1754983 C T nonsynonymous Rv1550 A423V 5 1757727 G A nonsynonymous Rv1552 G16D 1 1758790 C T synonymous Rv1552 P370P modern 1760923 C T synonymous Rv1555 T124T modern 1761789 C G nonsynonymous Rv1557 L16V 3 1763482 G A nonsynonymous Rv1559 G19R 3 1764225 C T synonymous Rv1559 A266A 6 1774116 A C nonsynonymous Rv1566c T169P 6 1775066 G A nonsynonymous Rv1567c V27I 6 1775167 G A intergenic - - 5 1776614 C T nonsynonymous Rv1568 A408V 6 1776663 G A synonymous Rv1568 S424S 6 1778197 G A synonymous Rv1570 L113L 3 1789933 G A intergenic - - 1 1791823 T C nonsynonymous Rv1591 L85S modern 1797577 G A nonsynonymous Rv1596 D64N 5 1799921 C A synonymous Rv1599 G113G 3 1802047 C G nonsynonymous Rv1601 Q5E 6 1805289 C A nonsynonymous Rv1605 P146Q 6 1805670 G A synonymous Rv1606 K6K 5 1810169 T G intergenic - - 6 1810253 C T nonsynonymous Rv1611 T5I 5 1811243 A C nonsynonymous Rv1612 E39D 6 1811964 C A synonymous Rv1612 R280R 3 1812448 C T synonymous Rv1613 Y30Y 6 1816587 C G synonymous Rv1617 V133V 3 1818286 C T synonymous Rv1618 G224G 6 1824107 G C nonsynonymous Rv1622c M98I modern 1826054 G A nonsynonymous Rv1624c A178T 1 1826343 G A synonymous Rv1624c R81R 5 1826577 C A nonsynonymous Rv1624c H3Q 1 1826624 C T nonsynonymous Rv1625c A441V 3 1827468 G T nonsynonymous Rv1625c V160L 1 1827553 G C nonsynonymous Rv1625c L131F 6 1827946 G A intergenic - - 6 1828389 C T synonymous Rv1626 R70R 6 1828506 G C nonsynonymous Rv1626 M109I 1 1829576 G A synonymous Rv1627c E166E 6 1829836 C T nonsynonymous Rv1627c P80S

233 Appendix B. Lineage-specific SNPs

2 1831220 A C nonsynonymous Rv1629 T186P 2 1831226 A G nonsynonymous Rv1629 R188G 2 1831288 C T synonymous Rv1629 P208P 3 1832509 C G synonymous Rv1629 T615T 1 1832642 G T nonsynonymous Rv1629 G660C 1 1832643 G C nonsynonymous Rv1629 G660A 3 1833025 C T synonymous Rv1629 D787D 2 1834177 A C synonymous Rv1630 R212R 3 1836417 G T synonymous Rv1632c G138G 6 1837168 G T nonsynonymous Rv1633 A32S modern 1839260 G T synonymous Rv1634 L31L 1 1839329 G A synonymous Rv1634 R54R 4 1839759 C G nonsynonymous Rv1634 R198G 6 1840543 C T nonsynonymous Rv1634 S459F 6 1841538 G A synonymous Rv1635c A235A 5 1846092 C T synonymous Rv1638 Y784Y 6 1846552 G T nonsynonymous Rv1638 A938S 5 1846854 C T synonymous Rv1638A Y40Y 1 1848147 G T synonymous Rv1639c V104V 5 1848963 G T nonsynonymous Rv1640c D1025Y 1 1849191 A G nonsynonymous Rv1640c I949V 4 1849609 G A synonymous Rv1640c R809R 5 1849814 T C nonsynonymous Rv1640c V741A 3 1852877 A C stoplost Rv1641 X202S 5 1853565 G A nonsynonymous Rv1643 A128T 1 1853974 C T synonymous Rv1644 G123G 6 1855260 G A synonymous Rv1645c S65S 4 1859559 A C nonsynonymous Rv1649 D276A 1 1859989 C T nonsynonymous Rv1650 R78W 5 1862099 C A nonsynonymous Rv1650 T781N 6 1867707 C T synonymous Rv1653 P359P 6 1870031 A C nonsynonymous Rv1656 Q37P 5 1870194 G C nonsynonymous Rv1656 L91F modern 1872959 G A synonymous Rv1659 R107R 3 1873700 G A synonymous Rv1659 L354L 1 1873954 G T nonsynonymous Rv1659 R439L 5 1874985 A C nonsynonymous Rv1660 T276P 6 1875585 C T synonymous Rv1661 T94T 1 1876739 T G nonsynonymous Rv1661 V479G 2 1877744 A C nonsynonymous Rv1661 E814A 6 1882180 C T synonymous Rv1662 A159A 6 1885481 C G nonsynonymous Rv1662 R1260G 6 1886077 G A synonymous Rv1662 A1458A 1 1886263 C G nonsynonymous Rv1662 H1520Q 1 1887284 G A nonsynonymous Rv1663 R258Q modern 1889073 C G nonsynonymous Rv1664 P350A 5 1890948 T G nonsynonymous Rv1664 L975V 2 1897608 G A synonymous Rv1672c L200L 1 1897646 C G nonsynonymous Rv1672c P188A 6 1900021 C T intergenic - - modern 1900800 C T nonsynonymous Rv1675c A59V 1 1902156 G A nonsynonymous Rv1677 G137R 6 1903173 C T synonymous Rv1678 I259I

234 Appendix B. Lineage-specific SNPs modern 1906336 G A intergenic - - 1 1907177 C T nonsynonymous Rv1682 R259W 6 1907794 G A synonymous Rv1683 V67V 3 1908598 G A synonymous Rv1683 R335R 5 1909456 C T synonymous Rv1683 A621A 5 1911301 G A synonymous Rv1685c A33A 6 1912024 G T nonsynonymous Rv1686c V20F 6 1912582 C T synonymous Rv1687c A113A 6 1912617 G A nonsynonymous Rv1687c D102N 6 1914570 A C synonymous Rv1689 A323A modern 1920120 T G synonymous Rv1696 P146P 5 1923633 C T nonsynonymous Rv1698 H297Y 1 1923985 G A nonsynonymous Rv1699 V53I modern 1924959 C T synonymous Rv1699 G377G 3 1925136 G A synonymous Rv1699 V436V 3 1926029 T C nonsynonymous Rv1700 Y150H modern 1931470 A G intergenic - - modern 1935695 G T nonsynonymous Rv1707 A272S modern 1936525 A G nonsynonymous Rv1708 T56A 6 1937727 G T nonsynonymous Rv1709 A139S 5 1939007 A C nonsynonymous Rv1711 Q57P 1 1940307 G T nonsynonymous Rv1713 W7L 5 1942121 C T nonsynonymous Rv1714 A90V 5 1946975 C G intergenic - - 3 1946999 T G intergenic - - 5 1951438 C T nonsynonymous Rv1725c P105L 6 1951764 G A intergenic - - modern 1952160 C T synonymous Rv1726 T103T 3 1952743 G A nonsynonymous Rv1726 V298M 5 1955339 A C synonymous Rv1729c A77A 6 1955686 G A intergenic - - 2 1955941 C G nonsynonymous Rv1730c D435E 1 1956930 A C nonsynonymous Rv1730c T106P 6 1958423 C T synonymous Rv1731 I249I 5 1961730 C T nonsynonymous Rv1735c A20V 1 1963383 C T synonymous Rv1736c Y268Y modern 1963957 G A nonsynonymous Rv1736c G77D 5 1964984 G A synonymous Rv1737c R129R 6 1965434 A G intergenic - - 2 1967543 G T nonsynonymous Rv1739c G32V 6 1968116 A C nonsynonymous Rv1741 K67T 5 1968284 C T nonsynonymous Rv1742 R38W 6 1970407 G T synonymous Rv1743 P468P 6 1970432 G A nonsynonymous Rv1743 A477T 4 1971725 G C synonymous Rv1745c R89R 5 1971965 A C synonymous Rv1745c P9P 3 1972901 C T nonsynonymous Rv1746 A255V 6 1975960 G T synonymous Rv1747 S777S 3 1977646 C T synonymous Rv1749c A80A 1 1978807 C A nonsynonymous Rv1750c A254D 1 1979026 G A nonsynonymous Rv1750c S181N 2 1980652 G T synonymous Rv1751 P344P 5 1989054 A T nonsynonymous Rv1758 I5F

235 Appendix B. Lineage-specific SNPs

5 1989057 G A nonsynonymous Rv1758 G6R 1 1989370 C T nonsynonymous Rv1758 P110L 5 1989553 G C nonsynonymous Rv1758 G171A 6 1992683 C T intergenic - - 3 1993561 T C nonsynonymous Rv1760 C137R 6 1993683 C T synonymous Rv1760 V177V 5 1997460 G A synonymous Rv1765c R352R 3 2003252 C T synonymous Rv1769 A209A 5 2005152 G A synonymous Rv1770 E425E 5 2005607 T G nonsynonymous Rv1771 S149R 6 2005758 G C nonsynonymous Rv1771 E200Q 6 2006954 C T intergenic - - 5 2007015 C G intergenic - - 5 2008140 A C synonymous Rv1774 P103P 4 2010614 A G intergenic - - 3 2010880 T G synonymous Rv1777 G75G 3 2011568 G C nonsynonymous Rv1777 E305Q 1 2013943 C G nonsynonymous Rv1779c D179E modern 2017291 G A synonymous Rv1781c R62R 1 2017560 A T intergenic - - 5 2017860 C T nonsynonymous Rv1782 R41C 6 2017861 G A nonsynonymous Rv1782 R41H 5 2018883 G C nonsynonymous Rv1782 G382R modern 2019236 G T synonymous Rv1782 P499P 1 2023211 G T nonsynonymous Rv1784 V860F 6 2025032 C T intergenic - - 6 2032898 C T nonsynonymous Rv1795 P220L 6 2033021 G A nonsynonymous Rv1795 G261D 5 2033307 T C synonymous Rv1795 A356A 3 2034676 C T synonymous Rv1796 I316I 1 2035937 G A nonsynonymous Rv1797 R152H 5 2047454 G A intergenic - - 6 2049907 G A intergenic - - 6 2053439 C T intergenic - - 6 2053762 C T nonsynonymous Rv1811 A107V 3 2056184 C T intergenic - - 5 2060377 C T synonymous Rv1817 Y261Y 6 2060557 C T synonymous Rv1817 F321F 5 2060606 C T nonsynonymous Rv1817 H338Y modern 2062922 G A nonsynonymous Rv1819c V603I 6 2063121 A T synonymous Rv1819c P536P 1 2066471 T C synonymous Rv1821 G5G 5 2069546 T C nonsynonymous Rv1822 V156A 6 2071192 C T nonsynonymous Rv1825 R53W 6 2071410 C T synonymous Rv1825 D125D modern 2072190 C A nonsynonymous Rv1826 A80E 2 2072313 C A nonsynonymous Rv1826 T121K 5 2080594 C T synonymous Rv1834 R255R 1 2083124 A G nonsynonymous Rv1836c S505G 5 2087453 T C nonsynonymous Rv1838c I67T 3 2087652 G A nonsynonymous Rv1838c V1M 1 2090306 C T nonsynonymous Rv1841c P138L 5 2090366 T C nonsynonymous Rv1841c M118T

236 Appendix B. Lineage-specific SNPs

6 2090776 T C nonsynonymous Rv1842c L437S 1 2092391 C A synonymous Rv1843c I436I 1 2092970 G A synonymous Rv1843c L243L modern 2093715 C T intergenic - - 5 2095234 G T nonsynonymous Rv1845c R312L 1 2096094 G A synonymous Rv1845c T25T 1 2096430 T G nonsynonymous Rv1846c L57R modern 2097144 C G nonsynonymous Rv1847 L90V 5 2099402 T G nonsynonymous Rv1850 V481G 5 2099631 C T synonymous Rv1850 G557G 5 2101921 G A synonymous Rv1854c S374S 5 2103112 C A intergenic - - 1 2104779 G A synonymous Rv1856c G15G 5 2107050 G A synonymous Rv1859 V159V 6 2107511 C T nonsynonymous Rv1859 T313M 6 2108374 C T synonymous Rv1860 D213D 4 2108890 C A intergenic - - 5 2108980 A T intergenic - - modern 2110365 C A synonymous Rv1862 T274T 6 2115064 C T nonsynonymous Rv1866 A642V 6 2115776 C T nonsynonymous Rv1867 P5S modern 2120796 T A stopgain Rv1870c L212X 6 2122380 C G nonsynonymous Rv1872c L258V 5 2122443 G T nonsynonymous Rv1872c A237S modern 2122625 C T nonsynonymous Rv1872c A176V 4 2122976 C G nonsynonymous Rv1872c A59G 5 2123146 G A synonymous Rv1872c A2A 6 2124926 T C intergenic - - 5 2125054 C T intergenic - - 6 2125863 C T intergenic - - 6 2127646 C T synonymous Rv1877 T581T modern 2128372 G A synonymous Rv1878 T117T modern 2129281 A C synonymous Rv1878 A420A modern 2130529 G A intergenic - - 5 2130784 G A synonymous Rv1880c P358P 1 2132062 G C nonsynonymous Rv1881c G90R 5 2132077 A G nonsynonymous Rv1881c I85V 5 2136642 C T nonsynonymous Rv1887 L129F 1 2138767 C G nonsynonymous Rv1889c A84G 5 2140748 G C synonymous Rv1894c A374A 3 2142250 C T intergenic - - 6 2143839 A G nonsynonymous Rv1896c E203G modern 2145878 C T nonsynonymous Rv1899c P123L 6 2150754 A G nonsynonymous Rv1903 T131A modern 2151678 A C nonsynonymous Rv1905c T240P 3 2151780 C A nonsynonymous Rv1905c Q206K 3 2153184 T G intergenic - - 4 2154724 T G nonsynonymous Rv1908c L463R 6 2155503 C T synonymous Rv1908c T203T modern 2156868 A C synonymous Rv1910c G144G 4 2158109 G A nonsynonymous Rv1912c G328D 5 2158190 T G nonsynonymous Rv1912c I301S 3 2158905 G T stopgain Rv1912c G63X

237 Appendix B. Lineage-specific SNPs

6 2159337 G T synonymous Rv1913 A49A 6 2167564 G T intergenic - - 6 2172012 G C nonsynonymous Rv1920 M130I 2 2172380 A C nonsynonymous Rv1920 E253A 3 2172526 T G stoplost Rv1921c X424G 5 2173728 C A nonsynonymous Rv1921c A23D 5 2176006 C A synonymous Rv1923 G278G 5 2176648 T G nonsynonymous Rv1924c W95G modern 2177073 C T intergenic - - 1 2177968 G T synonymous Rv1925 A294A 5 2178941 G A nonsynonymous Rv1925 V619M 6 2181541 C A synonymous Rv1929c P122P 1 2185358 C T synonymous Rv1934c T277T 2 2185674 T C nonsynonymous Rv1934c I172T 3 2186127 C T nonsynonymous Rv1934c A21V 3 2186236 G A synonymous Rv1935c P308P 5 2186371 C T synonymous Rv1935c A263A 5 2186421 A C nonsynonymous Rv1935c T247P 5 2195637 T G nonsynonymous Rv1944c Y100D 6 2195922 A G nonsynonymous Rv1944c T5A 6 2195923 T G nonsynonymous Rv1944c D4E 4 2199052 C G nonsynonymous Rv1948c R5G 6 2199061 A C nonsynonymous Rv1948c T2P modern 2199416 C T nonsynonymous Rv1949c L207F 5 2206970 G A intergenic - - 5 2208538 G A stopgain Rv1965 W11X 4 2209465 A G nonsynonymous Rv1966 T47A 5 2210198 A G nonsynonymous Rv1966 Y291C 3 2216345 C A synonymous Rv1971 P363P 1 2216370 G C nonsynonymous Rv1971 G372R modern 2218012 G A synonymous Rv1974 A118A 1 2218488 G C nonsynonymous Rv1975 S146T 3 2220947 A G nonsynonymous Rv1978 M14V 6 2221313 G A nonsynonymous Rv1978 A136T modern 2222308 G A nonsynonymous Rv1979c G286D 5 2223902 T A nonsynonymous Rv1980c I43N 6 2225175 T G nonsynonymous Rv1981c L5R 4 2229801 C G synonymous Rv1985c P34P modern 2231486 G A intergenic - - 1 2237497 C G nonsynonymous Rv1993c L27V 1 2238930 A C intergenic - - 1 2239055 C T nonsynonymous Rv1996 P18S 1 2240062 C G intergenic - - 1 2241646 C T synonymous Rv1997 A496A modern 2241742 G A synonymous Rv1997 P528P 1 2242808 G A nonsynonymous Rv1997 A884T modern 2243034 C A nonsynonymous Rv1998c R230S 6 2244343 C A synonymous Rv1999c R266R 3 2245916 T G synonymous Rv2000 A236A 5 2246459 C A synonymous Rv2000 A417A modern 2246960 C G synonymous Rv2001 V43V 1 2249035 T C nonsynonymous Rv2003c M129T 1 2255942 G T nonsynonymous Rv2006 R1314L

238 Appendix B. Lineage-specific SNPs

4 2260100 T C intergenic - - 6 2261693 G C intergenic - - 6 2265993 A C nonsynonymous Rv2019 Q2P 6 2266051 C T synonymous Rv2019 G21G 2 2267015 A G synonymous Rv2021c A32A 6 2267976 C A synonymous Rv2023c R45R 2 2268627 C G nonsynonymous Rv2023A Q34E 5 2268887 C G nonsynonymous Rv2024c L452V 6 2269376 C T nonsynonymous Rv2024c R289C 6 2274463 T G nonsynonymous Rv2027c L16V 3 2275764 C T nonsynonymous Rv2029c L221F 6 2275771 A G synonymous Rv2029c A218A 6 2276918 T C synonymous Rv2030c H523H 5 2278426 C A nonsynonymous Rv2030c R21S 6 2281289 G T intergenic - - modern 2282376 T C nonsynonymous Rv2036 V93A modern 2282377 T C synonymous Rv2036 V93V 5 2283293 C G synonymous Rv2037c T143T 3 2284456 A C nonsynonymous Rv2038c E114A 3 2285558 C T synonymous Rv2039c C28C 2 2288085 C G synonymous Rv2042c A199A 5 2290062 G A nonsynonymous Rv2045c A387T 5 2291331 C A synonymous Rv2046 G21G 5 2291331 C T synonymous Rv2046 G21G 2 2294007 A G nonsynonymous Rv2047c T174A 3 2296876 G C nonsynonymous Rv2048c G3371R 1 2297766 C T nonsynonymous Rv2048c A3074V modern 2301089 G A synonymous Rv2048c L1966L 5 2304017 G A synonymous Rv2048c L990L 3 2306472 A T nonsynonymous Rv2048c Y172F 3 2306472 A G nonsynonymous Rv2048c Y172C 3 2309203 C T nonsynonymous Rv2051c A518V 1 2309356 C T nonsynonymous Rv2051c T467I 5 2313815 G A nonsynonymous Rv2054 A231T 1 2321358 G C nonsynonymous Rv2063A R101P 5 2323291 C T synonymous Rv2066 A39A 6 2323880 T C synonymous Rv2066 L236L 3 2325320 A G nonsynonymous Rv2067c D184G 1 2327904 A C nonsynonymous Rv2070c I108L 5 2328420 G A synonymous Rv2071c T186T 4 2328543 G A nonsynonymous Rv2071c M145I 6 2328627 G A synonymous Rv2071c A117A 5 2328820 A G nonsynonymous Rv2071c D53G 6 2329466 G A synonymous Rv2072c P227P 1 2331255 C T nonsynonymous Rv2074 A88V 4 2331620 G T synonymous Rv2075c G420G 4 2331789 A C nonsynonymous Rv2075c Q364P 3 2333215 G A nonsynonymous Rv2076c C25Y 3 2335080 T C synonymous Rv2078 L8L 1 2335500 C T nonsynonymous Rv2079 A49V 5 2335650 G A nonsynonymous Rv2079 G99D 6 2336985 G C nonsynonymous Rv2079 G544A 2 2337179 C T stopgain Rv2079 Q609X

239 Appendix B. Lineage-specific SNPs

6 2337373 T C nonsynonymous Rv2080 V23A 5 2338773 G A nonsynonymous Rv2082 R22Q 2 2338810 T C synonymous Rv2082 R34R 2 2338811 A G nonsynonymous Rv2082 K35E 5 2338961 G A nonsynonymous Rv2082 V85I 3 2339240 G A nonsynonymous Rv2082 G178S 4 2339255 G A nonsynonymous Rv2082 A183T 5 2339605 A G synonymous Rv2082 P299P modern 2341030 A G nonsynonymous Rv2083 T54A 3 2345085 A C synonymous Rv2088 A225A 3 2346929 G A synonymous Rv2089c L132L 6 2347616 C A nonsynonymous Rv2090 L82I 1 2348482 G T nonsynonymous Rv2090 E370D 3 2348708 C T synonymous Rv2091c V195V 6 2349116 G A synonymous Rv2091c P59P 5 2349418 T C synonymous Rv2092c A879A modern 2350186 C T synonymous Rv2092c R623R 5 2350534 A G synonymous Rv2092c E507E 6 2351522 C T nonsynonymous Rv2092c T178M 3 2353385 C G nonsynonymous Rv2095c Q311E 5 2361174 C T nonsynonymous Rv2101 A312V 4 2369186 G C nonsynonymous Rv2109c G182R 6 2370902 G A intergenic - - 6 2372951 G C nonsynonymous Rv2113 G108R 1 2374442 G A synonymous Rv2114 S203S modern 2376425 G A intergenic - - modern 2379743 G C intergenic - - 6 2379997 T C nonsynonymous Rv2121c S222P 5 2382645 C T synonymous Rv2124c G1141G 5 2385408 G T nonsynonymous Rv2124c E220D 5 2388205 A G intergenic - - 4 2388641 A G nonsynonymous Rv2127 D9G 6 2389698 C A nonsynonymous Rv2127 N361K modern 2390299 G A intergenic - - 5 2393590 C G stopgain Rv2132 Y60X 1 2397760 C G nonsynonymous Rv2138 A144G 2 2399734 G A nonsynonymous Rv2139 G339S 1 2400031 C T nonsynonymous Rv2140c R100C 3 2402765 G A intergenic - - 5 2408524 C T nonsynonymous Rv2150c S334L 4 2413246 T G nonsynonymous Rv2153c L36V 5 2414989 G C nonsynonymous Rv2155c G469A 6 2415351 G T synonymous Rv2155c V348V 5 2419044 T C nonsynonymous Rv2158c V522A modern 2419142 C T synonymous Rv2158c V489V A63V- 4 2421816 C T nonsynonymous Rv2160c Rv2160A 5 2422502 G A synonymous Rv2161c L212L modern 2424864 G A intergenic - - 1 2425097 T G nonsynonymous Rv2163c V664G 4 2425471 G A synonymous Rv2163c R539R 6 2427828 C G synonymous Rv2164c G137G 6 2428953 G A nonsynonymous Rv2165c G106D

240 Appendix B. Lineage-specific SNPs

5 2432185 A G intergenic - - 3 2434749 C T intergenic - - 6 2435582 G A nonsynonymous Rv2173 A246T modern 2437259 T G nonsynonymous Rv2174 S451A 6 2437837 C T nonsynonymous Rv2175c P17L 6 2438094 T G nonsynonymous Rv2176 S52A 3 2440935 C G synonymous Rv2178c S262S 6 2443508 G T nonsynonymous Rv2181 L69F 5 2445414 C A intergenic - - 6 2447150 C A synonymous Rv2185c P117P 6 2447426 C A synonymous Rv2185c I25I modern 2447539 G A intergenic - - 1 2448288 G A stopgain Rv2187 W43X 6 2448402 C A stopgain Rv2187 Y81X 4 2448458 T C nonsynonymous Rv2187 I100T modern 2449295 A G nonsynonymous Rv2187 E379G 3 2449826 C G nonsynonymous Rv2187 S556W 1 2450045 C T nonsynonymous Rv2188c T369M 1 2451081 G C nonsynonymous Rv2188c E24Q 3 2452452 C T nonsynonymous Rv2190c T274M 5 2452657 G T nonsynonymous Rv2190c V206F 6 2453933 C A nonsynonymous Rv2191 R39S 6 2458234 A C nonsynonymous Rv2194 K228Q 6 2461545 G T nonsynonymous Rv2197c A202S 5 2463455 G A synonymous Rv2199c R66R 5 2465721 T C nonsynonymous Rv2201 V242A 1 2470485 G T intergenic - - 4 2470591 C A intergenic - - 3 2472029 T G nonsynonymous Rv2207 W207G modern 2472956 C T nonsynonymous Rv2208 S155L 6 2474271 G A nonsynonymous Rv2209 G291E 5 2477562 C T synonymous Rv2212 L125L 3 2477984 G A synonymous Rv2212 S265S 6 2478180 C A nonsynonymous Rv2212 L331I 3 2478619 G A synonymous Rv2213 L94L 6 2478967 C G nonsynonymous Rv2213 F210L 3 2480809 C G nonsynonymous Rv2214c A298G 5 2485956 G A synonymous Rv2218 E228E 5 2488898 C A nonsynonymous Rv2220 D428E 5 2489855 G A synonymous Rv2221c T833T 5 2490116 A C synonymous Rv2221c A746A 5 2493513 A G nonsynonymous Rv2222c Q77R 3 2494430 G C nonsynonymous Rv2223c G324R modern 2495500 G A synonymous Rv2224c L508L 5 2498200 G T nonsynonymous Rv2225 M153I 5 2500610 G A intergenic - - 6 2500697 C T intergenic - - 1 2500892 C T intergenic - - modern 2501148 C T nonsynonymous Rv2227 A73V 6 2501401 G T synonymous Rv2227 P157P 6 2501668 G A synonymous Rv2228c L357L 5 2503257 C T synonymous Rv2229c I72I 5 2503491 A C nonsynonymous Rv2230c E373A

241 Appendix B. Lineage-specific SNPs

1 2503549 G T nonsynonymous Rv2230c A354S 3 2504177 G A synonymous Rv2230c E144E 1 2508395 G A synonymous Rv2235 P253P 5 2508857 C T synonymous Rv2236c A173A 1 2509181 G C synonymous Rv2236c V65V 6 2509362 C T nonsynonymous Rv2236c T5I 2 2510350 C G intergenic - - modern 2511712 A C nonsynonymous Rv2240c K259T 5 2512359 C T synonymous Rv2240c A43A 6 2514867 G A nonsynonymous Rv2241 A777T 3 2516271 T C nonsynonymous Rv2242 M323T 5 2516804 A C synonymous Rv2243 A6A modern 2518132 T C synonymous Rv2245 T6T 6 2520466 G A synonymous Rv2246 A357A 2 2521428 A G nonsynonymous Rv2247 D229G 6 2522284 G A intergenic - - 5 2522650 T C synonymous Rv2248 R97R 6 2522878 C A synonymous Rv2248 L173L 5 2525534 C T nonsynonymous Rv2250A R45C 6 2526709 T C nonsynonymous Rv2251 M382T 1 2528931 C T synonymous Rv2254c N15N 5 2530101 A T nonsynonymous Rv2257c D241V 1 2530434 C A nonsynonymous Rv2257c P130H 5 2531033 C G nonsynonymous Rv2258c P289A 5 2531035 T C nonsynonymous Rv2258c V288A 5 2532788 A G nonsynonymous Rv2259 T182A 3 2536312 C T synonymous Rv2263 R224R 1 2538793 G A nonsynonymous Rv2265 G32S 3 2540554 T C nonsynonymous Rv2266 S151P 3 2541477 C G intergenic - - 3 2542543 C T nonsynonymous Rv2267c R90C 2 2543395 A G synonymous Rv2268c E294E 6 2544466 G A synonymous Rv2269c R52R 1 2544979 T A nonsynonymous Rv2270 H94Q 1 2547274 A G nonsynonymous Rv2275 D131G 2 2548700 C T nonsynonymous Rv2276 H318Y 5 2549057 C A intergenic - - 6 2550019 G T nonsynonymous Rv2277c R4L 1 2553682 C T synonymous Rv2281 Y170Y 5 2561261 T C nonsynonymous Rv2287 V520A modern 2562783 C T nonsynonymous Rv2290 A62V 5 2562933 T C nonsynonymous Rv2290 I112T modern 2563958 C A nonsynonymous Rv2291 A262E 6 2566596 G C intergenic - - 1 2569593 A G nonsynonymous Rv2298 H171R 6 2571678 C T stopgain Rv2299c Q109X 5 2572854 A T nonsynonymous Rv2300c K52M 1 2573434 G T nonsynonymous Rv2301 W140C 1 2574598 G A nonsynonymous Rv2303c S141N 5 2574950 A G nonsynonymous Rv2303c I24V 6 2576251 G A nonsynonymous Rv2305 G148D 5 2576863 G C nonsynonymous Rv2305 S352T modern 2577246 G A nonsynonymous Rv2306A V47I

242 Appendix B. Lineage-specific SNPs

6 2577994 C T nonsynonymous Rv2307c R235W modern 2581109 A C synonymous Rv2308 R231R 1 2582324 G A intergenic - - 2 2586076 G C synonymous Rv2314c R405R 6 2590122 G A synonymous Rv2317 V142V 3 2591172 G A nonsynonymous Rv2318 A219T 6 2592510 C T nonsynonymous Rv2319c R73C 6 2593621 C A nonsynonymous Rv2320c A178E 6 2596056 G A nonsynonymous Rv2323c V72I 5 2598899 C A nonsynonymous Rv2326c T350N 1 2602575 C T nonsynonymous Rv2329c P296L 5 2603523 G C intergenic - - modern 2605293 T G nonsynonymous Rv2332 D62E modern 2608488 C T intergenic - - 5 2609302 G C nonsynonymous Rv2334 K169N 1 2611704 A C synonymous Rv2336 R290R 6 2614882 G A nonsynonymous Rv2339 A64T 3 2615413 A G nonsynonymous Rv2339 T241A 5 2615969 C A nonsynonymous Rv2339 A426E 5 2616527 G T nonsynonymous Rv2339 R612L 5 2617442 C A stopgain Rv2339 S917X 4 2619271 C T intergenic - - 1 2622508 G A synonymous Rv2344c L415L 5 2622927 C G nonsynonymous Rv2344c P276A 6 2623603 A G synonymous Rv2344c A50A 6 2623917 T C nonsynonymous Rv2345 S33P 6 2624945 C A nonsynonymous Rv2345 D375E 3 2624986 T G nonsynonymous Rv2345 V389G 4 2625924 G A synonymous Rv2346c A83A 5 2626018 A G nonsynonymous Rv2346c E52G 3 2626095 G C synonymous Rv2346c A26A modern 2626108 C G nonsynonymous Rv2346c A22G modern 2626189 A C intergenic - - modern 2626191 T C intergenic - - 3 2626513 A T nonsynonymous Rv2347c T3S 3 2626514 A C synonymous Rv2347c A2A 3 2626600 G A intergenic - - modern 2631641 C A synonymous Rv2351c P145P 3 2632362 T C intergenic - - 6 2632373 C A intergenic - - modern 2632500 G A intergenic - - 1 2637088 C T intergenic - - 6 2641813 C T nonsynonymous Rv2359 A55V 6 2641828 T C nonsynonymous Rv2359 V60A modern 2641840 G A nonsynonymous Rv2359 R64H 1 2643653 C T synonymous Rv2362c G202G 5 2645780 G A synonymous Rv2364c L298L 3 2652254 C G nonsynonymous Rv2372c A191G 1 2652908 G C nonsynonymous Rv2373c E360D 5 2656136 T A intergenic - - 3 2656635 C T nonsynonymous Rv2378c P357S 1 2660319 C G nonsynonymous Rv2379c D589E 1 2660319 C T synonymous Rv2379c D589D

243 Appendix B. Lineage-specific SNPs

3 2661039 C T synonymous Rv2379c I349I 1 2663210 C T synonymous Rv2380c A1302A 5 2663463 G T nonsynonymous Rv2380c R1218L 6 2672906 C T nonsynonymous Rv2383c L978F 2 2673818 G C nonsynonymous Rv2383c V674L 5 2682158 C T nonsynonymous Rv2388c L329F 5 2683729 A G nonsynonymous Rv2390c S180G 1 2688225 G C nonsynonymous Rv2394 M72I 3 2688700 C T nonsynonymous Rv2394 P231S 6 2688726 T C synonymous Rv2394 A239A 6 2689193 G A nonsynonymous Rv2394 R395Q 3 2690160 A G nonsynonymous Rv2395 N30S modern 2691713 C T nonsynonymous Rv2395 P548S 6 2692608 A G intergenic - - 1 2696977 C G synonymous Rv2400c A246A 6 2697218 G T nonsynonymous Rv2400c R166L 3 2700222 T C nonsynonymous Rv2402 L565P 5 2701940 C T nonsynonymous Rv2404c P437S 5 2702166 C A synonymous Rv2404c R361R 5 2702403 T C synonymous Rv2404c L282L 5 2702612 G T nonsynonymous Rv2404c G213C 6 2703018 C A synonymous Rv2404c G77G 5 2703964 G A intergenic - - 1 2704291 C T synonymous Rv2406c I49I modern 2704884 A T nonsynonymous Rv2407 H63L 6 2705145 C A nonsynonymous Rv2407 T150K modern 2709795 C T synonymous Rv2411c A57A 6 2710422 G A nonsynonymous Rv2413c A294T 2 2711722 A C synonymous Rv2414c P385P 3 2712328 G A synonymous Rv2414c P183P modern 2712913 T G nonsynonymous Rv2415c L291R 6 2719057 G A intergenic - - 3 2720069 G A nonsynonymous Rv2423 G158E 3 2720444 C T nonsynonymous Rv2423 S283F 4 2723506 G A synonymous Rv2426c L226L 5 2724331 A G nonsynonymous Rv2427c I383V 1 2726051 C T nonsynonymous Rv2427A L13F 3 2726105 G A intergenic - - 1 2727037 A T nonsynonymous Rv2429 M78L 6 2730360 C T synonymous Rv2434c V67V 5 2730711 C T nonsynonymous Rv2435c A680V 1 2731741 C T synonymous Rv2435c L337L 6 2733100 G T intergenic - - 2 2734482 A T nonsynonymous Rv2437 Y36F 3 2738221 C T synonymous Rv2439c I9I 6 2739242 T C nonsynonymous Rv2440c S149P 4 2740693 C T intergenic - - 2 2741209 G A synonymous Rv2443 L167L 5 2741269 C A synonymous Rv2443 G187G 5 2744225 T C synonymous Rv2444c L254L 1 2745739 G A intergenic - - 1 2745839 C T synonymous Rv2446c A100A 5 2748366 G T nonsynonymous Rv2448c E620D

244 Appendix B. Lineage-specific SNPs

6 2751300 C T synonymous Rv2449c T91T 5 2752132 T C synonymous Rv2450c L17L 5 2753821 C G nonsynonymous Rv2454c A309G 3 2753869 T C nonsynonymous Rv2454c V293A 1 2755112 C T synonymous Rv2455c I531I 6 2757464 G T synonymous Rv2456c L243L 1 2759534 C G intergenic - - 1 2764206 T C synonymous Rv2462c D362D modern 2764939 T C nonsynonymous Rv2462c L118P 6 2770011 T G synonymous Rv2467 A342A 3 2771383 A G nonsynonymous Rv2467 S800G 6 2772741 G T nonsynonymous Rv2469c A99S 6 2772760 C G synonymous Rv2469c S92S 6 2772954 A G nonsynonymous Rv2469c S28G 1 2773955 G C nonsynonymous Rv2471 S131T 3 2782498 C T synonymous Rv2477c D515D 5 2784162 C A nonsynonymous Rv2478c D149E 6 2789237 A G nonsynonymous Rv2482c D16G 2 2789798 C A synonymous Rv2483c R409R modern 2790458 G T nonsynonymous Rv2483c G189C 4 2791098 A G nonsynonymous Rv2484c D466G 1 2791257 T A nonsynonymous Rv2484c I413N modern 2791475 C T synonymous Rv2484c A340A 3 2798595 G A synonymous Rv2488c A762A 1 2799493 T G nonsynonymous Rv2488c I463S 4 2807486 A C nonsynonymous Rv2492 D70A 5 2808296 G A nonsynonymous Rv2493 E72K 1 2809895 C T synonymous Rv2495c L15L 5 2810816 A G nonsynonymous Rv2496c E56G 5 2811013 G A nonsynonymous Rv2497c E362K 6 2813515 C T synonymous Rv2499c L72L 6 2817056 G A synonymous Rv2502c A473A 1 2817158 G T nonsynonymous Rv2502c M439I 3 2817747 C T nonsynonymous Rv2502c A243V 5 2819093 C G nonsynonymous Rv2503c A12G 6 2819183 C T nonsynonymous Rv2504c R230W 5 2820743 T C nonsynonymous Rv2505c V285A 5 2822701 A C synonymous Rv2507 P88P 6 2823743 C A nonsynonymous Rv2508c A284E 1 2824432 C T synonymous Rv2508c T54T 4 2825466 A G synonymous Rv2509 K263K 1 2828104 C T intergenic - - 6 2831046 C T nonsynonymous Rv2514c A98V 2 2833329 C T nonsynonymous Rv2516c A62V 6 2835261 G A nonsynonymous Rv2518c M25I 5 2839648 G A nonsynonymous Rv2523c V95I 4 2841022 C T nonsynonymous Rv2524c R2771C 5 2843482 C T nonsynonymous Rv2524c P1951S 5 2844125 G T synonymous Rv2524c P1736P 6 2844335 G T synonymous Rv2524c T1666T 4 2847281 C T synonymous Rv2524c D684D 6 2847318 G C nonsynonymous Rv2524c G672A 6 2847737 C T synonymous Rv2524c I532I

245 Appendix B. Lineage-specific SNPs

6 2848800 C T nonsynonymous Rv2524c A178V 6 2851746 A G intergenic - - 5 2852798 C T intergenic - - 5 2854669 T C synonymous Rv2530c D6D 6 2854864 C T nonsynonymous Rv2530A A15V 5 2855231 T C nonsynonymous Rv2531c F851L 6 2855422 T C nonsynonymous Rv2531c I787T 5 2855959 C T nonsynonymous Rv2531c P608L modern 2858669 A C nonsynonymous Rv2533c D19A 5 2859147 G A synonymous Rv2534c K48K 3 2867254 A G nonsynonymous Rv2544 H44R 2 2867298 C A nonsynonymous Rv2544 H59N 2 2867347 A G nonsynonymous Rv2544 Q75R 2 2867401 A C nonsynonymous Rv2544 N93T 2 2867756 T C synonymous Rv2544 I211I 5 2868769 G A nonsynonymous Rv2547 G55D 1 2869242 T C intergenic - - modern 2870386 T C intergenic - - 6 2871717 G A nonsynonymous Rv2552c R100Q 5 2874162 C T nonsynonymous Rv2555c A775V 5 2875717 A C nonsynonymous Rv2555c I257L 6 2875808 G C nonsynonymous Rv2555c K226N modern 2878980 A G nonsynonymous Rv2559c D317G 1 2881244 A G intergenic - - 3 2881337 C T intergenic - - 3 2881569 A G nonsynonymous Rv2561 E54G 5 2881938 G C nonsynonymous Rv2562 G61R 6 2886400 G A nonsynonymous Rv2566 G10S 4 2886570 G A synonymous Rv2566 E66E 6 2886640 C G nonsynonymous Rv2566 L90V 6 2887964 C T nonsynonymous Rv2566 A531V 6 2891366 G A synonymous Rv2567 A524A 6 2892917 C T synonymous Rv2568c L185L 6 2894322 C G nonsynonymous Rv2569c D29E 6 2894458 C T intergenic - - 1 2894594 G A nonsynonymous Rv2570 G28D 1 2894642 A G nonsynonymous Rv2570 E44G 3 2895473 T C nonsynonymous Rv2571c F163S 6 2896260 C T nonsynonymous Rv2572c A515V 1 2897528 C T synonymous Rv2572c A92A 3 2897660 A G synonymous Rv2572c A48A modern 2897871 G A intergenic - - 1 2899890 G T nonsynonymous Rv2575 Q184H 6 2900967 G A nonsynonymous Rv2577 G17E 3 2903050 C T nonsynonymous Rv2578c T161I 6 2904550 G A intergenic - - 6 2904864 T C nonsynonymous Rv2580c M410T 5 2910483 C A nonsynonymous Rv2584c L140I 3 2910852 G C nonsynonymous Rv2584c A17P 6 2912815 G A synonymous Rv2586c G399G 5 2919947 T C nonsynonymous Rv2590 F693L 5 2921513 C G intergenic - - 5 2921541 A C intergenic - -

246 Appendix B. Lineage-specific SNPs

3 2925462 T A intergenic - - 1 2925683 G A synonymous Rv2595 L64L 4 2925962 C T nonsynonymous Rv2596 R77C 5 2926445 G A nonsynonymous Rv2597 A31T 1 2926882 C T synonymous Rv2597 R176R 5 2927086 G A nonsynonymous Rv2598 R34Q modern 2927511 T G nonsynonymous Rv2599 I12S 6 2927864 G A nonsynonymous Rv2599 G130S 5 2934398 C G synonymous Rv2607 T67T 6 2939177 T C synonymous Rv2611c Y262Y 1 2940608 G A nonsynonymous Rv2612c S2N 3 2941179 C T synonymous Rv2613c R6R 6 2945042 C G intergenic - - 5 2945389 G A synonymous Rv2616 A20A 2 2948230 C T nonsynonymous Rv2621c A110V 3 2948524 A T nonsynonymous Rv2621c E12V 1 2948650 T C synonymous Rv2622 R5R 3 2949251 G A nonsynonymous Rv2622 V206I modern 2953307 T G intergenic - - 6 2954318 C T nonsynonymous Rv2627c T144M modern 2955233 T C nonsynonymous Rv2628 L59S 5 2955343 G A nonsynonymous Rv2628 A96T 5 2958044 G T nonsynonymous Rv2631 G158V 3 2958693 G A synonymous Rv2631 A374A 3 2959257 A T intergenic - - 3 2959265 A T intergenic - - modern 2959324 G A intergenic - - 3 2964594 G A nonsynonymous Rv2638 A64T 5 2964876 G C intergenic - - 6 2968468 G C intergenic - - 2 2969197 A G intergenic - - modern 2970017 C G intergenic - - modern 2970019 A G intergenic - - 6 2972107 C T intergenic - - 5 2976579 C T intergenic - - modern 2980970 G T nonsynonymous Rv2660c C74F 5 2981030 C G nonsynonymous Rv2660c P54R 5 2981688 A G synonymous Rv2662 A69A 6 2984105 G A nonsynonymous Rv2667 M70I modern 2985216 A G nonsynonymous Rv2668 K162E 1 2987918 G A synonymous Rv2672 R79R 5 2988374 C T synonymous Rv2672 A231A 4 2988630 G C nonsynonymous Rv2672 D317H 5 2991646 G T synonymous Rv2675c V97V 1 2992564 C T nonsynonymous Rv2676c S22L 1 2993523 A G nonsynonymous Rv2677c Q157R 4 2994187 G A synonymous Rv2678c L292L 5 2998287 G C nonsynonymous Rv2682c G561A modern 3000362 C T nonsynonymous Rv2683 S84L 5 3001754 G T nonsynonymous Rv2684 A381S 4 3003115 G C nonsynonymous Rv2685 G378A 5 3004427 G T nonsynonymous Rv2687c V108L modern 3006898 T C synonymous Rv2689c Y55Y

247 Appendix B. Lineage-specific SNPs modern 3007238 C T stopgain Rv2690c R658X 5 3009738 A C nonsynonymous Rv2691 Q132P 5 3009759 G C nonsynonymous Rv2691 W139S 3 3010014 G A nonsynonymous Rv2691 G224E 4 3010420 G A nonsynonymous Rv2692 V133I 2 3010993 G A nonsynonymous Rv2693c G126R 6 3011566 C A nonsynonymous Rv2694c R68S 3 3011837 A G intergenic - - 6 3011903 G A intergenic - - 6 3014016 G A synonymous Rv2697c L44L 1 3015379 G A synonymous Rv2700 P59P 5 3015639 A G nonsynonymous Rv2700 Q146R 5 3015834 T C nonsynonymous Rv2700 I211T 5 3016149 C G nonsynonymous Rv2701c A196G 6 3016608 A C nonsynonymous Rv2701c D43A 6 3022369 G T intergenic - - 2 3024021 C A synonymous Rv2711 R153R 1 3025431 C T intergenic - - 3 3027548 C T nonsynonymous Rv2714 P162S modern 3027606 G C nonsynonymous Rv2714 W181S 4 3027798 C T nonsynonymous Rv2714 A245V 6 3029177 C T synonymous Rv2716 A2A 6 3029360 C T synonymous Rv2716 T63T 4 3031168 C T nonsynonymous Rv2719c H124Y 1 3031285 T A nonsynonymous Rv2719c L85M 2 3033189 C T synonymous Rv2721c P477P 5 3035033 C T nonsynonymous Rv2723 S42L 2 3036826 G A nonsynonymous Rv2724c V156M 5 3037048 G A nonsynonymous Rv2724c E82K 5 3037196 G A synonymous Rv2724c A32A 6 3037234 C T nonsynonymous Rv2724c R20C 5 3039020 G A nonsynonymous Rv2726c V261I 6 3039842 T A nonsynonymous Rv2727c W310R 3 3040344 G A synonymous Rv2727c E142E 3 3043700 C T synonymous Rv2731 A225A 6 3043960 C G nonsynonymous Rv2731 A312G 6 3049728 A C nonsynonymous Rv2737c Q566P 5 3050362 G T nonsynonymous Rv2737c D355Y 5 3051911 C T synonymous Rv2738c G34G 5 3052223 T G nonsynonymous Rv2739c W323G 5 3056742 C A nonsynonymous Rv2743c T164K 6 3057309 C T nonsynonymous Rv2744c P252L 6 3057375 G A nonsynonymous Rv2744c R230Q 3 3059791 G C intergenic - - 6 3068710 G T nonsynonymous Rv2756c K458N 5 3068778 G A nonsynonymous Rv2756c E436K 3 3069566 G A nonsynonymous Rv2756c G173D modern 3069805 G C synonymous Rv2756c V93V 3 3072285 T G nonsynonymous Rv2761c L119R 1 3074830 C T synonymous Rv2765 Y65Y 3 3076172 G A nonsynonymous Rv2766c D67N 6 3085752 C T synonymous Rv2778c R144R 1 3086261 A G nonsynonymous Rv2779c Q165R

248 Appendix B. Lineage-specific SNPs

6 3086728 G A nonsynonymous Rv2779c M9I 5 3086788 C T intergenic - - 5 3087187 C G nonsynonymous Rv2780 A123G 5 3087190 A C nonsynonymous Rv2780 D124A modern 3088625 C T synonymous Rv2781c V120V 3 3089299 G C nonsynonymous Rv2782c G355R 5 3096576 G A nonsynonymous Rv2787 R489Q 1 3097349 C T stopgain Rv2788 Q131X 5 3098714 A G synonymous Rv2789c S75S 6 3103497 G T nonsynonymous Rv2794c W148C 4 3104189 C T synonymous Rv2795c C241C modern 3105144 T G nonsynonymous Rv2796c F159C 5 3106231 G A synonymous Rv2797c A359A 5 3106491 C T stopgain Rv2797c Q273X 5 3108299 T G nonsynonymous Rv2799 S178A 3 3111280 C T nonsynonymous Rv2802c L182F 2 3111476 G A synonymous Rv2802c S116S 3 3112700 G A nonsynonymous Rv2804c G132D D4G- 4 3112877 A G nonsynonymous Rv2805 Rv2804c 1 3114814 G A intergenic - - 5 3115108 G A synonymous Rv2808 E21E 5 3116253 C A nonsynonymous Rv2811 Q39K modern 3118449 G A nonsynonymous Rv2813 V76I 6 3119277 G C intergenic - - modern 3119513 T C intergenic - - modern 3119737 G C intergenic - - modern 3119740 T G intergenic - - modern 3119741 T A intergenic - - 6 3119769 A G intergenic - - 1 3120212 G A intergenic - - 5 3121880 G A intergenic - - 5 3122621 C G intergenic - - modern 3122949 C T intergenic - - 1 3122954 C T intergenic - - 3 3123247 T A intergenic - - 6 3123291 C T intergenic - - 5 3125087 T G nonsynonymous Rv2818c I353S 6 3125180 G C nonsynonymous Rv2818c R322P 6 3125235 C T stopgain Rv2818c Q304X 6 3127466 A C synonymous Rv2820c R269R 5 3128667 C T synonymous Rv2821c L99L 5 3130150 C G nonsynonymous Rv2823c R542G 1 3132956 C A nonsynonymous Rv2825c A195E 1 3132975 G C nonsynonymous Rv2825c V189L 4 3133054 C G nonsynonymous Rv2825c C162W 4 3133055 G C nonsynonymous Rv2825c C162S 6 3134839 C T synonymous Rv2827c R215R 1 3135852 C A nonsynonymous Rv2828c A161E 1 3135950 G C synonymous Rv2828c S128S 6 3137026 T G nonsynonymous Rv2830c S67A modern 3137237 C T intergenic - - 1 3137681 G A synonymous Rv2831 V137V

249 Appendix B. Lineage-specific SNPs

6 3146953 A G synonymous Rv2839c E307E 6 3147243 C T nonsynonymous Rv2839c P211S 6 3148174 T C nonsynonymous Rv2840c Y29H 6 3148356 G A intergenic - - 5 3148511 A C nonsynonymous Rv2841c E306D 6 3149678 G A nonsynonymous Rv2842c R100H 5 3151813 C T nonsynonymous Rv2845c R380C 6 3152837 G C synonymous Rv2845c A38A 6 3155164 C T synonymous Rv2847c S236S 5 3157540 A G nonsynonymous Rv2849c Q202R 5 3158512 C T stopgain Rv2850c R515X 5 3159164 C T synonymous Rv2850c D297D 5 3165203 A G intergenic - - 3 3165807 G A synonymous Rv2855 V201V 3 3168492 G A intergenic - - 1 3169993 T C nonsynonymous Rv2858c L244P 6 3172564 C A nonsynonymous Rv2860c T146N 6 3173645 G C nonsynonymous Rv2861c G125R 3 3174013 C T nonsynonymous Rv2861c P2L 4 3174496 C T nonsynonymous Rv2862c R50C 5 3175460 G A synonymous Rv2864c L602L 4 3180988 T G nonsynonymous Rv2869c F259V 5 3184670 T C intergenic - - 6 3187539 C T synonymous Rv2875 V170V 5 3187718 G A nonsynonymous Rv2876 S19N 6 3187792 T C nonsynonymous Rv2876 W44R 5 3188332 C T synonymous Rv2877c V180V 1 3188428 G A synonymous Rv2877c P148P modern 3188769 C T nonsynonymous Rv2877c H35Y 4 3189242 C T synonymous Rv2878c A52A modern 3189580 C T intergenic - - 1 3190342 C T nonsynonymous Rv2880c P113S 5 3193575 C T synonymous Rv2884 A61A 1 3197917 C T synonymous Rv2888c G123G 5 3199103 C T nonsynonymous Rv2889c A2V modern 3200304 G A synonymous Rv2891 V13V 3 3200478 G A synonymous Rv2891 L71L 6 3202515 G A synonymous Rv2893 A32A 1 3202629 C T synonymous Rv2893 H70H 3 3202731 A C synonymous Rv2893 G104G 5 3205077 T C synonymous Rv2895c D52D 3 3208600 C T synonymous Rv2899c Y269Y 3 3212723 G T synonymous Rv2902c A78A 6 3213255 G A synonymous Rv2903c K200K 5 3214120 T C nonsynonymous Rv2904c V45A 1 3214481 G T intergenic - - 3 3219500 A T nonsynonymous Rv2912c K121M 1 3229692 C T nonsynonymous Rv2918c A330V 1 3233605 C T synonymous Rv2921c L179L 3 3236442 C T synonymous Rv2922c H455H 3 3236497 T A nonsynonymous Rv2922c V437E 6 3236716 T G nonsynonymous Rv2922c L364R 5 3238190 C A nonsynonymous Rv2923c L104M

250 Appendix B. Lineage-specific SNPs

1 3241244 C T nonsynonymous Rv2927c T239I 5 3242131 G A intergenic - - 6 3243312 C G intergenic - - 6 3244091 C A nonsynonymous Rv2930 A132E 3 3244113 T C synonymous Rv2930 P139P modern 3244414 G A nonsynonymous Rv2930 V240M 1 3247089 G A nonsynonymous Rv2931 G549S 3 3247298 C A synonymous Rv2931 G618G 3 3247319 C T synonymous Rv2931 G625G 3 3247340 G A synonymous Rv2931 V632V 6 3247579 G A nonsynonymous Rv2931 R712Q 1 3254758 A C synonymous Rv2932 P1229P modern 3254880 T G nonsynonymous Rv2932 L1270R 6 3255169 A G synonymous Rv2932 A1366A 5 3265806 A G nonsynonymous Rv2934 I1187V 4 3266030 G A synonymous Rv2934 S1261S modern 3271037 G A nonsynonymous Rv2935 D1101N 3 3273107 C A synonymous Rv2936 A298A modern 3273138 G C nonsynonymous Rv2936 D309H 1 3274545 G A synonymous Rv2938 L158L 5 3275857 G C synonymous Rv2939 L303L 2 3276703 A C nonsynonymous Rv2940c T2005P 6 3277599 C G nonsynonymous Rv2940c A1706G 1 3280132 C T nonsynonymous Rv2940c R862W 6 3281634 C T nonsynonymous Rv2940c A361V 6 3283592 G C synonymous Rv2941 G86G 2 3284855 C T synonymous Rv2941 I507I 6 3286107 C A synonymous Rv2942 V346V 6 3286566 C T synonymous Rv2942 A499A 5 3286789 G A nonsynonymous Rv2942 G574S 1 3293423 T C synonymous Rv2946c D977D 1 3293601 A G nonsynonymous Rv2946c Q918R 1 3295124 G A synonymous Rv2946c R410R modern 3296721 C G nonsynonymous Rv2947c R374G 5 3296934 G T nonsynonymous Rv2947c A303S 5 3296935 G T synonymous Rv2947c L302L 5 3297989 A G nonsynonymous Rv2948c I656V 5 3298691 A C synonymous Rv2948c R422R 1 3299413 C T nonsynonymous Rv2948c A181V 5 3300479 T G nonsynonymous Rv2949c V31G 3 3302589 A C intergenic - - modern 3302683 T C intergenic - - 2 3304966 G A nonsynonymous Rv2952 G176R 1 3306169 T C synonymous Rv2953 T297T 1 3306175 G C synonymous Rv2953 A299A 3 3306441 G A nonsynonymous Rv2953 R388Q 1 3306594 T G intergenic - - 1 3308446 A G nonsynonymous Rv2955c T34A 5 3309071 G A nonsynonymous Rv2956 G135D 5 3309916 C A nonsynonymous Rv2957 F149L modern 3311119 G C synonymous Rv2958c V294V 3 3312620 G T nonsynonymous Rv2959c E73D 1 3312942 G A intergenic - -

251 Appendix B. Lineage-specific SNPs

4 3314412 C T synonymous Rv2962c A237A 6 3317795 G C intergenic - - 5 3320271 G A nonsynonymous Rv2967c A926T 5 3325336 C T nonsynonymous Rv2970c A123V 1 3326150 A G nonsynonymous Rv2971 D17G 4 3326554 C A nonsynonymous Rv2971 H152N 1 3328495 C G synonymous Rv2973c R484R 1 3336528 T A intergenic - - 5 3345427 T C nonsynonymous Rv2988c S217P 6 3346980 G A nonsynonymous Rv2990c G247E 6 3348258 G A nonsynonymous Rv2991 A93T 3 3348536 C T intergenic - - 3 3349917 C T nonsynonymous Rv2992c H121Y 5 3351172 G A intergenic - - 1 3351472 G A stopgain Rv2994 W68X 6 3353082 C G synonymous Rv2995c T129T 3 3355949 A C nonsynonymous Rv2997 D284A 6 3356517 T C synonymous Rv2997 P473P 5 3356624 G C intergenic - - 3 3357464 C G intergenic - - 6 3363185 A T intergenic - - 5 3363584 G A synonymous Rv3004 T79T 1 3365841 C T nonsynonymous Rv3007c P204S 1 3366420 G C nonsynonymous Rv3007c G11R modern 3369869 T G intergenic - - modern 3371260 G A nonsynonymous Rv3011c V59I 1 3378828 G T nonsynonymous Rv3019c W58C 6 3383287 A G nonsynonymous Rv3024c K201R 3 3385218 G A nonsynonymous Rv3026c G287E 6 3389840 C A nonsynonymous Rv3030 T247N 5 3391138 G A nonsynonymous Rv3031 S406N 6 3393311 C T intergenic - - 3 3393640 C T synonymous Rv3033 N87N 5 3395038 G A intergenic - - 3 3395654 G A synonymous Rv3035 L92L 6 3395847 G T nonsynonymous Rv3035 V157F 5 3399945 A C nonsynonymous Rv3039c E80A 6 3400476 G T nonsynonymous Rv3040c A195S 1 3401850 C T synonymous Rv3041c D23D 1 3406798 G A nonsynonymous Rv3045 A172T 1 3407028 C T synonymous Rv3045 N248N 3 3413785 G A nonsynonymous Rv3051c V128I modern 3415332 G A intergenic - - 5 3416432 G A nonsynonymous Rv3055 V118I 5 3416630 G A nonsynonymous Rv3055 G184S 4 3420825 G A nonsynonymous Rv3059 G445D 1 3424462 G A synonymous Rv3061c L322L 6 3425523 T C intergenic - - 6 3425952 C A synonymous Rv3062 G123G 5 3426279 C A nonsynonymous Rv3062 S232R 1 3427632 C T synonymous Rv3063 A130A 3 3428897 G C nonsynonymous Rv3063 R552P 6 3429605 G A intergenic - -

252 Appendix B. Lineage-specific SNPs

1 3431407 G T intergenic - - 5 3431529 C T synonymous Rv3067 P34P 5 3433326 G A synonymous Rv3068c L99L 3 3437007 G C nonsynonymous Rv3074 A77P 6 3438386 C T synonymous Rv3075c I196I modern 3440542 G A synonymous Rv3077 G334G 1 3442240 G T stopgain Rv3079c E120X 5 3445777 G C nonsynonymous Rv3080c R71P 1 3447480 T G nonsynonymous Rv3082c L316R 1 3448714 G C nonsynonymous Rv3083 D71H 1 3453382 C T nonsynonymous Rv3087 A153V 4 3454263 G C nonsynonymous Rv3087 V447L 5 3457858 C A intergenic - - 3 3459081 G C nonsynonymous Rv3090 A291P 1 3460765 G C synonymous Rv3091 P550P 6 3463724 G T synonymous Rv3094c A56A modern 3464629 G A nonsynonymous Rv3096 G28E 4 3467465 C G synonymous Rv3098c A66A 1 3474597 C A synonymous Rv3106 V197V modern 3475159 G A nonsynonymous Rv3106 D385N 5 3478253 C A nonsynonymous Rv3109 A202E 5 3478767 G A intergenic - - 1 3479561 G A nonsynonymous Rv3111 D131N 6 3479798 G A synonymous Rv3112 G33G 4 3480789 C T nonsynonymous Rv3114 P11S 2 3487108 C T synonymous Rv3121 A200A 3 3488122 G A nonsynonymous Rv3122 G12R 6 3488556 G T nonsynonymous Rv3122 R156S 6 3488687 G A nonsynonymous Rv3123 R40Q 6 3489340 C G intergenic - - 3 3489665 C T nonsynonymous Rv3124 P54S 6 3493823 T C nonsynonymous Rv3128c L120S 5 3496002 A C nonsynonymous Rv3130c E122A 1 3497586 G C synonymous Rv3132c A560A modern 3498418 C T nonsynonymous Rv3132c T283I 5 3499247 G A nonsynonymous Rv3132c V7I 3 3499497 C G nonsynonymous Rv3133c A140G 5 3503284 A G intergenic - - 5 3504184 T A intergenic - - 5 3504410 C A nonsynonymous Rv3138 H72Q modern 3505005 C A nonsynonymous Rv3138 P271T 5 3506470 G A nonsynonymous Rv3139 A370T 6 3508970 A G synonymous Rv3141 E292E modern 3509091 C G intergenic - - 3 3509231 C T nonsynonymous Rv3142c R106C 3 3509301 C G synonymous Rv3142c V82V 5 3511335 T G intergenic - - 3 3515467 C T nonsynonymous Rv3150 P19L modern 3521044 G A nonsynonymous Rv3153 A180T 1 3526986 A G synonymous Rv3158 A399A 1 3530145 G T nonsynonymous Rv3161c D332Y 4 3530955 C G nonsynonymous Rv3161c L62V 1 3533759 C T synonymous Rv3164c P49P

253 Appendix B. Lineage-specific SNPs

1 3536008 C A nonsynonymous Rv3167c P17Q 3 3539353 G A synonymous Rv3170 Q283Q 2 3542049 G A intergenic - - 6 3546678 G A nonsynonymous Rv3178 A81T 5 3552581 T G intergenic - - 6 3554217 C T intergenic - - 6 3554298 A G nonsynonymous Rv3188 M1V modern 3555699 C T nonsynonymous Rv3190c A330V 3 3557253 C T intergenic - - 5 3558733 C G intergenic - - modern 3560645 A G nonsynonymous Rv3193c Q843R 3 3562338 C A nonsynonymous Rv3193c L279M 6 3564897 G A nonsynonymous Rv3195 M178I 5 3566143 C G nonsynonymous Rv3196 A119G 6 3568004 C A synonymous Rv3197 R327R 6 3570843 C T synonymous Rv3198c S123S 6 3571742 G C nonsynonymous Rv3199c A268P modern 3571834 C A nonsynonymous Rv3199c P237Q 5 3573080 A G nonsynonymous Rv3200c H197R modern 3573636 G A nonsynonymous Rv3200c D12N 2 3577497 G A nonsynonymous Rv3202c E902K modern 3577958 C T nonsynonymous Rv3202c A748V 5 3580275 G C intergenic - - 6 3586551 C A nonsynonymous Rv3209 P93Q 5 3591082 C T synonymous Rv3213c N137N 6 3591488 C T nonsynonymous Rv3213c T2I 5 3591661 C T nonsynonymous Rv3214 H6Y 5 3592709 C T synonymous Rv3215 G152G 5 3596354 A G nonsynonymous Rv3220c D394G 6 3596407 G A synonymous Rv3220c V376V 2 3597249 G C nonsynonymous Rv3220c G96R 1 3597682 G A nonsynonymous Rv3221c V29I 6 3599099 G T nonsynonymous Rv3223c E151D 6 3603178 C T nonsynonymous Rv3226c H49Y 5 3603523 G A synonymous Rv3227 A49A 6 3617228 G C nonsynonymous Rv3239c G126R 6 3622618 G C nonsynonymous Rv3243c L158F modern 3624486 G A nonsynonymous Rv3244c G142D 5 3630235 C T synonymous Rv3249c R51R 6 3635117 T C nonsynonymous Rv3255c V384A 4 3638093 T C nonsynonymous Rv3257c L206S 3 3645524 G A nonsynonymous Rv3264c D152N 6 3646030 A C synonymous Rv3265c S285S 6 3646033 T G nonsynonymous Rv3265c S284R modern 3647041 C T nonsynonymous Rv3266c P257S 1 3647591 T C synonymous Rv3266c N73N 5 3648267 G T nonsynonymous Rv3267 G128V 5 3651849 T G nonsynonymous Rv3270 S442A 3 3656206 C A nonsynonymous Rv3273 R524S 5 3656289 C G nonsynonymous Rv3273 D551E 5 3658226 C T nonsynonymous Rv3275c A138V modern 3658266 C G nonsynonymous Rv3275c R125G 1 3661802 G A nonsynonymous Rv3279c G71S

254 Appendix B. Lineage-specific SNPs

4 3670040 A G synonymous Rv3289c A124A 1 3670118 C A synonymous Rv3289c G98G 1 3671532 C G nonsynonymous Rv3290c A88G 1 3671843 A C intergenic - - 1 3672105 C G nonsynonymous Rv3291c H65D 6 3673210 G T nonsynonymous Rv3292 G295C 2 3674157 T A nonsynonymous Rv3293 C186S 5 3674194 G C nonsynonymous Rv3293 G198A 1 3678091 G A synonymous Rv3296 P439P 3 3678094 C T synonymous Rv3296 A440A 2 3678249 A C nonsynonymous Rv3296 K492T 3 3679764 G A nonsynonymous Rv3296 S997N modern 3679949 G A nonsynonymous Rv3296 E1059K 6 3681349 C A synonymous Rv3297 T10T 4 3681548 C A synonymous Rv3297 R77R 1 3683237 C A synonymous Rv3299c A909A 1 3683715 A C nonsynonymous Rv3299c Q750P 5 3684169 G T nonsynonymous Rv3299c D599Y 5 3685487 G A synonymous Rv3299c L159L 3 3685510 A G nonsynonymous Rv3299c S152G 1 3687372 T C nonsynonymous Rv3301c L69S 6 3688648 C T synonymous Rv3302c R265R 4 3690016 C T nonsynonymous Rv3303c S308L 6 3693548 G A nonsynonymous Rv3306c A148T 4 3693681 G T synonymous Rv3306c A103A modern 3696181 G A nonsynonymous Rv3308 G440S 3 3697585 C T nonsynonymous Rv3310 L130F modern 3697708 G A nonsynonymous Rv3310 A171T 5 3701552 G A synonymous Rv3313c E211E 6 3702543 C T nonsynonymous Rv3314c R309W 1 3704261 C A nonsynonymous Rv3316 L54I 6 3705098 G A synonymous Rv3318 V33V 5 3705776 C T synonymous Rv3318 R259R 6 3706343 C T synonymous Rv3318 N448N 5 3708317 G A intergenic - - 5 3708768 C T synonymous Rv3322c A95A 3 3714639 G A nonsynonymous Rv3329 R83H 3 3715775 C T intergenic - - 3 3729342 C T nonsynonymous Rv3342 A240V modern 3743549 G A intergenic - - 1 3753414 T G intergenic - - modern 3753415 G C intergenic - - 6 3755443 G A intergenic - - 6 3767368 T G nonsynonymous Rv3351c V258G 1 3770325 G A nonsynonymous Rv3356c A109T 3 3771009 T G synonymous Rv3357 S79S modern 3771628 C A synonymous Rv3359 I95I modern 3772616 A G intergenic - - 1 3774506 C T synonymous Rv3364c G123G 2 3775409 A G nonsynonymous Rv3365c Q698R 2 3775441 T G nonsynonymous Rv3365c S687R 6 3775639 C A synonymous Rv3365c G621G 6 3776265 C A synonymous Rv3365c R413R

255 Appendix B. Lineage-specific SNPs

2 3778011 G A nonsynonymous Rv3366 R92Q 2 3778012 G T synonymous Rv3366 R92R 5 3778148 G A nonsynonymous Rv3366 A138T 2 3778396 C T intergenic - - 5 3780715 A G nonsynonymous Rv3368c T89A modern 3785946 G A nonsynonymous Rv3371 G339R modern 3786033 G A nonsynonymous Rv3371 V368I 3 3787466 G C nonsynonymous Rv3372 E385Q modern 3788365 G T stopgain Rv3373 G214X 6 3789077 C T nonsynonymous Rv3375 P153S 5 3790652 C T nonsynonymous Rv3376 T166M 5 3790693 G T nonsynonymous Rv3376 G180C 6 3792262 G T nonsynonymous Rv3377c G31V 1 3793634 T C nonsynonymous Rv3379c C412R 3 3799512 G A nonsynonymous Rv3384c A42T modern 3808103 G A nonsynonymous Rv3392c S112N 6 3811327 G A nonsynonymous Rv3395c V104M modern 3811629 T C nonsynonymous Rv3395c V3A 2 3811672 G T intergenic - - 5 3812009 C A nonsynonymous Rv3395A D97E 3 3821503 T G nonsynonymous Rv3402c L130R 5 3822047 C T intergenic - - 6 3823494 G T nonsynonymous Rv3403c A124S 6 3829664 C T nonsynonymous Rv3410c H83Y 2 3830349 G A nonsynonymous Rv3411c A391T 1 3830566 C T synonymous Rv3411c S318S 4 3830695 C T synonymous Rv3411c A275A 5 3835102 G T stopgain Rv3416 E71X 5 3836728 T C nonsynonymous Rv3417c V55A 5 3840932 C T synonymous Rv3423c N163N 6 3841662 T C intergenic - - 6 3841663 C T intergenic - - 5 3850261 G T intergenic - - 1 3851084 T C nonsynonymous Rv3432c I224T 4 3851887 T G nonsynonymous Rv3433c S443A 4 3851888 G A synonymous Rv3433c A442A 1 3854899 T C intergenic - - 5 3858894 C T nonsynonymous Rv3439c L257F 3 3860216 C T synonymous Rv3441c A385A 1 3861914 G A nonsynonymous Rv3442c A13T 6 3862148 C T synonymous Rv3443c P81P 1 3864041 C T nonsynonymous Rv3446c A164V 5 3864816 G A synonymous Rv3447c S1141S 5 3865398 C A synonymous Rv3447c V947V 5 3866953 T C nonsynonymous Rv3447c V429A 5 3867027 G A synonymous Rv3447c T404T 1 3868738 C A synonymous Rv3448 I129I 3 3869355 T C nonsynonymous Rv3448 I335T 5 3870238 T G nonsynonymous Rv3449 S163A 4 3871246 G A synonymous Rv3450c G417G 5 3872171 C T nonsynonymous Rv3450c A109V 6 3872797 A G nonsynonymous Rv3451 I61V 3 3874745 G C intergenic - -

256 Appendix B. Lineage-specific SNPs

5 3875633 A G nonsynonymous Rv3454 Y271C 5 3876305 C G nonsynonymous Rv3455c T214S 1 3876953 C T synonymous Rv3456c V160V 3 3880175 C A intergenic - - 5 3882025 G A synonymous Rv3464 L63L 6 3883278 A G nonsynonymous Rv3465 T149A 1 3883467 A G intergenic - - 6 3893290 C T nonsynonymous Rv3476c A144V 4 3893480 T C nonsynonymous Rv3476c F81L 4 3895727 A C intergenic - - 6 3898522 G A synonymous Rv3479 S901S modern 3898869 A C nonsynonymous Rv3479 N1017T 1 3899654 G T nonsynonymous Rv3480c R250L 5 3902782 G C nonsynonymous Rv3483c A11P 3 3908062 C T intergenic - - 4 3909235 C G nonsynonymous Rv3490 L334V 5 3909589 G A nonsynonymous Rv3490 G452R 1 3913737 C T synonymous Rv3495c A266A 3 3918649 A G synonymous Rv3499c G184G 5 3919261 G A nonsynonymous Rv3500c A268T 1 3920109 C A synonymous Rv3501c I251I 1 3921094 G A nonsynonymous Rv3502c R316H modern 3922836 C T synonymous Rv3504 D122D 5 3923267 A C nonsynonymous Rv3504 E266A 6 3939062 C G nonsynonymous Rv3510c L66V 5 3939405 G T intergenic - - 5 3945304 C G nonsynonymous Rv3513c Q149E 5 3951036 C G nonsynonymous Rv3515c Q479E 5 3953660 A C nonsynonymous Rv3517 Q77P 3 3954222 G C nonsynonymous Rv3517 E264D 5 3956535 G A synonymous Rv3520c V278V 5 3957514 C T intergenic - - 1 3958007 C T nonsynonymous Rv3521 H163Y 5 3964234 A T intergenic - - 6 3968375 C G synonymous Rv3531c S190S 6 3968618 C T synonymous Rv3531c T109T 2 3970594 C G intergenic - - 1 3974142 C T nonsynonymous Rv3535c P120L 2 3979990 G C nonsynonymous Rv3540c V224L 6 3980437 T G nonsynonymous Rv3540c S75A 6 3981329 C T nonsynonymous Rv3542c H218Y 4 3984321 T C synonymous Rv3545c H375H 1 3984926 C T synonymous Rv3545c L174L 5 3986987 G T synonymous Rv3547 L48L 5 3987180 G A nonsynonymous Rv3547 D113N 1 3989107 C T intergenic - - modern 3989914 T G nonsynonymous Rv3551 S7A 3 3990093 C T synonymous Rv3551 V66V 2 3993058 G A nonsynonymous Rv3554 G125E 3 3994101 A G nonsynonymous Rv3554 I473V modern 3994898 C T nonsynonymous Rv3555c R268W 6 3995060 G T nonsynonymous Rv3555c V214F 6 3999774 C T nonsynonymous Rv3559c A221V

257 Appendix B. Lineage-specific SNPs

5 3999805 A G nonsynonymous Rv3559c T211A 2 4001622 T C intergenic - - 6 4001813 G A synonymous Rv3561 V59V 6 4002847 C A nonsynonymous Rv3561 T404N 1 4003645 C T nonsynonymous Rv3562 A162V 6 4004907 A T nonsynonymous Rv3563 E206V 4 4005114 C G nonsynonymous Rv3563 S275W 5 4006943 G A synonymous Rv3565 V248V 3 4007272 G A nonsynonymous Rv3565 R358Q 6 4007432 G A nonsynonymous Rv3566c E251K 5 4008252 G A nonsynonymous Rv3566A G61D 4 4008747 C T nonsynonymous Rv3567c T179I 5 4008863 A G synonymous Rv3567c S140S 5 4011992 C T synonymous Rv3570c N93N modern 4012219 G A nonsynonymous Rv3570c D18N 5 4012274 C T intergenic - - 3 4012286 C T intergenic - - 1 4013076 G A synonymous Rv3571 G220G 1 4014431 G A synonymous Rv3573c P594P 1 4019103 G T intergenic - - 6 4021757 C A synonymous Rv3579c R213R 1 4022652 C T synonymous Rv3580c S384S 1 4024079 A G nonsynonymous Rv3581c Q90R 5 4026414 G A intergenic - - 1 4026800 G C synonymous Rv3585 A119A 4 4028752 G A nonsynonymous Rv3586 A288T 6 4031202 C A nonsynonymous Rv3589 A237D 5 4033260 C T intergenic - - 3 4033711 T C nonsynonymous Rv3591c V111A 6 4035242 C A nonsynonymous Rv3593 F297L 1 4040517 T C nonsynonymous Rv3596c V63A 5 4040824 G A intergenic - - 2 4041581 G A nonsynonymous Rv3598c A454T 5 4041899 A C nonsynonymous Rv3598c I348L 3 4044872 C T synonymous Rv3602c G113G 5 4045844 C T nonsynonymous Rv3603c T92I 6 4046218 G A intergenic - - 6 4051853 T C nonsynonymous Rv3610c V344A 3 4052349 A C nonsynonymous Rv3610c K179Q modern 4054637 C T synonymous Rv3614c G20G 4 4056416 A C intergenic - - 2 4056693 G T intergenic - - 3 4057036 A G intergenic - - 1 4058711 T C nonsynonymous Rv3618 L5S 3 4059186 G T synonymous Rv3618 A163A 5 4060201 C T nonsynonymous Rv3619c S23L 5 4062582 C G nonsynonymous Rv3623 A19G 6 4064918 G A nonsynonymous Rv3626c G329D 6 4067044 C T nonsynonymous Rv3627c A81V modern 4067152 T C nonsynonymous Rv3627c V45A 5 4067386 C T intergenic - - 6 4069598 T C nonsynonymous Rv3630 S142P 1 4069797 C T nonsynonymous Rv3630 S208L

258 Appendix B. Lineage-specific SNPs

5 4074437 G C synonymous Rv3635 P268P 6 4074919 C T nonsynonymous Rv3635 P429L 5 4075626 C T intergenic - - 6 4078292 A C nonsynonymous Rv3639c K53N 5 4080434 G A nonsynonymous Rv3641c D43N 1 4081987 C G synonymous Rv3644c A245A 1 4081996 C G synonymous Rv3644c P242P 1 4083360 T G nonsynonymous Rv3645 I185S 6 4084010 G A nonsynonymous Rv3645 G402S 1 4085200 C G nonsynonymous Rv3646c D686E 5 4086604 C T synonymous Rv3646c Y218Y 6 4086697 C A synonymous Rv3646c A187A 3 4087495 A G intergenic - - 6 4087670 G A synonymous Rv3647c V173V 3 4087880 C T synonymous Rv3647c P103P 4 4089058 C T nonsynonymous Rv3649 P93L 6 4090238 C T synonymous Rv3649 G486G 3 4090453 C G nonsynonymous Rv3649 A558G 3 4092376 T C nonsynonymous Rv3651 I179T 1 4092921 G T intergenic - - 4 4095295 G A synonymous Rv3655c E2E 1 4096190 C T synonymous Rv3658c A250A 3 4096636 G A nonsynonymous Rv3658c G102S 3 4097569 C A synonymous Rv3659c A142A 6 4098514 A G nonsynonymous Rv3660c D212G 6 4099060 C T nonsynonymous Rv3660c P30L 1 4101018 C T intergenic - - modern 4105757 A G nonsynonymous Rv3666c Q443R 6 4106075 G A nonsynonymous Rv3666c R337H 1 4106154 G A nonsynonymous Rv3666c E311K 4 4107074 G A nonsynonymous Rv3666c R4Q 5 4109151 G A nonsynonymous Rv3667 D454N 6 4110626 C T intergenic - - 4 4112429 G A nonsynonymous Rv3671c V363I 5 4113005 G A nonsynonymous Rv3671c A171T 1 4115029 C T synonymous Rv3673c D43D 3 4115952 C T intergenic - - modern 4116682 C T synonymous Rv3676 L69L 5 4117097 A C nonsynonymous Rv3676 E207A modern 4117161 A C intergenic - - 5 4119114 C G synonymous Rv3679 G113G modern 4119246 C T synonymous Rv3679 D157D 5 4120451 G T synonymous Rv3680 A219A 5 4123685 C A nonsynonymous Rv3682 D590E 5 4123724 G A synonymous Rv3682 A603A 5 4124983 C T synonymous Rv3683 R189R 1 4126087 T G nonsynonymous Rv3684 S217A 6 4127009 C A intergenic - - 6 4128152 G A nonsynonymous Rv3685c V192I 5 4128879 T C nonsynonymous Rv3686c W69R 6 4130604 G A nonsynonymous Rv3689 S83N modern 4130711 C G nonsynonymous Rv3689 L119V 3 4132509 G A intergenic - -

259 Appendix B. Lineage-specific SNPs

2 4133316 A T nonsynonymous Rv3691 T267S 5 4133466 T C synonymous Rv3691 L317L 1 4133907 G A nonsynonymous Rv3692 R131H 5 4134341 G A nonsynonymous Rv3692 V276I 5 4134401 G A nonsynonymous Rv3692 A296T 5 4137136 G A intergenic - - 1 4137190 T C intergenic - - modern 4138377 C T nonsynonymous Rv3696c A460V 3 4138622 C T synonymous Rv3696c R378R 5 4139131 G A nonsynonymous Rv3696c E209K 6 4141285 C T nonsynonymous Rv3698 R265C 3 4142192 G A nonsynonymous Rv3699 G50E 1 4142689 G A nonsynonymous Rv3699 E216K 5 4144371 A G synonymous Rv3701c V182V 4 4145737 C T synonymous Rv3703c Y385Y 6 4148162 G T nonsynonymous Rv3704c A9S 5 4153687 C T intergenic - - 1 4155266 C G synonymous Rv3710 G469G modern 4156239 C T nonsynonymous Rv3711c A164V 4 4156503 A G nonsynonymous Rv3711c D76G 6 4158032 A C nonsynonymous Rv3712 K351T modern 4158361 G A synonymous Rv3713 P45P 1 4160536 C T synonymous Rv3716c G126G 5 4161854 G T synonymous Rv3718c V135V 1 4163558 G C nonsynonymous Rv3719 S418T 3 4166290 C T nonsynonymous Rv3721c H148Y 2 4167656 T C nonsynonymous Rv3722c M158T 1 4169719 C T synonymous Rv3724B A38A 6 4172173 A C nonsynonymous Rv3726 R251S 6 4173849 C T nonsynonymous Rv3727 R299C 2 4174131 G C nonsynonymous Rv3727 G393R 4 4179089 T C nonsynonymous Rv3729 S269P 6 4179391 G A stopgain Rv3729 W369X 2 4179832 G C nonsynonymous Rv3729 Q516H 2 4182387 G C synonymous Rv3731 V210V 6 4183288 G A nonsynonymous Rv3732 G119S 6 4183602 C T synonymous Rv3732 A223A 6 4186050 G T intergenic - - 6 4186864 C T synonymous Rv3736 I77I 1 4187063 G A nonsynonymous Rv3736 G144R 5 4189191 G T nonsynonymous Rv3737 R498L modern 4190532 C A intergenic - - 1 4190596 A C intergenic - - 5 4190639 G A intergenic - - 6 4192341 C T synonymous Rv3741c R171R 5 4192797 A G synonymous Rv3741c Q19Q 6 4193449 G C nonsynonymous Rv3743c G642A 6 4193641 T C nonsynonymous Rv3743c V578A 6 4195182 C T synonymous Rv3743c C64C 3 4195390 A T intergenic - - 3 4195799 C A synonymous Rv3744 G120G 6 4195897 C T synonymous Rv3745c L68L modern 4197189 G A intergenic - -

260 Appendix B. Lineage-specific SNPs modern 4200220 C T nonsynonymous Rv3753c T1M 5 4200686 C T nonsynonymous Rv3754 T89M 1 4201105 G A nonsynonymous Rv3754 D229N 6 4201352 C T synonymous Rv3755c T179T 5 4201535 G A synonymous Rv3755c V118V 5 4201728 C T nonsynonymous Rv3755c P54L modern 4202383 G T synonymous Rv3756c V77V 5 4205237 G A nonsynonymous Rv3759c G46E 2 4205325 T C nonsynonymous Rv3759c W17R 5 4208331 G T synonymous Rv3762c A182A modern 4210876 G A nonsynonymous Rv3764c R45H 6 4212174 T C intergenic - - 5 4214206 G C nonsynonymous Rv3768 R46P 6 4215001 C A intergenic - - 4 4215484 G C nonsynonymous Rv3770c A98P 4 4217557 G A nonsynonymous Rv3772 A142T 6 4219820 G C nonsynonymous Rv3775 A46P 6 4220427 C T nonsynonymous Rv3775 P248L 3 4222131 C T nonsynonymous Rv3776 A348V 6 4225401 C T synonymous Rv3779 T139T 3 4226275 C T nonsynonymous Rv3779 P431S 6 4228997 C T synonymous Rv3782 C217C 5 4230965 G T nonsynonymous Rv3784 G237V 5 4231528 A G nonsynonymous Rv3785 Y70C modern 4232327 G A synonymous Rv3785 R336R 5 4233541 G A synonymous Rv3786c T19T 6 4236150 C T synonymous Rv3790 G124G 6 4236891 C A synonymous Rv3790 P371P 5 4239843 A C nonsynonymous Rv3792 K638Q modern 4240671 T C nonsynonymous Rv3793 I270T 1 4241042 A G nonsynonymous Rv3793 N394D 6 4241843 C A nonsynonymous Rv3793 L661I 3 4242075 G A nonsynonymous Rv3793 R738Q 6 4244379 C T nonsynonymous Rv3794 P383S 5 4244635 T C nonsynonymous Rv3794 V468A 5 4245147 C T nonsynonymous Rv3794 P639S 1 4245969 C T nonsynonymous Rv3794 P913S 6 4246864 C T synonymous Rv3795 V117V modern 4247646 C A nonsynonymous Rv3795 A378E 4 4251297 C G synonymous Rv3797 G71G 1 4254347 T A intergenic - - 2 4254431 C T synonymous Rv3799c D506D 3 4258447 C T synonymous Rv3800c R900R modern 4261499 T C nonsynonymous Rv3801c V523A 5 4262256 A G nonsynonymous Rv3801c I271V modern 4263279 G A intergenic - - 3 4266647 T C synonymous Rv3804c V4V 2 4267647 A G nonsynonymous Rv3805c D397G 6 4269351 C T synonymous Rv3806c A161A modern 4269387 C A nonsynonymous Rv3806c D149E 6 4269522 C T synonymous Rv3806c T104T modern 4269606 C T synonymous Rv3806c R76R 6 4270171 C A nonsynonymous Rv3807c A56E

261 Appendix B. Lineage-specific SNPs

5 4271348 C T nonsynonymous Rv3808c A311V 6 4271498 T C nonsynonymous Rv3808c L261P 6 4272211 G A synonymous Rv3808c V23V 5 4275241 C A synonymous Rv3811 L148L 1 4275935 A G nonsynonymous Rv3811 M380V 1 4276306 C T synonymous Rv3811 G503G 1 4280441 G T synonymous Rv3815c P116P 1 4281143 G C synonymous Rv3816c V143V 1 4281272 C T synonymous Rv3816c Y100Y 2 4284429 C T nonsynonymous Rv3820c P466L 3 4286826 A G nonsynonymous Rv3822 K36E 4 4287164 G A synonymous Rv3822 G148G modern 4287361 T C nonsynonymous Rv3822 V214A 6 4289216 G A nonsynonymous Rv3823c G772S modern 4289953 C T nonsynonymous Rv3823c A526V 2 4290135 G A synonymous Rv3823c L465L modern 4290564 A C synonymous Rv3823c A322A modern 4290827 C G nonsynonymous Rv3823c R235G 6 4292095 G T synonymous Rv3824c A360A 5 4292317 C T synonymous Rv3824c F286F 3 4292941 C T synonymous Rv3824c H78H 5 4293133 G C nonsynonymous Rv3824c W14C 5 4296229 A G nonsynonymous Rv3825c D1126G 1 4296381 C A synonymous Rv3825c T1075T 2 4301075 G C nonsynonymous Rv3826 E422Q 1 4303407 T G nonsynonymous Rv3829c S534R 6 4303554 C A synonymous Rv3829c G485G 6 4303675 T C nonsynonymous Rv3829c L445P 5 4304824 C G nonsynonymous Rv3829c P62R 3 4305243 C T nonsynonymous Rv3830c P148L 5 4306059 C T synonymous Rv3831 A101A 1 4306339 C A nonsynonymous Rv3832c A158E 1 4307344 C G nonsynonymous Rv3833 L160V 2 4308395 C T synonymous Rv3834c L174L 6 4308991 G A intergenic - - 4 4313128 T C nonsynonymous Rv3839 S122P 5 4313357 G A nonsynonymous Rv3839 R198H 1 4314843 C T synonymous Rv3842c D240D 6 4316322 G A nonsynonymous Rv3843c R92H modern 4316566 C G nonsynonymous Rv3843c R11G 5 4317750 C T intergenic - - modern 4318425 G C intergenic - - 5 4319352 T C synonymous Rv3845 A24A 2 4319985 G C intergenic - - 6 4320299 G A intergenic - - 5 4322042 G A nonsynonymous Rv3847 A169T modern 4323006 A G synonymous Rv3848 A227A 6 4326465 A G nonsynonymous Rv3854c I337V 5 4326928 C T synonymous Rv3854c G182G 5 4327103 G A nonsynonymous Rv3854c G124D 3 4328492 C T nonsynonymous Rv3856c A306V 5 4328644 T C synonymous Rv3856c R255R 4 4329782 A G intergenic - -

262 Appendix B. Lineage-specific SNPs

6 4330238 G A nonsynonymous Rv3858c R423H 6 4333284 A G nonsynonymous Rv3859c N933S modern 4334425 G C nonsynonymous Rv3859c E553Q modern 4336597 A C intergenic - - 3 4336991 C T nonsynonymous Rv3860 A72V 1 4337574 A G synonymous Rv3860 E266E 1 4338603 G A intergenic - - 5 4339610 G A synonymous Rv3863 L254L 6 4339880 C T synonymous Rv3863 A344A 5 4340964 A C nonsynonymous Rv3864 D232A 5 4340966 C A nonsynonymous Rv3864 L233I 5 4340999 G T nonsynonymous Rv3864 E244L 5 4341000 A T nonsynonymous Rv3864 E244V 5 4343653 G A nonsynonymous Rv3868 G114S 3 4343784 G A synonymous Rv3868 K157K 1 4344058 T C nonsynonymous Rv3868 S249P 5 4345036 A C intergenic - - 3 4345548 G A nonsynonymous Rv3869 M170I 6 4346843 C A synonymous Rv3870 T121T 5 4347337 C A nonsynonymous Rv3870 A286D 5 4357657 C T nonsynonymous Rv3879c A709V 1 4357773 C T synonymous Rv3879c T670T 2 4357804 A C nonsynonymous Rv3879c E660A 1 4357946 G A nonsynonymous Rv3879c G613S 6 4358866 C T nonsynonymous Rv3879c P306L 6 4359202 C T nonsynonymous Rv3879c S194F 5 4361250 T C nonsynonymous Rv3881c Y226H 6 4362384 G C nonsynonymous Rv3882c G346A 2 4362568 A C synonymous Rv3882c R285R 6 4364323 C T synonymous Rv3883c D145D 6 4365212 A G nonsynonymous Rv3884c N543D modern 4367649 C T synonymous Rv3885c Y291Y 1 4369499 C T synonymous Rv3886c P224P 5 4371331 G T nonsynonymous Rv3887c K118N 6 4372275 G A synonymous Rv3888c G144G 4 4372353 G C synonymous Rv3888c R118R modern 4374228 G A nonsynonymous Rv3891c A49T modern 4377033 G C synonymous Rv3894c S1140S 2 4378504 A G nonsynonymous Rv3894c D650G 6 4378608 C T synonymous Rv3894c Y615Y 6 4382296 A G nonsynonymous Rv3896c I186V 5 4382553 C T nonsynonymous Rv3896c A100V 1 4383442 A G nonsynonymous Rv3897c I67V 4 4383655 C T stopgain Rv3898c Q111X 4 4384007 G C intergenic - - modern 4385187 C T nonsynonymous Rv3899c P65S 6 4386257 C G nonsynonymous Rv3900c P18A 6 4386625 G A nonsynonymous Rv3901c V64M 5 4386746 A G nonsynonymous Rv3901c I23M 5 4387392 C T synonymous Rv3902c F168F 5 4387423 C T nonsynonymous Rv3902c S158F 5 4388976 C T nonsynonymous Rv3903c P486L 1 4390380 T C nonsynonymous Rv3903c V18A

263 Appendix B. Lineage-specific SNPs

6 4393838 C T synonymous Rv3908 L130L modern 4394210 C G nonsynonymous Rv3909 R7G modern 4395387 G A nonsynonymous Rv3909 S399N 3 4396495 C A synonymous Rv3909 G768G 5 4397110 G A nonsynonymous Rv3910 V172M 5 4397374 G A nonsynonymous Rv3910 A260T 5 4397763 C A synonymous Rv3910 P389P 6 4398223 C T nonsynonymous Rv3910 R543W 1 4398732 G A synonymous Rv3910 L712L 5 4399422 G A synonymous Rv3910 A942A 6 4400663 C G nonsynonymous Rv3911 R160G 5 4400947 C G nonsynonymous Rv3912 D26E 1 4401400 A G synonymous Rv3912 P177P 3 4401509 C T nonsynonymous Rv3912 R214W 4 4407588 G A synonymous Rv3919c A205A 1 4407873 G T synonymous Rv3919c V110V 6 4408570 C T nonsynonymous Rv3920c R110W modern 4408920 G A intergenic - - 4 4408923 T C intergenic - - 3 4409954 C G nonsynonymous Rv3921c A39G 1 4410386 C A synonymous Rv3922c R10R 1 4411016 G A intergenic - -

264 Appendix C: Lineage-specific SNPs within genes associated with drug resistance

Appendix C

Lineage-specific SNPs located within genes associated with M. tuberculosis drug resistance, as identified in the TBDreaMDB database. Last column indicates drug resistance: EMB, ethambutol; FLQ, flouroquinolones; INH, isoniazid; RIF, rifampicin; SM, streptomycin.

Genomic Ancestral Derived Lineage position allele allele Mutation type Gene Mutation DR 5 408935 C T nonsynonymous Rv0340 A101V EMB 5 409079 G A nonsynonymous Rv0340 G149E EMB 6 1416633 G C nonsynonymous embR L239V EMB 1 1417019 C T nonsynonymous embR C110Y EMB 5 1507920 G A synonymous Rv1341 V116V EMB 3 3489665 C T nonsynonymous moaR1 P54S EMB 3 3645524 C T nonsynonymous manB D152N EMB 1 3647591 A G synonymous rmlD N73N EMB modern 3647041 G A nonsynonymous rmlD P257S EMB modern 4240671 T C nonsynonymous embC I270T EMB 1 4241042 A G nonsynonymous embC N394D EMB 6 4241843 C A nonsynonymous embC L661I EMB 3 4242075 G A nonsynonymous embC R738Q EMB 6 4244379 C T nonsynonymous embA P383S EMB 5 4244635 T C nonsynonymous embA V468A EMB 5 4245147 C T nonsynonymous embA P639S EMB 1 4245969 C T nonsynonymous embA P913S EMB 6 4246864 C T synonymous embB V117V EMB modern 4247646 C A nonsynonymous embB A378E EMB 6 1674434 T C nonsynonymous inhA V78A ETH 5 4326928 G A synonymous ethA G182G ETH 6 4326465 T C nonsynonymous ethA I337V ETH 5 4327103 C T nonsynonymous ethA G124D ETH 1 6112 G C nonsynonymous gyrB M330I FLQ modern 9143 C T synonymous gyrA I614I FLQ 5 9566 C T synonymous gyrA Y755Y FLQ 1 8452 C T nonsynonymous gyrA A384V FLQ 6 8493 C T nonsynonymous gyrA L398F FLQ 3 157129 C T nonsynonymous fbpC G158S INH modern 412280 G T nonsynonymous iniA Q481H INH 5 2101921 C T synonymous ndh S374S INH 6 2155503 G A synonymous katG T203T INH 4 2154724 A C nonsynonymous katG L463R INH 3 2516271 T C nonsynonymous Rv2242 M323T INH 5 2516804 A C synonymous fabD A6A INH modern 2518132 T C synonymous kasA T6T INH

265 Appendix C: Lineage-specific SNPs within genes associated with drug resistance

2 2521428 A G nonsynonymous accD6 D229G INH 1 2726051 G A nonsynonymous oxyR' L13F INH 5 3506470 G A nonsynonymous fadE24 A370T INH 6 4007432 C T nonsynonymous nat E251K INH oxyR/ ahpC upstre 3 2726105 G A intergenic #N/A - am 3 762434 T G synonymous rpoB G876G RIF 4 763031 C T synonymous rpoB A1075A RIF 6 760969 C T nonsynonymous rpoB S388L RIF 6 761723 A C nonsynonymous rpoB E639D RIF rpoB upstre 3 759746 C T intergenic #N/A - am 4 4407588 C T synonymous gid A205A SM 1 4407873 C A synonymous gid V110V SM

266 Appendix D: Nonsynonymous/synonymous SNP ratio

Appendix D

Nonsynonymous to synonymous SNP ratios for the 28 genomes used in the study. A. Ratio based on lineage branch SNPs. B. Ratio based on all SNPs within each of the 28 strains used in the study. This ratio therefore includes the singleton SNPs present in the extant strains.

A. Lineage-specific SNPs (internal branches)

Nonsynonymous Nonsynonymous / Lineage Synonymous SNP SNP Synonymous

1 238 156 1.5 5 385 213 1.8 6 374 206 1.8 2 74 33 2.2 3 182 117 1.6 4 96 46 2.1 Modern branch 172 78 2.2 Average - - 1.9

267 Appendix D: Nonsynonymous/synonymous SNP ratio

B. SNPs in external branches

Nonsynonymous Nonsynonymous / Strain Lineage Synonymous SNP SNP Synonymous MTB_95_0545 1 473 296 1.7 MTB_K21 1 525 325 1.7 MTB_K67 1 495 313 1.6 MTB_K93 1 486 303 1.7 MTB_T17 1 464 317 1.4 MTB_T83 1 464 307 1.5 MTB_T92 1 506 314 1.7 MTB_N0070 1 489 331 1.4 MTB_N0072 1 491 341 1.4 MAF_11821_03 5 546 321 1.5 MAF_5444_04 5 523 329 1.2 MAF_4141_04 6 590 353 1.5 MAF_GM_0981 6 605 353 1.6 MTB_00_1695 2 527 259 2.0 MTB_98_1833 2 518 271 1.9 MTB_M4100A 2 506 279 1.8 MTB_T67 2 530 273 1.9 MTB_T85 2 534 273 1.9 MTB_N0031 2 514 256 2.0 MTB_91_0079 3 492 295 1.7 MTB_K49 3 490 282 1.9 MTB_SG1 3 525 301 1.9 MTB_4783_04 4 475 262 1.8 MTB_erdman 4 445 229 1.9 MTB_GM_1503 4 472 282 1.6 MTB_H37Rv 4 504 283 1.7 MTB_K37 4 440 242 1.8 MTB_KZN_605 4 461 249 1.8 Average - - - 1.7

268 Appendix E: RNA-seq differential expression

Appendix E

Differentially expressed genes, antisense and sRNAs between Lineage 1 and 2. A. Sense transcription (gene expression). B. Antisense transcription. C. sRNAs.

A. Sense transcription (N=112)

Gene p-value Functional category Fold Fold change Lineage 1 Lineage 2 Lineage

Rv0027 76.2 28.5 2.7 1.23E-02 conserved hypotheticals Rv0028 58.1 23.5 2.5 1.51E-02 conserved hypotheticals Rv0082 126.5 35.7 3.5 9.27E-03 intermediary metabolism and respiration Rv0130 htdZ 121.0 48.1 2.5 1.11E-02 intermediary metabolism and respiration Rv0157A 144.8 66.1 2.2 1.85E-02 conserved hypotheticals Rv0193c 206.1 78.7 2.6 2.28E-02 conserved hypotheticals Rv0250c 322.7 780.7 0.4 3.71E-02 conserved hypotheticals Rv0275c 131.3 359.4 0.4 3.67E-02 regulatory proteins Rv0469 umaA 2628.7 1217.0 2.2 3.75E-02 lipid metabolism Rv0553 menC 361.9 57.8 6.3 1.79E-03 intermediary metabolism and respiration Rv0554 bpoC 542.4 239.8 2.3 1.88E-02 virulence, detoxification, adaptation Rv0557 mgtA 410.6 127.6 3.2 3.14E-05 lipid metabolism Rv0619 galTb 66.8 9.7 6.9 5.12E-07 intermediary metabolism and respiration Rv0620 galK 37.7 1.0 39.3 5.12E-07 intermediary metabolism and respiration Rv0653c 156.2 53.2 2.9 1.34E-02 regulatory proteins Rv0686 159.6 477.5 0.3 5.00E-04 cell wall and cell processes Rv0724A 210.0 58.6 3.6 4.84E-03 conserved hypotheticals Rv0783c emrB 544.3 244.1 2.2 3.58E-02 cell wall and cell processes Rv0847 lpqS 236.7 46.7 5.1 1.00E-02 cell wall and cell processes Rv0877 991.9 431.9 2.3 4.03E-02 conserved hypotheticals Rv0890c 1871.3 872.9 2.1 1.65E-02 regulatory proteins Rv1044 79.3 27.5 2.9 7.62E-03 conserved hypotheticals Rv1075c 755.3 298.4 2.5 1.85E-02 cell wall and cell processes Rv1103c mazE3 126.3 57.7 2.2 3.57E-02 virulence, detoxification, adaptation Rv1233c 5340.6 1249.2 4.3 3.08E-02 cell wall and cell processes Rv1397c vapC10 86.0 862.7 0.1 6.10E-04 virulence, detoxification, adaptation Rv1433 1469.9 414.0 3.6 4.09E-04 cell wall and cell processes Rv1440 secG 990.7 376.2 2.6 1.99E-02 cell wall and cell processes

269 Appendix E: RNA-seq differential expression

Rv1503c 462.9 55.1 8.4 3.35E-04 conserved hypotheticals Rv1504c 355.2 30.8 11.5 6.04E-15 conserved hypotheticals Rv1505c 1726.9 137.5 12.6 2.53E-19 conserved hypotheticals Rv1506c 475.6 205.2 2.3 4.06E-03 unknown Rv1508c 1550.0 731.3 2.1 4.45E-02 cell wall and cell processes Rv1530 adh 162.8 71.4 2.3 3.65E-02 intermediary metabolism and respiration Rv1541c lprI 283.8 135.2 2.1 2.20E-02 cell wall and cell processes Rv1551 plsB1 401.3 90.9 4.4 1.25E-06 lipid metabolism Rv1592c 3281.5 701.5 4.7 1.12E-06 conserved hypotheticals Rv1661 pks7 389.2 957.0 0.4 2.07E-03 lipid metabolism Rv1699 pyrG 1516.8 790.4 1.9 3.12E-02 intermediary metabolism and respiration Rv1733c 24.0 60.9 0.4 2.28E-02 cell wall and cell processes Rv1749c 767.2 404.9 1.9 2.84E-02 cell wall and cell processes Rv1778c 1019.2 404.1 2.5 2.14E-03 conserved hypotheticals Rv1781c malQ 211.9 94.3 2.2 3.08E-02 intermediary metabolism and respiration Rv1895 275.9 74.4 3.7 7.17E-05 intermediary metabolism and respiration Rv1912c fadB5 969.7 451.6 2.1 1.65E-02 lipid metabolism Rv1918c PPE35 536.3 1278.3 0.4 1.51E-02 PE/PPE Rv1925 fadD31 728.0 2531.8 0.3 3.80E-05 lipid metabolism Rv1926c mpt63 12021.3 3363.3 3.6 4.29E-02 cell wall and cell processes Rv1929c 1032.0 430.0 2.4 1.11E-02 conserved hypotheticals Rv1979c 250.3 540.7 0.5 1.28E-02 cell wall and cell processes Rv1980c mpt64 914.6 2468.6 0.4 1.28E-02 cell wall and cell processes Rv1981c nrdF1 4224.9 971.0 4.4 1.10E-07 information pathways Rv2051c ppm1 1963.9 901.7 2.2 3.97E-02 cell wall and cell processes Rv2063 mazE7 1210.6 77.3 15.7 2.38E-03 virulence, detoxification, adaptation Rv2063A mazF7 153.6 31.1 4.9 8.05E-07 virulence, detoxification, adaptation Rv2080 lppJ 358.4 64.7 5.5 5.36E-09 cell wall and cell processes Rv2090 224.9 92.5 2.4 2.40E-02 information pathways Rv2144c 4074.8 1755.5 2.3 1.56E-02 cell wall and cell processes Rv2161c 62.6 802.9 0.1 1.50E-13 intermediary metabolism and respiration Rv2189c 51.9 159.3 0.3 6.10E-04 conserved hypotheticals Rv2211c gcvT 2460.5 1131.1 2.2 3.61E-02 intermediary metabolism and respiration Rv2243 fabD 1747.5 756.7 2.3 3.44E-03 lipid metabolism Rv2274A mazE8 62.8 19.4 3.2 8.15E-03 virulence, detoxification, adaptation Rv2331 182.0 59.7 3.0 4.86E-03 conserved hypotheticals Rv2428 ahpC 1089.1 326.6 3.3 6.31E-05 virulence, detoxification, adaptation Rv2429 ahpD 307.3 129.9 2.4 1.49E-02 virulence, detoxification, adaptation Rv2478c 171.7 54.6 3.1 3.58E-02 conserved hypotheticals Rv2497c bkdA 2822.7 1525.3 1.9 3.58E-02 intermediary metabolism and respiration Rv2518c ldtB 1508.5 383.2 3.9 2.99E-03 cell wall and cell processes

270 Appendix E: RNA-seq differential expression

Rv2525c 1076.0 520.7 2.1 1.77E-02 conserved hypotheticals Rv2526 vapB17 233.2 1505.4 0.2 3.34E-09 virulence, detoxification, adaptation Rv2527 vapC17 60.2 553.3 0.1 3.02E-10 virulence, detoxification, adaptation Rv2528c mrr 113.5 49.2 2.3 1.37E-02 information pathways Rv2573 50.5 14.5 3.5 2.48E-03 conserved hypotheticals Rv2596 vapC40 215.1 96.2 2.2 4.57E-02 virulence, detoxification, adaptation Rv2697c dut 2014.8 642.7 3.1 6.63E-04 intermediary metabolism and respiration Rv2707 1353.7 592.0 2.3 1.02E-02 conserved hypotheticals Rv2719c 358.6 130.3 2.8 3.14E-04 cell wall and cell processes Rv2729c 292.2 125.2 2.3 3.58E-02 cell wall and cell processes Rv2758c vapB21 599.5 242.1 2.5 1.37E-02 virulence, detoxification, adaptation Rv2765 413.4 55.8 7.4 2.21E-11 intermediary metabolism and respiration Rv2809 763.7 361.3 2.1 2.96E-02 conserved hypotheticals Rv2830c vapB22 290.5 101.5 2.9 1.81E-03 virulence, detoxification, adaptation Rv2843 215.4 88.0 2.4 2.71E-02 cell wall and cell processes Rv2870c dxr 1241.2 419.3 3.0 2.63E-03 intermediary metabolism and respiration Rv2938 drrC 708.9 264.5 2.7 2.99E-03 cell wall and cell processes Rv2952 1325.5 666.5 2.0 3.08E-02 intermediary metabolism and respiration Rv3082c virS 780.3 45.8 17.0 4.09E-20 regulatory proteins Rv3167c 92.0 24.8 3.7 5.19E-05 regulatory proteins Rv3168 1695.7 343.5 4.9 1.25E-06 conserved hypotheticals Rv3196A 134.0 342.0 0.4 3.75E-02 conserved hypotheticals Rv3198c uvrD2 510.1 1224.9 0.4 1.44E-02 information pathways Rv3233c 159.0 747.3 0.2 1.27E-06 lipid metabolism Rv3242c 19.2 54.7 0.4 2.81E-02 conserved hypotheticals Rv3350c PPE56 818.8 318.3 2.6 2.66E-02 PE/PPE Rv3366 spoU 42.5 201.8 0.2 4.35E-06 information pathways Rv3389c htdY 1481.1 504.3 2.9 2.07E-03 intermediary metabolism and respiration Rv3415c 307.7 89.3 3.4 4.15E-04 conserved hypotheticals Rv3435c 2667.0 826.7 3.2 4.09E-04 cell wall and cell processes Rv3446c 153.0 13.7 11.2 3.58E-06 conserved hypotheticals Rv3500c yrbE4B 1711.8 469.8 3.6 5.19E-05 virulence, detoxification, adaptation Rv3540c ltp2 1111.1 355.3 3.1 6.08E-03 lipid metabolism Rv3652 PE_PGRS 1200.2 74.1 16.2 1.27E-16 PE/PPE 60 Rv3679 413.8 4344.3 0.1 2.55E-04 cell wall and cell processes Rv3680 570.5 2840.3 0.2 1.25E-06 cell wall and cell processes Rv3695 58.3 268.2 0.2 1.60E-05 cell wall and cell processes Rv3741c 35.7 10.3 3.5 1.85E-02 intermediary metabolism and respiration Rv3742c 90.7 23.8 3.8 1.05E-02 intermediary metabolism and respiration

271 Appendix E: RNA-seq differential expression

Rv3810 pirG 1826.3 879.2 2.1 2.61E-02 cell wall and cell processes Rv3812 PE_PGRS 219.3 86.9 2.5 6.39E-03 PE/PPE 62 Rv3829c 409.9 8412.8 0.05 6.32E-05 intermediary metabolism and respiration Rv3831 52.4 739.0 0.1 1.02E-02 conserved hypotheticals

B. Antisense transcription (N=56)

Gene p-value Functional category Fold Fold change Lineage 1 Lineage 2 Lineage

Rv0213c 1389.4 539.0 2.6 3.64E-02 intermediary metabolism and respiration Rv0345 241.8 787.6 0.3 3.82E-02 conserved hypotheticals Rv0354c PPE7 168.2 600.4 0.3 1.04E-02 PE/PPE Rv0423c thiC 43.2 7.7 5.6 1.98E-02 intermediary metabolism and respiration Rv0440 groEL2 154.3 14.0 11.0 2.59E-07 virulence, detoxification, adaptation Rv0470c pcaA 844.3 205.3 4.1 8.88E-04 lipid metabolism Rv0482 murB 19.2 67.9 0.3 4.10E-02 cell wall and cell processes Rv0524 hemL 31.2 123.2 0.3 3.95E-02 intermediary metabolism and respiration Rv0552 744.1 86.1 8.6 8.11E-08 conserved hypotheticals Rv0557 mgtA 161.8 23.9 6.8 2.19E-04 lipid metabolism Rv0635 hadA 36.2 3.1 11.8 3.70E-02 intermediary metabolism and respiration Rv0682 rpsL 14.4 0.0 inf 2.29E-02 information pathways Rv0689c 56.2 243.1 0.2 1.81E-02 conserved hypotheticals Rv0842 3.6 528.6 0.01 6.42E-06 cell wall and cell processes Rv0870c 209.3 32.2 6.5 1.52E-04 cell wall and cell processes Rv0874c 120.3 590.0 0.2 6.52E-04 conserved hypotheticals Rv0970 30.3 109.0 0.3 3.82E-02 cell wall and cell processes Rv1087A 68.1 285.9 0.2 2.21E-02 cell wall and cell processes Rv1093 glyA1 119.7 8.2 14.7 5.96E-03 intermediary metabolism and respiration Rv1253 deaD 15.1 624.0 0.02 2.67E-16 information pathways Rv1453 73.4 222.7 0.3 4.10E-02 regulatory proteins Rv1477 ripA 243.7 87.4 2.8 4.77E-02 virulence, detoxification, adaptation Rv1505c 31.1 121.8 0.3 1.98E-02 conserved hypotheticals Rv1567c 109.9 353.7 0.3 2.30E-02 cell wall and cell processes Rv1700 161.4 3.2 49.7 2.50E-02 information pathways Rv1898 10.6 138.7 0.1 1.09E-02 conserved hypotheticals Rv1900c lipJ 27.9 99.7 0.3 4.62E-02 intermediary metabolism and respiration

272 Appendix E: RNA-seq differential expression

Rv1926c mpt63 48.9 362.4 0.1 3.06E-05 cell wall and cell processes Rv1982c vapC36 474.8 92.7 5.1 3.68E-05 virulence, detoxification, adaptation Rv2038c 68.8 13.1 5.2 2.29E-02 cell wall and cell processes Rv2228c 38.6 9.6 4.0 3.95E-02 information pathways Rv2247 accD6 18.7 2.8 6.8 4.92E-02 lipid metabolism Rv2397c cysA1 54.3 210.9 0.3 1.36E-02 cell wall and cell processes Rv2413c 119.1 406.0 0.3 4.10E-02 conserved hypotheticals Rv2528c mrr 74.9 1055.9 0.1 1.17E-09 information pathways Rv2671 ribD 242.3 2.9 82.2 2.67E-16 intermediary metabolism and respiration Rv2672 365.0 44.4 8.2 1.01E-07 intermediary metabolism and respiration Rv2724c fadE20 13.3 95.5 0.1 9.45E-05 lipid metabolism Rv2831 echA16 283.7 8.2 34.5 3.86E-04 lipid metabolism Rv2995c leuB 26.5 95.3 0.3 3.70E-02 intermediary metabolism and respiration Rv3078 hab 1.8 18.9 0.1 2.76E-02 intermediary metabolism and respiration Rv3143 26.4 139.3 0.2 5.46E-03 regulatory proteins Rv3196A 7.1 39.2 0.2 4.73E-02 conserved hypotheticals Rv3209 84.0 317.2 0.3 2.50E-02 conserved hypotheticals Rv3216 92.6 436.9 0.2 4.10E-02 intermediary metabolism and respiration Rv3235 417.0 1729.1 0.2 5.96E-03 conserved hypotheticals Rv3254 63.0 235.8 0.3 3.42E-02 conserved hypotheticals Rv3290c lat 51.8 188.7 0.3 3.70E-02 intermediary metabolism and respiration Rv3587c 115.4 321.9 0.4 4.45E-02 cell wall and cell processes Rv3652_m 280.5 11.0 25.6 9.54E-05 sRNA pr Rv3673c 23.8 2.2 11.0 1.98E-02 intermediary metabolism and respiration Rv3708c asd 108.0 1.7 64.3 2.29E-02 intermediary metabolism and respiration Rv3797 fadE35 39.9 139.8 0.3 3.95E-02 lipid metabolism Rv3830c 46.1 223.6 0.2 1.04E-02 regulatory proteins Rv3832c 115.7 2393.6 0.05 1.10E-02 conserved hypotheticals Rv3842c glpQ1 2852.1 883.9 3.2 3.03E-02 intermediary metabolism and respiration

C. sRNA transcription (N=3)

sRNA p-value Functional category Fold Fold change Lineage 1 Lineage 2 Lineage

MTS0900 326.8 2678.2 0.1 3.89E-02 NA MTS1338 71.5 911.3 0.1 1.10E-02 NA MTS2458 108.5 364.2 0.3 3.70E-02 NA

273 Appendix F: Functional categories

Appendix F

Functional category representation for differentially expressed genes. Toxin-antitoxins were found to be significantly over-represented.

)

Functional class (%) value adjusted adjusted - genome p expected) 2 ( 2 χ annotated in in annotated Differentially Differentially Differentially Representation Representation expressed genesexpressed genes expressed Number of genes genes of Number (foldchange from

information pathways 243 5 4.5 0.7 0.97 intermediary metabolism 20 17.9 0.8 0.47 and respiration 925 PE/PPE 168 4 3.6 0.9 0.97 regulatory proteins 198 5 4.5 0.9 0.97 conserved hypotheticals 1042 27 24.1 0.9 0.97 lipid metabolism 271 9 8.0 1.2 0.47 cell wall and cell 27 24.1 1.3 0.97 processes 773 virulence, detoxification, 4 3.6 1.3 0.97 adaptation 112 unknown 16 1 0.9 2.2 0.97 toxin-antitoxins 124 10 8.9 2.9 0.03

274 Appendix G: Publications

Appendix G

List of Publications

Rose, G., Cortes, T., Comas, I., Coscolla, M., Gagneux. S. & Young, D. B. (2013). Mapping genotype-phenotype diversity amongst clinical isolates of Mycobacterium tuberculosis by sequence based profiling. Under review.

Cortes, T., Schubert, O., Rose, G., Arnvig, K. B., Comas, I., Aebersold, R. & Young, D. B. (2013). Genome-wide mapping of transcriptional start sites defines an extensive leaderless transcriptome in Mycobacterium tuberculosis. Under review.

Kato-Maeda, M., Ho, C., Passarelli, B., Banaei, N., Grinsdale, J., Flores, L., Anderson, J., Murray, M., Rose, G., Kawamura, L. M., Pourmand, N., Tariq, M. A., Gagneux, S., Hopewell, P. C. (2013). Use of Whole Genome Sequencing to Determine the Microevolution of Mycobacterium tuberculosis during an Outbreak. PLoS One. 8(3) e58235.

Muller, B., Borrell, S., Rose, G. & Gagneux, S. (2012). The heterogeneous evolution of multidrug-resistant Mycobacterium tuberculosis. Trends Genet. 29(3) 160-9.

Comas, I., Borrell, S., Roetzer, A., Rose, G., Malla, B., Kato-Maeda, M., Galagan, J., Niemann, S. & Gagneux, S. (2011). Whole-genome sequencing of rifampicin- resistant Mycobacterium tuberculosis strains identifies compensatory mutations in RNA polymerase genes. Nat Genet 44(1) 106-110.

Arnvig, K. B., Comas, I., Thomson, N. R., Houghton, J., Boshoff, H. I., Croucher, N. J., Rose, G., Perkins, T. T., Parkhill, J., Dougan, G. & Young, D. B. (2011). Sequence-based analysis uncovers an abundance of non-coding RNA in the total transcriptome of Mycobacterium tuberculosis. PLoS Pathog 7(11) e1002342.

275