A genomic and transcriptomic study of lineage-specific variation in Mycobacterium tuberculosis
Graham David Rose
Thesis submitted for the degree of Doctor of Philosophy
2013
MRC National Institute for Medical Research
Declaration
I, Graham David Rose, confirm that the work presented in this thesis is my own. Where information has been derived from other sources, I confirm that this has been indicated in the thesis.
Signed………………………………………….Date……………………………………..
The thesis work was conducted from September 2009 to March 2013 at the MRC National Institute of Medical Research (NIMR), London, UK, under the supervision of Douglas Young (NIMR, London), and Sebastien Gagneux (Swiss Tropical and Public Health Institute, Switzerland).
ii Abstract
Human tuberculosis (TB) is caused by several closely related species of bacteria collectively known as the Mycobacterium tuberculosis complex (MTBC). In this thesis the identification and effect of lineage-specific genetic variation within the phylogenetic lineages of the MTBC was investigated using a combination of computational methods and high-throughput sequencing technology.
Genome sequencing has now identified an extensive repertoire of single nucleotide polymorphisms (SNPs) amongst clinical isolates of the MTBC. Comparative analysis focused on the detection of all lineage-specific SNPs, providing the first glimpse of the total SNP diversity that separates the main phylogenetic lineages from each other. Bioinformatic analysis focused on SNPs more likely to contribute to functional diversity, which predicted nearly half of all SNPs in the MTBC to have functional consequences, while SNPs within regulatory proteins were over-represented. To determine whether these and other lineage- specific SNPs lead to phenotypic diversity, genome datasets were integrated with RNA- sequencing to assess their impact on the comparative transcriptome profiles of strains belonging to two MTBC lineages. Analysing the transcriptomes in the light of the underlying genetic variation found clear correlations between genotype and transcriptional phenotype. These arose by three mechanisms. First, lineage-specific changes in amino acid sequence of transcriptional regulators were associated with alterations in their ability to control gene expression. Second, changes in nucleotide sequence were associated with alteration of promoter activity and generation of novel transcriptional start sites in intergenic regions and within coding sequences. Finally, genes showing lineage-specific patterns of differential expression not linked directly to primary mutations were characterised by a striking over- representation of toxin-antitoxin pairs.
iii Acknowledgements
This thesis would not have been possible without the efforts of my colleagues and friends. Firstly I would like to thank my PhD supervisors Sebastien Gagneux and Douglas Young for their support and guidance throughout my project, providing me with their invaluable depth of knowledge and resources. Of special note were the annual Gagneux group retreats in Charmey and Les Diablerets, which always provided a healthy mix of stimulating scientific discussions about my projects and great food, including of course the meringue et la crème double. I am grateful to my three thesis supervisor’s, Delmiro Fernandez-Reyes, Roger Buxton and Seb, who were a great help in contextualising my ideas and providing a focus. My thesis relied heavily on sequence data, and as such I thank Abdul Sesay and the rest of the High Throughput Sequencing group at NIMR for performing the Illumina sequencing. Next I would like to thank Iñaki Comas, who was always happy to answer my questions on evolutionary theory and phylogenomics, and provide more general daily support on all things computational. I also thank the other original member of the Gagneux group at NIMR, Sonia Borrell, particularly so for her help in getting me up and running in the lab at the start, and then the current members of Douglas Young’s group, including Kristine Arnvig, for her guidance on the RNA side of my project, and Steve Coade, who was my Biosafety Containment Level 3 trainer for the first six months of my PhD. My time at NIMR would not have been as enjoyable without my colleagues and friends Christina Kahramanoglou and Teresa Cortés Méndez, and to Teresa, I am indebted to you for your support in keeping me focused and all things in perspective during the final few months. I apologise that despite your and the past efforts from the Spanish contingent of the group that my vocabulary is still quite limited in your language. One day! Of course I am grateful to my parents, who provided me with their untiring support to undertake my studies throughout the years, and to my brother Phil for his advice and the countless Sunday lunches in Balham. Finally I am grateful to the Medical Research Council (MRC) for their funding, who supported not only my university costs and living expenses for the last three and a half years, but the research of many of my colleagues as well. Thank you.
iv CONTENTS
Contents
Declaration...……………………………………………………………………………..ii Abstract.…………………………………………………………………………………iii Acknowledgements...……………………………………………………………………iv List of Figures...………………………………………………………………………….x List of Tables...………………………………………………………………………….xii Glossary...………………………………………………………………………………xiii
Chapter 1 Introduction ...... 1 1.1 The genus Mycobacterium ...... 2 1.1.1 Taxonomy ...... 2 1.1.2 The Mycobacterium tuberculosis complex (MTBC) ...... 4 1.1.3 TB disease in humans ...... 5 1.1.4 Disease diversity ...... 6 1.2 Genetic diversity in the MTBC ...... 7 1.2.1 General features of the M. tuberculosis genome ...... 7 1.2.2 Typing the MTBC ...... 7 1.2.3 The phylogenetic lineages of the MTBC ...... 9 1.2.4 Origin of the MTBC ...... 13 1.2.5 Selective pressures acting within the MTBC ...... 13 1.3 Phenotypic diversity ...... 15 1.3.1 Laboratory strains ...... 15 1.3.2 Clinical strain phenotype ...... 16 1.4 Linking genotype to phenotype ...... 17 1.4.1 In silico prediction of functional SNPs ...... 19 1.4.2 Gene expression diversity ...... 20 1.4.3 High throughput DNA sequencing technology ...... 22 1.5 Thesis Outline ...... 25
v CONTENTS
Chapter 2 Materials and Methods ...... 26 2.1 General microbiological methods ...... 26 2.1.1 Containment 3 laboratory ...... 26 2.1.2 General chemicals and reagents ...... 26 2.1.3 Bacterial culture and storage ...... 27 2.1.4 Growth curves ...... 27 2.2 Molecular biology techniques ...... 28 2.2.1 Genomic DNA extraction ...... 28 2.2.2 RNA Isolation and handling ...... 28 2.2.3 Quantification of DNA and RNA by Nanodrop ...... 29 2.2.4 Determination of DNA and RNA integrity by micro fluidics ...... 30 2.2.5 Removal of DNA contamination from RNA samples ...... 30 2.2.6 Polymerase chain reaction (PCR) ...... 30 2.3 Materials ...... 31 2.3.1 Mycobacterium tuberculosis strains ...... 31 2.4 DNA-seq ...... 31 2.5 RNA-seq ...... 32 2.5.1 Strand specific RNA-seq libraries ...... 32 2.5.2 TSS 5’ enriched RNA-seq libraries ...... 34 2.6 Illumina sequencing DNA (genome) and cDNA (RNA-seq) libraries ...... 34 2.7 Quantitative RT-PCR ...... 34 2.7.1 Primer sequences ...... 35 2.8 MTBC annotation datasets ...... 36 2.8.1 Coding sequence annotations ...... 36 2.8.2 Functional Categories ...... 36 2.8.3 Essential M. tuberculosis genes ...... 36 2.9 Bioinformatics software ...... 37 2.9.1 Artemis ...... 37 2.9.2 Quality control of raw RNA-sequencing data ...... 37 2.9.3 Transcriptome mapping software ...... 38 2.9.4 Calculation of mapped read frequencies per feature region ...... 39 2.9.5 R ...... 40 2.9.6 Perl scripts ...... 40 2.9.7 Graph pad prism 5.0 ...... 40
Chapter 3 Lineage-specific SNPs ...... 41 3.1 Introduction ...... 41
vi CONTENTS
3.1.1 Aims ...... 42 3.2 Materials and Methods ...... 43 3.2.1 Genome collection used in study ...... 43 3.2.2 Genome sequencing...... 43 3.2.3 Mapping genome sequences ...... 43 3.2.4 Phylogenetic analysis...... 44 3.2.5 Categorising SNPs ...... 44 3.2.6 dN/dS calculation ...... 45 3.3 Results ...... 47 3.3.1 A globally representative 28-genome human-adapted MTBC phylogeny . 47 3.3.2 Identification of all lineage-specific SNPs ...... 53 3.3.3 Distribution of SNPs ...... 56 3.3.4 Monomorphic population structure and homoplasic SNPs ...... 59 3.3.5 Creation of pseudogenes ...... 62 3.3.6 SNPs within genes associated with antibiotic resistance ...... 69 3.3.7 Conservation and removal of lineage-specific nonsynonymous SNPs ...... 72 3.4 Discussion ...... 77 3.4.1 Strengths and limitations of this study ...... 77 3.4.2 General characteristics of lineage-specific diversity ...... 78 3.4.3 Insights into the evolution of M. tuberculosis lineages ...... 80
Chapter 4 In silico prediction of functional Single Nucleotide Polymorphisms .. 84 4.1 Introduction ...... 84 4.1.1 Aims ...... 86 4.2 Materials and Methods ...... 87 4.2.1 SIFT ...... 87 4.2.2 Indels ...... 89 4.2.3 Homology modelling ...... 89 4.2.4 Change in protein stability ...... 90 4.3 Results ...... 91 4.3.1 Predicting functional SNPs within control set ...... 91 4.3.2 Predicted functional nonsynonymous SNPs ...... 92 4.3.3 Impact of nonsynonymous SNPs outside of the human adapted MTBC .... 95 4.3.4 Clustering of functional SNPs ...... 95 4.3.5 Functional category analysis of functional SNPs ...... 99 4.3.6 Functional impairment of Lineage 1 and 2 regulatory proteins ...... 101 4.4 Discussion ...... 106
vii CONTENTS
4.4.1 Strengths and limitations of the study ...... 106 4.4.2 Validation of the SIFT method ...... 108 4.4.3 Half of lineage-specific SNPs are predicted to have functional consequences ...... 109
Chapter 5 Screening the effect of lineage-specific variation by sequence-based transcriptional profiling ...... 112 5.1 Introduction ...... 112 5.1.1 Aims ...... 113 5.2 Methods ...... 114 5.2.1 Clinical isolates in study ...... 114 5.2.2 Cluster analysis ...... 118 5.2.3 Differential expression analysis ...... 118 5.2.4 Transcriptional Start Site (TSS) calling ...... 119 5.3 Results ...... 120 5.3.1 Growth rate in vitro ...... 120 5.3.2 RNA isolation and Illumina ready libraries ...... 124 5.3.3 Transcriptome sequencing ...... 125 5.3.4 Mapping reads to the H37Rv genome ...... 128 5.3.5 Identifying strain specific gene deletions ...... 129 5.3.6 Clustering of strains at the total sample level ...... 133 5.3.7 Clustering of strains by antisense expression ...... 138 5.3.8 Testing for differential expression in RNA-seq data ...... 140 5.3.9 Lineage-specific gene expression ...... 141 5.3.10 Enrichment of toxin-antitoxins ...... 155 5.4 Discussion ...... 159 5.4.1 Strengths and limitations of the study ...... 159 5.4.2 Lineage-specific expression ...... 161 5.4.3 Linking genotype to phenotypic at the transcriptional level ...... 162
Chapter 6 Final discussion ...... 167
References ...... 174 Appendices A-G Appendix A. genomeDeletions.pl…………………………………………………209 Appendix B. Lineage-specific SNPs………………………………………………211 Appendix C. Lineage-specific SNPs within drug resistance associated genes……265 Appendix D. Nonsynonymous/synonymous SNP ratio………………………...…267 Appendix E. RNA-seq differential expression……………………………………269
viii CONTENTS
Appendix F. Functional categories…………………………………………..…274 Appendix G. Publications…………………………………………...…..……...275
ix LIST OF FIGURES
List of Figures
Figure 1.1. Phylogenetic structure of the genus Mycobacterium...... 3! Figure 1.2. The most complete phylogeny of the human adapted MTBC ...... 11! Figure 1.3. Distribution of the MTBC lineages globally ...... 12! Figure 1.4. The number of MTBC genome sequences in the Short Read Archive…....18! Figure 3.1. Neighbour-joining phylogeny for 28 human-adapted MTBC genomes ..... 49! Figure 3.2. Within-lineage SNP diversity...... 52! Figure 3.3. Isolating lineage-specific SNPs from the phylogeny...... 54! Figure 3.4. Distribution of the lineage-specific SNPs across the genome...... 55! Figure 3.5. The average number of non-coding and coding lineage-specific SNPs ..... 57! Figure 3.6 Distribution of lineage SNPs per gene...... 58! Figure 3.7. Homoplasic lineage SNPs...... 60! Figure 3.8. Change in protein length due to nonsense SNPs...... 67! Figure 3.9. Gene creation by nonsense SNPs ...... 68! Figure 3.10 Lineage-specific SNPs within genes associated with drug resistance ...... 69! Figure 3.11. The rate of nonsynonymous SNP accumulation by functional category .... 75! Figure 4.1. SIFT database phylogeny...... 88! Figure 4.2. SIFT predictions...... 94! Figure 4.3. Distribution of predicted functional SNPs per gene...... 97! Figure 4.4. Frequency distribution of predicted functional SNPs across genome...... 98! Figure 4.5. Functional category representation...... 99! Figure 4.6. Predicted loss of function of virS transcriptional regulator in Lineage 1.. 105! Figure 4.7. Spectrum of functional SNPs...... 111! Figure 5.1. Strains sequenced in RNA-seq study...... 117! Figure 5.2. In vitro growth curves...... 121! Figure 5.3. Quality control of RNA-seq samples by Bioanalyser...... 124! Figure 5.4. Distribution of quality scores for strain N0145...... 125! Figure 5.5. Circular plot of mapped RNA-seq data...... 128!
x LIST OF FIGURES
Figure 5.6. Representation of transcriptome plot based on Artemis...... 129! Figure 5.7. Distribution of gene deletions in the six RNA-seq study strains...... 130! Figure 5.8. Distribution of gene deletions grouped by gene function category...... 132! Figure 5.9. Unsupervised hierarchical clustering of total gene expression...... 135! Figure 5.10. Relationship of genotypic to transcriptomic diversity...... 136! Figure 5.11. Correlation of SNP distance to gene expression...... 137! Figure 5.12. Unsupervised hierarchical clustering of total antisense expression...... 139! Figure 5.13. Venn diagram comparing differential expression methods ...... 141! Figure 5.14. Heatmap of 112 differentially expressed genes...... 142! Figure 5.15. Differential expression of divergently regulated genes...... 144! Figure 5.16. Heat map of dosR regulon...... 146! Figure 5.17. Duplication of dosR region...... 147! Figure 5.18. DosR regulon and SNP-associated TSS...... 149! Figure 5.19. SNP-associated TSS leading to differential gene expression...... 152! Figure 5.20. SNP-associated TSS leading to differential antisense expression...... 154! Figure 5.21. Over-representation of differentially expressed toxin-antitoxins...... 156! Figure 5.22. Validation of select RNA-seq differentially expressed toxin- antitoxins.. 156! Figure 5.23. Rates of the types of nucleotide mutations across...... 165!
xi LIST OF TABLES
List of Tables
Table 2.1. Primer sequences used in the qRT-PCR study...... 35 Table 3.1. Twenty eight strains used in this study...... 46 Table 3.2. Estimates of evolutionary divergence between strains...... 50 Table 3.3. Summary of lineage-specific SNPs...... 57 Table 3.4. Homoplasic nucleotide positions within the lineage branches...... 60 Table 3.5. Variable genomic positions within the lineages...... 61 Table 3.6. Nonsense SNPs ...... 63 Table 3.7. Nonsense SNPs by lineage...... 64 Table 3.8. Nonsense SNPs grouped by functional category...... 64 Table 3.9. Mutations found in drug resistance studies associated with drug resistance Table 3.10. The rate of nonsynonymous SNP accumulation across the lineages...... 73 Table 3.11. The rate of nonsynonymous SNP accumulation by functional category….76 Table 4.1. SIFT database of non-MTBC species...... 89 Table 4.2. Predicted tolerated and functional SNPs using SIFT...... 94 Table 4.3. Functional category representation...... 100 Table 4.4. Transcriptional regulators with predicted functional mutations...... 102 Table 4.5. Regulatory proteins with predicted functional mutations in Lineage 1 and 2...... 104 Table 5.1. Lineage 1 and 2 strain used in the RNA-seq study...... 115 Table 5.2. Additional strains used in growth curve experiment...... 115 Table 5.3. Additional strains used in qRT-PCR confirmation...... 116 Table 5.4. In vitro growth rates...... 123 Table 5.5. Details of exponential phase transcriptomes used in differential expression analysis...... 126 Table 5.6. Transcriptomes used in TSS mapping…...... 127 Table 5.7. Differential expression associated with lineage-specific amino acid mutations SNPs...... 143
xii LIST OF TABLES
Table 5.8. Ten differentially expressed genes associated with a change in promoter sequences……...... 150 Table 5.9. Nine differentially expressed antisense associated with introduction of SNP- associated TSS……………...... 153 Table 5.10. Ten differentially expressed toxin-antitoxins (TA)...... 157
xiii
PCR polymerase chain reaction PDB protein data bank PE proline-glutamic acid PPE proline-proline-glutamic acid PGRS polymorphic glycine rich Glossary sequence qRT-PCR quantitative realtime-PCR RD region of difference RNA ribonucleic acid ∆∆G change in Gibbs free RNA-seq RNA-sequencing energy RPKM reads per kilobase per -10 Pribnow box million mapped reads CCAL creative commons rRNA ribosomal RNA attribution license sd standard deviation cDNA complementary DNA SNP single nucleotide dt doubling time polymorphism DNA deoxyribonucleic acid SEM standard error of the mean DNA-seq DNA-sequencing sRNA small RNA g gram TA toxin-antitoxin GA Genome Analyser TSS transcriptional start site Gb gigabase µg microgram HS HiSeq2000 µl microlitre HTH helix-turn-helix UTR untranslated region indel insertion/deletion VST variance stabilising LSP large sequence transformation polymorphism HGT horizontal gene transfer Mb megabase TbD1 M.tuberculosis specific mg milligram deletion 1 ml millilitre HMM Hidden Markov model MLSA multilocus sequence VCF variant call format analysis GTF gene transfer format mRNA messenger RNA X2 chi-square test MTBC Mycobacterium tuberculosis complex nt nucleotide OD optical density
xiv 1.1 The genus Mycobacterium
Chapter 1 Introduction
Tuberculosis (TB) is caused by several closely related species of bacteria collectively known as the Mycobacterium tuberculosis complex (MTBC) (Cole et al., 1998). The infamous member of the MTBC is the human-adapted pathogen Mycobacterium tuberculosis, the etiologic agent of human TB along with Mycobacterium africanum, a phylogenetic variant limited to West Africa (de Jong et al., 2010). Together these species are regarded as human-adapted MTBC members. Today, TB causes more adult deaths than any other single infectious disease, and is second only to HIV/AIDS, of which TB is the greatest cause of mortality in those infected with HIV (WHO, 2012). It is estimated that nine million new TB cases and over one million deaths from TB currently occur each year (WHO, 2012). In addition to active cases of TB, two billion people have a latent infection, effectively acting as a reservoir of active TB cases for several decades to come (Barry et al., 2009).
Historically TB is an ancient disease (Donoghue et al., 2004). Early cultural references date back to classical Greek times (Daniel, 1997), when Hippocrates used the term “phthisis” to describe active TB in individuals (Coar, 1982). Ancient M. tuberculosis DNA has been isolated from mummies found in Egypt (Nerlich et al., 1997) and South America (Salo et al., 1994). More recently, molecular genetics and the advent of sequencing technologies have facilitated more rigorous dating of M. tuberculosis and other MTBC members; low estimates range from 15,000-20,000 (Sreevatsan et al., 1997a), but more recently 70,000 years or more has been suggested (Hershberg et al., 2008). TB has therefore been a burden on humans for a long time, possibly since the migration of modern humans out of Africa (Hershberg et al., 2008). Recent analyses of MTBC evolution, largely driven by the advances in sequencing technology (Loman et al., 2012), have revealed a global picture of human MTBC strain variation, consisting of
1 1.1 The genus Mycobacterium six major phylogenetic lineages that display strong geographic structure (Gagneux & Small, 2007; Hershberg et al., 2008) and a rare seventh lineage recently discovered in the Horn of Africa (Firdessa et al., 2013). This has questioned the accuracy of prior assumptions that variation in the MTBC was negligible and of no clinical significance (Musser et al., 2000; Sreevatsan et al., 1997a), whilst bringing to the forefront the identification, potential effects of genetic variation, and future trajectory of the disease (Comas & Gagneux, 2009; Hershberg et al., 2008; Homolka et al., 2010). New opportunities now exist to study how the evolution of the MTBC has resulted in functional consequences in the lineages of MTBC at the definitive resolution - the level of DNA and RNA. It is these opportunities that shall be explored in this thesis.
1.1 The genus Mycobacterium
A genus of Actinobacteria, Mycobacteria are distinctive rod-shaped bacteria that are characterised by high GC content, and complex lipid-rich cell walls (Madigan et al., 2003). This physical property of the cell wall was exploited in 1882 by Koch, who stained M. tuberculosis with alkaline methylene blue and a Bismarck brown stain for surrounding tissue (Ellis & Zabrowarny, 1993). In the same year the Ziehl-Neelsen stain was developed, which used a similar process to identify acid-fast bacteria, and is still used today to identify mycobacteria (Parish & Stoker, 2001).
1.1.1 Taxonomy
A working taxonomy for Mycobacteria was established 50 years ago, with original classifications based on growth rate, pigmentation and clinical significance (Stahl & Urbance, 1990). A fundamental division can be made based on growth rate, splitting Mycobacteria into two major groups, fast and slow growers. The fast growers include mainly opportunistic or non-pathogenic mycobacteria, such as Mycobacterium smegmatis, which can be cultured from dilute inocula within a week. In contrast, the slow growing species can take several weeks for visible growth from dilute inocula. This group includes M. tuberculosis, Mycobacterium bovis and Mycobacterium leprae, the causative agents of human TB, bovine TB and leprosy, respectively. Modern molecular biology techniques based on 16S rRNA have revealed the macro population structure of mycobacteria (Gutierrez et al., 2005; Stahl & Urbance, 1990). The phylogenetic structure of mycobacteria based on this method is shown in Figure 1.1, and of note is the
2 1.1 The genus Mycobacterium position of the MTBC together with the smooth tubercle bacilli, which includes Mycobacterium canetti; it is hypothesised that it was an ancestral pool of smooth tubercle-like bacilli from which the MTBC originated (Gutierrez et al., 2005; Supply et al., 2013).
Figure 1.1. Phylogenetic structure of the genus Mycobacterium. The neighbor- joining tree is based on 16S sequences from seventeen smooth mycobacterial and MTBC strains. The blue triangle indicates the MTBC. Bootstrap support higher than 90% shown on nodes. Scale bar is pairwise distances after Jukes-Cantor correction. Adapted from Gutierrez et al. (2005). Image reproduced under the Creative Commons Attribution License (CCAL).
3 1.1 The genus Mycobacterium
1.1.2 The Mycobacterium tuberculosis complex (MTBC)
The MTBC is used as an umbrella term to group the closely related mycobacteria that cause TB (Cole et al., 1998). Early sequencing of mycobacteria from the MTBC showed that they share more than 99.9% sequence identity (Sreevatsan et al., 1997a), as demonstrated by the collapsed branches in Figure 1.1 for the MTBC members. However, despite this close relatedness, members of the MTBC display different phenotypic characteristics and mammalian host ranges; as described above, MTBC members M. tuberculosis and M. africanum are the primary cause of TB in humans.
The MTBC includes several other species and sub-species that are adapted to various hosts, including both wild and domestic animal species; these bacterial variants have been referred to as “ecotypes” (Smith et al., 2006b). Here an ecotype is used as the definition of a set of strains using the same or similar ecological resources (Cohan, 2002). The host of M. bovis is largely cattle, which is of significant agricultural significance due to the associated cost of bovine TB, estimated globally at $3 billion per year (Garnier et al., 2003). M. bovis can also cause TB in humans through the consumption of unpasteurised milk (de la Rua-Domenech, 2006; Grange, 2001). Fortunately, modern food practices have effectively stopped this transmission route, and person-to-person transmission of M. bovis is rare (Evans et al., 2007; Grange, 2001). Other animal adapted pathogens include Mycobacterium microti (infects voles), Mycobacterium caprae (infects sheep and goats) and Mycobacterium pinnipedii (infects seals and sea lions). An MTBC pathogen of Dassies, or Rock Hyrax, has been isolated in South Africa and named the Dassie bacillus (Parsons et al., 2008), whilst more recently an MTBC pathogen of banded mongooses has been identified in Botswana named Mycobacterium mungi (Alexander et al., 2010). It is anticipated that MTBC members of other ecotypes will likely be identified in future studies.
A special member of the MTBC is M. canetti, a rare tubercle bacillus with an unusual smooth colony phenotype, unlike the classical rough appearance of other MTBC members (van Soolingen et al., 1997). M. canetti and the other smooth TB bacilli harbor greater genetic diversity compared with the rest of the MTBC, and are more distantly related to the remaining MTBC than any two other MTBC strains are to each other (Gutierrez et al., 2005). M. canetti is subsequently a common choice as an outgroup in phylogenetic analysis (Bentley et al., 2012; Comas et al., 2010). Horizontal recombination events are another feature of the M. canetti genome (Supply et al., 2013),
4 1.1 The genus Mycobacterium which is in stark contrast to the rest of the MTBC where no significant signs of recombination are seen (Hirsh et al., 2004; Supply et al., 2003).
1.1.3 TB disease in humans
M. tuberculosis and M. africanum, which together make up the human adapted members of the MTBC, are the etiological agents of TB in humans. TB infection in humans broadly follows an established pattern of events. Briefly, infectious bacilli are spread through droplet nuclei that can remain aerosolised for several hours. Following inhalation of the droplets the bacteria are phagocytosed by the host’s alveolar macrophages, which are then thought to invade the subtending epithelial layer of the lung (Russell et al., 2010); the infectious dose is estimated to be as low as a single bacterium. A primary site of infection is established, known as the Ghon focus, whereby a localised inflammatory response leads to recruitment of mononuclear cells from the neighboring blood vessels, which acts to provide fresh cells for the bacterial infection. The subsequent lesion or granuloma, is a defining pathogenic feature of TB disease. Initially consisting as a mass of macrophages, neutrophils and monocytes, the granulomas eventually become stratified with recruitment of lymphocytes and develop a centre that is rich in lipids. At this stage an equilibrium with the host immune system is established in most individuals, which can persist from weeks to decades and is known as latent TB infection. In this latent state the host is asymptomatic and noninfectious. It is estimated that 95% of human-adapted MTBC infection follows this route into latency, which is based on evidence of immunological sensitisation by mycobacterial proteins in the absence of clinical signs and symptoms of active TB (Barry et al., 2009). In individuals with active TB, either from disease progression, which occurs in about 5% of cases, or from the reactivation of a latent infection estimated to occur in 10% over a lifetime in HIV-negative individuals, the granuloma centre fills with caseous debris including necrotic macrophages. This ultimately ruptures and releases thousands of infectious bacilli into the lungs and respiratory airways (Kaplan et al., 2003). A persistent productive cough develops, effectively aerosolising and spreading the bacilli to new hosts, and it is this late stage of active TB that contributes to tissue damage and pathogenesis. Bacilli can also escape into other tissues via the lymphatic blood system, and this is known as miliary or extrapulmonary TB. Rapid progression to active TB from an initial infection is higher in infants or immunocompromised persons, whilst latent TB can be triggered by immunosuppression, of which the greatest identified cause is HIV infection (Ho et al., 1995).
5 1.1 The genus Mycobacterium
1.1.4 Disease diversity
Although TB is clinically defined into active and latent TB forms, it is likely that this is a gross oversimplification, with TB infection following a continuous spectrum, ranging from sterilising immunity, subclinical active disease, and active disease (Barry et al., 2009). Development of active disease is likely determined by multiple factors, including the host genotype, environmental factors, and bacterial genetics. On the human genetics side, SNPs have been identified that determine susceptibility of an individual to TB using genome-wide linkage analysis (Bellamy et al., 2000). In addition to environmental influences, strain variation in the MTBC is now also thought to play a role in the outcome of TB infection and disease (Coscolla & Gagneux, 2010). The ability of the MTBC strain to elicit an immune response was explored by Portevin et al. recently using a monocyte-derived macrophage model to study the innate immune response to twenty-eight diverse clinical MTBC strains (Portevin et al., 2011). It was shown that macrophages infected with different strains differed in the levels of cytokines and chemokines produced; infections by a group of strains that belong to the modern phylogenetic lineages produced less pro-inflammatory cytokines compared with strains from the ancient lineages (classification of modern and ancient lineages is discussed in detail below in section 1.2.3). Moving into a clinical setting, it has been shown that over the course of two years household contacts exposed to strains from the modern lineages were more likely to develop active disease compared to strains from the ancient lineages (de Jong et al., 2008). Taken together, Gagneux hypothesised that modern strains have developed an evolutionary strategy of increased virulence and shorter latency, possibly through adaptation to expanding human population sizes over the past few hundred years which have provided more hosts for the MTBC pathogen (Gagneux, 2012). In summary, it is likely that multiple factors play an important role in disease, with a complex interaction between the host, pathogen and environment (Comas & Gagneux, 2009). This study focuses on the pathogen side, and the following section introduces the genetic diversity and lineages of the MTBC.
6 1.2 Genetic diveristy in the MTBC
1.2 Genetic diversity in the MTBC
1.2.1 General features of the M. tuberculosis genome
A seminal moment in mycobacterial research was the genome sequencing of the first strain of M. tuberculosis in 1998 (Cole et al., 1998). A canonical strain of TB research, M. tuberculosis H37Rv was chosen in 1993 to be the first MTBC strain sequenced, and the genome was closed and finished over the next five years. It was shown that the single circular chromosome was 4,411,532 bp in length and consists of just over 4,000 protein coding genes. The annotated genome opened new insights into the biology and metabolism of the pathogen, with identification of large protein families related to fatty acid and polyketide biosynthesis, regulation, drug efflux pumps and transporters, and PE_PGRS proteins. PE_PGRS are a large duplicated family unique to the MTBC.
The genome is rich in repetitive DNA, such as IS6110 insertion sequences, and in multigene families and duplicated housekeeping genes (Cole et al., 1998). Sixteen copies of the IS6110 sequence and six copies of the more stable element IS1081 were found to reside within the genome of H37Rv. Due to the variable number of IS6110 elements in strains these were utilised in a DNA fingerprinting protocol which quickly evolved into the first international gold standard for genotyping of MTBC (van Embden et al., 1993). Typing of the MTBC in the context of strain diversity is discussed in the following section.
1.2.2 Typing the MTBC
Members of the MTBC are considered genetically monomorphic with a high level of genomic sequence similarity and negligible horizontal gene transfer (Hirsh et al., 2004; Liu et al., 2006). As such, the MTBC displays a classic clonal population structure and evolves by descent (Achtman, 2008), which leads to the situation whereby mutations in the parental strain become defining markers for the rest of the progeny. Together, this creates a situation where many genotyping tools useful in other species do not transfer to the MTBC effectively (Achtman, 2008; Comas et al., 2009). Development of tools to measure genetic variation in the MTBC was the start of generating a robust framework needed firstly to measure the amount of genetic variation in strains, before secondary questions, such as the effect of strain variation in TB disease could be asked. Before
7 1.2 Genetic diveristy in the MTBC discussing the lineages of the MTBC it is first necessary to introduce a brief history of typing the MTBC and the evolution of such tools to measure genetic diversity in a robust and definitive manner.
As introduced above, the early 1990s saw the establishment of IS6110 restriction fragment length polymorphism (RFLP) typing as the gold standard of the MTBC typing (van Embden et al., 1993). The method is based on strain differences in the IS6110 copy numbers, ranging from 0 to about 25, as well as the variability in the chromosomal positions of the insertion sequences. Large collections were subsequently typed and the first families of strains with a common genotype were uncovered in the MTBC (Van Soolingen, 2001). It was found that some strains were at a higher frequency and across a wider geographic area, suggesting differential success rates in terms of infection and geographical spread (Van Soolingen, 2001). Although non-sequence based tools including the above RFLP technique, and other methods such as Pulsed-Field Gel Electrophoresis (PFGE) are useful for typing of monomorphic bacteria at the fine scale, they have many drawbacks, including problems of reproducibility between laboratories (Achtman, 2008).
Development of sequence based tools such as spoligotyping and MIRU-VNTR have largely replaced RFLP typing, and are currently the official gold standards for epidemiological typing of the MTBC (Supply et al., 2001). Spoligotyping is the mycobacterial name given to the clustered regularly interspaced short palindromic repeats (CRISPR) typing method, which is based on counting unique spacer regions between a series of direct repeats in the M. tuberculosis genome (Grissa et al., 2008). The second method, MIRU-VNTR or mycobacterial interspersed repetitive units variable number tandem repeats, classifies strains by comparison of strain-specific numbers of repeats of short DNA sequences at various genomic positions (Lindstedt, 2005). Databases have been built around the results of typing tens of thousands of patient isolates with these methods, such as SpolDB4 (Brudey et al., 2006) and MIRU- VNTR plus (Weniger et al., 2010). Although spoligotyping and MIRU-VNTR have been invaluable from an epidemiological view, the application of such tools to study evolutionary questions is not ideal as they are susceptible to convergent evolution. Convergent evolution describes the identification of the same genotype in two strains that is not due to descent, and this impacts the robustness of derived phylogenies (Comas et al., 2009). This scenario arises due to the limited number of loci that the methods are based on. In a study by Comas et al. it was found that phylogenies built
8 1.2 Genetic diveristy in the MTBC using either method had low discriminatory power and were incongruent compared to those based on a recent SNP based typing method (Comas et al., 2009). It was therefore argued that for evolutionary studies the MTBC should be typed using robust SNP or large sequence polymorphisms (LSPs) markers (Comas et al., 2009).
Typing the MTBC by LSP or gene deletions exploits the absence of horizontal gene transfer in the MTBC, making each deletion event unique and so robust informative phylogenetic markers. Whilst LSPs have been used to resolve the main lineages of the MTBC (Gagneux et al., 2006a; Reed et al., 2009), deletions are less abundant that SNPs and were also largely based on deletions found in the reference strain H37Rv, making SNPs the best choice for sampling MTBC diversity. To date numerous studies have utilised SNP markers to classify strains and explore the evolutionary history of the MTBC (Baker et al., 2004; Comas et al., 2010; Gagneux & Small, 2007; Hershberg et al., 2008). However, SNP analyses can also suffer from the same problems as previous studies based on LSPs, such as using SNPs based on prior information, which can introduce a discovery bias, or through simply using a non-representative set of strains. In 2008, Hershberg et al. used de novo sequencing of multiple genes from 108 global MTBC strains to identify novel SNPs and constructed the most complete phylogenetic tree of the MTBC (Hershberg et al., 2008). Subsequent whole genome sequencing of a smaller set of strains in 2010 has defined the MTBC lineages at the highest possible resolution, the single nucleotide level (Comas et al., 2010).
1.2.3 The phylogenetic lineages of the MTBC
The global populations structure of the MTBC is defined by six main phylogenetic lineages, named Lineage 1 to 6 (Comas et al., 2010), although these have also been described by their geographic distribution and other naming schemes in previous studies (Filliol et al., 2003; Gagneux et al., 2006a; Hershberg et al., 2008). The largest phylogeny of global MTBC diversity is shown in Figure 1.2. Lineages are coloured based on previous deletion analysis in a global set of strains (Gagneux et al., 2006a), and the same colouring scheme is continued throughout this thesis. The phylogeny is based on a multi locus sequencing analysis (MLSA) of SNPs identified from the sequencing of 89 genes in 108 MTBC strains (Hershberg et al., 2008). The MLSA also included seven animal-adapted strains, which were shown to all cluster within one of the M. africanum lineages (Lineage 6). Of special note is the Beijing sub-lineage of Lineage 2, which is of interest in the context of association with multidrug resistance and recent expansion
9 1.2 Genetic diveristy in the MTBC
(Borrell & Gagneux, 2009); this is discussed further in section 1.3.2. In addition to strains clustering into six main lineages, two major groupings were observed, the “ancient” and “modern” lineages (Figure 1.2). Lineage 1 and the two M. africanum lineages are referred to as ancient as they branched off from a common ancestor at an early stage of evolution, whilst the remaining three modern lineages diverged at a later time point (Lineage 2, 3, and 4). Previously, studies have classified MTBC strains into two groups based on the presence of a single genomic deletion known as TbD1 (Brosch et al., 2002), but here it was demonstrated this separation is more than a single deletion (Hershberg et al., 2008). TbD1 is in the relatively long branch prior to the separation of Lineages 2, 3 and 4 shown in Figure 1.2, thus representing more genetic variation between the ancient and modern lineages than had been suggested by TbD1. As mentioned previously, recently a rare seventh MTBC lineage was identified, and this has a phylogenetic location that is between the ancient and modern lineages in Figure 1.2, although the Lineage 7 branch point is before TbD1 (Firdessa et al., 2013); Lineage 7 was published in March 2013 and therefore is not discussed further in this thesis.
Strains used in the MLSA study were derived from a global collection of 875 strains from 80 countries that were previously characterised by genome wide deletion analysis (Gagneux et al., 2006a), and represent the broadest sample of genetic and geographic MTBC diversity to date. In the study by Gagneux et al. and following analyses, it was found that the MTBC diversity is highly geographically structured (Gagneux et al., 2006a; Hershberg et al., 2008). This is shown in Figure 1.3, where for example Lineage 4 is the dominant lineage in terms of geographical spread across the continents of Europe, America and Africa, whilst Lineage 2 is predominantly found in East Asia.
10 1.2 Genetic diveristy in the MTBC
The Philippines Lineage 1
Rim of Indian lineages Ancient Ocean M. africanum Lineage 5 (West Africa 1)
M. africanum Lineage 6 (West Africa 2)
India, Lineage 3 East Africa Modern lineages lineages Modern Beijing Beijing East Asia Lineage 2
Europe, America, Lineage 4 Africa
Figure 1.2. The most complete phylogeny of the human adapted MTBC. Maximum Parsimony phylogeny of MTBC built using 89 concatenated gene sequences in 108 strains. The branches are colored according to the main lineages defined previously based on LSP deletion analysis (Gagneux et al., 2006a). Although not part of this study, the animal strains were part of the previous MLSA study and shown here for reference. Adapted from Hershberg et al. (2008). Image reproduced under the Creative Commons Attribution License (CCAL).
11 1.2 Genetic diveristy in the MTBC
Figure 2. Global distribution of the six main lineages of human MTBC. Each dot represents the most frequent lineage(s) circulating in a country. Colours correspond to the lineages defined in Figure 3 (adapted from [20]). Figuredoi:10.1371/journal.ppat.1000600.g002 1.3. Distribution of the MTBC lineages globally. The six lineages display a strongsequenced geographic for each strain [26],structure, has been used with very each successfully dot to representingevolution of MTBC. the dominant In-depth population lineage genetic in analyseseach byof define the genetic population structure of many bacterial species Hershberg et al. highlight the fact that purifying selection against the[27]. 80 Because countries of the low representeddegree of sequence in polymorphisms the strain in collection.slightly deleterious Adapted mutations in fromthis organism Gagneux is strongly reducedet al. MTBC, however, standard MLST is uninformative [28]. A recent compared to other bacteria [29]. As a consequence, nonsynon- study of MTBC extended the traditional MLST scheme by ymous SNPs tend to accumulate in MTBC, leading to a high ratio (2006asequencing) and 89 complete Hershberg genes in 108 et strains,al. (2008) covering. 1.5% Image of the reproducedof nonsynonymous under to synonymous the CCAL. mutations (also known as dN/ genome of each strain [29]. Phylogenetic analysis of this extended dS). The authors hypothesized that the high dN/dS in MTBC multilocus sequence dataset resulted in a tree that was highly compared to most other bacteria might indicate increased random congruent with that generated previously using LSPs (Figure 3). genetic drift associated with serial population bottlenecks during The new sequence-based data also revealed that the MTBC past human migrations and patient-to-patient transmission. If strains that are adapted to various animal species represent just a confirmed, this would indicate that ‘‘chance,’’ not just natural subset of the global genetic diversity of MTBC that affects different selection, has been driving the evolution of MTBC. Although these human populations [29]. Furthermore, by comparing the kinds of fundamental evolutionary questions are often underap- geographical distribution of various human MTBC strains with preciated by clinicians and biomedical researchers, studying the their position on the phylogenetic tree, it became evident that evolution of a pathogen ultimately allows for better epidemiolog- MTBC most likely originated in Africa and that human MTBC ical predictions by contributing to our understanding of basic originally spread out of Africa together with ancient human biology, particularly with respect to antibiotic resistance. migrations along land routes. This view is further supported by the fact that the so-called ‘‘smooth tubercle bacilli,’’ which are the A Vision for the Future closest relatives of the human MTBC, are highly restricted to East Thanks to recent increases in research funding for TB [4], Africa [30]. The multilocus sequence data reported by Hershberg substantial progress has been made in our understanding of the basic et al. [29] further suggested a scenario in which the three biology and epidemiology of the disease. Unfortunately, this increased ‘‘modern’’ lineages of MTBC (purple, blue, and red in Figure 3) knowledge has not yet had any noticeable impact on the current seeded Eurasia, which experienced dramatic human population global trends of TB (Figure 1). While TB incidence appears to have expansion in more recent times. These three lineages then spread stabilized in many countries, the total number of cases is still increasing globally out of Europe, India, and China, respectively, accompa- as a function of global human population growth [1]. Of particular nying waves of colonization, trade and conquest. In contrast to the concern are the ongoing epidemics of multidrug-resistant TB [31], as ancient human migrations, however, this more recent dispersal of well as the synergies between TB and the ongoing epidemics of HIV/ human MTBC occurred primarily along water routes [29]. AIDS and other comorbidities such as diabetes (Box 1). The availability of comprehensive DNA sequence data has also As our understanding of TB improves, we would like to be able allowed researchers to address questions about the molecular to make better predictions about the future trajectory of the
PLoS Pathogens | www.plospathogens.org 3 October 2009 | Volume 5 | Issue 10 | e1000600
12 1.2 Genetic diveristy in the MTBC
1.2.4 Origin of the MTBC
Early dating of the MTBC ranged from 15,000-20,000 years ago, where it was hypothesised that animal domestication was the cause of TB in humans during the Neolithic transition (Sreevatsan et al., 1997a). But more recent estimates place the MTBC at 70,000 or more years old, linked with early human migrations out of Africa (Hershberg et al., 2008). It is interesting that the continent that harbours the greatest MTBC genetic diversity is Africa, with all six lineages represented (Figure 1.3). Based on the MLSA data by Hershberg et al., it was postulated that the MTBC originated in Africa and accompanied the Out-of-Africa migrations of modern humans approximately 70,000 years ago (Hershberg et al., 2008). In this evolutionary model it is suggested that the two ancient M. africanum lineages (Lineage 5 and 6) remained in Africa, whilst the other lineages spread with human migrations into Eurasia, with the three modern MTBC lineages seeding Europe, India and China. Recent expansions in human population over the last few centuries led to the rapid expansion of these modern lineages (Gagneux, 2012). In 2010, Comas et al. generated the first whole-genome global phylogeny of human adapted MTBC (Comas et al., 2010). This phylogeny resolved the lineages at much greater resolution than previous analyses, and demonstrated that the two M. africanum lineages are the most basal. These two lineages are exclusively found in West Africa (de Jong et al., 2010), and whilst the reason for this is unknown, this evidence further supports the model that the MTBC originated in Africa (Gagneux, 2012; Hershberg et al., 2008)
1.2.5 Selective pressures acting within the MTBC
Genetic diversity is introduced and fixed into populations by the four primary evolutionary forces – mutation, natural selection, genetic drift and gene flow (Robinson et al., 2010a). Mutation is a stochastic process affecting DNA regardless of function, but only those mutations that ‘survive’ the processes of genetic drift and selection will be detected in the genome. Genetic drift is a change in allele frequency over time due to random sampling over the course of multiple generations. Importantly, it is dependent on effective population size; smaller sizes are more strongly affected by genetic drift than larger populations. In contrast, natural selection is a non random process and determined by the differential survival of genetic variant within a population (Robinson et al., 2010a). Finally, gene flow in the form of horizontal gene transfer (HGT) or recombination can shuffle mutations and introduce new genetic information into
13 1.2 Genetic diveristy in the MTBC populations. Importantly, while mycobacterial species display gene flow, it has not been detectable in the MTBC (Hirsh et al., 2004; Supply et al., 2003), thus leaving the three former evolutionary forces acting within the MTBC. Mutation, selection and drift are intrinsically interdependent, and Hershberg et al. used the MLSA dataset to explore the evolutionary forces that might have shaped the MTBC genetic diversity (Hershberg et al., 2008). Comparison of nonsynonymous SNPs (which cause an amino acid change) to synonymous SNPs (no amino acid change) can provide a measure of the selective pressures acting within a sequence. This is expressed as the dN/dS ratio, whereby the ratio of nonsynonymous SNPs to potential nonsynonymous SNPs (dN) is divided by the respective synonymous ratio (dS); a ratio of near unity indicates the absence of selection, whilst the ratio increases under positive selection, and decreases under purifying selection (Rocha et al., 2006). Positive selection describes the process of certain alleles increasing in frequency due to a greater fitness than others, whilst purifying selection purges deleterious alleles, likely generated by nonsynonymous SNPs, from the population. Applied to the MLSA it was found that 62% of the SNPs were nonsynonymous and 38% synonymous, corresponding to a dN/dS ratio of 0.57. To put this in context, the dN/dS ratio for M. canetti, the outlying member of the MTBC was 0.18, and in two sequenced Mycobacterium avium strains the dN/dS was 0.17 (see phylogeny in Figure 1.1). Similar ratios were observed across all other Actinobacteria, hence the dN/dS seen in the MTBC is markedly high compared to other mycobacteria. It was concluded that in the MTBC purifying selection is strongly reduced.
The consequence of reduced purifying selection in the MTBC was examined at the level of conservation of amino acid positions in the 89 genes sequenced across the MTBC strains. Orthologs were found for 62 genes in mycobacteria distantly related to the MTBC strains, and using a multiple sequence alignment of these genes the amino acids were divided into either conserved or variable positions. This categorised 64% of the amino acids positions in mycobacteria into conserved positions, and 36% into variable. Mutations within conserved positions are more likely to have a functional effect than at variable positions. Nonsynonymous changes in M. canetti predominantly fell into variable positions (72%), but the majority (58%) of amino acid mutations in MTBC fell into the conserved positions. This percentage was not dissimilar from that expected if purifying selection in MTBC was no longer making a distinction among mutations in these two classes of sites (Hershberg et al., 2008).
14 1.3 Phenotypic diveristy
1.3 Phenotypic diversity
Whilst the outcome of human tuberculosis infection and resulting disease is highly variable and has been attributed to many factors including host and environmental variables, the impact of bacterial strain variation on the clinical outcome of human infection by MTBC remains an open question. At the level of phenotypic diversity, a number of studies have explored the phenotypic differences between specific strains. Many of the earlier studies were based on a small set of canonical laboratory reference strains, whilst later studies moved into the use of clinical strains, increasingly informed by the phylogenetic structure of the MTBC. The former studies shall be discussed first in the next subsection, and then moving onto a discussion of clinical strain phenotypes.
1.3.1 Laboratory strains
As introduced above, many early studies were based on a few characterised reference strains, namely the laboratory strains H37Rv, H37Ra, Erdman and the vaccine strain M. bovis BCG reviewed in Coscolla & Gagneux (2010). In addition to these strains, two additional reference clinical strains CDC1551 and HN878, isolated from TB outbreaks in Tennessee and Texas respectively, have also been used (Jones et al., 1999; Valway et al., 1998). From a phylogenetic context these stains are not representative of MTBC diversity, with H37Rv, H37Ra, Erdman and CDC1551 all from Lineage 4, whilst HN878 is part of the Beijing subgroup of Lineage 2 (Figure 1.2).
One of the clear differences in strain phenotype compared to the above laboratory and clinical reference strains is from strain HN878 in infections. HN878 is consistently associated with low inflammatory response and increased virulence in both in vitro macrophage studies and in vivo animal models compared to the other laboratory stains (Manca et al., 1999; Manca et al., 2001; Manca et al., 2005). In a mouse challenge study using several clinical strains, it was found that HN878 was hypervirulent, causing unusually early death of infected immune-competent mice (Manca et al., 2001). Hypervirulence of HN878 was suggested to be due the failure of this strain to stimulate Th1 type immunity for control of M. tuberculosis infection (Manca et al., 2001).
All studies that utilise laboratory strains suffer from the same issue of strain adaptation to laboratory conditions. This mechanism was exploited to create the laboratory strain
15 1.3 Phenotypic diveristy
H37Ra, an avirulent M. tuberculosis strain that was generated by culturing H37, the parental strain of H37Rv, on solid egg medium and selecting for resistance to lysis (Steenken, 1935). This phenomenon can also affect clinical strains but can be managed through minimal handling and passaging of cells, thereby limiting the number of generations and potential for mutation. Adaptation can lead to changes in the virulence of the strain, such as the loss of phthiocerol dimycocerosate (PDIM) from strain H37Rv grown in vitro. PDIM is a wax-like compound and an important cell wall lipid associated with mycobacterial virulence (Domenech & Reed, 2009). The other laboratory strain, H37Ra, does not synthesise a number of cell surface antigens, including sulfolipid-1, trehalose mycolates, as well as PDIM (Chesne-Seck et al., 2008). As H37Rv and other laboratory strains have been passaged for many decades outside of the human host (Ioerger et al., 2010), their relevance in studies of infection and virulence is debatable. This is further underscored by the genomic diversity seen in strains of H37Rv, which has been grown in numerous laboratories throughout the world effectively in an unintentional in vitro evolution experiment, resulting in their separation by multiple SNPs and frameshift insertion and deletions (indels) (Ioerger et al., 2010).
1.3.2 Clinical strain phenotype
Whilst there is currently little evidence of common phenotypic differences at the lineage level, multiple phenotypes have been identified in nearly forty studies investigating the virulence and immunological characteristics of clinical strains (Coscolla & Gagneux, 2010). One consistent phenotype is the lower induction of proinflammatory cytokines by the Beijing sub-lineage of Lineage 2 (Figure 1.2) compared to H37Rv and other strains. This group of strains is so described as they are endemic in many parts of East Asia, and account for the majority of cases of TB in these regions (Qian et al., 1999); they have also been described as the W-Beijing family of strains (Glynn et al., 2002). The Beijing group has subsequently become the focus of numerous studies owing to its recent spread in human populations (Cowley et al., 2008), and association with multidrug resistance (Borrell & Gagneux, 2009). Whilst the characteristics that predispose this family of strains to such clinical outcomes have not been fully resolved, Reed et al. (2007) showed that Beijing strains accumulate large quantities of triglycerides in in vitro aerobic culture, and that this was linked to the constitutive over expression of genes that are members of the DosR-controlled regulon. DosR is induced during conditions that are likely to occur during latent infection, such as by nitric oxide and low oxygen tension and is thought to contribute to bacterial persistence (Kumar et al., 2007). One
16 1.4 Linking genotype to phenotype consequence of this constitutive expression is the observed accumulation of large quantities of triglycerides during in vitro aerobic culture conditions in contrast to non- Beijing strains. The authors hypothesise that the triglycerides provide an adaptive advantage to the Beijing strain family by acting as an energy source during infection (Reed et al., 2007), which would represent the first example of an in vitro phenotypic characteristic shared at the MTBC strain sub-lineage level (Nicol & Wilkinson, 2008).
From a clinical perspective, early studies of MTBC strain variation found that strains from South India were less virulent and had increased susceptibility to oxidative stress compared to strains from Great Britain (Mitchison et al., 1960; Mitchison et al., 1963). Although these strains were not genotyped at the time, it can be speculated using the current knowledge MTBC phylogeography that this represents a divide between Lineage 1 (Indo-Oceanic) and Lineage 4 strains (Coscolla & Gagneux, 2010). Another example of differences between MTBC strains detected at the clinical level is Lineage 2, which has been associated with extra pulmonary (Kong et al., 2007) and menigeal TB (Caws et al., 2008) compared to strains from other lineages. Several studies have also associated Lineage 2 with HIV coinfection (Caws et al., 2006), but the experimental phenotype is not clear and has been contested in other studies which found no significant associations (de Jong et al., 2009). In summary, the extent to which clinical MTBC phenotypes are shared by strains belonging to broader phylogenetic lineages is largely unknown, but this may reflect the previous paucity of research in this area (Nicol & Wilkinson, 2008). In the context of increasing evidence that the amount of sequence variation in MTBC has been underestimated, genetic diversity may have important phenotypic consequences, including an impact on areas such as drug and vaccine design (Gagneux & Small, 2007).
1.4 Linking genotype to phenotype
The first step towards understanding the influence of genetic diversity in the MTBC on TB infection is to understand the molecular mechanisms that link strain diversity to phenotype. This is a challenging area of research and there are few examples of such studies for the MTBC. The previously described study by Reed et al. linked the accumulation of triacylglycerides to the constitutive over-expression of the DosR regulon (Reed et al., 2007). This has recently been partially associated with a 350 kb genomic duplication that is present in some strains from the Lineage 2 (Domenech et al., 2010). A second example is a link between the hypervirulence of some Lineage 2 strains
17 1.4 Linking genotype to phenotype to the production of the immune modulatory phenolic glycolipid (PGL). It was found that the laboratory strain H37Rv and other members of Lineage 4 do not produce PGL due to a seven base pair frameshift deletion in the pks1/15 gene cluster; this encodes a polyketide synthase involved in the production of PGL (Constant et al., 2002). If pks1/15 is disrupted in the Lineage 2 laboratory strain HN878, then the hypoinflammatory and hypervirulent phenotype is lost (Reed et al., 2004). However, this phenotype is more complex than simply the presence of an intact pks1/15. Insertion of an intact pks1/15 into the lineage 4 H37Rv laboratory strain did not result in increased virulence (Sinsimer et al., 2008), thus demonstrating the importance of taking into account the lineage genetic background of the strain in question.
With the advent of advances in sequencing technology, the number of MTBC strains sequenced and associated number of SNPs identified is rapidly increasing (Stucki & Gagneux, 2012). Shown in Figure 1.4 is the number of MTBC genome sequences within the NCBI Short Read Archive (SRA), which is a repository for all next-generation genome sequencing data, and currently stands at 4,913 MTBC genome sequences. SNPs are the most common form of genetic variation in MTBC, followed by insertions and deletions (indels), and a total of 9,037 SNPs were discovered by sequencing twenty-one clinical strains of MTBC (Comas et al., 2010). Whilst this presents an opportunity to understand the impact of such SNPs, there are also considerable challenges due to the shear number of SNPs identified, which will only grow in size with the associated increase in comparative genome sequencing studies.
4913 5000 4675
s
e 4000 m A o n R e S 3000
g I
f B o
C r 1799 N e 2000
b n i m u
N 1000 355 0 1 0 2008 2009 2010 2011 2012 2013 Year
Figure 1.4. The number of MTBC genome sequences in the NCBI Short Read Archive (SRA). The database was queried on 21-02-2013 using the search term Mycobacterium tuberculosis complex. The year 2013 is not complete and only representative of nearly the first two months of the year.
18 1.4 Linking genotype to phenotype
1.4.1 In silico prediction of functional SNPs
Whilst identifying SNPs in bacterial genomics studies is becoming relatively simple through whole genome sequencing using one of the second-generation technologies (Loman et al., 2012), understanding the effects of sequence variations has become a major effort in mutation research (Thusberg & Vihinen, 2009). Experimental study of the molecular effects of all MTBC SNPs identified in recent studies, such as those found in the above twenty-one genome study, is unfeasible. The development of computational methods to screen for SNPs likely to have a functional effect from those that are neutral has therefore been a highly active field within bioinformatics, and a number of computational tools have been created for this purpose (Bao & Cui, 2005; Cingolani et al., 2012; Ng & Henikoff, 2006). From here on, the term functional SNP is used to refer to those SNPs that are expected to alter gene expression or function, and therefore associated with a phenotype. Use of such methods to predict functional SNPs can help prioritise additional research on those SNPs more likely to affect protein function.
Methods that predict whether a SNP has a functional effect use either sequence or structural information, or a combination of both to form the prediction. Such methods rely on the evidence that mutations which effect protein function tend to occur at evolutionary conserved positions, or are buried in the interior of the protein structure (Ng & Henikoff, 2006). Predictions based on sequence information typically follow a common procedure, as implemented by Ng & Henikoff in their SIFT prediction algorithm (Ng & Henikoff, 2003). Firstly an input sequence is used in a database search for homologous sequences. These are used to create a multiple sequence alignment, which identifies the evolutionary conserved positions, and these are inferred to be important for function. A scoring method based on the frequency of each amino acid at each position, and the severity of an amino acid change is then used for each position in the input sequence. The introduction of an amino acid that does not appear in the specific amino acid position can still be classified neutral and not functional as predictions also use the physiochemical properties of the amino acids already present in the alignment. For example, if a position in an alignment contains the hydrophobic amino acids isoleucine, leucine and valine, then this position can likely only contain hydrophobic amino acids, and changes to other hydrophobic amino acids, such as methionine, will likely not have a functional effect (Ng & Henikoff, 2003; Ng & Henikoff, 2006).
19 1.4 Linking genotype to phenotype
1.4.2 Gene expression diversity
After the genome sequence of the first M. tuberculosis strain was published in 1998 (Cole et al., 1998), and the extent of genetic diversity was beginning to be uncovered (Comas et al., 2010; Hershberg et al., 2008), the next logical step in understanding the consequences of such genetic diversity is to build upwards from the genomic information layer. Uncovering the complexity of phenotypic differences in the MTBC likely requires the integration of multiple layers of biological information (Comas & Gagneux, 2009), and moving from the DNA to RNA level to explore MTBC transcriptional diversity is discussed in the following section.
In the first systematic survey of variation in mRNA expression, Gao et al. compared the gene expression of ten clinical isolates of M. tuberculosis in additional to the reference strains H37Rv and H37Ra (Gao et al., 2005). All isolates were grown in vitro and under exponential growth conditions. The authors found that 527 (15%) of the genes tested were variable amongst the isolates, highlighting for the first time strain-to-strain variability in expression under identical growth conditions. Combined with gene function information, it was found that variable genes were statistically over-represented by genes involved in lipid metabolism; it was speculated that this could have implications in virulence, as lipid and lipid metabolism is thought to have an important role in host pathogen interactions (Barry, 2001; Forrellad et al., 2012; Reed et al., 2004). A further 16% of genes represented those consistently expressed, and as might be expected it was found that this class was over-represented by those found in the information pathways class; this class consists of genes associated with replication, transcription and translation (Lew et al., 2011), and are consequently highly expressed in actively growing bacteria. Approximately two-thirds of the remaining genes in the study were equally split between low or undetectable and unexpressed classes. Many of these genes included those that were classed as unknown hypotheticals, and so could represent incorrect annotation of coding regions, or alternatively discovery bias through the use of only one culture condition (Gao et al., 2005). Overall the study identified transcriptional variation amongst a set of clinical isolates, with implications in the choice of drug targets for vaccine development and diagnostic markers. The study predates the robust classification of the phylogenetic lineages of the MTBC (Gagneux et al., 2006a), and so limits the use of the results in a phylogenetic context.
20 1.4 Linking genotype to phenotype
More recently, a study of transcriptional variation amongst clinical isolates of the MTBC has been undertaken within a phylogenetic framework using microarray technology (Homolka et al., 2010). The authors included fifteen clinical strains from four MTBC lineages (Lineages 1, 2, 4 and 6), plus the reference strains H37Rv and CDC1551, which are part of Lineage 4. Under in vitro exponential growth conditions the authors identified 364 genes (9.1% of all annotated genes) differentially expressed between strains of different lineages in at least one pairwise comparison. Several genotypic signals were identified, such as the dysregulation of virS-mymA operon in Lineage 1, thought to be involved in maintenance of the cell wall structure (Singh et al., 2003), and over-expression of the dosR two component regulator in the Beijing strains, which controls the DosR regulon and described in section 1.3.2. Analyses were extended to the transcriptional response of intracellular bacilli before and after infection of resting and activated murine macrophages. Apart from identifying the core universal induction or repression of 280 genes (7.0%) in all strains regardless of state compared to in vitro expression, a proportion of genes (293 genes; 7.3%) displayed significant genotypic patterns in response to the intracellular conditions in the macrophage (Homolka et al., 2010). This study currently represents the most comprehensive survey of human- adapted MTBC transcriptional diversity in gene expression. The presence of genotypic signals implicates the effect of the underlying genotypic diversity, driven by large deletions, indels, and coding and noncoding SNPs, although this was not explored in the study.
In 2007, the global transcriptional differences between a strain of M. bovis and the reference strain H37Rv was investigated by microarray (Golby et al., 2007). This study provides a useful comparison from the perspective of a human-adapted strain and M. bovis, which whilst it can be sustained in humans, is regarded as primary pathogen of wild and domesticated animals (as discussed in section 1.1.2). Under nutrient limited conditions and in steady state growth, it was found that 92 genes (2.3%) had 3-fold differential expression. Genes showing higher expression were equally split between the two strains. Focusing again on the major gene functional categories, a large proportion of differentially expressed genes encoded proteins involved in the cell wall, lipid metabolism, gene regulators, the PE/PPE protein family, and toxin–antitoxin (TA) gene pairs.
The growing understanding that regulatory processes are often mediated by RNA molecules beyond the classical view of protein based regulation was combined with
21 1.4 Linking genotype to phenotype advances in sequencing technology to uncover the total transcriptome of M. tuberculosis by RNA-sequencing (RNA-seq) (Arnvig et al., 2011). The RNA-seq method is discussed in the following section (1.4.3). All RNA molecules from in vitro exponential and stationary phase cultures of M. tuberculosis strain H37Rv were sequenced, and it was found that more than a quarter of all sequence reads mapped to intergenic regions; this excluded the highly expressed ribosomal RNAs involved in protein synthesis. Accounting for the size of the intergenic regions based on the H37Rv genome size, this represented a 2-fold higher density of noncoding RNA expression compared to gene expression (mRNA transcription). The non-coding RNA ranged from 5’ and 3’ untranslated regions (UTRs), antisense transcripts, and intergenic small RNA (sRNA) molecules. Although based on the reference strain H37Rv, the work provides an important benchmark for future studies of transcriptional diversity in MTBC strains, demonstrating the significant quantity of RNA expression that had not been detectable in previous microarray based studies.
1.4.3 High-throughput DNA sequencing technology
Our awareness of greater levels of genetic diversity in the MTBC has been largely driven by technology changes in sequencing, and next-generation high-throughput DNA sequencing is likely to play an important role in improving our understanding of TB (Loman et al., 2012); whilst this technology is often described as next-generation sequencing, this term is likely to become less useful as the technology advances by further generations. As introduced earlier, in 2010 Comas et al. sequenced twenty-one representative clinical MTBC strains, and this was performed using Illumina sequencing by synthesis technology (Comas et al., 2010; Loman et al., 2012). This genome set has since become an ideal basis on which to perform later phylogenetic studies employing ever increasing numbers of MTBC strains (Bentley et al., 2012). This section briefly introduces the technology, focusing specifically on the methods used in this thesis, namely genome and RNA-sequencing using the Illumina sequencing platform.
Recent advances in DNA sequencing technologies have enabled the determination of nucleotide sequence at a greater data throughput, a shorter amount of time and at lower cost than was previously possible using capillary-based Sanger sequencing (Shendure & Ji, 2008). Several novel approaches have been developed including 454 (pyrosequencing) and Illumina sequencing, previously known as Solexa sequencing. The Illumina system was established at NIMR in 2010, initially by an Illumina Genome
22 1.4 Linking genotype to phenotype
Analyser IIx sequencer (GA), and later on by the Illumina HiSeq2000 (HS); the HS sequencer was the result of technical developments and has five times greater data output than older GA sequencer (Loman et al., 2012). The Illumina method involves sequencing millions of short reads, initially 36bp but more recently ~100bp, using a flowcell based system for capturing DNA. It is the flowcell in which the sequencing reactions take place, which is divided into eight lanes, and therefore up to eight different samples can be added. This limitation of sample number has been removed by recent multiplexing technology, which utilises sequence tags to track each sample and therefore increases the number of individual samples added to each flowcell lane (Meyer & Kircher, 2010).
Briefly, there are three broad stages in the generation of sequence data: library preparation, amplification and sequencing. Libraries are initially constructed by one of several methods that generate a mixture of DNA fragments with ligated adaptor sequences up to several hundred bp in length. These are amplified using PCR primers attached to a flowcell, resulting in the physical clustering of the DNA templates across the flowcell, creating a lawn of sequence fragments (Shendure & Ji, 2008). This is followed by sequencing, consisting of multiple cycles of single base extensions using fluorescently labeled reversible terminator nucleotides and imaging to detect which base has been incorporated, thereby determining the base in the sequence (Bentley et al., 2008). At the end of each cycle the labeled nucleotide is cleaved and another round of terminators is added; the number of cycles therefore determines the length of the reads generated.
The Illumina sequencing platform generates considerable quantities of data per run, with each flowcell producing up to 6 billion reads which translates into 600 Gigabase (Gb) of sequence data. Apart from creating demands on storage capacity, with image data from each flowcell requiring 32 terabytes of temporary storage, a robust informatics pipeline is required to handle the downstream analysis (Bentley, 2010). There are two main analytical approaches to using the sequence data, one involves aligning to a reference sequence, also known as a mapped assembly, and the other is reference free and therefore a de novo assembly. The short read data generated by the Illumina sequencers is most applicable to the former method, and is very useful in the discovery of SNPs and phylogenetics.
23 1.4 Linking genotype to phenotype
High-throughput sequencing has translated into numerous publications that provide new insight into the evolution and genomic diversity of bacteria (Comas et al., 2010; Holt et al., 2008; Qi et al., 2009). This technology is being applied to other disciplines, such as transcriptomics, where whole genome sequencing of RNA transcripts (RNA-seq) is creating a powerful new approach to characterisation of the bacterial transcriptome (Perkins et al., 2009). For over ten years, microarray technology has allowed the simultaneous monitoring of expression levels of all annotated genes in cell populations (Schena et al., 1998). Whilst microarrays have been instrumental in our understanding of transcription, generating a wealth of publications and data based on this technology, limitations in its applicability have begun to be reached (Mortazavi et al., 2008). Inherent issues such as the limited dynamic range for the detection of transcript levels, cross hybridisation and the need for normalisation provide some explanation for the explosion in use of second generation technologies in the analysis of transcriptomes (Marguerat & Bähler, 2010). As well as surveying the total transcriptional landscape, adaptation of the library making process can facilitate Transcriptional Start Site (TSS) mapping, whereby the precise position of transcription initiation can be determined in a genome-wide manner (Filiatrault et al., 2011; Sharma et al., 2010b). This can provide greater understanding of the transcriptional output, and in the human pathogen Helicobacter pylori revealed a complex structure of TSS within operons and opposite to annotated genes (Sharma et al., 2010b).
24 1.5 Thesis outline
1.5 Thesis Outline
In this thesis, the identification and effect of lineage-specific genetic variation within the phylogenetic lineages is investigated using computational methods and high-throughput sequencing technology. This is driven by the overarching hypothesis that fixation of mutations at evolutionary conserved positions in the lineages of M. tuberculosis, either due to a relaxed selective constraint or positive selection, has resulted in functional consequences that separate the MTBC lineages. Chapter 3 begins with the construction of a representative 28-genome phylogeny using Illumina sequencing data. Comparative analysis focuses on the detection of all lineage-specific single nucleotide polymorphisms (SNPs), providing the first glimpse of the total SNP diversity that separates the main phylogenetic lineages from each other. The lineage-specific coding SNPs are used to investigate the evolutionary pressures acting within the lineages using population genetics measures and gene function categories. Chapter 4 applies in silico tools to the lineage-specific SNPs to predict those likely to have a functional effect. Focus is made on the largest group of genetic variation, the nonsynonymous SNPs, and a significant overrepresentation of transcriptional regulators with predicted functional SNPs was detected. Chapter 5 moves from the DNA to RNA level using a transcriptomic approach. RNA-sequencing of multiple strains from two lineages was performed, and differential expression analysis used to define lineage-specific transcriptomes. Along with the differential expression of genes between the lineages, the experimental method used allowed novel expression of noncoding and antisense to be detected. In the context of previously identified lineage-specific SNPs, significant associations were found between the genomic and transcriptomic data, which were found to arise by three main mechanisms. These have the potential to alter the response of isolates to differing microenvironments and to modulate expression of ligands involved in innate immune recognition.
25 2 Materials and Methods
Chapter 2 Materials and Methods
The following chapter details all protocols used in this thesis. From basic laboratory methods used in the culture of Mycobacterium tuberculosis and the strains used. Genome and RNA sequencing are next outlined, alongside the bioinformatics analysis tools used to interpret this data. Details of MTBC strains and specific bioinformatics analyses are detailed in results Chapters 3 to 5.
2.1 General microbiological methods
2.1.1 Containment 3 laboratory
All culturing of M. tuberculosis strains was performed in a Biosafety Level 3 laboratory, and work undertaken within a Class II flow cabinet at a negative pressure of at least 160kPA.
2.1.2 General chemicals and reagents
Unless otherwise stated all laboratory chemicals were purchased from Sigma-Aldrich. Buffers were prepared as aqueous solutions using distilled water, and solutions were sterilised either by autoclaving or filtration (Millipore, 0.22μm) depending on the volume.
26 2 Materials and Methods
2.1.3 Bacterial culture and storage
Growth of M. tuberculosis strains used in this study was performed in liquid Middlebrook 7H9 growth media (Difco, Becton Dickinson). The 7H9 media was supplemented with 0.5% glycerol (Fisher Scientific), 10% Middlebrook ADC (Albumin, Dextrose, Catalase), and to help prevent clumping of the cells during growth, 0.05% Tween-80. This is standard rich nutrient medium to culture M. tuberculosis (Atlas & Snyder, 2006). Cultures were grown in one litre roller bottles (Nalgene) in a rolling incubator at 37oC. For long-term storage all isolates were stored at -20°C in 2ml cryo tubes (Sigma-Aldrich), and supplemented with 10% glycerol to increase viable cell number during storage.
2.1.4 Growth curves
Growth curves of the bacterial strains used in this study were performed to determine the previously unknown growth rates of the clinical isolates, which is critical for the extraction of RNA from the correct growth phase for subsequent experiments. This would also provide important phenotypic data on potential differences in in vitro growth rates between the lineages.
Inoculation of 50ml conical screw cap falcon tubes (Fisher Scientific) with 10mls 7H9 medium was performed two days prior to the start of the growth curve experiment. On starting the experiment a roller bottle with 100ml 7H9 was inoculated with the pre culture so that the starting OD was 0.01 (the lower limit of detection by the spectrophotometer). Samples of 1ml were taken every 24 hrs and the OD measured in 1ml cuvette.
2.1.4.1 Optical density (OD) measurements
The optical density (OD) method was used to measure the growth of mycobacterial cultures in the above protocol. This is a rapid method that employs a spectrophotometer to measure the difference in light transmission at a certain wavelength before and when passing through a path length of a culture sample in a cuvette. Here an Amersham Bioscience spectrophotometer was used for all OD measurements. All readings were taken at a wavelength of 600nm (OD600), and sterile 7H9 used as a reference. Saturation
27 2 Materials and Methods of absorbance occurs > 1 OD, therefore any readings above this were taken from a diluted sample and multiplied by the dilution factor afterwards (typically 1:10).
2.2 Molecular biology techniques
2.2.1 Genomic DNA extraction
Genomic DNA was extracted using the CTAB method described previously (van Soolingen et al., 1991). 20mls of culture with an OD of ~0.5 was transferred into a sterile 50ml conical tube and centrifuged at 3000xg for 10mins to precipitate the bacteria. The supernatant was decanted and the pellet resuspended in 1ml lysis buffer. The suspension was transferred into a 2ml screw cap tube, and placed into a water bath at 90oC for 1hr. Following this step the crude cells and lysate were transferred to a containment 2 laboratory. The cells were pelleted at 13000xg, the supernatant discarded, resuspended in 400µl lysis buffer and 100µl of 10mg/ml lysozyme, gently mixed, and incubated at 37oC for 2 hrs.
The cell lysis step consisted of the addition of 50µl 20% SDS and 25µl Proteinase K to the cell mix. The sample was incubated at 55oC for 40mins and 250µl of 4M NaCl added and gently mixed. 160µl of preheated CTAB was added and incubated for 10 minutes. To separate the DNA from protein contamination, 900µl chloroform-isoamyl alcohol (24:1) was added and the biphasic suspension vortexed, then centrifuged for 10mins at 13000xg at 4oC to separate the phases. The upper phase containing the DNA mix was transferred to a clean 2ml eppendorf. DNA was purified with 700µl cold isopropanol and mixed by gently inverting the tube. Following a 2hr or overnight precipitation, the sample was centrifuged at 13,000xg for 10mins at 4oC. The supernatant was decanted and the pellet air dried. 1xTE buffer was added to dissolve the DNA that was then stored at 4oC.
2.2.2 RNA Isolation and handling
Inoculation of 10mls 7H9 media in falcon tubes from previously frozen bacterial stock was performed per experiment to enable the rapid growth of pre-cultures before scaling up to larger growth volumes. Following approximately two days and before OD reached
28 2 Materials and Methods
0.8, this culture was used to inoculate a roller bottle containing up to 180mls 7H9 liquid media.
As determined by growth curve experiments (section 5.3.1), exponential phase cultures were harvested at an OD of between 0.4 and 0.8, whilst stationary phase cultures were harvested one week after the OD had reached 1.0. When ready, cultures were cooled rapidly by addition of ice directly into the culture, and centrifuged at 12,000xg for 15 mins at 4oC. RNA was isolated using the FastRNA Pro blue kit from QBiogene/MP Bio following the manufacturer’s instructions. The supernatant was subsequently decanted. Following this procedure, the standard FastRNA Pro blue kit instructions were followed. Briefly, 1ml of RNApro solution was added to the pellet and the cells resuspended by pipetting, and 1ml transferred to a blue-cap tube containing Lysing Matrix B. The cell mix in the tube was homogenised in a FastPrep Ribolyser (QBiogene/MP Bio) for 40secs at a setting of 6.0, and centrifuged at 12000xg for 5mins at 4oC. The upper phase was transferred to a fresh microcentrifuge tube, incubate for 5mins, 300µl chloroform added, vortexed for 10secs and further centrifuged at 12,000xg for 5mins. Following transfer of the upper phase to a fresh microcentrifuge tube, 500µl of cold ethanol was added and inverted for 5 times.
Following this step the RNA suspension was transferred to containment level 2 laboratory and precipitated for at least 2hrs or alternatively overnight. After precipitation, the sample was centrifuged at 12,000xg for 15mins at 4oC, the supernatant removed and pellet washed in 500μl of cold 75% ethanol (made with DEPC-H2O). The ethanol was aspirated and the pellet air-dried at room temperature for 5mins, then the
RNA resuspended in 100 μl of DEPC-H2O.
2.2.3 Quantification of DNA and RNA by Nanodrop
A Nanodrop spectrophotometer (version ND-1000) was used to detect the quantity of DNA and RNA following the above protocols. This requires 1μl of sample to be placed on to the Nanodrop pedestal. Then the Nanodrop measures the absorption of the sample at a range of wavelengths (230-350nm). This correlates with the concentration of DNA present, given in ng/μl. The Nanodrop also provides a measure of the quality of DNA or RNA extraction. Nucleic acids and proteins have absorbance maxima at 260 and 280nm, respectively. A ratio of ~1.8 is generally accepted as high quality for DNA, a ratio of ~2.0 is generally accepted as high quality for RNA. If DNA or RNA extractions were
29 2 Materials and Methods appreciably lower than these ratios a repeated round of purification was performed to remove potential protein or other contamination that may be present in the sample.
2.2.4 Determination of DNA and RNA integrity by micro fluidics
Both RNA and DNA concentration was first measured using Nanodrop, and then followed by quality control using the Agilent 2100 Bioanalyser. The Bioanalyser is a chip-based capillary electrophoresis machine for sizing, quantification and quality control of DNA, RNA, as well as proteins and cells. Depending on the sample type, the nucleic acid was measured using the Agilent DNA 1000 chip or Agilent RNA 6000 nano chip following the manufacture’s instructions.
2.2.5 Removal of DNA contamination from RNA samples
Rigorous DNase treatment of all RNA samples was performed using the TURBO DNase free kit (Applied Biosystems). This procedure can remove > 200µg DNA per ml. Up to 5µg total RNA was treated in volumes of 50µl according to the manufacture’s instruction. Briefly, 0.1 volume of 10X TURBO buffer and 1µl (2U) TURBO DNase was added to the 50µl total RNA aliquot and mixed well. This was incubated at 37oC for 20mins, followed by an additional 1µl (2U) TURBO DNase, and 20min incubation. To terminate the reaction 0.2 volumes DNase Inactivation Reagent was added and incubated for 5mins at room temperature. The sample was then centrifuged at 13,000xg for 2mins and the supernatant, containing the DNase free RNA, transferred to a fresh microcentrifuge tube and stored at -20oC.
2.2.6 Polymerase chain reaction (PCR)
PCR was used to amplify specific regions of DNA. For general PCR amplification of template DNA Supermix (Invitrogen) was used. Specific protocols including DNA-seq, RNA-seq and qRT-PCR, used the manufacturers recommended reagents and are described in the following sections. All PCR reactions were done in 0.2ml RNase- and DNase-free thin wall PCR tubes (Ambion) using an Applied Biosystems Veriti Thermal Cycler. As a negative control the same reaction was conducted in the absence of a DNA template.
30 2 Materials and Methods
2.3 Materials
2.3.1 Mycobacterium tuberculosis strains
At the start of this project, strain stocks were generated for the entire duration of the project. Stocks were taken from a strain collection at NIMR derived from a global collection isolated in San Francisco (Gagneux et al., 2006a). Handling of stocks was kept to a minimum to minimise the effect of laboratory adaptation; strains were cultured for one week at NIMR to obtain sufficient stocks for this thesis. Stocks were frozen at OD 0.4-0.8 to prepare stocks for subsequent exponential phase transcriptome sequencing experiments.
Specific description of the MTBC used in this thesis is described in the respective results chapters (Chapters 3 and 4).
2.4 DNA-seq
Following extraction and quality control of DNA described in the above method, the Epicentre Nextera DNA kit was used to generate Illumina sequencing ready DNA libraries. Briefly, the Nextera method employs in vitro transposition to simultaneously fragment and tag DNA in a single-tube reaction, thereby facilitating the rapid generation of DNA libraries; accounting for all quality control procedures, libraries can take less than two days. The manufacturer’s instructions were followed, and the High-Molecular- Weight Buffer (HMW) used, which generates fragments of 175-700bp and is recommended for paired-end sequencing. A limited PCR step was performed, consisting of a 72°C 3min extension step to denature the templates, followed by nine cycles of 95°C for 10secs, 62°C for 30secs and 72°C for 3mins. The amplified DNA fragments were subsequently purified using the Zymo column DNA Clean & Concentrator-5 kit.
Additional MTBC strains that were not part of this study were also generated using the above method at the same time, and therefore the Nextera barcoded adapters were used in the above PCR step. This can be used to add up to twelve unique barcodes to the Nextera library, enabling multiplexing of the libraries to reduce the sequencing cost.
31 2 Materials and Methods
2.5 RNA-seq
Following trialling of several methods to generate cDNA libraries ready for sequencing from the RNA extractions, two methods were chosen for the generation of transcriptomes in this thesis. The two methods are described below; one generates transcriptomes for differential expression analysis (2.5.1), whilst the other was used for transcriptional start site (TSS) mapping analysis (2.5.2).
2.5.1 Strand-specific RNA-seq libraries
The strand-specific protocol for transcriptome sequencing is largely based on the small RNA sample preparation protocol from Illumina (part # 1001375), but with exclusion of polyA-tail and size selection methods in order to capture all RNA species. Total RNA from the above DNase treated RNA extraction was randomly fragmented, specific 5’ and 3’ adapters attached to both ends of the RNA; the adapters are complementary to oligonucleotides immobilised on the glass surface of the Illumina flowcell. The protocol consists of six main steps: fragmentation, phosphatase treatment, PNK treatment, ligation of the adapters, reverse transcription and PCR amplification. These are followed by purification steps using Solid Phase Reversible Immobilisation (SPRI) beads.
Fragmentation: Initially between 3-5µg of DNase treated RNA was fragmented following the described Illumina protocol with the 10X fragmentation reagent. This was stopped with the stop solution and put ice, the volume increased to 100µl with RNase free water and precipitated by adding 3 volumes of 100% ethanol, 0.1 volumes of sodium acetate (3M) (Ambion Cat # AM9740) and 0.05 volumes of glycogen. This was precipitated for at least 30 minutes at -20°C. The pellet was washed with 500µl of 70% ethanol, air dry the pellet on ice and resuspended in 16µl of RNase free water in a 200µl PCR tube.
Phosphatase treatment: The sample was treated with 2µl Antartic phosphatase with 10X Phosphatase buffer (NEB Cat # M0289S) and incubated for 30mins at 37°C, 5mins at 65°C and held at 4°C. PNK treatment: To the previous PCR tube 2µl T4 Polynucleotide Kinase (PNK) (NEB Cat # M0201S), 17µl water, 5µl 10X PNK buffer, 5µl ATP (10mM) (Epicentre Cat # R109AT) and 1µl RNAse OUT (Invitrogen, part # 10777-019) was added and incubated for 60mins at 37°C and held at 4°C.
32 2 Materials and Methods
Phenol purification: In a new 1.5ml microcentrifuge tube the sample was transferred and volume increased to 200µl by addition of RNase free water (Ambion, Cat # AM9920). After 200µl acid phenol (Ambion Cat # AM9720) was added, vortexed, and after centrifuging for 15mins at room temperature the upper phase was transferred to a new microcentrifuge tube. 3 volumes of cold 100% ethanol, 0.1 volumes of sodium acetate and 0.05 volumes of glycogen was added and precipitated for 30mins or overnight. Following precipitation the sample was centrifuged for 25mins at 4°C, the pellet washed in 70% ethanol and air dried on ice. Once dry 5µl RNase free water was added to the pellet.
Ligation of the adapters: Adapters were from the Illumina small RNA kit preparation kit with the v1.5 sRNA 3’ Adaptor (Illumina cat # FC-102-1009). Following the manufacturers instructions the 3’ sRNA adaptor v1.5 and then SRA 5’ Adapter was ligated to 5µl RNA from previous step.
Reverse Transcribe and Amplify: 4µl of the 5’ and 3’ ligated RNA was mixed with 1µl diluted (1:5) SRA RT primer from the Illumina small RNA kit and heated at 70°C for 2mins. The standard SuperScript II Reverse transcriptase kit with 100mM DTT and 5X first strand buffer (Invitrogen, part # 18064-014) was used to reverse transcribe the ligated RNA following the manufacturer’s instructions.
PCR Amplification: Using the Phusion DNA Polymerase kit (NEB part # M0530S) following the manufacturer’s instructions, 10µl of the product from the reverse transcription reaction was amplified in a thermal cycler using the following conditions: 30secs at 98°C, 17 cycles of: 10secs at 98°C, 30secs at 60°C, 30secs at 72°C, followed by 10mins at 72°C and then holding at 4°C.
Purification of libraries: The SPRI bead purification system (Agencourt AMPure from Beckman Coulter Genomics) was used to remove residue reagents from the previous steps to leave a purified DNA sample. The standard manufacturer’s instructions were used for two rounds of SPRI bead purification. The final supernatant was transferred to a fresh labelled RNase free tube, along with another 4µl aliquot for assessing library concentration and purity (using a Bioanalyser), and stored at -20°C.
33 2 Materials and Methods
2.5.2 TSS 5’ enriched RNA-seq libraries
Terminator-5’-phosphate-dependent exonuclease (Epicentre Biotechnologies) was used to deplete processed RNAs in cDNA samples used in TSS mapping analysis. Total RNA was sent to Vertis Biotechnologie AG (Freising, Germany) and Illumina ready libraries were constructed using the same protocol as above, but with the addition of the Terminator-5’-phosphate-dependent exonuclease step to remove all RNA transcripts without a 5’ triphosphate cap. This step removes degraded mRNAs and rRNAs, thereby biasing the sequencing of only the 5’ end of mRNA transcripts and facilitating the mapping of transcriptional start sites (TSS).
2.6 Illumina sequencing DNA (genome) and cDNA (RNA-seq) libraries
The library sequencing stage was performed by the high-throughput sequencing (HTS) group at NIMR under the supervision of Abdul Sesay. Generated libraries were quality checked by Agilent DNA 1000 chip and quantified by Qubit (Invitrogen).
Briefly, sequencing libraries were denatured with sodium hydroxide and a dilution of 2nM of the library loaded onto a single lane of an Illumina Genome Analyser 2x (GA) or HiSeq2000 (HS) flowcell. Cluster formation, primer hybridisation and single or paired-end sequencing were performed using proprietary reagents according to manufacturer’s recommended protocol (Illumina).
2.7 Quantitative RT-PCR
To confirm differential expression identified by RNA-seq, qRT-PCR was carried out on a 7500 Fast Real-Time PCR System (Applied Biosystems) using Fast SYBR Green Master Mix (Applied Biosystems). To minimise across plate normalisation problems arising, each 96-well plate consisted of a closed experimental plate design, with all clinical strain samples included. RNA without RT (RT-) was analysed alongside cDNA (RT+). Standard curves were performed for each gene analysed, and the quantities of cDNA within the samples were calculated from cycle threshold values. Three biological replicates were tested, consisting of three qRT-PCR plates per gene tested. Data was averaged, adjusted for chromosomal DNA contamination (RT+ minus RT-) and normalised to corresponding 16S RNA values.
34 2 Materials and Methods
cDNA for quantitative RT-PCR was made with random primers and Superscript III according to manufacturer's instructions (Invitrogen). 2µg of DNase treated total RNA from each respective strain was used as the starting material. Three biological replicates per strain were used in this study.
2.7.1 Primer sequences
Primers were designed using the Primer 3 software (Rozen & Skaletsky, 2000), and ordered from Sigma at 100≤µM concentration in 100µl aliquots, and stored at -20oC. Primers used in the RNA-seq study in Chapter 5 are shown in Table 2.1.
Table 2.1. Primer sequences used in the qRT-PCR study. Seven toxin-antitoxin genes were measured by qRT-PCR, and the 16S rRNA sequence was used in normalisation. In the sequence column the suffix denotes the forward (F) and reverse (R) primers.
Gene qRT-PCR primer Sequence (5’ - 3’)
Rv2063 mazE7_F TCCACGACGATTAGGGTTTC Rv2063 mazE7_R ACATCGAGATTCCCCGTTC Rv2274A mazE8_F CGAACCAGAAACCCTTCCT Rv2274A mazE8_R GACGACTCTGCTCCCAACTC Rv2830c vapB22_F GATCGAGATCACCAAACACG Rv2830c vapB22_R GGTGGTGAAGAGTTCGTCGT Rv2758c vapB21_F GTATGCTCTCCGGGTGTGAC Rv2758c vapB21_R TGTCGTGGTACCCAGTTCCT Rv1398c vapB10_F GGACCTGCAGGCTATAAACG Rv1398c vapB10_R GCAAGGTGCTGTTCACGAC Rv1397c vapC10_F TGGACTTGGCGACTATCTGA Rv1397c vapC10_R GGAAATGCCACACGTTGAG Rv2527 vapC17_F CGATATCGGCGAACTTGAAT Rv2527 vapC17_R CAGTGACGTTTGTTGGCTGT 16S 16S_F AAGAAGCACCGGCCAACTAC 16S 16S_R TCGCTCCTCAGCGTCAGTTA
35 2 Materials and Methods
2.8 MTBC annotation datasets
2.8.1 Coding sequence annotations
All gene annotations were based on the reference H37Rv genome sequence (Cole et al., 1998) and using the most recent annotations from the Tuberculist database, release 24 (December 2011) (Lew et al., 2011). In total there are 4,015 protein coding gene sequences, 13 pseudogenes, 45 tRNAs and 3 rRNAs.
2.8.2 Functional Categories
The genes can be classified based on the function of the encoded proteins. Using the Tuberculist database annotations there are ten functional categories, listed below (Lew et al., 2011):
1. virulence detoxification and adaptation 2. lipid metabolism 3. information pathways 4. cell wall and cell processes 5. intermediary metabolism and respiration 6. unknown 7. regulatory proteins 8. conserved hypotheticals 9. insertion sequences and phages 10. PE/PPE
2.8.3 Essential M. tuberculosis genes
Definition of gene essentiality was based on experiments using transposon mutagenesis to generate single gene knockouts, followed by transposon site hybridization after growth on 7H11 agar or in mice (Sassetti et al., 2003; Sassetti & Rubin, 2003). On the basis of these studies a total of 760 genes fell into the category of essential genes and the remaining genes were classed as nonessential. This follows the same convention as Comas et al. (Comas et al., 2010).
36 2 Materials and Methods
2.9 Bioinformatics software
2.9.1 Artemis
The genome browsing and annotation tool, Artemis (Carver et al., 2008), from the Wellcome Trust Sanger Institute was used extensively throughout this work. Importantly, this tool enables new features to be overlaid onto published annotations, and the user plot function allows transcription data to be plotted against the genome.
2.9.2 Quality control of raw RNA-sequencing data
2.9.2.1 FastQC
Raw reads were first filtered to discard low quality reads, which improves the mapping through a decrease in time and higher number of mapped reads. Raw fastq files deposited from the Illumina machine were inspected using FastQC version version 0.9.3 (downloaded 20-6-11, Babraham Bioinformatics). FastQC provides a modular set of analyses in a GUI environment written in the JAVA language. The Phred quality of scores across the read length displayed in a box whisker plot, per base N content and over-represented Illumina primer sequences were used to determine if a run has passed QC.
2.9.2.2 SolexaQA
Fastq files passing the initial QC were filtered using SolexaQA version 1.7 (Cox et al., 2010) (downloaded April 2011). SolexaQA is a Perl-based software package for quality analysis of Illumina data. The DynamicTrim.pl script within this package was used to remove poor quality bases from reads. Specifically, bases with Phred scores < 13 (which corresponds to a p>0.05) were trimmed from the 5’ and 3’ ends of reads until all bases were above this parameter. The Perl scripts were run on a linux server.
Trimming of reads was performed with the command:
$ DynamicTrim.pl [in.fastq] –h 13
37 2 Materials and Methods
The resulting output trimmed fastq file was used with the LengthSort.pl script. This removes reads that were poor for a high percentage of the read length and are not sufficiently long enough for mapping. The default parameter was used, removing reads < 25 bases:
$ perl LengthSort.pl [in.fastq] > [out.fastq]
2.9.3 Transcriptome mapping software
An analysis pipeline was created to manage the high throughput sequencing datasets generated by this study. Each file can contain about 150 million reads consisting of 10 Gigabases of sequence data. A reference based assembly was used for this study, and mapping was performed against the reference genome H37Rv using BWA (version 0.5.9) (Li & Durbin, 2009). The raw sequence data file in the fastq format (Cock et al., 2009) was mapped to the reference genome in fasta format using the following commands:
Index the reference sequence using bwa index.
$ bwa-0.5.9 index [in.fasta]
The reads in the fastq file were mapped to the indexed fasta using the following commands:
$ bwa-0.5.9 aln -I [in.reference.fasta] [in.fastq] > [out.fastq.sai]
$ baw-0.5.9 samse [in.reference.fasta] [in.fastq.sai] [in.fastq] > [align.sam]
For later processing and storage the mapped file in sam format is converted to the binary format, BAM, using SAMtools (Li et al., 2009).
$ samtools view -bS [in.align.sam] > [out.align.bam]
The bam file is sorted to further reduce storage size and indexed for viewing the BAM file in Artemis.
38 2 Materials and Methods
$ samtools sort [in.align.bam] [out.align.sorted.bam] $ samtools index [in.align.bam]
Basic mapping statistics after this stage were viewed using the SAMtools idxstats command.
$ samtools idxstats [in.align.bam]
Artemis plots were produced using the unix command.
$ paste [genomeCoverageBed reverse strand.out] [genomeCoverageBed forward strand.out] > artemis.plot.out
2.9.4 Calculation of mapped read frequencies per feature region
Genome coverage of reads mapping to sense and antisense gene annotations and sRNAs were calculated using the BEDtools package (Quinlan & Hall, 2010). Specifically, the coverageBed and genomeCoverageBed utilities were used for extraction of gene regions and whole genome coverage plots respectively. BEDtools is based on four widely used file formats used in HTS data: BED, GFF, VCF and SAM/BAM. Gene and intergenic annotations based on H37Rv were parsed into the BED (Browser Extensible Data) format using standard linux command line tools. The Bed format consists of one line per feature, each line containing a minimum of three fields of tabbed delimited information: chr (chromosome name), chr start (start position), chr end (end position). Two of the optional fields were used in this study: name (feature e.g. gene name), strand (either forward or reverse strand). These optional fields enable the calculation of the reads number that map to either the coding (sense) or non-coding (antisense) strand of the gene in question.
As described above, the coverageBed script was used to identify the number of reads mapping to each annotated feature, such a gene. The following was used to identify reads mapping to each specific strand in the fastq file, in this case the forward strand.
$bamToBed -i [in.align.bam] | grep -w + | coverageBed -a stdin -b [annotations.bed] > [plus.strand.out]
39 2 Materials and Methods
The genomeCoverageBed provides a useful base-per-base output of read depth that can be imported into the Artemis, and was also used in deletion analysis. The following was used to identify all read depths on the forward strand:
$genomeCoverageBed -strand + -d -ibam i [in.align.bam] -g [genome_length.bed] > [plus.strand.out]
2.9.5 R
R is an open source statistical programming analysis environment (Team_RDC, 2008). The Bioconductor package programmed in R was used as it provides tools for the analysis and comprehension of high-throughput genomic data. Specific packages used are described in the Methods of Chapter 5 in relation to RNA-seq analysis.
2.9.6 Perl scripts
Adhoc Perl scripts were written to aid in the parsing of flat file formats for use in such as Artemis and R. In addition to these, the Perl script genomicDeletions.pl was written to identify genomic deletions in genome sequencing data (Appendix A).
2.9.7 Graph pad prism 5.0
For the plotting and analysis of data used the program Graph Pad Prism 5.0c for OSX was used. The software contains comprehensive statistical analysis and presentation tools.
40 3.1 Introduction
Chapter 3 Lineage-specific SNPs
3.1 Introduction
Genetic variation within the M. tuberculosis complex (MTBC) is higher than previously recognised. From studies of Large Sequence Polymorphisms (LSPs), to targeted multi locus sequence analysis (MLSA), and finally whole genome sequencing (WGS), each method has provided a greater resolution of the genetic variation that exists between clinical isolates (Comas et al., 2010; Gagneux & Small, 2007; Hershberg et al., 2008). The most comprehensive set of phylogenetically representative strains sequenced using new high throughput sequencing (HTS) technology was published recently (Comas et al., 2010). For the first time all branches within the MTBC phylogenetic tree could be resolved, encompassing the six major MTBC phylogenetic lineages. Genome sequences of the twenty-one clinical strains sequenced in the previous study are publicly available, making this an ideal reference phylogeny on which to base further analyses. The genomes were sequenced at high depth (40 to 90-fold coverage) using the Illumina sequencing platform, making it possible to capture the most complete picture yet of MTBC nucleotide diversity.
Single Nucleotide Polymorphisms (SNPs) are the most common form of genetic variation in the MTBC, and driven by advances in sequencing technology an extensive and ever growing catalogue of SNPs amongst clinical isolates of M. tuberculosis have been identified (Comas et al., 2010; Stucki & Gagneux, 2012). As described in Chapter 1, analysis of SNPs in 89 genes from 99 human MTBC isolates provided strong evidence that human MTBC originated in Africa and accompanied the Out-of-Africa migrations of modern humans approximately 70,000 years ago (Hershberg et al., 2008). The six human MTBC lineages exhibit a strong global population structure (Gagneux et
41 3.1 Introduction al., 2006a) and phenotypic diversity has been associated with the different MTBC lineages. This includes the ability to elicit an immune response in vivo (Portevin et al., 2011), and clinical associations with extra pulmonary tuberculosis (Kong et al., 2005; Kong et al., 2007). However, the effect that MTBC genomic diversity plays in TB disease remains an open question, but one that can now be explored using a rational data driven approach (Coscolla & Gagneux, 2010).
Using available MTBC genome datasets, it is now possible to identify all SNPs that contribute to the background genetic variation of the six lineages. Due to the clonal population structure of MTBC (Supply et al., 2003), the majority of this variation is expected to be exclusive to the lineage in question, and therefore private from all other lineage strains. This presents an opportunity to understand the nature of this lineage- specific variation, and is expected to provide insight into how the hypothesised reduced purifying selection in the MTBC has shaped the lineages (Hershberg et al., 2008).
3.1.1 Aims
The aim of the work presented in this chapter was to characterise whole genome variation within the MTBC at the lineage-specific level using M. tuberculosis and M. africanum clinical isolates. As the identification of the lineage-specific SNPs is reliant on a representative phylogeny, the initial aim was to generate a robust phylogeny comprising of strains sequenced using second-generation sequencing technology. Following generation of a robust phylogeny, specific aims of the analysis were to:
• identify lineage-specific SNPs from the main six lineages. These SNPs make up the basal branch of each lineage • gain insights into the evolution of the MTBC, focusing on the type and frequency of genetic changes within and across the phylogenetic lineages. • measure the selective pressures on different gene function categories across the lineages.
42 3.2 Materials and Methods
3.2 Materials and Methods
3.2.1 Genome collection used in study
In total twenty-eight phylogentically representative strains were used in this study. Twenty-seven were collected from previously published resources, either through deposited data in public databases or published studies (Comas et al., 2010). Accession numbers are as follows: SRP001137, SRA009341, SRA009367, SRA008875, SRA009637. An additional strain was sequenced as part of this study (Lineage 2 strain N0031). Data has been deposited in the EBI SRA under the accession number: ERX192819. Details of the strains, country of isolation, and metrics from the mapping performed for this study is shown in Table 3.1.
3.2.2 Genome sequencing.
Genomic DNA for N0031 was extracted using the CTAB method described [previously in Methods], and 2µg DNA used for sequencing on the Illumina HiSeq platform. Sequencing libraries were constructed using the Epicentre Nextera DNA kit according to manufacturer’s instructions. Paired-end 75 base read sequencing was performed in a single Illumina flowcell lane as part of a multiplexed run. In total 10.6 million reads were generated, corresponding to an average sequence depth of 180 reads.
3.2.3 Mapping genome sequences
MAQ (Li et al., 2008) was used to map the reads produced by the Illumina sequencer to the reference genome. The most recent common ancestor of MTBC was used as the reference sequence as described previously (Comas et al., 2010). This sequence is based on the H37Rv genome (NC_000962) but substituting H37Rv alleles with those of the
43 3.2 Materials and Methods reconstructed common ancestor of the strains. Standard MAQ parameters were used, removing SNPs with a Phred score <30, read depth of <5, and non-unique matches. A non-redundant list of variable positions called with high confidence in at least one strain was constructed and used to recover the base call in all other strains. SNPs and indels called within repetitive regions (genes annotated as PE/PPE/insertions/phages) were removed.
3.2.4 Phylogenetic analysis
Phylogenetic analysis was based on filtered SNPs detected when each strain was compared against the most common recent ancestor of the sequences, as explained in the above (section 3.3.3). Concatenated SNPs from 13,086 variable genomic positions were used to infer the phylogenetic relationships between strains using the neighbour-joining method. Both coding and noncoding were included. The resulting tree was generated with MEGA (Tamura et al., 2011), using 1000 bootstrap replications for clade support, and the observed number of substitutions as the measure of genetic distance. In cases where SNP calls were missing from individual strains, pairwise-deletion was performed and missing data in the specific comparison ignored. As an outgroup, the distantly related M. canetti (strain K116) was used to root the tree. For presentation purposes the branch length of the M. canetti outgroup was reduced by only including SNP positions shared by the MTBC and M. canetti. Trees in Newick tree format were imported into FigTree v1.3.1, a graphical viewer of phylogenetic trees and as a program for producing publication-ready figures. FigTree was downloaded from: http:// tree.bio.ed.ac.uk/software/figtree/.
3.2.5 Categorising SNPs
SNPs were categorised as nonsynonymous (an amino acid change) or synonymous (no change) using snpEff (Cingolani et al., 2012). Source code was downloaded from: https://snpeff.svn.sourceforge.net/svnroot/snpeff/SnpEffect/trunk, and run as a local installation. As an input snpEff takes two files: a database for the reference genome, and a SNP file in the Variant Call Format (VCF). It was necessary to generate a custom reference database based on the ancestral genome sequence of the MTBC. The database was built within snpEFF using the packages command line modules, and the ancestral sequence in fasta format was parsed into the Genome Transfer Format version 2.2 (GTF
44 3.2 Materials and Methods
2.2), and using the Tuberculist database gene annotations, version 22 (May 2011) to define regions encoding genes. Annotation of SNPs by functional category was based on the Tuberculist database. Genes are grouped into ten functional categories as described previously (section 2.8.2).
3.2.6 dN/dS calculation dN/dS was calculated by division of the two rate ratios dN and dS. dN is calculated by dividing the sum of nonsynonymous SNPs by the total number of potential nonsynonymous sites in coding sequences, and dS is the sum of synonymous SNPs divided by the total number of synonymous sites in coding sequences. Due to the low number of SNPs in the MTBC, instead of calculating the dN/dS per gene, gene concatenates were generated based on different classification. Firstly, genes defined as essential and nonessential on the basis of Transposon screens (Sassetti et al., 2003; Sassetti & Rubin, 2003), and secondly using the Tuberculist gene functional categories. For each concatenate, the Nei-Gojobori method was implemented in SNAP to define synonymous and nonsynonymous substitutions by pairwise comparison using the inferred ancestral genome (Korber, 2000).
45 3.2 Materials and Methods ., 2013) ., 2013) et al et al et ., (2010) ., (2010) ., (2010) ., (2010) ., (2010) ., (2010) ., (2010) ., (2010) ., (2010) ., (2010) ., (2010) ., (2010) ., (2010) ., (2010) ., (2010) ., (2010) ., (2010) ., (2010) ., (2010) ., (2010) ., (2010) et al et al et al et al et al et al et al et al et al et al et al et al et al et al et al et al et al et al et al et al et al et Study source of genome Study source Comas Comas Comas Comas Comas Comas (Comas Unpublished (Comas Unpublished (SRA009341) Institute Broad Comas TBC) study (SRA This Comas Comas Comas Comas Comas (SRA009637) Institute Broad Comas - Comas Comas Comas (SRA008875) Institute Broad (SRA009637) Institute Broad Comas Comas Comas Comas Comas Filtered SNPs Filtered 1,834 1,867 1,883 1,937 1,910 1,883 1,290 1934 1854 1,280 1229 1,279 1,276 1,293 1,305 1,271 1330 1,263 - 733 791 661 862 771 1,959 1,959 2,045 2,065 1,018 2 Percent genome genome Percent 99.75 99.36 98.85 99.29 98.95 99.22 99.04 98.89 97.18 99.02 99.23 99.10 98.94 98.73 99.04 99.14 99.25 99.25 - 98.78 99.08 98.85 98.22 99.52 99.02 98.92 98.61 99.00 96.32 coverage coverage reads Number of Number 7,621,946 7,130,412 5,068,053 7,112,888 7,097,284 6,017,391 3,421,436 3,696,378 3,573,058 7,394,236 21,138,728 6,395,114 4,022,290 7,616,603 6,159,284 7,228,038 3,850,822 6,845,266 - 7,466,814 7,891,933 5,480,451 4,333,184 11,458,643 7,491,737 7,578,690 7,027,143 7,350,873 6,544,254 Alternative strain name as used in previous MLSA and 1 ! Average Average 77.37 72.59 46.01 77.99 78.29 65.52 55.07 59.49 61.56 77.92 179.69 64.49 40.47 78.77 61.65 74.03 66.34 75.52 Reference 78.12 82.26 59.86 32.69 93.51 78.22 79.75 72.62 76.39 93.01 mapped depth Patient place of birth and strain isolation given. Depth of coverage and Country of San Francisco San Francisco San Francisco San Francisco San Francisco San Francisco San Francisco San Francisco San Francisco San Francisco San Francisco San Francisco San Francisco San Francisco San Francisco San Francisco San Francisco San Francisco - San Francisco San Francisco San Francisco - Africa South San Francisco San Francisco San Francisco San Francisco Djibouti isolation Patient place of place Patient Laos Philippines The Philippines The Zimbabwe Islands Comoro Tanzania Indonesia India Vietnam Japan China China South Korea China China Ethiopia India Tanzania USA Sierra-Leone Gambia The Uganda - Africa South Sierra-Leone Ghana Sierra-Leone Gambia The Djibouti birth Based on H37Rv reference genome. on H37Rv Based reference 2 Lineage 1 Lineage 1 Lineage 1 Lineage 1 Lineage 1 Lineage 1 Lineage 1 Lineage 1 Lineage 1 Lineage 2 Lineage 2 Lineage 2 Lineage 2 Lineage 2 Lineage 2 Lineage 3 Lineage 3 Lineage 3 Lineage 4 Lineage 4 Lineage 4 Lineage 4 Lineage 4 Lineage 4 Lineage 5 Lineage 5 Lineage 6 Lineage 6 Lineage M. canetti 1 relative to the reference H37Rv is shown. SNPs et al., 2008). al., et Alternative name Alternative N0032 N0121 MTB_T92 - - - EAS050 EAS053 MTB_T83 N0001 MTB_94_M4241A MTB_98_1833 N0110 MTB_T67 N0155 N0022 N0114 ------Hershberg Alternative name as used in Hershberg et al (2008) and Comas et al (2010). Preserves link from previously published and transition to systematic naming convention naming systematic to transition and published previously from (2010). Preserves link al et Comas (2008) and al et Hershberg as used in name Alternative Based on H37Rv reference genome on H37Rv Based reference Table X. Based on mapping to H37Rv mappingBased on X. to Table Strain name MTB_95_0545 MTB_T17 N0157 MTB_K21 MTB_K67 MTB_K93 N0070 N0072 N0153 MTB_00_1695 N0031 N0052 MTB_M4100A N0145 MTB_T85 MTB_91_0079 MTB_SG1 MTB_K49 H37Rv MTB_4783_04 MTB_GM_1503 MTB_K37 MTB_Erdman MTB_KZN_K605 MAF_11821_03 MAF_5444_04 MAF_4141_04 MAF_GM_0981 MTB_K116 1 2 Table Table 3.1. Twenty eight strains used in this study. number of filtered genome study is included to preserve the link to a new systematic naming convention used in the strain collection (Comas et al., 2010;
46 3.3 Results
3.3 Results
3.3.1 A globally representative 28-genome human-adapted MTBC phylogeny
To identify and extract all lineage-specific SNPs, a representative genome collection was built from previously published and newly sequenced M. tuberculosis strains (Table 3.1). This set of genomes formed the dataset to identify the lineage-specific SNPs analysed in this study; a subset of these strains will also be followed in Chapter 5 using a transcriptomic approach (RNA-sequencing). The majority of the strains used in this phylogeny were published by Comas et al., (2010), consisting of twenty-one genomes sequenced on the Illumina platform. As previously reported, these genomes have mean 72-fold sequence depth, with 98.9% coverage of the reference genome (Comas et al., 2010). A further six genomes sequences were downloaded from the European Nucleotide Achieve (ENA), and the last strain, N0031, was sequenced as part of this study. Strain N0031 was included in the previous MLSA study and therefore known to be a rare Lineage 2 strain that is ancestral to the Beijing sub group (Hershberg et al., 2008). For this reason the strain was selected for sequencing to capture the greatest possible within-lineage diversity. All strains were sequenced using the Illumina platform and with a minimum 32-fold average sequence depth, seen in Table 3.1.
Using the H37Rv genome as reference, a mapping assembly was built for the twenty- eight strains using MAQ (Heng, 2008). SNPs were filtered if they had low associated Phred quality scores, read numbers, or if they fell within annotated repeat regions such as PE/PPE regions (see 3.2.3). Such regions are families of genes encoding proteins carrying Proline-Glutamic acid (PE) or Proline-Proline-Glutamic acid (PPE) motifs found near the N-terminus (Cole et al., 1998), and are inherently difficult to map using short read technology such as Illumina. In total 39,764 SNPs were identified in the strains relative to the reference, and the frequency of filtered SNPs per strain is shown in Table 3.1. Many of these SNPs are present in more than one strain, leaving a high level
47 3.3 Results of redundancy in the SNP lists. A non-redundant list of SNPs was constructed, highlighting 13,088 nucleotide positions that were variable across the 4.4Mb genome. These positions will therefore harbour a SNP in one or more of the 28 strains, and were subsequently used to derive a genome wide phylogeny. A Neighbour-Joining phylogeny, constructed using MEGA5 (Tamura et al., 2011), is shown in Figure 3.1.
Strains group into six main phylogenetic lineages, with bootstrap values indicating strong statistical support (Figure 3.1). The phylogenetic structure and strain groupings are completely congruent to the most recent whole genome based phylogeny (Comas et al., 2010), and previous MLSA and gene deletion based phylogenies (Comas et al., 2010; Gagneux et al., 2006a; Hershberg et al., 2008). The same lineage colouring scheme used in previous studies is continued here, and this will be continued where applicable throughout the thesis (Comas et al., 2010; Hershberg et al., 2008). Naming of lineages from 1 to 6 follows the convention of Comas et al. (2010). Mycobacterium canetti (strain K116) was used to root the phylogenetic tree, as it is the closest known relative to the MTBC (Gutierrez et al., 2005). The number of SNPs has been artificially reduced for M. canetti in the phylogeny (Figure 3.1). This reduction was performed for aesthetic reasons due to the large number of singletons between M. canetti and any of the other MTBC strains used in this study. For example, between the reconstructed most recent common ancestor of the MTBC sequence used in this study (see section 3.2.3) and the M. canetti genome sequence used there are 12,319 SNPs, compared to the ~1,500 SNPs between any other MTBC strain and the ancestral sequence.
48 3.3 Results
MTB_erdman MTB_GM_1503 100 100 MTB_KZN_605 Lineage 4 100 MTB_H37Rv
100 MTB_4783_04 100 MTB_K37 MTB_N0145 100 100 MTB_T85 100 MTB_00_1695 Lineage 2 100 MTB_N0052 100 MTB_M4100A 77 MTB_N0031 81 100 MTB_91_0079
100 MTB_K49 Lineage 3 100 MTB_SG1 MTB_N0157 100 MTB_T17 MTB_95_0545 100 100 MTB_N0153
100 MTB_K21 Lineage 1 MTB_K67 98 100 MTB_K93 100 MTB_N0070 100 MTB_N0072 MAF_11821_03 Lineage 5 100 MAF_5444_04 100 MAF_4141_04 Lineage 6 100 MAF_GM_0981 MCAN_K116 200 SNPs
Figure 3.1. Neighbour-joining phylogeny based on 13,088 variable common nucleotide positions across 28 human-adapted MTBC genome sequences. Scale bar shows the number of SNPs. The six lineages are coloured as defined previously (Hershberg et al., 2008). The root has been truncated due to the large numbers of changes that separate M. canetti from the rest of the phylogeny. Node support after 1,000 bootstrap replications with all nodes > 75. M. canetti strain K116 was used as the phylogenetic outgroup.
49 3.3 Results
MCAN K116 0 0
MTB N0031 963 0
MTB N0072 1848 1032 0 349
MTB N0070 1847 1019 0 955 973
MTB N0157 1828 1004 0 880
MTB T85 1891 1928 1926 1018 0 902 838 839 956
MTB N0153 1857 1790 0 231 863 992
MTB N0145 1838 1869 1906 1902 0 885 336 940 953 975
MTB T17 1852 1872 1809 0 1146 MTB SG1 1897 1203 1863 1220 1908 1931 1940 1038 0 870 888 837 965 1170 MTB M4100A 1833 1801 1838 1872 1876 0 936 1193 1169 MTB KZN 605 1254 1781 1218 1764 1233 1801 1829 1842 0 908 809 923 562 570 991
MTB K93 1789 1838 1895 1867 1884 1825 SNP distances in the 28 genome Strain phylogeny. 0 317 934 836 954 590 596
MTB K67 1816 1866 1925 1892 1917 1848 1012 0 390 983 1183 1130 1144 MTB K49 1872 1850 1097 1831 1818 1856 1884 1894 1088 Pairwise 0 644 888 1133 1141 1169 1182 1117 MTB K37 1776 1750 1214 1744 1725 1761 1793 1801 0 924 892 967 877 990 931 940
MTB K21 1812 1900 1850 1902 1949 1922 1946 1882 1055 0 661 771
MTB H37Rv 1937 1263 1910 1883 1276 1330 1867 1293 1854 1305 1883 1920 1934 1229 1018 0 791 666 468 963 1189 MTB GM 1503 1888 1208 1856 1824 1210 1279 1807 1228 1800 1249 1827 1863 1868 0 829 862 750 799 892 ! 1154 1143 1183 1187 1132 MTB erdman 1802 1773 1733 1221 1731 1714 1743 1773 1787 0 864 599 621 852 988 1163 1144 1116 1186 MTB N0052 1217 1279 1902 1883 1855 1208 1839 1828 1860 1896 1891 0 900 868 838 907 503 932 879 886 976 1811 MTB 95 0545 1830 1706 1795 1834 1713 1823 1755 1892 1849 1859 1802 0 366 436 999 1111 1143 1164 1150 1196 1113 1148 1170 MTB 91 0079 1831 1224 1271 1913 1891 1862 1845 1832 1870 1899 1906 0 805 721 733 562 706 935 1190 1192 1176 MTB 4783 04 1207 1769 1206 1869 1836 1802 1266 1800 1226 1780 1242 1815 1848 1859 0 589 867 333 357 855 984 1134 1166 1152 1113 1191 MTB 00 1695 1215 1833 1216 1280 1909 1884 1855 1205 1833 1827 1859 1894 1894 0 1156 MAF GM 0981 2050 1989 2050 1998 2040 1926 2009 2065 2088 1932 2038 2052 2025 1981 2040 2103 2025 2052 1990 2079 2054 2064 2076 2023 0 1911 MAF 5444 04 1958 1935 1881 1931 1895 1938 1827 1909 1959 1978 1825 1922 1945 1915 1873 1924 1986 1915 1950 1873 1972 1943 1952 1963 1048 0 701 1911 1139 MAF 4141 04 1922 2026 1969 2023 1972 2024 1904 1991 2045 2070 2021 2031 1998 1956 2016 2079 2004 2036 1964 2058 2031 2044 2050 2006 0 466
MAF 11821 03 1946 1973 1946 1890 1941 1899 1945 1838 1915 1959 1986 1827 1928 1954 1930 1880 1934 1994 1921 1955 1887 1974 1948 1971 1973 1925 1062 MAF 11821 03 MAF 11821 MAF 4141 04 MAF 5444 04 MAF GM 0981 MTB 00 1695 MTB 4783 04 MTB 91 0079 MTB 95 0545 MTB N0052 MTB erdman MTB GM 1503 MTB H37Rv MTB K21 MTB K37 MTB K49 MTB K67 MTB K93 MTB KZN 605 MTB M4100A MTB SG1 T17 MTB T67 MTB MTB N0153 T85 MTB MTB N0157 MTB N0070 MTB N0072 MTB N0031 MCAN K116 Table 3.2. Table Estimates of evolutionary divergence between strains. 3.1. Figure as in same the are matrix the in names
50 3.3 Results
Across and within-lineage genetic diversity was next investigated using the phylogeny. A SNP distance matrix was constructed based on the number of base differences per pairwise strain comparison, shown in Table 3.2. Across the phylogeny the average number of SNPs per pairwise comparison is 1544, which translates to an average of one SNP per 2.857 kb sequence length, as based on the H37Rv reference sequence genome (4.411532 Mb). Contrasting to the phylogenetic outgroup M. canetti, there was on average one SNP per 0.358 kb sequence, which is nearly 8 times higher SNP density than the MTBC.
Within-lineage variation was next measured by taking the average of all pairwise comparisons for each lineage strain, shown in Figure 3.2. Average within-lineage diversity ranged from 397 SNPs (sd=36) between any Lineage 3 strain, to 811 SNPs (sd=193) between any Lineage 1 strain. Lineages 2 and 1 have the greatest within- lineage variation, with a standard deviation of 222 and 193 SNPs respectively. This is nearly twice that of Lineage 4 (sd=104) and over five times the variation seen in Lineage 3 (sd=36). Lineage 1 also has the greatest number of genome sequences in the phylogeny (9 strain genomes). This might indicate a discovery bias, where the increasing number of genome sequences is uncovering more within-lineage variation. Whilst this cannot be ruled out, there was not a significant correlation between the number of strains per lineage and average within-lineage variation (Pearson r = 0.73, p = 0.10). Furthermore, the M. africanum lineages, (Lineage 5 and 6) had the least representative strains per lineage sequenced at the time of this study owing to the restricted number of strains avaliable, but diversity is still greater than Lineage 3, with Lineage 6 diversity comparable to all but Lineage 1. Overall it would appear that Lineages 1 and 2 have the greatest within-lineage SNP diversity.
51 3.3 Results
1200
s P
N 800 S
f o
r e b
m 400 u N
0
Lineage 1 Lineage 5 Lineage 6 Lineage 2 Lineage 3 Lineage 4
Ancient Modern
Figure 3.2. Within-lineage SNP diversity. The number of SNPs per pairwise comparison of all strains per lineage. Lineages are ordered by Ancient and Modern groups. Error bars indicate mean and standard deviation (sd). There was not a significant correlation between the number of strains per lineage and average within-lineage variation (Pearson r = 0.73, p = 0.10).
52 3.3 Results
3.3.2 Identification of all lineage-specific SNPs
Using the underlying connections from the derived whole genome phylogeny, it was possible for the first time to identify and extract all SNPs that are common to all strains from each of the six lineages. Due to the clonal nature of the MTBC (Supply et al., 2003), SNPs within these branches are largely exclusive to the respective lineage. All alleles on the derived phylogeny were traced throughout the tree and the nodes for each lineage branch were used to isolate all SNPs that contribute to this branch (Figure 3.3A). For example, the 163 SNPs between node 5 and 7 define Lineage 4 strains (red lineage), and in all but a few rare cases are exclusive to the lineage. The SNPs were subsequently defined as lineage-specific, and form the main dataset for the following analysis; SNPs found in more than one lineage branch represent homoplasic nucleotide positions and are described later in section 3.3.4.
In total 2,794 lineage-specific SNPs were identified (Figure 3.3B), and these are distributed throughout the genome, shown in Figure 3.4 (the full list in shown in Appendix B). Lineage-specific SNPs frequencies range from 124 (Lineage 2) to 698 (Lineage 5). The highest number of lineage-specific SNPs is in the two M. africanum lineages (Lineages 5 and 6). In addition to the six lineages, SNPs from the relatively long phylogenetic branch that is basal to the three modern lineages (Lineages 2, 3 and 4) have also been included in this study (Figure 3.3B). This branch defines the three modern lineages and consists of 319 SNPs. From here on this branch is called the modern lineage branch.
53 3.3 Results
Figure 3.3. Isolating lineage-specific SNPs from the phylogeny. A. Ancestral states reconstructed at each node of the tree to extract SNPs belonging to the lineage branches – so called lineage-specific SNPs. For example163 SNPs between node 5 and 7 define Lineage 4 strains (red lineage), B. All SNPs identified from the lineage branches of the six lineages, including the Modern lineage branch (coloured in black), which defines the three modern lineage strains. Arrows show the number of lineage-specific coding and noncoding SNPs. Scale bar at bottom indicates number of SNPs.
54 3.3 Results
Figure 3.4. Distribution of the lineage-specific SNPs across the genome. Genes on forward and reverse strands shown in outer rings as blue and red respectively. Mapped lineage-specific SNPs depicted in six inner rings, with the SNP colouring based on lineage phylogeny colours. From the innermost ring: Lineage 4, 3, 2, 1, 6, and 5. Genome structure and size based on H37Rv.
55 3.3 Results
3.3.3 Distribution of SNPs
The most recent M. tuberculosis annotations at the time of this study were used to classify the lineage-specific SNPs as non-coding (intergenic SNPs) and coding (Tuberculist database release 24). The average percentage of SNPs falling into these two regions across all the lineages is shown in Figure 3.5. It can be seen that vast majority of SNPs (86.4%) fall within annotated coding regions. This is not unexpected, as the percentage of the M. tuberculosis genome annotated as coding is 91.3% (based on the H37Rv reference). However, adjusting for the differences in sequence length between coding and noncoding regions, the number of SNPs falling across coding and non- coding is not equal, with a nearly 2-fold higher SNP density in intergenic regions (1.0 SNPs per kb of intergenic sequence compared to 0.6 SNPs per kb coding sequence) (X2, p <0.0001). This may not be surprising as SNPs in coding regions are more likely to be removed through purifying selection; the selective pressures acting on the coding regions is investigated later in the chapter (section 3.3.7). Coding SNPs can be further divided into those that cause a change in the amino acid encoded by the codon (a nonsynonymous SNP), or cause no change in the amino acid (a synonymous SNP). On average 55% of all SNPs are nonsynonymous, shown in Figure 3.5. Table 3.3 shows the frequency of SNP types for each lineage. Although rare, nonsynonymous SNPs were also found to cause the introduction of a stop codon (1.3% of all SNPs), and these were found across all lineages (Table 3.3). Conversely, three nonsynonymous SNPs removed an existing stop codon, contributing to < 0.1% of all lineage-specific SNPs.
The direction of amino acid change was determined using the reconstructed ancestral sequence of the MTBC. This sequence is similar to the H37Rv genome structure and has the same nucleotide length, but with H37Rv alleles substituted by those inferred from a reconstruction of the ancestral states using the derived phylogeny (section 3.2.4). Inference of the ancestral alleles is possible because the chromosome is effectively a single linkage group and all descendants share characteristics of the single ancestral cell (Comas et al., 2010). Therefore, using the ancestral sequence is advantageous as it enables the evolutionary direction of nucleotide change to be determined, instead of basing the change from the reference strain H37Rv, which can be problematic as it is a Lineage 4 strain.
56 3.3 Results
!"#$ Intergenic Non-synonymous Stop gain %%#$ Stop loss &'#$ Synonymous
<1% !#$
Figure 3.5. The average number of lineage-specific SNPs broken down into non- coding and coding types. Coding SNPs are further subdivided into synonymous, nonsynonymous, and nonsynonymous SNPs that affect stop codons, either through an introduction of a stop codon in a coding sequence, (stop gain) or removal of existing stop codons (stop loss).
Table 3.3. Summary of lineage-specific SNPs. This total includes the nonsynonymous SNPs indicated in the table that affect stop codons, either through an introduction of a stop codon in a coding sequence, (stop gain) or removal of existing stop codons (stop loss).
SNP type 1 5 6 2 3 4 ineage ineage ineage ineage ineage ineage lineage L L L L L L Modern
Intergenic 59 90 86 16 53 18 57 Nonsynonymous 248 395 381 75 183 99 184 Stop gain 8 10 6 1 3 3 5 Stop loss 0 0 0 0 2 0 1 Synonymous 156 213 207 33 117 46 78 Total SNPs 463 698 674 124 353 163 319
57 3.3 Results
There were 1,556 genes (38.7% of all annotated genes) with one or more lineage SNP. Three quarters (75.1%) of the genes with a lineage SNP harboured a single SNP (Figure 3.6A). The distribution of SNPs per gene followed a Poisson distribution, suggesting that there is no clustering of SNPs at the gene specific level, ranging from 0 to a maximum of 8 SNPs per gene (Figure 3.6B). The single gene with the highest frequency of SNPs (Rv2424c, fas), encodes a probable fatty acid synthase and has multiple SNPs present in Lineages 4, 5 and 6. Typical of lipid associated genes in M. tuberculosis, fas is quite long at 9.21kb, compared to the average M. tuberculosis gene length at 1.0kb. This is likely the cause of the high number of SNPs, and plotting the nucleotide length of all genes with a lineage-specific SNP against SNP frequency found a positive correlation (Pearson r = 0.43, p<0.0001), which is shown in (Figure 3.6C).
A.! 2464 B.! 2500 10000 ) 0 1
2000 g 1000 s o l e (
n s e e
g 100
1500 n f e o
g r
f
e 1000 o
b 10 1000 r m e u b N m
500 377 u 1 N 109 51 11 5 2 1 0 0.1 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 Number of SNPs per gene Number of SNPs per gene
C.! 15000 ) s e d i t o e l 10000 c u n (
h t g n
e 5000 l
e n e G 0 0 2 4 6 8
Number of SNPs per gene
Figure 3.6 Distribution of lineage SNPs per gene. A. Frequency of SNPs per gene, with actual SNP numbers recorded at top of bars. B. Poisson model (shown in red) fitted to the data. The y-axis is plotted as a log10 scale to better show the SNP distribution. C. Correlation between the number of SNPs per gene and gene length.
58 3.3 Results
3.3.4 Monomorphic population structure and homoplasic SNPs
The MTBC displays a highly clonal population structure (Supply et al., 2003); (Hirsh et al., 2004). Consistent with this structure a negligible degree of homoplasy was observed in the lineages. Of the 2,794 lineage-specific SNPs identified, four homoplastic SNPs were found, corresponding to 0.14% of the lineage SNPs being homoplastic (Table 3.4). The SNPs have the same nucleotide change across two or more of lineages, and three of the four cause synonymous changes.
As shown in Table 3.4, the first homoplasy (SNP 1) at genomic position 1480945, introduces a synonymous C to G mutation into codon 519 in Rv1319, which encodes a possible adenylate cyclase (Cole et al., 1998). This mutation occurs in all Lineages 3 and 5 strains (Figure 3.7A), indicating convergent evolution of this nucleotide position between an ancient and modern lineage. Interestingly, the homoplasy 2 also occurs in Rv1319, at position 1480948, and is three nucleotides from the first homoplasy. Furthermore, this also occurs in the same lineages (Lineage 3 and 5), causing a synonymous C to T mutation in the preceding codon (codon 518). It was confirmed that this was not an artefact from poor sequencing over this region by inspection of the MAQ alignment files, and it was found that the surrounding 100bp region in strains from Lineage 3 and 5 were mapped with high confidence, shown by MAQ quality scores of 1.0 (Heng, 2008). If an insertion or deletion was present this could cause erroneous SNPs to be called in close proximity, but again this would cause a loss in the associated MAQ quality scores for the region, and this was not found to be the true. Together this would suggest that these two SNPs have been called with a high confidence, and the two homoplasies are likely true.
The third and fourth homoplasic SNPs occur in Rv2082, which encodes a conserved hypothetical protein. The two homoplasies are present in Lineages 1, 2 and 6, introducing synonymous (A94A) and nonsynonymous (T96A) SNPs (Figure 3.7B). Again these are in modern and ancient lineages, and located closely together, this time within four nucleotides of each other. The gene is a conserved hypothetical with no known function, but independent mutation of the same allele across three lineages might suggest biological relevance. Although not within the lineage branch, some strains from Lineage 4, including H37Rv, also have these two homoplasies as a sub-lineage homoplasy.
59 3.3 Results
Table 3.4. Homoplasic nucleotide positions within the lineage branches. Independent mutation of the same nucleotide position occurring across the phylogenetic tree. SNP position based on the reference strain H37Rv genome coordinates.
Gene Mutation Lineages
Gene product SNP allele Homoplasy SNP position Ancient allele Ancient adenylate 1 Rv1319c 1480945 C G T519T 3, 5 cyclase adenylate 2 Rv1319c 1480948 C T E518E 3, 5 cyclase hypothetical 3 Rv2082 2338990 C G A94A 1, 2, 6 protein hypothetical 4 Rv2082 2338994 A G T96A 1, 2, 6 protein
A.! B.!
Figure 3.7. Homoplasic lineage SNPs. A. Homoplasy 1 and 2 occur in Rv1319c in Lineages 3 and 5. B. Homoplasy 3 and 4 occur in Rv2082 in Lineages 1, 2 and 6.
60 3.3 Results
In addition to the four homoplasic positions at the nucleotide level shown in Table 3.4, there was one intergenic SNP at nucleotide position at 2566768 that was mutated to an Adenosine in Lineage 4, but a Cytosine in Lineage 1 (Table 3.5). Finally, at the amino acid level, the residue at position 733 within Rv0339c, harbours different nonsynonymous SNP in Lineages 3 and 5. Rv0339c encodes a transcriptional regulatory protein, and the two SNPs result in change to different amino acids in the lineages (Table 3.5).
Table 3.5. Variable genomic positions within the lineages. Two nucleotide positions harbour different SNPs across the lineages.
Mutation Lineage Gene product SNP Gene SNP allele SNP position Ancient allele Ancient
Rv2294- hypothetical 1 2566768 G A intergenic 4 Rv2295 protein
Rv2294- hypothetical 2566768 G C intergenic 1 Rv2295 protein
transcriptional 2 Rv0339c 406251 A G D733G 3 regulatory protein transcriptional Rv0339c 406251 A C D733A 5 regulatory protein
61 3.3 Results
3.3.5 Creation of pseudogenes
In total, thirty-nine SNPs were found to affect stop codons. SNPs can either cause the premature introduction of a new stop codon at any point in the annotated coding sequence (a nonsense SNP), or more rarely remove an existing stop codon. Thirty-six of the SNPs cause the former type of nonsense mutation. As shown previously (section 3.3.3), the majority of SNPs occur in isolation within genes, and nonsense SNPs follow this distribution, thus leading to the potential generation of thirty-five pseudogenes in the respective lineages (Table 3.6A). The remaining three nonsynonymous SNPs have the reverse effect, causing the loss or removal of an existing stop codon (Table 3.6B).
Whilst all lineages have accumulated nonsense SNPs, the three ancient lineages have the greatest frequency, with nearly two-thirds of nonsense SNPs (24 out of 39 nonsense SNPs). To test if this is due to the longer branch lengths of these lineages compared to the modern lineages, and so a reflection of the greater time that these lineages have had to accumulate nonsense mutations, the number of nonsense SNPs was compared to the total number of SNPs found in each respective lineage branch, shown in Table 3.7. Lineage 4 has the shortest branch length and one nonsense SNP, whilst Lineage 5 has the longest branch and the most nonsense SNPs. A significant correlation was found between branch length and the number of pseudogenes (Pearson r= 0.8477, p= 0.0160).
It can be seen from Table 6 that a large proportion of the nonsense SNPs are within genes annotated as encoding hypothetical proteins (21 out of 39 SNPs). Using the formal functional gene categories defined by Tuberculist, it was tested if the nonsense SNPs were distributed across all gene function categories. Whilst all categories were affected by one or more nonsense SNP, as expected the hypothetical category contained the largest proportion (15 SNPs, 38.7%). Due to the low number of nonsense SNPs, it was not possible to stratify into functional groups by each lineage, but the distribution of nonsense SNPs was not significantly different for any of the functional categories using the ancient and modern lineage groupings (Table 3.8) (Mann-Whitney U test, p= 0.24).
62 3.3 Results
Table 3.6 Nonsense SNPs. In total thirty-nine SNPs cause a change in the encoded stop codon. A. Introduction of a stop codon within the coding sequence. B. Removal of an existing stop codon. The stop codon is indicated by an asterisk (*) in column 3. Rows are ordered by gene.
A. Stop introduction Gene Mutation Lineage Gene product Rv0064 Q862* 5 hypothetical protein Rv0134 ephF W152* 1 epoxide hydrolase Rv0146 Y94* 3 hypothetical protein Rv0325 Q75* 4 hypothetical protein Rv0329c R141* 6 hypothetical protein Rv0368c S277* 5 hypothetical protein Rv0402c mmpL1 R376* 5 transmembrane transport protein Rv0457c W119* 1 peptidase Rv0490 senX3 R410* 6 two component sensor histidine kinase Rv0574c Q149* 5 hypothetical protein Rv0610c Q305* 1 hypothetical protein Rv0621 W355* Modern hypothetical protein Rv0836c W218* 4 hypothetical protein Rv0906 Q183* 1 hypothetical protein Rv1251c E875* 3 hypothetical protein Rv1504c E200* Modern hypothetical protein Rv1870c L212* Modern hypothetical protein Rv1912c fadB5 G63* 3 oxidoreductase Rv1965 yrbE3B W11* 5 integral membrane protein Rv2079 Q609* 2 hypothetical protein Rv2132 Y60* 5 hypothetical protein Rv2187 fadD15 Y81* 6 long-chain-fatty-acid-CoA ligase Rv2187 fadD15 W43* 1 long-chain-fatty-acid-CoA ligase Rv2299c htpG Q109* 6 heat shock protein 90 Rv2339 mmpL9 S917* 5 transmembrane transport protein Rv2690c R658* Modern hypothetical protein Rv2788 sirR Q131* 1 transcriptional repressor Rv2797c Q273* 5 hypothetical protein Rv2818c Q304* 6 hypothetical protein Rv2850c R515* 5 magnesium chelatase Rv2994 W68* 1 integral membrane protein Rv3079c E120* 1 hypothetical protein Rv3373 echA18 G214* Modern enoyl-CoA hydratase Rv3416 whiB3 E71* 5 transcriptional regulatory protein Rv3729 W369* 6 transferase Rv3898c Q111* 4 hypothetical protein
B. Stop removal Gene Lineage Gene product Rv0257 *23R Modern hypothetical protein Rv1641 infC *202S 3 translation initiation factor IF-3 Rv1921c lppF *424G 3 lipoprotein
63 3.3 Results
Table 3.7. Nonsense SNPs by lineage. Thirty-six lineage-specific nonsynonymous SNPs result in the introduction of a stop codon within the coding sequence (nonsense SNP). The number of nonsense SNPs is correlated to branch length.
Branch length Lineage Nonsense (SNPs) 1 8 463 5 10 698 6 6 674 2 1 124 3 3 353 4 3 163 Modern 5 319
Table 3.8 Nonsense SNPs grouped by functional category. Nonsense SNPs separated by functional category of the affected gene, and into modern (Lineages 2, 3 and 4) and ancient groups (Lineages 1, 5 and 6). Rows are ordered by descending total number of SNPs per functional category.
Lineage Functional category Total Modern Ancient conserved hypotheticals 15 7 8 cell wall and cell processes 7 3 4 lipid metabolism 5 3 2 intermediary metabolism and respiration 3 0 3 regulatory proteins 3 0 3 virulence, detoxification, adaptation 3 0 3 unknown 2 1 1 information pathways 1 1 0
64 3.3 Results
3.3.5.1 Nonsense and stop codon removal SNPs in essential genes
The thirty-eight genes harbouring the thirty-nine nonsense and stop codon removal SNPs were next grouped by gene essentiality. These groups are based on the genome- wide analyses of mutants that were unable to grow in vitro on Middlebrook 7H11 agar or in the spleens of intravenously infected mice (Sassetti et al., 2003; Sassetti & Rubin, 2003).
Strikingly, all but two of the genes harbouring a SNP involved in creation or removal of a stop codon were nonessential. There were 36 SNPs in nonessential genes compared to 2 in essential, out of a genome-wide number of 2,986 nonessential and 760 essential genes (X2 test; p = 0.0362). Given that nonsense SNPs within essential genes would highly likely cause a loss of function for the encoded protein that leads to cell death, this result is perhaps unsurprising. One of the two exceptions is in Lineage 6, an M. africanum lineage. Here an amino acid change at position 410 in senX3 (Rv0490), leads to the change of an Arginine residue for a stop codon. SenX3 encodes a predicted secreted two component sensor histidine kinase (Malen et al., 2007). Whilst this has the potential to severely affect the function of the encoded protein, the precise position of the SNP within the gene will determine the length of protein truncation, and so the likely severity. The amino acid length of SenX3 is 410, which places the new stop codon directly adjacent to the existing ancestral stop codon, and an ensuing loss of only one amino acid residue from the protein C-terminus; such a short truncation is likely to have little or no effect on gene function which may explain why the SNP is allowed to persist in the lineage.
A similar scenario exists in the second essential gene harbouring a stop codon affecting SNP. infC (Rv1641) encodes the translation initiation factor-IF3, one of the three initiation factors in bacteria (Malys & McCarthy, 2011). IF3 binds to the 30S ribosomal subunit, and shifts the equilibrium between 70S ribosomes and their 50S and 30S subunits by promoting dissociation of 30S from 50S, and thereby subsequent binding of mRNA (Liveris et al., 1993); it is therefore required for the initiation of protein biosynthesis in bacteria. Lineage 3 strains carry a nonsynonymous SNP that removes the existing stop codon at codon position 202, and introduces a Serine residue (Table 3.6B). This could lead to transcription of infC into the following intergenic region and potential fusion to the next encoded gene, rpmI. However, 27 nucleotides downstream from the removed stop codon is another in frame stop codon at position 1852903. Therefore infC
65 3.3 Results in Lineage 3 is 27 nucleotides longer, and the protein 9 amino acids longer, than in the rest of the MTBC. Again, this is unlikely to be harmful to the cell.
3.3.5.2 Length of protein truncation
The majority of SNPs that affect a stop codon cause the introduction of termination codon within the coding sequence (36 nonsense SNPs). Whilst this has the potential to severely affect the function of the encoded protein, it has been demonstrated previously (section 3.3.5.1) that the position of the SNP within the gene should also be taken into account. Comparing the full-length ancestral protein sequence to the truncated protein revealed that truncations were distributed throughout the protein length (Figure 3.8A). There was only one example of more than one nonsense SNP within a gene. Lineage 6 strains have two SNPs within fadD15, both of which would cause >85% loss of the protein length. The most extreme truncation, in yrbE3B (Rv1965), will lead to a protein 96.3% shorter in length than the ancestral protein. Although yrbE3B encodes a protein of unknown function, it is highly similar to other membrane proteins, and forms one of the mammalian cell entry operons in M. tuberculosis (Mce3) (Cole et al., 1998). Overall, 14 SNPs (38.9% of all nonsense SNPs) cause the deletion of >50% of the ancestral amino acid sequence; such a deletion might be expected to have severe effects on the function of the gene product.
It can been seen in Figure 3.8A that nine of the nonsense SNPs cause <1% of the protein being truncated. Apart from senX3, which has one amino acid truncated and described in the above section (3.3.5.1), the remaining eight genes affected by nonsense SNPs have 0% deletions. This is an artefact of basing the length of truncations on H37Rv strain annotations, a Lineage 4 strain. Therefore the analysis is identifying proteins with a premature stop codons introduced either from Lineage 4 or more basal Modern lineage branch, which have then been integrated into the H37Rv annotations. Interestingly, in four cases the nonsense SNPs have created two open reading frames that have been annotated as separate genes in H37Rv: these genes are Rv0325-Rv0326, Rv1504c-Rv1503c, Rv3373-Rv3374 and Rv3898c-Rv3897c. Whilst these are effectively new genes in the respective lineages, all but one are annotated as encoding hypothetical proteins. The single exception is echA18 (Rv3373) and echA18.1 (Rv3374) which encode probable Enoyl-CoA hydratases, but was previously a single open reading frame (Figure 3.9).
66 3.3 Results
The three SNPs that remove existing stop codons lead to proteins that are 104.3-563.6% greater in amino acid length compared to existing annotations (Figure 3.8B). This is based on the next in frame stop codon from the 3’ end of the annotated gene. infC has the smallest increase in length, and was described previously (section 3.3.5.1). The remaining two genes, lppF and Rv0257, increase by 110 (110.2% increase) and 104 (563.6%) amino acids.
A.! Rv3898c echA18 Rv2690c Rv1870c Rv1504c Rv0836c Rv0621 Rv0325 senX3 mmpL9 Rv2079 Rv0064 Rv2850c Rv2818c Rv0610c Rv2132 Rv1251c whiB3 Rv0368c Rv0329c sirR ephF Rv0906 Rv2797c Rv3729 Rv3079c mmpL1 Rv0574c Rv0146 fadB5 Rv0457c htpG Rv2994 fadD15 fadD15 yrbE3B 0 20 40 60 80 100 Percentage of protein truncated
B.! infC lppF Rv0257 0 20 40 60 80 100 120 300 600
Percentage increase in protein length
Figure 3.8. Change in protein length due to nonsense SNPs. A. Distribution of protein truncations due to thirty-six nonsense SNPs causing premature stop codon introductions. Truncations expressed as percentage change based on H37Rv annotations. Note fadD15 shown twice due to two SNPs that introduce stop codons. Black bars indicate the deletion; grey bars are remaining protein. B. Percentage increase in protein length from three SNPs that remove existing stop codons. Striped bars indicate new protein sequence.
67 3.3 Results
Figure 3.9. Gene creation by nonsense SNPs. echA18 (Rv3373) and echA18.1 (Rv3374) is a contiguous open reading frame in the ancient sequence, but introduction of a nonsense SNP in the modern branch led to the annotation of two genes in the reference H37Rv, and all other modern lineage strains.
68 3.3 Results
3.3.6 SNPs within genes associated with antibiotic resistance
Many drug resistance-conferring mutations have been identified in the MTBC and are held in the publicly available TBDReaMDB database (Sandgren et al., 2009). Identification of such mutations has been important in the development of molecular genotypic based assays for drug resistance (Boehme et al., 2011; Hillemann et al., 2007). However, as shown in this study, many SNPs in the MTBC are phylogenetic markers for the lineage, and so it is important to understand the underlying phylogeny to distinguish SNPs within drug resistant genes that are unlikely to be the cause of drug resistance but instead phylogenetic markers.
Using the above database, the lineage-specific SNPs were screened to identify SNPs within genes associated with drug resistance. In total, forty-six coding SNPs were identified, thirty-two were nonsynonymous and fourteen synonymous. Lineage-specific SNPs were found in genes associated with resistance to six of the nine antibiotics used in the treatment of tuberculosis, these were: Ethambutol (SNPs in 9 out of 13 associated genes), ethionamide (2 of 3), flurorquinolones (2 of 2), isoniazid (11 of 23), rifampicin (1 of 2) and streptomycin (1 of 3) (Figure 3.10). A further two intergenic SNPs were in potential regulatory regions (<100bp from the translational start site) of the genes ahpC and rpoB, which are associated with isoniazid and rifampicin resistance respectively (Ramaswamy & Musser, 1998; Sherman et al., 1996). Whilst more SNPs in drug resistance associated genes were found in the two M. africanum lineages (11 SNPs each), all lineages harboured at least one example (see Appendix C for details).
20 s
P 15 N S
f o
r 10
e b m u 5 N
0
Isoniazid Rifampicin EthambutolEthionamide Streptomycin Flurorquinolones Figure 3.10 Lineage-specific SNPs within genes associated with drug resistance. In total 46 SNPs were identified.
69 3.3 Results
Whilst one of the genome sequences used to construct the whole genome phylogeny is extensively drug resistant (XDR) (Lineage 4 strain KZN 605), the inherent nature of this study excludes SNPs only present in one strain (singleton SNPs), and so all of the lineage-specific SNPs are not directly involved in causing drug resistance. Interestingly, nine of the forty-six lineage-specific SNPs identified above were found within the TBDream database (19.6%), shown in Table 3.9. It is therefore likely that these lineage- specific SNPs have been incorrectly associated with drug resistance.
It can be seen at the top of Table 3.9 that a cysteine to tyrosine mutation (C110Y) within embR (Rv1267c) was found in a study by (Srivastava et al., 2009). This SNP is present within all strains from Lineage 1. In the former study, three genes implicated in ethambutol resistance (embB, embC and embR) were sequenced in 44 ethambutol resistant clinical strains isolated in India (Srivastava et al., 2009). The C110Y mutation was found within one of the study strains, which also had two mutations in embC (G288W and V303G). The C110Y mutation therefore identifies this strain as likely belonging to Lineage 1. Lineage 1 is not prevalent in the country from which the strains were isolated (Gagneux et al., 2006a; Gagneux & Small, 2007), which might account for there only being one instance of the SNP out of the 44 strains in the study. Interestingly, Lineage 1 strains also harbour two more lineage-specific SNPs within genes involved in ethambutol resistance, one in embA (Rv3794), a P913S mutation, and another within embC (Rv3793), a N394D mutation but these were not found in the study. However, embA was not sequenced and the primers used to sequence embC did not extend beyond the 5’ 308bp region of embC that has sequence homology to the resistance-determining region (ERDR), and so missed the Lineage 1 SNP that is in the middle of the gene (Sreevatsan et al., 1997b; Srivastava et al., 2009). It is therefore not possible for C110Y SNP to be involved directly in drug resistance to ethambutol.
The above study and others have identified the most common mutation reported in embC at codon 270 (I270T) (Srivastava et al., 2006; Srivastava et al., 2009). However, in this study the mutation was found to be a modern lineage SNP, and so is present within Lineages 2, 3 and 4. This would make the mutation highly prevalent in the study areas where the strains were isolated (Srivastava et al., 2006; Srivastava et al., 2009). The mutation is typically reported as the conversion of an existing Tyrosine residue, but most studies use the reference strain H37Rv as the ancient allele, and therefore the direction of change is reported incorrectly; this agrees with the findings of (Koser et al., 2011).
70 3.3 Results
Table 3.9. Putative mutations found in drug resistance studies incorrectly associated with drug resistance. All SNPs are lineage-specific, and therefore phylogenetic markers of the respective lineages.
Drug Lineage Gene Mutation Primary reference resistance 1 Rv1267c embR C110Y ethambutol Srivastava et al., 2009 3 Rv3264c manB D152N ethambutol Ramaswamy et al., 2000 Modern Rv3793 embC I270T ethambutol Sreevatsan et al., 1997b; Srivastava et al., 2009 1 Rv3793 embC N394D ethambutol Ramaswamy et al., 2000 3 Rv3793 embC R738Q ethambutol Ramaswamy et al., 2000 1 Rv3794 embA P913S ethambutol Ramaswamy et al., 2000 Modern Rv3795 embB A378E ethambutol Srivastava et al., 2006 4 Rv1908c katG L463R isoniazid Heym et al., 1995 3 Rv2242 M323T isoniazid Ramaswamy et al., 2003
71 3.3 Results
3.3.7 Conservation and removal of lineage-specific nonsynonymous SNPs
In the following section the extent to which nonsynonymous SNPs are removed from the lineages was analysed. The commonly used method to detect selection by measuring the proportion of nonsynonymous nucleotide changes (dN) to synonymous nucleotide changes (dS) was applied to the lineage-specific SNPs (see 3.2.6). A dN/dS >1 indicates positive selection, <1 indicates purifying selection and a ratio at or close to 1 is regarded as neutral, or a balance of the two former selective forces. The rate of nonsynonymous SNP accumulation was first compared across the six lineages. The relatively low number of SNPs within the MTBC made calculation of dN/dS for individual genes of questionable value and impossible for the 2,459 (61.2%) genes with no lineage-specific SNPs. As an alternative approach the dN/dS ratio was calculated using gene concatenates based firstly on all genes, then gene essentiality and functional categories.
The mean dN/dS for the lineages was 0.67 (ranging from 0.54-0.79), corresponding to nearly two thirds (64.8%) of SNPs causing a change to the encoded amino acid (Table 3.10). This finding is consistent with the average dN/dS based on all SNPs identified in 21 MTBC genome sequences (dN/dS=0.59), and the sequencing of 89 genes from 108 MTBC strains (dN/dS=0.57) (Comas et al., 2010; Hershberg et al., 2008). If the lineages are grouped into the ancient and modern categories, the mean dN/dS was 0.61 and 0.72 respectively; whilst a higher rate of nonsynonymous SNP accumulation was found in the modern lineages, the difference between two is not significant (Mann Whitney U test, p=0.2118). High dN/dS ratios are often considered to indicate a reduction in purifying selection (He et al., 2010; Hershberg et al., 2008; Holt et al., 2008), which would suggest here that all lineages are experiencing the same weak purifying selection. Alternatively, signals of weak purifying selection may be due to the close relatedness of the MTBC strains. Rocha et al. (2006) has shown that dN/dS is often higher when the organisms compared are closely related. Therefore dN/dS becomes dependent on time due to a lag in the time to remove deleterious nonsynonymous mutations by purifying selection, and so elevating dN/dS.
To test how the frequencies of nonsynonymous SNPs vary over different timescales, the ratio of nonsynonymous to synonymous SNPs was compared in different branches of the phylogenetic tree. No significant difference was found in the SNP ratio from the lineage branches compared to the external branches, which includes SNPs from the twenty-eight extant strains used in the phylogeny (Mann Whitney U test, p = 0.1033). The mean
72 3.3 Results lineage branch ratio was 1.9, whilst the external branches 1.7 (Appendix D), suggesting that nonsynonymous SNP accumulation in the MTBC is a consistent feature irrespective of time.
Table 3.10. The rate of nonsynonymous SNP accumulation across the lineages. The dN/dS ratio was used, which measures the accumulation of nonsynonymous SNPs against the background rate of synonymous SNPs.
Lineage Lineage Lineage Lineage Lineage Lineage Modern 5 6 1 4 2 3 Nonsynonymous 385 374 238 96 74 182 172 SNP Synonymous 213 206 156 46 33 117 78 SNP Nonsynonymous 2968425 2968425 2968425 2968425 2968425 2968425 2968425 positions (N) Synonymous 1052024 1052024 1052024 1052024 1052024 1052024 1052024 positions (S) dN rate 0.000130 0.000126 0.000081 0.000032 0.000025 0.000061 0.000058 dS rate 0.000202 0.000196 0.000148 0.000044 0.000031 0.000111 0.000074 dN/dS 0.64 0.64 0.54 0.74 0.79 0.55 0.78
3.3.7.1 Nonsynonymous SNPs within essential genes
The previous method was based on total sequence concatenates which is quite a blunt method for detecting selection, likely averaging both purifying and potential positive selection in the sequences. Further concatenates were generated based on biologically relevant categories. Firstly, genes were grouped by those shown to be essential for growth by transposon mutagenesis (Sassetti et al., 2003; Sassetti & Rubin, 2003). Based on the findings in other bacteria and evolutionary theory, it would be expected for less nonsynonymous SNPs to accumulate within genes that are essential to the cell (Jordan et al., 2002). There were 335 (14.0%) nonsynonymous SNPs and 212 (8.9%) synonymous SNPs within essential genes, leaving the remaining 1215 (50.8%) nonsynonymous and 630 (26.3%) synonymous SNPs within nonessential genes. Adjusting for differences in the nucleotide length of the two categories using the number of potential
73 3.3 Results nonsynonymous SNP positions, it was found that significantly less nonsynonymous SNPs were within essential genes (X2, p=0.0011). Whilst the average dN/dS for essential genes was lower than nonessential (0.56 and 0.68 respectively), indicating that essential genes are more conserved than nonessential.
3.3.7.2 Nonsynonymous SNPs within functional gene categories genes
Gene concatenates were next generated for all gene functional categories based on the Tuberculist database. Seven categories were tested: 1. information pathways, 2. intermediate metabolism and respiration, 3. lipid metabolism, 4. cell wall and cell wall processes, 5. conserved hypothetical, 6. virulence-detoxification and adaptation and 7. regulatory proteins (Lew et al., 2011). In Figure 3.11A, the dN/dS ratios across these categories are shown. A one-way ANOVA of the dN/dS for each lineage and functional category found an uneven distribution (Kruskal-Wallis test, p=0.0084). Following multiple testing correction it was seen that the dN/dS between the information pathways and regulatory protein categories was significantly different (Dunn's Multiple Comparison Test, p<0.05). It might be expected for the information pathways class to have the lowest number of nonsynonymous SNPs due to the critical function of these genes within cell, such as in DNA replication and repair. This was confirmed by comparison of the percentage of essential genes per functional category to the dN/dS ratio, which found a significant correlation (Spearman r = -0.8929, p = 0.0123) (Figure 3.11B).
Whilst there was evidence of gene function categories varying by the level of low purifying selection, only genes within the regulatory category showed strong signs of positive selection in multiple lineages (mean dN/dS = 1.16) (Table 3.11). Stratifying the regulatory functional category by lineage, the dN/dS was > 1 in Lineages 3, 4, 5 and 6. Focusing on this category, 84 regulatory proteins harboured 132 lineage-specific SNPs - 101 nonsynonymous and 31 synonymous. This corresponds to a nonsynonymous to synonymous ratio of 3.3, compared to the mean of ratio of 1.9 found across all functional categories. Potential positive selection (dN/dS >1) was also seen in the intermediary metabolism and respiration category for just Lineage 2 (dN/dS=1.49), and in lipid metabolism also for Lineage 2 (dN/dS=1.13) and the modern lineage branch (dN/dS =1.39) (Figure 3.11A).
74 3.3 Results
A. =:9 >>
<:;
A 2 @ <:9 ? 2
9:;
9:9 . . & . (!/'1 (+,'- ('3$1!. (+0 )*' )&0 )%0.*!%'(!$" (!$" )+-*$ 1!*!2 )'"2 #$%&' %0851'($%-)*%$(0!". !" '3$1!.& ( /$".0%402 /011),'11)'"2)/011)*%$/0..0)&0
(0%&02!'%- !" 4!%510"/06)20($7!#!/'(!$")'"2)'2'*'('(!$" B.
100 1 s
e 80 information1pathways n e y g r 1 l o a i g t e
n t e
a 60 s c 1 s l e a 1 f n o o i 1 t e c
g 40 n a u t
f 1 n n e i c r e 20 P regulatory1proteins
0 0.0 0.5 1.0 1.5
dN/dS
Figure 3.11. The rate of nonsynonymous SNP accumulation by functional category. A. Lineage dN/dS by functional category. Lineages coloured as previously and bars represent mean dN/dS. Information pathways dN/dS significantly lower than regulatory proteins (one-way ANOVA with Dunn’s post-hoc test, p <0.05). B. Correlation between essential genes per functional category (as percentages) and dN/dS. Spearman r = - 0.8929, p = 0.0123.
75 3.3 Results
Table 3.11. The rate of nonsynonymous SNP accumulation in each functional category. The nonsynonymous/synonymous ratio and dN/dS ratio is shown. N= all possible nonsynonymous positions, S = all possible synonymous positions.
ymous N S dN/dS
onymous Nonsynon SNPs Synonymous SNPs nonsynonymous /syn information pathways 96 73 1.3 202427 70831 0.46 lipid metabolism 168 99 1.7 294422 102505 0.59 intermediary metabolism and respiration 394 237 1.7 765073 268496 0.58 cell wall and cell processes 377 197 1.9 595063 214706 0.69 conserved hypotheticals 344 175 2.0 594619 209012 0.69 virulence, detoxification, adaptation 64 29 2.2 106294 38498 0.80 regulatory proteins 101 31 3.3 123975 44208 1.16
76 3.4 Discussion
3.4 Discussion
3.4.1 Strengths and limitations of this study
This study used recently published MTBC genomes sequenced by high-throughput sequencing technology to identify for the first time all SNPs that contribute to the background genetic variation within the six lineages of the MTBC. At the time of this study about thirty globally representative strains from all of the lineages had been sequenced and the genomes made publicly available. It is likely that a discovery bias exists within this small genome set, as illustrated in Figure 3.2 where it was seen that the lineages with the most genome sequences (Lineages 1, 2 and 4) had the greatest within- lineage diversity. Lineages 5 and 6 only had two genome sequences available to use in this study. However, this study was designed to capture variation within the internal basal branches of each lineage through exploitation of the clonal population structure of the MTBC, and this should circumvent any discovery bias. Theoretically, as backward mutations are rare in the MTBC, genome sequences from two strains belonging to the same lineage would capture all lineage-specific SNPs for the respective lineage, and additional genomes will only serve to reduce the branch length and so the number of lineage-specific SNPs. Finally, twenty-one of the genomes used to construct the genome phylogeny were selected from a wider collection of 875 strains characterised previously by the analysis of deletions across the genome (Comas et al., 2010; Gagneux et al., 2006a; Hershberg et al., 2008). Therefore, whilst it is expected that future studies will sequence ever-greater numbers of MTBC strains, the lineage-specific SNPs identified in this study are expected to be robust.
Removal of SNPs found within repetitive regions, such as in phages, and the PE and PPE gene families, will likely have resulted in the loss of potentially important variation within the MTBC lineages. Pe genes are characterised by the presence of a proline- glutamic acid (PE), whilst ppe genes contain a proline-proline-glutamic acid (PPE); both
77 3.4 Discussion families are highly variable in size and contain extensive repetitiveness of their C- terminal regions (Cole et al., 1998). Excluded regions total ~10% of the coding genome, and recently it has been shown that the large pe and ppe families harbour about 3-fold higher frequency of nonsynonymous SNPs compared to non-pe/ppe genes (McEvoy et al., 2012), which would suggest that a pool of lineage-specific variation might have been missed in this study. It was necessary to remove SNPs identified in these regions due to inherent difficulties encountered in sequencing through repetitive regions using the second generation short read technologies, such the Illumina sequenced strains used in this study. SNPs were detected in these regions in the lineage branches, but they would need to be confirmed by methods beyond the scope of this study. This is a common disadvantage of current short read sequencing technology (Loman et al., 2012), and developments in sequencing technology with longer read lengths will likely remove this current limitation (Branton et al., 2008).
3.4.2 General characteristics of lineage-specific diversity
Prior to identification of the lineage-specific SNPs, a 28-genome phylogeny was built using a non-redundant set of variable nucleotide positions derived from the genome sequences. The phylogeny was largely derived from the genome sequences published previously (Comas et al., 2010), and supplementing by other recently published genome sequences available in the EBI SRA. An additional strain (N0031), known to be a rare Lineage 2 strain based on a previous MLSA study, was sequenced for this project to widen diversity in this lineage (Hershberg et al., 2008). The topology of the resulting phylogeny was highly congruent with other MTBC phylogenies based on SNPs and other markers, such as deletions, further highlighting the clonal population structure of the MTBC (Comas et al., 2010; Gagneux et al., 2006a).
In total 2,794 SNPs lineage-specific SNPs were identified, with each lineage differing by an average of 400 SNPs. The ancient lineages (Lineages 1, 5 and 6) harboured the most lineage-specific SNPs, which is likely a reflection of the greater time that these lineages have had to accumulate mutations. On average, two-thirds of all coding SNPs were nonsynonymous and therefore cause a change in the encoded amino acid. This is a feature of the MTBC, and has been previously identified at the genome level (Fleischmann et al., 2002; Hershberg et al., 2008). Nonsynonymous SNPs are more
78 3.4 Discussion likely than synonymous SNPs to have a functional effect, which raises the possibility that this variation will have functional consequences in the respective MTBC lineages.
The ability to isolate the total background SNP variation that contributes to the diversity of all strains from a particular lineage (lineage-specific SNPs) was fundamental to this study. This was only possible due to the negligible level of recombination seen in the MTBC (Liu et al., 2006), and because back mutations are rarely observed (Casali et al., 2012). Therefore a SNP in the parental strain becomes a defining marker for the rest of the progeny. It has previously been reported that homoplasic nucleotide positions are rare in the MTBC, in which a SNP cannot be explained without convergence when mapped onto the tree, and typically found only in cases of drug resistance or compensatory mutations (Casali et al., 2012; Comas et al., 2011). Similar examples have been found in other bacterial studies, such as the sequencing of MRSA strains, where the authors found few homoplasic SNPs but when identified, corresponded to mutations conferring antibiotic resistance (Harris et al., 2010). In this study it was found that there were only four cases of homoplasic SNPs (0.14% of all lineage-specific SNPs), in which lineage-specific SNPs with the same nucleotide change were present in more than one lineage (Table 3.4). Independent fixation of SNPs across multiple lineages could represent signals of selective pressure acting on these positions, and this was strengthened by the distribution of the four SNPs, whereby they cluster within two genes and are within a few nucleotides of each other. Whilst these may have biological significance, the respective genes are not associated with drug resistance. Further work would be needed to confirm these SNPs and to understand if these SNPs have biological function.
The lineage-specific SNPs can also be exploited in SNP typing assays to genotype strains, either at the lineage or from any sub-lineage level (Bergval et al., 2012; Kahla et al., 2011; Stucki et al., 2012). SNP typing is suggested to be the new gold standard of phylogenetic classification of MTBC (Comas et al., 2009), and the majority of the SNPs identified in the lineage branches in this study, excluding the above homoplasies, would be applicable to such typing assays. At the epidemiological level, genotyping of strains has also been driven by the need for rapid tests to identify drug resistant strains. Resistance to first-line TB drugs rifampicin and isoniazid (Multidrug resistance or MDR-TB), and now also to some second-line drugs (extensively drug resistant tuberculosis or XDR-TB) has led to a growth in molecular genotypic drug susceptibility testing, such as the Genotype MTBDRplus (Hain Life science) and Xpert MTB/RIF
79 3.4 Discussion
(Cepheid) (Boehme et al., 2011; Hillemann et al., 2007; McNerney et al., 2012). Several SNPs were identified within drug resistant associated genes that are not associated with drug resistance, but act as evolutionary markers (Table 3.9). Previous studies have identified highly prevalent mutations within drug resistant strains, but these have been shown here to be lineage-specific markers. Other studies have also questioned some associations of SNPs with drug resistance. A significant association of a SNP within Rv2629 and rifampicin resistance was found based on a study of over 100 rifampicin resistant strains (Wang et al., 2007), but this was subsequently shown to be a phylogenetic marker of Lineage 2, specifically of the Beijing group of strains(Homolka et al., 2009)(Homolka et al., 2009)(Homolka et al., 2009)(Homolka et al., 2009)(Homolka et al., 2009). Similar approaches have been applied to inhA SNPs with isoniazid resistance, and embC SNPs with ethambutol resistance, that are instead phylogenetic markers and unlikely the cause of drug resistance (Projahn et al., 2011; Ramaswamy et al., 2000). From the perspectives of typing strains for evolutionary analysis, and linking genotype to phenotype to identify potential molecular causes of drug resistance, it is clear that an understanding of the underlying phylogenetic structure of the MTBC is critical.
Whilst several lineage-specific SNPs within genes associated with drug resistance are unlikely to be direct causes of resistance, some could play an indirect role in modulating the fitness cost of drug resistant mutations. It has been shown that strains from different lineages but with identical rifampicin resistance mutations show different levels of fitness cost (Gagneux et al., 2006b). In a wider context, the Beijing family of strains within Lineage 2 is often associated with drug resistance (Borrell & Gagneux, 2009; Parwati et al., 2010). It has been suggested therefore that strain genetic background plays a role in the spread of drug resistance strains (Muller et al., 2013), although the actual molecular mechanisms of this are currently unknown. Pre-existing mutations in genes associated with drug resistance, such as the lineage-specific SNPs found in this study, may increase the tolerance of the cell to future drug resistance mutations through higher baseline fitness, or epistatic interactions between the genetic background of the strain and drug resistance mutations (Muller et al., 2013).
3.4.3 Insights into the evolution of M. tuberculosis lineages
It has been hypothesised that, due to historical human migrations and serial transmission bottlenecks due to the low-infectious dose of tuberculosis, the MTBC have small
80 3.4 Discussion effective populations size (Hershberg et al., 2008). This phenomenon can lead to increased random genetic drift compared to natural selection, limiting the removal of potential functional mutations (Smith et al., 2006a). As discussed above, about two- thirds of all coding SNPs cause a change in the encoded amino acid, however nonsynonymous SNPs that cause the introduction or change of existing stop codons would highly likely cause a loss of function. Although rare (1.3% of all lineage-specific SNPs), thirty-five lineage-specific pseudogenes were identified due to the introduction of stop codons in the lineage branches. These genes may have been allowed to lose their function either due to the genome-wide loss of selective constraint in the MTBC, or potentially selection may have been relaxed during adaptation to a new niche in the respective lineages. The former hypothesis is more likely however, as no difference was found between the frequency of pseudogene creation or functional category of affected gene and the specific lineage. Furthermore, most genes were conserved hypotheticals and all but one nonessential to growth; the exception was senX3, but the nonsense SNP in Lineage 6 resulted in a modest loss of one amino acid, unlikely to affect function. The annotated H37Rv genome sequence contains thirteen pseudogenes (Lew et al., 2011), and it is likely that all of these pseudogenes are the result of random drift, which will eventually be removed by deletions leaving a tighter packed and eventually more reduced genome.
With such little variation in MTBC it is not currently possible to measure selection in each gene, although future whole genome studies employing low hundreds to thousands of MTBC genomes may enable this. An approach to analyse selection in DNA sequence data is to use dN/dS ratio, which provides a measure of the accumulation of nonsynonymous SNPs against the background of assumed silent synonymous SNPs. The dN/dS measure has been applied to many bacterial species to understand the evolutionary histories, including Salmonella typhi (Roumagnac et al., 2006), Clostridium difficile (He et al., 2010) and previously in the MTBC (Hershberg et al., 2008). However, the method was originally developed for the analysis of genetic sequences from divergent species (Kimura, 1977), and it has recently been suggested that it is inappropriate for the analysis for variation within a population (Kryazhimskiy & Plotkin, 2008). The problem with such comparisons is the potential short times scales involved, whereby slightly deleterious mutations that will have been removed by selection cannot be separated from substitutions that are fixed in the population (Rocha et al., 2006); this has been shown to lead to high dN/dS values for closely related bacteria, often approaching 1 (Rocha et al., 2006). If this is the case in the MTBC, it
81 3.4 Discussion might be expected for the external branches of the phylogeny, which includes SNPs from the extant strains, to harbour more nonsynonymous SNPs than the lineage-specific SNPs that were the focus of this study. Mutations would be expected to decrease over time as they are purged by purifying selection. In this study, the ratio of nonsynonymous to synonymous SNPs was not different between the external tips of the tree compared to the lineage branches (ranging from a ratio of 1.9 in the lineage branches, to 1.7 in the external). This is in agreement with other studies (Hershberg et al., 2008), and together shows that nonsynonymous SNPs are not more intensely purged than synonymous SNPs, which would suggest that the high dN/dS is not due to close relatedness of the strains.
Previous studies of MTBC variation found genome-wide dN/dS values of 0.57 (Hershberg et al., 2008) and 0.60 (Comas et al., 2010). These suggest strongly reduced purifying selection acting within MTBC. It has been suggested that the cause of this reduced selection is due to the small effective population size of the MTBC, which is a consequence of the clonality of the MTBC and serial population bottlenecks during transmission of TB (Hershberg et al., 2008; Smith et al., 2006a). The mean dN/dS for the lineage branches found in this study was 0.67, with no significant difference in the overall dN/dS per lineage. The lack of significant differences between the lineages suggests that the hypothesised reduction in purifying selection is a general feature across the lineages. Categorising all genes by essentiality, the effects of purifying selection could however still be detected in the MTBC, with significantly fewer nonsynonymous SNPs in essential genes. Furthermore, splitting genes by annotated function, the gene category with critical function to the cell had the lowest dN/dS. This information pathways category consists of genes involved in critical cellular functions, including genes involved in transcriptional and translational machinery. At the other end of the spectrum, the regulatory gene category had the greatest accumulation of nonsynonymous SNPs; four of the lineages (Lineages 3, 4, 5 and 6) had dN/dS ratios >1, indicating potential positive selection within this class, with three to five times more nonsynonymous to synonymous SNPs.
High frequencies of nonsynonymous SNPs in regulatory genes have been detected previously. In 2011, Schürch et al. sequenced several isolates from the Beijing family of the MTBC, a subgroup of Lineage 2, and found overrepresentation of nonsynonymous SNPs in the regulatory and associated signalling transduction pathways (Schürch et al., 2011). As previously discussed, in this study gene concatenates were used, which has
82 3.4 Discussion the disadvantage of averaging the selective forces acting on the sequences and thereby providing a summary of the pressure acting on the sequences; it is not possible to identify individual genes potentially under positive selection. Furthermore, analysing the frequency of SNPs clustering within genes, it was found that no genes harboured a rate that deviated from the expected Poisson distribution. This suggests that specific genes in the regulatory category are not highly variable, but that the whole category is accumulating the greatest ratio of nonsynonymous SNPs, which in turn may affect the regulatory networks of the respective lineages. Overall, this has shown that the loss of selective constraint is a common feature of all lineages, and functional genetic diversity is anticipated, specifically due to the high number of amino acid changing SNPs.
83 4.1 Introduction
Chapter 4 In silico prediction of functional Single Nucleotide Polymorphisms
4.1 Introduction
Current knowledge on the effect of genetic variation in the M. tuberculosis Complex (MTBC) is limited, but it has been suggested that much of the genetic variation in the MTBC will have functional consequences due to a reduction in purifying selection (Hershberg et al., 2008). This concept was further investigated by Hershberg et al. through comparison of the rates of nonsynonymous SNPs, and therefore amino acid changes, within conserved amino acid positions between the MTBC and M. canetti (Hershberg et al., 2008). Positions were classified as conserved based on the gene sequences of all other mycobacterial species. Reduced selection would be detected by a difference in the number of amino acid changes falling in conserved and variable sites between M. canetti and the MTBC. This was found to be the case, with nonsynonymous SNPs falling in conserved amino acid positions 27% of the time in M. canetti, but just over double the frequency (58%) was found in MTBC.
While underscoring the reduced selective constraint in MTBC, this also raises the possibility that much of the genetic variation could have a functional impact. Nonsynonymous SNPs have the potential to affect gene expression or the function of the encoded protein, which can have a range of phenotypic consequences to the cell. Most nonsynonymous SNPs are deleterious and eventually removed through the process of purifying selection (Balbi & Feil, 2007), but as demonstrated in this and other studies, the capacity to remove such SNPs is diminished in the MTBC due to low levels of purifying selection. This raises the question of how many and which nonsynonymous
84 4.1 Introduction
SNPs actually have a functional consequence. Based on an extrapolation of the aforementioned MLSA dataset, the actual number of functional SNPs was estimated. Specifically, the decreased number of nonsynonymous SNPs falling in conserved positions in M. canetti was used to estimate the number of nonsynonymous SNPs that would have been removed in the MTBC if purifying selection was similar to that of M. canetti, or any other Actinobacteria. It was suggested that about 40% of the amino acid changes in the MTBC would result in functional consequences, and if the small gene set was unbiased, genome-wide this translates to about 300 functional SNPs per average pairwise comparison of MTBC strains; strains that diverged at a closer time point would have would have few functional SNPs whilst the most divergent strain comparisons would have up to 500 functional SNPs (Hershberg et al., 2008).
Whilst the study represented the most complete analysis of genetic diversity at the time, the MLSA approach assays variation within a small sample of the genome. Whole genome sequencing datasets enable this hypothesis to be tested without risk of potential gene selection bias, and critically all of the predicted functional SNPs can be identified for the first time. Focus is made on the nonsynonymous SNPs identified in Chapter 3. This is the dominant SNP type identified in the MTBC, and is more amenable to in silico prediction methods due to the inherent property of causing amino acid change, which can be measured by the methods described below.
The main body of research into predicting the effects of nonsynonymous SNPs has been undertaken in eukaryotic systems, specifically in human based genetics studies (Ng & Henikoff, 2006). SNPs constitute about the 90% of human protein sequence variability (Collins et al., 1998), and the importance of nonsynonymous SNPs in humans is illustrated by the database containing disease-causing variants, the Human Gene Mutation Database (HGMD) (Stenson et al., 2012). In this database, nonsynonymous SNPs make up about half of the genetic variants that are known to cause disease (Stenson et al., 2012). In silico methods fall into two main groups, either based on sequence or structural information, and some hybrid methods now exist using a mix of the two approaches (Thusberg & Vihinen, 2009). The overarching basis of all amino acid substitution based predictions is the evidence that mutations which effect protein function tend to occur at evolutionary conserved positions, suggesting that predictions could be based on sequence homology (Miller & Kumar, 2001). It was also found that mutations had common structural features that distinguish them from neutral SNPs, suggesting that structural features could also be used in predictions (Sunyaev et al.,
85 4.1 Introduction
2000; Wang & Moult, 2001). In 2001, Wang & Moult used the human SNPdb database to model disease-causing mutations onto their respective wild-type protein structures and found that 83% of disease-causing mutations affect protein stability. These key studies spawned the development of algorithms to differentiate between functional and neutral SNPs. Some are based on sequence homology, such as SIFT (Ng & Henikoff, 2003) and PANTHER (Thomas et al., 2003), whilst others use structural features such as TopoSNP (Stitziel et al., 2004). As described, some combine many predictive features, and one example is the prediction method PolyPhen (Ramensky et al., 2002).
4.1.1 Aims
The work presented in this chapter is a comprehensive genome-wide prediction and characterisation of MTBC lineage-specific nonsynonymous SNPs. The specific aims were to:
• computationally predict functional nonsynonymous SNPs. • gain insight into the impact of functional SNPs across the lineages. • generate a focused SNP set that can be followed in experimental systems.
86 4.2 Materials and Methods
4.2 Materials and Methods
4.2.1 SIFT
Prediction of nonsynonymous SNPs likely to affect protein functional was performed using the Sorting Intolerant From Tolerant (SIFT) algorithm (Ng & Henikoff, 2003). SIFT version 4.0.2 (downloaded February 2010) was installed as a stand-alone version on a Linux server. A custom bash routine was written to analyse all SNPs in several batches.
The SIFT prediction is based on sequence conservation and the type of amino acid change. Briefly, SIFT looks for homologs in other bacteria of the gene of interest and 1) scores the conservation of the positions where mutations are found, and 2) weights this score by the nature of the amino acid change. These measures are incorporated into a normalised probability score, with scores ≤ 0.05 indicating a functional SNP prediction. The classification threshold was previously optimised for performance on a data set comprising of 55 LacI-related sequences, including paralogs (Ng & Henikoff, 2001). Furthermore, if sequence alignments over the SNP position were at a depth <3 then prediction was excluded.
A further conservation measure was also used to prevent the prediction of mutations on sequences too conserved, which would contaminate the multiple sequence alignment and bias SIFT to predicting more functional SNPs. The recommended <3.5 conservation score threshold was used, thereby filtering those genes and associated predictions above this threshold. As a bacterial database to generate the protein sequence alignment, all publicly available mycobacterial genome sequences outside of the M. tuberculosis complex (MTBC) were used. Therefore predictions were based on mycobacterial homologs, but not on species that are evolutionary too close to the query sequences, which could again contaminate the alignment with sequences likely to harbour the SNP
87 4.2 Materials and Methods allele to be tested. The MTBC database consisted of thirteen complete mycobacterial genomes, seen in Figure 4.1 and Table 4.1.
100 M. leprae (TN)
88 M. leprae (Br4923) M. ulcerans (AGY99) 100 100 M. marinum (M) M. avium subsp. paratuberculosis (K10) 100 M. avium (104) M. abscessus (ATCC 19977) M. smegmatis (MC2 155) 56 M. sp. JLS 100 100 M. sp. MCS 100 M. sp. KMS 91 M. gilvum (PYR-GCK) 100 M. vanbaalenii (PYR-1) Nocardia farcinica (IFM 10152)
50
Figure 4.1. SIFT database phylogeny. BLAST database constructed for SIFT. Neighbour-Joining phylogeny based on concatenated 16S RNA and rpoB nucleotide sequences from the thirteen available mycobacterial genomes. Node support after 1000 bootstrap repetitions shown on branches. Scale bar indicates number of SNPs. The tree is rooted using the outgroup Nocardia farcinica. The MTBC was not included to prevent contamination of the predictions by closely related sequences; if present the MTBC would diverge from M. leprae.
88 4.2 Materials and Methods
Table 4.1. SIFT database of non-MTBC species. Thirteen complete whole genome sequences were published at time of this study. Genomes downloaded from NCBI.
Genome Description M. leprae TN Causative agent of human leprosy. Leads to permanent damage to the skin, nerves, limbs and eyes if left untreated M. leprae Br4923 As above M. ulcerans AGY99 An emerging pathogen that causes Buruli ulcer M. marinum M Causes a tuberculosis-like disease in cold-blooded animals, and a peripheral granulomatous disease in humans M. avium subsp. Causes tuberculosis in birds and disseminated infections Paratuberculosis K10 in immunocompromised humans M. avium 104 See above M. abscessus ATCC 19977 Environmental bacterium that causes lung, wound, and skin infections M. smegmatis str. MC2 Generally non-pathogenic, capable of causing soft tissue 155 lesions M. sp. JLS A pyrene-degrading bacterium isolated from the soil M. sp. MCS As above M. sp. KMS As above M. gilvum PYR-GCK As above M. vanbaalenii PYR-1 Capable of degrading a variety of aromatic hydrocarbons
4.2.2 Indels
Short indels (ranging from 1 to about 20 nt) were identified in Lineage 1 and 2 genome strains using the indelpe module in MAQ (Li et al., 2008). All Lineage 1 and 2 genomes used in Chapter 3 were used in this analysis. The tab delimited output file includes: start position, indel type (inserted/deleted nucleotides). From this file it was possible to identify frameshift mutations as those not divisible by three, the codon length. Indels are inherently difficult to identify in short read data, and so only a targeted analysis of two lineages was performed.
4.2.3 Homology modelling
Prediction of protein structure was performed using Protein Homology/analogy Recognition Engine V 2.0 (Phyre2) (Kelley & Sternberg, 2009). Phyre2 is available at: http://www.sbg.bio.ic.ac.uk/phyre2. Detailed description of the Phyre2 server has been
89 4.2 Materials and Methods previously described (Bennett-Lovsey et al., 2008; Kelley & Sternberg, 2009; Mao et al., 2012). Briefly, nine ancestral (wild-type) regulatory protein-coding sequences were submitted to the phyre2 server. A non-redundant fold library is constructed based on known protein sequences mined from the Structural Classification of Proteins (SCOP) database and Protein Data Bank (PDB). The query protein sequence is scanned against a non-redundant sequence database, and a profile Hidden Markov model (HMM) generated. A PSI-Blast is used to collect close and remote sequence homologues, an alignment is constructed and secondary structure predicted. The profile HMM and the secondary structure are then used to scan the fold library. This alignment process returns a score on which all alignments are ranked, and an E-value is generated. Top twenty scoring matches are then used to generate full 3-D models of each sequence and reported to the user. For each regulatory protein, the highest confidence model (>99%) with the greatest coverage was used in the subsequent analysis. Whilst it was possible to generate a homology model for all regulators, for four proteins the structure did not cover the SNP region and so was not used in later analysis.
4.2.4 Change in protein stability
Prediction of SNPs that cause a destabilisation of the protein structure was made using CUPSAT (Parthiban et al., 2006). The CUPSAT server is available at: http://cupsat.tu- bs.de/. CUPSAT predicts the change in free energy of protein unfolding between wild- type and mutant proteins (ΔΔG) using structural environment specific atom potentials and torsion angle potentials. The prediction is based on existing PDB protein structures, or user supplied structures. The output consists of information about mutation site, its structural features (solvent accessibility, secondary structure and torsion angles), and comprehensive information about changes in protein stability for nineteen possible substitutions of a specific amino acid mutation (Parthiban et al., 2006). Protein stability is categorised as destabilising by a loss of protein stability (-ΔΔG) or stabilising if protein stability increases (+ΔΔG). Changes in stability of < 0.5 ΔΔG are not considered significant, and are classified as neutral mutations.
90 4.3 Results
4.3 Results
4.3.1 Predicting functional SNPs within control set
The Sorting Intolerant From Tolerant (SIFT) algorithm was first tested on a set of SNPs that are highly likely to affect protein function in the MTBC. Drug resistance in the MTBC is largely caused by SNPs (Ramaswamy & Musser, 1998; Riska et al., 2000), and many of these drug resistance-conferring mutations have been identified and are housed in the TBDream database (Sandgren et al., 2009) (database downloaded on 07- 06-10). In total a non-redundant set of 87 SNPs was extracted, consisting of SNPs from the following genes: ahpC, kasA and katG (SNPs associated with Isoniazid resistance), embB (ethambutol resistance), gyrA and gyrB (fluroquinolone resistance), pncA (pyrazinamide resistance) rpoB (rifampicin resistance).
In addition to the drug resistance conferring SNPs, a literature search of experimentally determined functional SNPs in the MTBC was conducted to supplement the test set. SNPs from two additional genes: pykA and mmaA3 were included from this search (Behr et al., 2000; Keating et al., 2005). One of the early signs of variation amongst the MTBC was the variation in carbon utilisation (Goldman, 1963; Winder & Brennan, 1966). A characteristic of M. bovis was the inability to grow on glycerol as a sole carbon source, unlike M. tuberculosis, and instead requiring the addition of pyruvate to the growth medium in vitro (Wayne, 1994). A mutation within pykA in M. bovis, encoding pyruvate kinase, was found to render this enzyme inactive and thereby disrupting the use of carbohydrates as an energy source (Keating et al., 2005). The nonsynonymous SNP (E220D) is also found in strains of M. africanum and M. microti (an infection in Voles), and these cultures are also supplemented with pyruvate (Keating et al., 2005; Wayne, 1994). The second nonsynonymous SNP (G98D) in mmaA3 is present within most strains of M. bovis BCG, such as BCG-Pasteur (Behr et al., 2000). A defining characteristic of mycobacteria is their capacity to synthesise mycolic acids, and it had
91 4.3 Results been known that some BCG strains could not synthesis methoxymycolates, one type of mycolic acid (Minnikin et al., 1983). The G98D mutation was subsequently found to be responsible for this difference (Behr et al., 2000; Yuan et al., 1998).
SIFT was applied to the test SNP set and the results filtered as described (section 4.2.1), removing regions covered by <3 homologs and alignments with too little sequence variation with which to form a reliable prediction. In total 63 SNP predictions were made for the control set, and 48 (78.7%) of the drug resistance associated SNPs were predicted to impact protein function, leaving the remaining 13 SNPs (21.3%) predicted to be tolerated. The two pykA and mmaA3 SNPs were also predicted functional, both receiving the lowest SIFT scores of 0.00. Together, nearly 80% of the SNP set was predicted functional, which may suggest a false negative error rate of 20%. Although it should be stressed that not all of the SNPs within the drug resistance set are experimentally confirmed to be involved in drug resistance, and instead causally associated. Additionally, promoter mutations could also be the cause of drug resistance, such as the inhA promoter mutations that cause isoniazid resistance (Musser et al., 1996); non-coding SNPs can inherently not be tested in this type of analysis.
4.3.2 Predicted functional nonsynonymous SNPs
All lineage-specific nonsynonymous SNPs indentified in Chapter 3 were entered into the dataset for this study (N=1550 SNPs). Predictions could be made for 1339 (86.4%) of the SNPs. Removal of predictions based on genes that were highly conserved reduced this set by 37.8% (506 SNPs), leaving 833 SNP predictions. SNPs within genes that harboured little sequence diversity were not included as such predictions would be biased, potentially causing increased functional mutation calls and thereby increased false positive error rate (Ng & Henikoff, 2003).
In total, 371 nonsynonymous SNPs were predicted to affect gene function (Table 4.2). The ancient lineages (Lineages 1, 5 and 6) were found to harbour nearly double the number of predicted functional SNPs than the modern lineages (246 vs 125 functional SNPs respectively). However, the three ancient lineages also have the longest branch lengths as shown in Chapter 3. To counter for any influence of gene branch length, the number of functional and tolerated SNPs was expressed as percentages (Figure 4.2). The percentage of SNPs predicted functional, for which predictions could be made, ranged
92 4.3 Results from 40.9-48.4% across the Lineages, with a mean of 44.5%. There was no significant difference between the frequency of predicted functional and tolerated SNPs across the lineages (Mann Whitney, p = 0.4817). Additionally, no difference was observed between the number of functional SNPs by the ancient and modern classification, with a mean of 44.7% and 44.1% predicted functional SNPs respectively.
As a further control, all genes with predicted functional SNPs were categorised as essential or nonessential on the basis of transposon mutagenesis screens (Sassetti et al., 2003; Sassetti & Rubin, 2003). Using these two categories 54 genes (14.6%) of the functional predictions were essential. This would suggest a 14.6% false positive error rate for SIFT predictions, which is also close to the previously described false positive error rate for the SIFT algorithm (~20%) (Ng & Henikoff, 2003).
93 4.3 Results
Table 4.2. Predicted tolerated and functional SNPs using SIFT. Based on SIFT score ≤ 0.05 are predicted functional, and genes with conservation scores not < 3.5 were filtered.
Total Lineage Tolerated Functional predictions L1 79 74 153 L2 25 18 43 L3 52 44 96 L4 33 23 56 L5 111 89 200 L6 118 83 201 Modern branch 44 40 84 462 371 833
100 )
% 80 (
s Tolerated SNP P
N 60 Functional SNP S
f o
r 40 e b m u 20 N
0
Lineage 1Lineage 5Lineage 6Lineage 2Lineage 3Lineage 4
Modern lineage
Figure 4.2. SIFT predictions. To account for differences in lineage branch lengths, the percentage of SNPs predicted as being functional and tolerated is shown. Horizontal dashed line indicates the average percentage of predicted functional SNPs (44.5%).
94 4.3 Results
4.3.3 Impact of nonsynonymous SNPs outside of the human adapted MTBC
To test if the high percentage of predicted functional SNPs is restricted to the MTBC or is a common phenomenon in mycobacteria, all SNPs were identified between the reconstructed ancestor of the MTBC sequences and M. canetti, the closely related outgroup of the MTBC. Out of a total 12,319 coding SNPs, 4,245 (34.5%) were nonsynonymous. Compared to the percentage of nonsynonymous SNPs found within the lineage branches of the MTBC (64.8%), M. canetti has nearly half the number of nonsynonymous SNPs. Screening these nonsynonymous SNPs for potential functional impact using SIFT, it was found that there were significantly more predicted functional SNPs in the MTBC. Out of total 2,416 possible predictions, 522 (21.6%) were predicted functional (chi-square, p<0.0001). This would suggest that in contrast to the MTBC, the majority of changes in M. canetti are functionally neutral.
4.3.4 Clustering of functional SNPs
There was little evidence of functional SNPs clustering within specific genes, which could be indicative of adaptive selection. The majority of genes did not harbour a predicted functional SNP (3701 genes, 92.1%), whilst those that did ranged from 0-5 SNPs per gene, as shown Figure 4.3A. The frequency of SNPs mainly followed the expected distribution seen by the Poisson model fitted to the data, however there were a few exceptions: Rv2079, fadD15 (Rv2187) and Rv0465c. The three genes that deviate from the expected number of SNPs had SNP numbers ranging from 4-5 per gene (Figure 4.3B).
All three genes are above the average gene length of 1003nt, ranging from 1425-2514nt, which could account for the increased number of predicted functional SNPs. However, out of the fifteen nonsynonymous SNPs found within the three genes, only one was not predicted to be functional, which would not be expected based on the genome-wide distribution of predicted functional and tolerated SNPs (chi-square, p=0.0002). Therefore, whilst these are relatively long genes, this does not account for the skewed number of predicted functional nonsynonymous SNPs.
Not much is known about Rv2079, which has four predicted functional SNPs. It is a conserved hypothetical gene of unknown function, and SNPs are found in four lineages
95 4.3 Results
(1, 2, 5 and 6); in Lineage 2 a nonsynonymous SNP causes the introduction of a stop codon. Combined with evidence that this gene is nonessential for growth based on transposon screens (Sassetti et al., 2003; Sassetti & Rubin, 2003), it is possible that functional mutations are accumulating as Rv2079 it is either incorrectly annotated as a gene, or in the case of Lineage 2 has become a pseudogene. The other outliers were fadD15 and Rv0465c, which contain five predicted functional SNPs each. The genes belong to different functional categories, lipid metabolism and regulation proteins, respectively. As before, fadD15 functional SNPs are across multiple lineages (1, 3, 4 and 6), and one SNP is also present in the modern lineage branch. Therefore all the Modern lineages have one or two functional SNPs in fadD15. Furthermore, in Lineages 1 and 5, the two SNPs are nonsense and result in the introduction of stop codons in the lineages. Function is again not known for fadD15, but it is encodes a fatty-acid-CoA synthetase and is likely involved in lipid degradation (Cole et al., 1998).
The other gene with five predicted functional SNPs, Rv0465c, is a probable transcriptional regulator (Cole et al., 1998). It shares high sequence identity with the RamB protein from Corynebacterium glutamicum, which is in the same phylum as M. tuberculosis. As well as binding to its own promoter to autoregulate expression, RamB controls isocitate lyase (icl1) which is part of the glyoxyate cycle (Micklinghoff et al., 2009). Although not annotated in the most current release of the Tuberculist database (Release 26, December 2012), it has been given the gene name ramB by Micklinghoff et al. (2009), and this has been adopted in the following sections. Characteristic of regulators, the mycobacterial ramB has a DNA binding domain, which is in the N- terminus of the 465 amino acid protein, including the helix-turn-helix domain (HTH), from amino acid residues 21 to 40, as based on the PROSITE database. One of the two predicted functional SNPs in Lineage 6 is located within the HTH domain (N36D), which might be expected to directly affect the capacity of the protein to bind DNA. All other functional SNPs, found in Lineages 1, 4 and 5, are located throughout the first half protein length, leaving only Lineage 2 and 3 with a likely functioning ramB.
Next, the distribution of the predicted functional SNPs across the genome was calculated, shown in Figure 4.4. Functional SNPs were located across the genome, and appear to follow the same distribution profile of the nonsynonymous SNP frequencies, as identified in Chapter 3. On average, there is one functional SNP per 10.9kb of coding sequence.
96 4.3 Results
A 4000 3701
l a n
o 3000 i t c n s u P f
N 2000 d S e t c i d
e 1000 r P 278 35 3 1 2 0 0 1 2 3 4 5 Number of SNPs in gene
B
l
a 1000 n
) o 0 i t 1 c g n o
u 100 L f
(
d s e t P c i N 10 d S e r P 1
0 1 2 3 4 5 Number of SNPs in gene
Figure 4.3. Distribution of predicted functional SNPs per gene. A. SNPs per gene range from 0-5, with actual number of genes shown at top of bar. Line indicates predicted values under a Poisson distribution fitted to the data. B. y-axis potted on a log10 scale to highlight deviation from the expected number at high SNP numbers per gene.
97 4.3 Results
Predicted functional SNPs 20 60
s Nonsynonymous SNPs N P o N n s S
y l 15 n a o
n 40 n o i y t m c
n 10 o u f u
s d
S e 20 t N c
i 5 P d s e r P 0 0 0 1 2 3 4 Genome position (Mb)
Figure 4.4. Frequency distribution of predicted functional SNPs across genome. SNPs were placed into bins of 0.1Mb. Right y-axis predicted functional SNPs, left y- axis nonsynonymous SNPs.
98 4.3 Results
4.3.5 Functional category analysis of functional SNPs
To determine if the predicted functional SNPs are within specific gene categories or instead evenly distributed, the genes with predicted functional SNPs were grouped by the Tuberculist functional categories (Lew et al., 2011). The percentage of functional SNPs within each of the eight functional categories was compared to the percentage representation of the respective category genome-wide, and is shown in Figure 4.5. In this way, the unequal distribution of genes within specific categories was normalised and functional SNP distribution expressed as a ratio. Ratios >1 represent functional categories over-represented with functional SNPs, whereas <1 indicates under- representation. Categories significantly over-represented with functional SNPs were lipid metabolism (2.4-fold) and regulatory proteins (1.6-fold) (chi-square, false discovery rate adjusted p < 0.05). Interestingly, information pathways were the most under-represented category, with 2.0-fold less predicted functional SNPs that would have been expected (chi-square, false discovery rate adjusted p=0.04) (Table 4.3). Genes within the conserved category were also significantly under-represented.
lipid-metabolism *
unknown
regulatory-proteins * cell-wall-and-cell-processes
intermediary-metabolism-and-respiration
virulence,-detoxification,-adaptation
conserved-hypotheticals * information-pathways * !3 !2 1 2 3 Functional-category-representation
Figure 4.5. Functional category representation. Values on the x-axis are ratios, representing the deviation from the expected number of predicted functional SNPs per category. Ratios > 1 indicate overrepresentation, <1 underrepresentation, and ~1 indicates that the number of predicted functional SNPs is on par with the expected number. Categories are based on Tuberculist annotations. * indicates p <0.05 by individual chi-square test followed by multiple test correction (False Discovery Rate method) (Benjamini & Hochberg, 1995) .
99 4.3 Results
Table 4.3. Functional category representation. The number of predicted functional SNPs within genes from each respective category. Representation of category expressed as ratios. Independent chi-square tests performed for all categories, followed by multiple test correction (False Discovery Rate method) (Benjamini & Hochberg, 1995).
chi-square Gene Functional Functional category Representation (adjusted number SNPs p-value) information pathways 242 12 -2.0 0.04 conserved hypotheticals 1031 63 -1.6 <0.01 virulence, detoxification, adaptation 238 17 -1.4 0.27 intermediary metabolism and respiration 936 88 -1.1 0.55 cell wall and cell processes 773 91 1.2 0.18 regulatory proteins 198 31 1.6 0.04 unknown 16 3 1.9 0.55 lipid metabolism 271 66 2.4 <0.01
An alternative method to account for the number of functional SNPs per category was also calculated. This was based on the number of functional SNPs per potential nonsynonymous SNP position in each functional category. Using this method, it was again found that the information pathways category had accumulated the least number of functional SNPs (12 functional SNPs out of 202,427 potential nonsynonymous positions, 0.006%). The lipid and regulatory categories had accumulated the most functional SNPs, with 0.02% and 0.03% of all potential nonsynonymous positions harbouring a functional SNP respectively. In summary, this method highlights the same gene categories over and under represented found previously.
Stratification of the predicted functional SNPs by lineage in the functional categories by one-way ANOVA found no significant difference (Kruskal-Wallis test, p=0.99). This would suggest that whilst there is a significant difference in representation of functional SNPs within the above four gene categories, it is not driven by specific lineages but instead a phenomena across the MTBC lineages.
100 4.3 Results
4.3.6 Functional impairment of Lineage 1 and 2 regulatory proteins
It has been shown that two functional categories, regulatory proteins and lipid metabolism, have accumulated a greater number of predicted functional SNPs than expected. The following section focuses on the over-represented regulatory category, and specifically on the predicted functional mutations within Lineages 1 and 2, which are the focus of the transcriptomic study in Chapter 5. This provides an opportunity to combine additional predictive information such as structural features, to the previous sequence based predictions, whilst also providing a reduced SNP set to initially guide the transcriptome analysis.
Eleven genes within the two lineages harbour lineage-specific SNPs predicted by SIFT analysis as likely to impair protein function, and a further gene harbours a nonsense mutation (Table 4.4). Targeted analysis of insertion and deletion (indel) mutations in the lineage branches identified a further two genes with mutations that cause frameshift mutations (Table 4.4). The frameshift mutation in Lineage 2 removes the existing stop codon, likely causing run through and fusion with the downstream gene Rv3829c. Similarly, the two base frameshift deletion within Rv1028c (kdpD) at chromosome position 1151486 leads to the introduction of stop codon at codon position 235 and a resulting 625 (72.8%) amino acid truncation of the ancestral protein. kdpD is a two component transcriptional sensor and controls the expression of the kdpABC operon, which in Escherichia coli is involved in potassium transport at low potassium concentrations (Walderhaug et al., 1992). A third indel was found within mce1R (Rv0165c), at chromosome position 194305. However, the same two-nucleotide insertion (consisting of two CC nucleotides) was found across all Lineage 1 and 2 strains, and so was removed from the analysis as this likely represents a two base deletion that is specific to the H37Rv sequence used in the reference based mapping.
101 4.3 Results
Table 4.4. Transcriptional regulators with predicted functional SNPs and indels. Eleven SNPs with prediction functional SNPs based on SIFT analysis. One SNP causes a nonsense mutation (stop gain). Two indels cause frameshift mutations. n/a: not possible to predict with SIFT.
SIFT Gene Regulator type SNP Mutation Lineage score Rv1846c BlaI penicillinase repressor T 2096430 G L57R 1 0.05 Rv3082c VirS AraC T 3447480 G L316R 1 0.01 Rv3167c TetR C 3536008 A P17Q 1 0.02 Rv0465c RamB HTH-XRE A 555945 G Q121R 1 0.02 Rv1032c TcrS 2-component sensor C 1157771 G S62C 1 0.01 Rv3736 AraC G 4187063 A G144R 1 0.01 Rv0844c NarL 2-component regulator G 940602 C G169R 2 0.00 Rv0377 LysR G 455325 C R302P 2 0.00 Rv0275 TetR T 331588 C L24S Modern 0.00 Rv0981 MprA 2-component regulator A 1097023 G S70G Modern 0.04 Rv2359 Zur Fur G 2641840 A R64H Modern 0.02 Rv2788 SirR Fe-dependent C 3097349 Q131X 1 n/a repressor Rv3830c TetR insertion: S208 2 n/a 4305063 T frameshift Rv1028c KdpD 2-component sensor deletion: H67 1 n/a 1151486 AC frameshift
4.3.6.1 Change in protein stability
The sequence-based predictions of functional impairment of transcriptional regulators were refined through incorporation of structural based information. The location of each SNP was placed in the context of protein domain information, such as identification of SNPs within the functionally important DNA binding helix-turn-helix (HTH) domain. Protein domain annotations were extracted from the Pfam database (Punta et al., 2012). These were then complemented with predictions on the protein stability (ΔG) of wild- type and mutant protein structures, enabling the change in protein stability (ΔΔG) to be
102 4.3 Results calculated. Compromised protein folding and decreased stability of the protein product are major pathogenic consequences of nonsynonymous SNPs, affecting the ability of the protein to function (Wang & Moult, 2001; Yue et al., 2005).
To calculate ΔΔG it is necessary to have protein structures for each of the regulators. Only two of the eleven regulators with predicted functional SNPs have had their protein structures resolved and are publicly available in the Protein Data Bank (PDB) (Burley, 2013); these are BlaI (PDB ID: 2G9W) and NarL (3EUL) (Sala et al., 2009; Schnell et al., 2008). For the remaining nine regulators, homology modeling was performed using the Phyre2 server (Kelley & Sternberg, 2009) as described in Methods (section 4.2.3). Following this it was still not possible to construct protein models for four of the regulators, either due to the low quality of the model or because the SNP position was not covered. The remaining seven regulators were entered into the analysis.
The CUPSAT server was used to predict ΔΔG (Parthiban et al., 2006). Protein stability is categorised as destabilising (-ΔΔG), neutral (0 ΔΔG) or stabilising (+ΔΔG). Changes in stability of < 0.5 ΔΔG are not considered significant (see section 4.2.4). Five of the regulator SNPs were predicted to cause a loss of protein stability, one protein structure increased in stability following the SNP, and one prediction of energy change was too small to classify as either stabilising or destabilising, and so is likely neutral (Table 4.5). Combined with the protein domain information, five of the destabilising SNPs were located within the HTH DNA binding domains, and likely affect the regulatory function of the protein: Rv0275, Rv0844c (narL), Rv1846c (BlaI), Rv3082c (virS) and Rv3167c. These were classified as having “high predictive scores” and form a reduced set of transcriptional regulators predicted to be functionally impaired (Table 4.5). For example, a SNP in Lineage 1 strains introduces an arginine residue into the conserved position of the virS HTH domain, which is predicted to destabilise the structure and cause a loss of function (Figure 4.6).
103 4.3 Results
Table 4.5. Regulatory proteins with predicted functional SNPs and indels in Lineages 1 and 2. Sequence based predictions of functional SNPs are combined with Pfam protein domain information and prediction of changes in protein stability (ΔΔG). n/a: unable to calculate ΔΔG as the mutation is an indel or nonsense SNP, unkn: unable to generate a protein structure using homology modelling.
Protein stability Gene Mutation Lineage Domain (ΔΔG; kcal/mol) high predictive score Rv0275 L24S Modern helix-turn-helix -3.18
Rv0844c NarL G169R 2 helix-turn-helix -4.66
Rv1028c KdpD H67 frameshift 1 2-component n/a sensor Rv1846c BlaI L57R 1 helix-turn-helix -8.72
Rv2788 SirR Q131X 1 Fe-dependent n/a repressor Rv3082c VirS L316R 1 helix-turn-helix -2.03
Rv3167c P17Q 1 helix-turn-helix -1.21
Rv3830c S208 frameshift 2 low complexity n/a fusion low predictive score
Rv0465c RamB Q121R 1 low complexity unkn
Rv0377 R302P 2 low complexity unkn
Rv0981 MprA S70G Modern cheY 2.83
Rv1032c TcrS S62C 1 low complexity unkn
Rv2359 Zur R64H Modern helix-turn-helix 0.47
Rv3736 G144R 1 arabinose- unkn binding
104 4.3 Results
L316R
281 310 320 QUERY LIERERRAQA ARYLAQPGLY LSQIAVLLGY SEQSALNRSC RRWFGMTPRQ YRAYGGVSGR * mmi:MMAR_3320 VVDDVRREVT ERYLRDSDMT LTHLARQLGY AEQSVLSRSC QRWFGASPAS LRAXXXXXXX X mmi:MMAR_5276 LIDEVRKETA DRYLRTTAMS LSHLARELGY AEQSVLTRSC KRWFGIGPAA YRAXXXXXXX X mul:MUL_4350 LIDEVRKETA DRYLRTTAMS LSHLARELGY AEQSVLTRSC KRWFGIGPAA YRAXXXXXXX X mab:MAB_3997c LVDQIRREAA ERLLSDTDLS LDHLSRQLGY AEQSVFTRSC KRWFGTTPSA YRSXXXXXXX X mgi:Mflv_5495 LVDQTRRDTA QRLLLDTALS LDQLACPLXX XXXXXXXXXX XXXXXXXXXX XXXXXXXXXX X mab:MAB_3623 LLDTIRLDLA DHLVTSDRHS LTEISEMLAF SSPSNFSRWF RGHRAMSPRT WRXXXXXXXX X mmc:Mmcs_3216 LRQSFLRERA ILRLLDRSLS VSEIAAELGY AELTNFTHAF KRWTGRSPRH FRXXXXXXXX X mkm:Mkms_3278 LRQSFLRERA ILRLLDRSLS VSEIAAELGY AELTNFTHAF KRWTGRSPRH FRXXXXXXXX X mjl:Mjls_3227 LRQSFLRERA ILRLLDRSLS VSEIAAELGY AELTNFTHAF KRWTGRSPRH FRXXXXXXXX X mgi:Mflv_4594 LRQSCLRESA MMLLITRSMS ASQIATELGY GDLANFSHAF KRWTGRSPSE YRXXXXXXXX X mab:MAB_0715c IRDAALRTEA IKSLEDGSES LNDLSVRLGF SELSAFTRAF RRWTGASPAQ YRXXXXXXXX X mab:MAB_2050 LRQSFLQERA ILRILDRSVS VSEIAAELGY ADLTNFTHAF KRWTGRSPRH FRXXXXXXXX X mmi:MMAR_3156 LRQAFLRERA MLQLLDRSLS VSEIATDLGY SDLANFSHAF KRWTGRSPSE FRXXXXXXXX X
Figure 4.6. Predicted loss of function of virS transcriptional regulator in Lineage 1. Homology model of wild-type VirS protein, covering amino acid residues 214 - 334. Arrow indicates Lineage 1 SNP at amino acid position 316 within the HTH domain. CUPSAT analysis of the ancestral and mutant protein predicts a destabilisation of the structure (ΔΔG = -2.03 kcal/mol). Sequence conservation of region used in SIFT prediction shown on right hand side, with the ancestral MTBC sequence shown at the top of the sequence alignment. Standard one-letter amino acids nomenclature used, and X indicating a gap in the alignment.
105 4.4 Discussion
4.4 Discussion
4.4.1 Strengths and limitations of the study
The overall aim of this study was to computationally measure the impact of SNPs in the MTBC, focusing specifically on SNPs that contribute to lineage-specific variation identified in Chapter 3. Over 1,500 nonsynonymous SNPs were identified in the lineage branches of MTBC, and the phenotypic effects of these are unknown. Unlike other bacterial species, the majority of SNPs in MTBC (over two-thirds) are nonsynonymous, and this SNP set was the focus of this computational study. Such SNPs are more tractable to computational prediction methods than synonymous and intergenic SNPs, as the impact of the amino acid substitutions can be measured using the properties of the amino acid, such as residue volume change, as well as the evolutionary conservation of the specific nucleotide position based on multiple sequence alignments. This is reflected in the development of computational prediction methods based mainly on nonsynonymous SNPs (Ng & Henikoff, 2006). However, clearly noncoding SNPs can also have an impact on gene function, such as the mutation of regulatory regions found in M. tuberculosis drug resistance (Müller et al., 2011; Riska et al., 2000). More recently it has also been suggested that synonymous SNPs are less silent than previously assumed (Plotkin & Kudla, 2011). Despite not having an effect on the resulting protein sequence, synonymous SNPs, and therefore synonymous codon changes, have shaped gene expression through the phenomenon of codon-usage bias (Plotkin & Kudla, 2011). Differential use of synonymous codons can effect RNA processing, protein translation and protein folding (Plotkin & Kudla, 2011); industrial applications have exploited this to increase gene expression over 1000-fold through introduction of synonymous SNP changes (Gustafsson et al., 2004). Furthermore, in human based studies, a synonymous mutation has also been shown to change the substrate specificity of the multidrug- resistance protein 1 (MDR1), although the precise mechanism is not yet understood (Kimchi-Sarfaty et al., 2007; Komar, 2007). Together this demonstrates the potential
106 4.4 Discussion functional importance of all SNP types, and it is likely that future study of M. tuberculosis genomic variation will attribute instances of functional variation not only to nonsynonymous SNPs but the latter two SNP types as well.
Whilst experimental methods exist to characterise the functional effect of SNPs, such as site-directed mutagenesis, studying the molecular effects of mutations in the MTBC is time-consuming, laborious and unfeasible at this scale, therefore computational methods can provide useful and reliable information about the effects of amino acid substitutions at an initial stage. There are two main methods to predict the functional effect of coding nonsynonymous SNPs. The first relies on mapping the SNP to the three-dimensional protein structure and the latter takes a sequence-based approach, assessing the nature of the position and introduced amino acid type. At the time of writing, there were protein structures for 259 (6.4%) of all annotated M. tuberculosis proteins in the Protein Data Bank (Burley, 2013). This number has not increased significantly in the interim period, and currently 314 genes have associated protein structures (December, 2012) (Burley, 2013). To ensure that this was a comprehensive study of the effects of lineage-specific nonsynonymous SNPs, it was decided to use the latter prediction method based on sequence homology, thus maximising the number of SNP predictions. The method chosen was the Sorting Intolerant From Tolerant (SIFT) algorithm (Ng & Henikoff, 2003). Although SIFT relies solely on amino acid sequence to make the prediction, it has been shown to perform similarly to methods based on different evolutionary and structural features, and critically can be applied to many more of the lineage-specific SNPs (Saunders & Baker, 2002; Sunyaev et al., 2001). It has been suggested that a combination of the two main prediction methods (sequence and structural based) will likely improve the accuracy of predictions (Bao & Cui, 2005; Thusberg & Vihinen, 2009), but the chosen method was viewed as an acceptable trade-off. More in depth structural work can be applied at a later targeted stage, as was used in this study on the genes within the regulatory protein category. However, even at this stage, four of the eleven (36.4%) regulatory proteins with nonsynonymous SNPs could not be entered into structural based predictions, owing to the lack of structural information; for the remaining proteins only two had been experimentally determined, requiring intensive homology modeling to increase the size of the structural dataset.
Moving from SNPs, short insertion and deletions (indels) also have potential functional consequences, particularly indels that are of a length not divisible by three and so lead to a change in the reading frame. However, inference of indels from next-generation
107 4.4 Discussion sequence data is challenging, and so far methods for identifying these lag behind methods for calling SNPs in terms of sensitivity and specificity (Albers et al., 2011). For this reason, it was decided to not include a genome-wide analysis of indels, but focus on a few potential indels in genes involved in regulatory function instead. Indels are also more rare than SNPs in the MTBC, and for these reasons the identification of SNPs has had the greatest attention in such studies so far. They are effectively the lower hanging fruit. It is likely that these issues will be resolved and indels will have more attention as newer algorithms to detect them are developed (Albers et al., 2011), and as potentially longer reads from third generation sequencing technologies are utilised.
4.4.2 Validation of the SIFT method
For the first time it was possible to identify all potential functional SNPs in the lineages of MTBC. As described in Chapter 3, these SNPs represent the background variation that contributes to the underlying lineage genetic diversity. Identification of SNPs more likely to contribute to functional diversity focuses later analyses on predicted phenotypically important SNPs, and on a broader scale tests the hypothesis that a high proportion of SNPs within the MTBC will be functional, likely due to reduced purifying selection acting within MTBC (Hershberg et al., 2008).
The SIFT algorithm was first run on a test SNP set that would be expected to be enriched for functional SNPs, and so act as positive control for the performance of the method. The set was based on SNPs associated with drug resistance from the current release of the TBDReam database (Sandgren et al., 2009). It was found that 79.4% of SNPs were predicted functional by SIFT, leaving 20.6% of SNPs associated with drug resistance predicted to be functionally neutral. This potential false negative error rate of ~20% is close to that previously described by the authors of SIFT (Ng & Henikoff, 2001; Ng & Henikoff, 2003). However, it is important to note that the majority of SNPs in the positive control set are putative mutations found in drug resistant clinical M. tuberculosis isolates, and so may be causally related and not involved in drug resistance (Sandgren et al., 2009). This will likely mean that the control set has some SNPs that are not functional and so is not a completely robust test of the SIFT algorithm. As an alternative test, it was found that significantly fewer predicted functional SNPs were found within the genes previously characterised as being essential for growth, and that functional SNPs that did fall within the group of essential genes (14.6%) is again close
108 4.4 Discussion to the expected false positive error rate of SIFT. Together this provides confidence in the later SNP predictions.
4.4.3 Half of lineage-specific SNPs are predicted to have functional consequences
Applying SIFT to all lineage-specific SNPs, it was possible to make predictions for >85% of the set, and strikingly it was found that just under half were predicted to have a functional effect. The mean percentage of functional SNPs for all lineages was 44.5% and no significant difference was found between the individual lineages, or by grouping lineages into ancient and modern categories. This prediction is very close to the estimate made by Hershberg et al. (2008). The authors of this former study estimated that ~40% of the SNPs within MTBC are functional by extrapolating from the SNPs found within the set of 89 genes sequenced in 99 human M. tuberculosis isolates (Hershberg et al., 2008). In contrast to the high proportion of functional SNPs in the MTBC, all SNPs between an M. canetti strain, the closely related outlier from the MTBC, and the reconstructed M. tuberculosis ancestor were identified and it was found that only 21.6% of the nonsynonymous SNPs were predicted to be functional, which is less than half of the proportion seen in the MTBC. This suggests that the hypothesised low frequency of purifying selection acting with MTBC is generating substantial diversity. Interestingly, a similar phenomenon has been observed in humans, where recent demographic expansions have led to the accumulation of low frequency genetic variants associated with strong functional effects (Keinan & Clark, 2012; Tennessen et al., 2012). Considering the tight link between the MTBC and its human host, it is interesting to speculate that these human expansions might have had a similar effect on the genetic diversity of the MTBC (Hershberg et al., 2008).
Although purifying selection is likely reduced in MTBC, it was still possible to detect signals of this force through increased removal of predicted functional SNPs within genes classed as essential for growth compared to nonessential genes and also by clustering of SNPs beyond the expected distribution. When grouped by functional category, genes encoding proteins involved in the information pathways category accumulated significantly less predicted functional SNPs than expected. Conversely, genes encoding proteins that perform regulatory functions and those involved in lipid metabolism were over-represented with functional SNPs. Interestingly, it was also found that the transcriptional regulator ramB had accumulated more functional SNPs than
109 4.4 Discussion expected, spanning four of the lineages. Following the regulatory protein category, focus was made on Lineage 1 and 2 SNPs; the two respective lineages form the transcriptomic study in Chapter 5, and so a focused analysis was performed through integration of additional mutational and structural information to identify likely impaired functional regulators for the proceeding study. It was found that several SNPs lie within the HTH DNA binding domain of the regulatory proteins, such as a Lineage 1 SNP in virS. VirS regulates its own transcription and is also a positive regulator of an adjacent divergently- expressed MymA locus, which has experimentally been shown to be involved in virulence in guinea pigs (Singh et al., 2003; Singh et al., 2005). Together with several frameshift mutations arising from short indels, it is hypothesised that specific lineages have functionally impaired regulators and this has the potential to give rise to phenotypic diversity. Such SNPs should be detectable at the transcriptional level, and part of the following chapter (Chapter 5) explores this hypothesis.
In summary, this study has identified a set of nonsynonymous SNPs likely to have functional consequences in MTBC. However, it is not possible using the SIFT predictions to predict how these mutations affect protein function. There are four possible evolutionary fates for SNPs: The mutant is beneficial; causes a severe fitness cost and so is lost from the population; is functionally neutral; or finally is neither beneficial or excessively harmful, but slightly deleterious (Balbi & Feil, 2007). Slightly deleterious SNPs are the largest class, and in Escherichia coli it has been estimated that for every beneficial mutation there are 105 slightly deleterious mutations (Kibota & Lynch, 1996). As seen in Figure 4.7, it can be anticipated that many of the predicted functional SNPs identified in this study will fall within this slightly deleterious category, whilst the proportion of SNPs that have a greater impact or are “more” functional is unknown, but likely determined by a combination of selective and stochastic forces, such as the level of purifying selection acting within the organism.
110 4.4 Discussion
of SNPs of “more” functional harmful, cell death SNPs (?%) Increasingnumber tolerated SNPs (~60%) functional SNPs (~40%) Increasing severity of SNP
Figure 4.7. Spectrum of functional SNPs. The consequence of nonsynonymous SNPs range from tolerated/neutral to functional and at the extreme results in cell death, and therefore are not observed in the bacterial population. In MTBC ~40% SNPs were predicted functional in this study, but severity is unknown.
111 5.1 Introduction
Chapter 5 Screening the effect of lineage- specific variation by sequence-based transcriptional profiling
5.1 Introduction
M. tuberculosis infection is defined by a typically protracted period of asymptomatic infection followed by progression to active disease in a minority of individuals. Throughout these stages of infection, M. tuberculosis is exposed to a range of microenvironments, including acidic pH, reactive oxygen species, and nutrient starvation (Barry et al., 2009). Genome sequencing of the M. tuberculosis reference strain H37Rv by Cole et al. revealed a complex network of transcriptional regulation, including thirteen sigma factors, eleven two-component regulators, eleven serine- threonine protein kinases and over one hundred predicted transcription factors (Cole et al., 1998). At the initiation of this study, the extent of transcriptional variation between clinical isolates from the six main lineages was unknown, and the effect of the underlying genetic diversity to such variation was an open question.
In 2007, a microarray based study comparing H37Rv and the animal adapted M. bovis growing under steady state conditions revealed that the human and bovine pathogens showed differential expression of ninety two genes, which encoded a range of functions, including cell wall and secreted proteins, transcriptional regulators, PE/PPE proteins, lipid metabolism and toxin–antitoxin pairs (Golby et al., 2007). It is now known that there are on average ~1500 SNPs separating any MTBC strain (section 3.3.1), which raises the likelihood that human-adapted MTBC strains will also display a similar
112 5.1 Introduction quantity of differential expression. Shortly after identification of the main six human adapted MTBC lineages, a microarray-based study in 2010 surveyed for the first time differences in gene expression amongst clinical isolates of the MTBC (Homolka et al., 2010). The study was based on a total fifteen MTBC clinical isolates from Lineage 1, the Beijing group of Lineage 2, two sub-lineages from Lineage 4 and Lineage 6. The study found specific transcriptional patterns in vitro and in intracellular growth based on the ancient and modern lineage groupings, demonstrating that strains from defined phylogenetic groups display similar gene expression, which suggests the importance of understanding the underlying genetic background. The strains used in the study were not genome sequenced which limited the scope of the study, and it was not possible to relate to specific genetic variation.
The previous chapters would not have been possible without the availability of whole genome sequences, and such data now is crucial to experiments linking genotype to phenotype. Previous transcriptomic studies have relied on microarray based methods, but recent advances in DNA sequencing technologies has enabled the determination of RNA expression through sequencing of cDNA prepared by reverse transcription of total cellular RNA (RNA-seq), which provides dynamic ranges several orders of magnitude greater than other technologies, whilst at the greatest possible resolution. The first sequence based transcriptome of M. tuberculosis strain H37Rv was published in 2011 by Arnvig et al., and whilst this was not a clinical isolate, this demonstrated the power of RNA-seq to capture the complete transcriptional landscape of M. tuberculosis (Arnvig et al., 2011).
5.1.1 Aims
The aims of this chapter were to survey the transcriptome profiles of M. tuberculosis clinical isolates from Lineages 1 and 2, and to understand the effects of lineage-specific variation identified in the previous Chapters. Specific aims were to:
• characterise M. tuberculosis transcriptomes using a sequence based approach • capture lineage-specific transcription profiles in the transcriptome sets • explore the functional impact of lineage-specific SNPs identified in Chapter 3 and 4
113 5.2 Methods
5.2 Methods
5.2.1 Clinical isolates in study
5.2.1.1 Strains sequenced using RNA-seq
Strains are from a collection of M. tuberculosis isolates from foreign-born tuberculosis patients in San Francisco, who contracted the infection in their country of origin (Gagneux et al., 2006a). All strains are drug susceptible and have been typed in studies (Table 5.1) (Gagneux et al., 2006a; Hershberg et al., 2008). Three strains were selected from Lineages 1 and 2 respectively, to represent the genetic diversity in the lineages. Figure 5.1 shows the previously described MTBC phylogeny based on MLSA analysis, and the strains used in the RNA-seq study are highlighted (Hershberg et al., 2008). From Lineage 1, two strains are from the large Rim of Indian subgroup (strains N0072 and N0153) and a representative of the Philippines subgroup (strain N0157). Two Beijing strains from Lineage 2 were selected (strain N0145 and N0052) and a less common non- Beijing strain (N0031). Figure 5.1 uses the original naming schema, but from this point on the later adopted ‘N’ number strain naming will be referred to. To preserve the two naming conventions both have been used in Table 5.1. All strains have been genome sequenced in previous studies or as part of this thesis.
5.2.1.2 Additional growth curve experiment strains
The determination of growth rates for the RNA-seq study strains was supplemented by the clinical isolates shown in Table 5.2. In total six strains from Lineage 1 and 2 were included to explore potential lineage-specific differences in exponential phase growth rate. The reference laboratory strain H37Rv was also included.
114 5.2 Methods
Table 5.1. Lineage 1 and 2 strains used in the RNA-seq study. All strains were previously genome sequenced except strain N0031, which was sequenced for this thesis in Chapter 3. This study refers to the strain names used in the Gagneux group, but original strain names used by Hershberg et al. (2008) are shown for reference. In addition to lineage, the region of difference (RD), which has been historically used to type the strains, is indicated. Geographic distribution and prevalence of lineage based on previous classifications (Coscolla & Gagneux, 2010).
Lineage MLSA strain RD Strain Lineage geographic Patient origin name lineage distribution Rim of Indian N0153 T83 1 RD239 Vietnam Ocean Rim of Indian N0072 EAS053 1 RD239 India Ocean N0157 T92 1 RD239 The Philippines The Philippines N0145 T67 2 RD105 Beijing China N0052 98_1833 2 RD105 Beijing China N0031 94_M4241A 2 RD105 Non-Beijing China
Table 5.2. Additional strains used in growth curve experiment. Three additional strains from Lineage 1 and Lineage 2 were included in the growth curve experiments in combination with the previously described six RNA-seq study strains. All are clinical strains and isolated as part of the San Francisco strain collection (Gagneux et al., 2006a). Genome column indicates genome sequencing status of strain.
RD Strain Strain ID Lineage Patient origin Genome lineage
N0043 96_4329 1 RD239 Burma Y
N0075 EAS080 1 RD239 Vietnam N N0121 T17 1 RD239 The Philippines Y
N0041 96_2104 2 RD105 Vietnam N
N0053 98_1863 2 RD105 China Y
N0140 T47 2 RD105 Macau N
115 5.2 Methods
5.2.1.3 Additional qRT-PCR strains
The confirmation of select lineage-specific expression of genes by qRT-PCR used all previous RNA-seq strains and the addition of four Lineage 1 and 2 strains. These are shown below in Table 5.3.
Table 5.3 Additional strains used in qRT-PCR confirmation. Two strains from Lineage 1 and Lineage 2 were included in the RNA-seq confirmation. All are clinical strains and isolated as part of the San Francisco strain collection (Gagneux et al., 2006a). Genome column indicates genome sequencing status of strain. One strain is currently not genome sequenced but this was not required for the aims of the qRT-PCR study.
RD Strain Strain ID Lineage Patient origin Genome lineage
N0043 96_4329 1 RD239 Burma Y
N0121 T17 1 RD239 The Philippines Y
N0041 96_2104 2 RD105 Vietnam N
N0053 98_1863 2 RD105 China Y
116 5.2 Methods
Figure 5.1. Strains sequenced in RNA-seq study. Circles indicate the six Lineage 1 and 2 strains used in the RNA-seq study. Phylogenetic tree of MTBC adapted from (Hershberg et al., 2008). Image reproduced under the Creative Commons Attribution License (CCAL).
117 5.2 Methods
5.2.2 Cluster analysis
Hierarchical cluster analysis of the transcriptomes was performed using the hclust function in R by the complete linkage method. Spearman distances were calculated from the dissimilarity matrix of pairwise correlations of total gene expression (N=4,015 genes), expressed as Reads Per Kilobase per Million mapped reads (RPKM). Clade support using 1000 bootstrap replications was performed using the R function pvclust. Comparison of the total gene expression per strain to SNP distance was performed with normalised read counts that were transformed using the variance stabilising transformation (VST), and implemented in the DESeq package (Anders & Huber, 2010). VST is a monotonous function, and is calculated for each sample such that variance in the count data becomes independent of the mean.
5.2.3 Differential expression analysis
Statistical testing for the main differential expression analysis was performed using DESeq (Anders & Huber, 2010). DESeq is a method based on the negative binomial distribution and implemented in the R statistical environment. Raw reads were normalised first using DESeq to adjust for differences in library sizes. Reads from technical replicates were combined and treated as one sample. Gene deletions at either strain or lineage level were first removed from the analysis (N=223 genes); deletions were identified based on genome coverage using the respective strains genome, with a threshold of <90% gene coverage to define a deletion. Normalised expression of features (annotated genes, antisense or sRNAs) that overlapped with strains from different lineages due to strain specific expression were filtered and removed, with 1,606 features entered into the analysis. For the purpose of testing for lineage-specific differential expression in DESeq, strains from the same lineage were treated as biological replicates, and the mean expression from the two lineages compared. Significant differential expression was defined as p<0.05 (p-value adjusted for multiple testing using Benjamini-Hochberg method).
118 5.2 Methods
5.2.4 Transcriptional Start Site (TSS) calling
Custom Perl scripts were written for TSS calling. Briefly, the increment in reads from one genome position to the next consecutive base was calculated for all genomic positions, with an increment significantly above the average background coverage defined as candidate TSS. TSS peak height was considered as representative of the level of expression of the TSS. To build a genome-wide TSS map for M. tuberculosis, automated annotation of the putative TSS detected according to genomic distribution similar to previous TSS analysis using RNA-seq data (Sharma et al., 2010b).
119 5.3 Results
5.3 Results
5.3.1 Growth rate in vitro
It was critical to isolate the transcriptomes of all study strains from the same physiological state, ensuring that differential transcription is not simply a reflection of the stage of growth. RNA was harvested at two growth phases in this study, mid- exponential and stationary; and these were defined as an Optical Density (OD600) of 0.4 to 0.6 and one week after an OD of 1.0, respectively. A difficulty of working with clinical strains compared to well-used reference strains is that the growth rates are largely unknown, which are required to standardise the RNA extraction process.
Three representative strains from Lineage 1 and 2 were selected for the RNA-seq study (section 5.2.1.1, and the growth of the six strains was monitored over a 14-day period. In a defined 7H9 media (section 2.1.3) culture density (OD600) was measured daily from the initial inoculation (day 0). From frozen stocks, strains were grown in 10mls 7H9 for two days prior to transfer into roller bottles used for the growth curves and all RNA extractions. At day 0, a calculated volume was transferred from the pre-culture to start all growth curves at OD 0.01. This experiment was also used to identify any lineage level differences between growth rates in vitro, and three additional strains from both Lineage 1 and 2 were included to increase the sample size and so the statistical power of the test. Additional clinical isolates are described in section 5.2.1.2. The H37Rv laboratory strain was also included as a reference.
120 5.3 Results
A
a) 10 N0121 N0043 N0072 ) 0
0 1 6 N0153 D
O N0157 (
y
t N0075 i s
n 0.1 N0145 e D
N0053 l a
c N0041 i t p 0.01 N0031 O N0052 N0140 0.001 H37Rv 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Days from inoculum
B
b) 10 ) 0 0 6
D 1 Lineage 1 O (
y Lineage 2 t i
s H37Rv n e D
l a c
i 0.1 t p O
0.01 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Days from inoculum
Figure 5.2. In vitro growth curves. A. Growth of twelve strains from Lineage 1 and 2, plus H37Rv. B. Strains pooled by lineage. Error bars are the standard error of the mean (SEM). All strains were grown in three independent experiments, and under the same conditions. Strains are coloured using previously defined lineage colouring.
121 5.3 Results
Growth rates of the clinical strains did vary with a trend for Lineage 2 strains to continue into late-exponential phase for longer that Lineage 1 strains (Figure 5.2A). This was reflected in higher OD600 readings for Lineage 2, with the Lineage 2 strain N0145 reaching an OD600 of ~10, the highest of all strains. The reference strain H37Rv is in the middle of the growth rates. By day 9 to 10 all strains had entered stationary phase. Figure 5.2B plots all strains from the same lineage as replicates, confirming the observation that Lineage 2 strains do continue in late-exponential phase for comparatively longer. However, mid-exponential growth is similar for all strains irrespective of lineage. As pre-cultures were used, all strains were in exponential growth at day zero, the start of the growth curve. Between days three and four, strains leave mid-logarithmic and enter late-logarithmic growth. For these experiments, mid- logarithmic growth was defined as OD ≤ 0.6.
Strain specific doubling times are shown in Table 5.4. Exponential doubling times range from 13.8 ± 0.2 hrs (strain N0043) to 24.2 ± 0.6 hrs (strain N0075). This shows that the doubling times of the clinical strains can range by up to 10 hrs, which is important when synchronising RNA extraction experiments. Whilst there is some variability in the specific growth rates of the strains, this was not significant at the lineage level. The mean lineage exponential doubling time for Lineage 1 was 18.2 ± 1.8 hrs and for Lineage 2 was 16.4 ± 0.5 hrs (two tailed students t-test, p=0.35).
!
122 5.3 Results
Table 5.4. In vitro growth rates. Doubling times in hours are shown for exponential phase growth with the SEM. All strains were grown in at least three independent experiments, under conditions detailed in 2.1.4. Lineage mean doubling time also shown. The laboratory strain H37Rv was used as a reference. Asterisks (*) identify strains used in RNA-seq study.
Doubling time Error Lineage mean Strain Lineage (hrs) (SEM) doubling time (hrs)
N0043 1 13.8 0.2 N0072 * 1 16.1 0.8 N0075 1 24.2 0.6 18.2 N0121 1 16.0 0.4 N0153 * 1 23.2 1.9 N0157 * 1 15.9 0.4 N0031 * 2 16.2 0.1 N0041 2 18.1 2.0 N0052 * 2 16.2 2.1 16.4 N0053 2 16.6 2.1 N0140 2 16.9 1.9 N0145 * 2 14.4 0.6 H37Rv 4 18.0 1.7 18.0
123 5.3 Results
5.3.2 RNA isolation and Illumina ready libraries