A Comparative Genomic Framework for the in Silico
Total Page:16
File Type:pdf, Size:1020Kb
A COMPARATIVE GENOMIC FRAMEWORK FOR THE IN SILICO DESIGN AND ASSESSMENT OF MOLECULAR TYPING METHODS USING WHOLE-GENOME SEQUENCE DATA WITH APPLICATION TO LISTERIA MONOCYTOGENES PETER KRUCZKIEWICZ Bachelor of Science, University of Victoria, 2010 A Thesis Submitted to the School of Graduate Studies of the University of Lethbridge in Partial Fulfillment of the Requirements for the Degree MASTER OF SCIENCE Department of Biological Sciences University of Lethbridge LETHBRIDGE, ALBERTA, CANADA c Peter Kruczkiewicz, 2013 Dedicated to Mariola and Zbigniew Kruczkiewicz iii Abstract Although increased genome sequencing efforts have increased our understanding of genomic variability within many bacterial species, there has been limited application of this knowledge towards assessing current molecular typing methods and developing novel molecular typing methods. This thesis reports a novel in silico comparative genomic framework where the performance of typing methods is assessed on the basis of the discriminatory power of the method as well as the concordance of the method with a whole-genome phylogeny. Using this framework, we designed a comparative genomic fingerprinting (CGF) assay for Listeria monocytogenes through optimized molecular marker selection. In silico validation and assessment of the CGF assay against two other molecular typing methods for L. monocytogenes (multilocus sequence typing (MLST) and multiple virulence locus sequence typing (MVLST)) revealed that the CGF assay had better performance than these typing methods. Hence, optimized molecular marker selection can be used to produce highly discriminatory assays with high concordance to whole-genome phylogenies. The framework described in this thesis can be used to assess current molecular typing methods against whole-genome phylogenies and design the next generation of high-performance molecular typing methods from whole-genome sequence data. Acknowledgements I would like to thank my supervisors, Drs. James Thomas and Eduardo Taboada, without whose support this thesis would not have been possible. I am grateful for the time and energy they have invested in mentoring me and guiding me towards becoming and thinking like a scientist. I feel fortunate to have had supervisors that had my best interests at heart. It was an honour to have Drs. Victor Gannon, Betty Golsteyn-Thomas and Brent Selinger on my thesis advisory committee. Their comments and suggestions lead me to much deliberation about the direction of the project. I would like to thank them for their guidance and support. I would like to thank Dr. Dele Ogunremi for coming all the way from Ottawa to be the external examiner during my thesis defense. Thank you for the many comments and insights, and for the suggestions of potential future directions for this research. I would also like to thank Dr. Roy Golsteyn for chairing my thesis defense. I would like to thank my colleagues at the Public Health Agency of Canada for their support and advice: Steven Mutschall and Dillon Barker for all of the wonderful suggestions for data analysis, software program features and algorithms; Cody Buchanan for the suggestions in improving various software programs I had developed; Cassandra Jokinen for all of the very useful and helpful advice for doing a masters; Ben Hetman for introducing me to Settlers of Catan; Chad Laing for the exchange of ideas and interesting and relevant papers; Susanna Whelpley for being so helpful and for dealing with HR on my behalf. I would like to thank our collaborators at the National Microbiology Laboratory (Dr. Matt Gilmour and his Listeria genomics team, Dr. Morag Graham and the DNA core team, Dr. Gary Van Domselaar and his bioinformatics team) for providing access to Listeria monocytogenes whole-genome sequence data. I would like to thank the Public Health Agency of Canada and the University of Lethbridge for their financial support, and the Government of Canada's Genomics Research and Development Initiative for partially funding this research. v Contents Approval/Signature Page ii Dedication iii Abstract iv Acknowledgementsv Contents vi List of Tables ix List of Figures x Abbreviations xi Symbols xiii 1 Literature Review1 1.1 Epidemiology......................................1 1.2 Bacterial population diversity.............................2 1.3 Phenotypic subtyping.................................3 1.3.1 Biotyping....................................4 1.3.2 Serotyping...................................4 1.3.3 Phage typing..................................5 1.4 Molecular subtyping..................................6 1.4.1 Pulsed-field gel electrophoresis........................6 vi Contents 1.4.2 PCR-based molecular subtyping.......................7 1.4.2.1 PCR-RFLP..............................7 1.4.2.2 Random amplified polymorphic DNA typing...........8 1.4.2.3 Amplified fragment length polymorphism typing.........8 1.4.2.4 Multilocus variable-number tandem-repeat analysis.......8 1.5 Population genetics and typing............................9 1.5.1 Multilocus enzyme electrophoresis...................... 10 1.5.2 Multilocus sequence typing.......................... 11 1.6 Comparative genomics................................. 12 1.6.1 Early whole-genome sequencing........................ 12 1.6.2 Microarray-based comparative genomic hybridization............ 13 1.7 The pan-genome paradigm for bacteria........................ 14 1.8 Next-generation sequencing and genomic epidemiology............... 15 1.9 Challenges with current molecular typing methods................. 16 1.10 Overview of the thesis................................. 18 1.10.1 Goals of the thesis............................... 19 2 Material and Methods 21 2.1 Retrieval of L. monocytogenes WGS data...................... 21 2.1.1 Retrieval of Canadian outbreak-related L. monocytogenes WGS data... 21 2.1.2 Retrieval of publicly available L. monocytogenes WGS data........ 22 2.1.3 Annotation of unannotated L. monocytogenes genomes........... 22 2.2 Core genome phylogenetic analysis of 60 L. monocytogenes strains........ 22 2.3 Identification of homologs using pairwise protein sequence BLAST searching... 23 2.4 Defining a CGF marker pool from the accessory genome of L. monocytogenes .. 23 2.5 CGF marker set selection using the Wallace coefficient............... 23 2.6 Multiplex PCR assay creation............................. 24 2.7 In silico validation of CGF assay........................... 25 vii List of Tables 3 Results and Discussion 27 3.1 Comparative genomic analysis of L. monocytogenes ................. 27 3.1.1 Defining the core genome of L. monocytogenes ............... 27 3.1.2 Determination of core genome phylogeny and SNP analysis........ 29 3.2 Optimized CGF assay design............................. 32 3.2.1 Determination of accessory genes suitable for CGF assay design...... 32 3.2.2 Optimized marker selection.......................... 33 3.2.3 Multiplex PCR assay design.......................... 43 3.3 In silico typing using WGS data........................... 46 3.3.1 Assumptions and limitations of in silico assay design and assessment... 48 3.3.2 In silico validation of the L. monocytogenes CGF40............ 52 3.3.2.1 In silico CGF40........................... 52 3.3.2.2 In silico MLST............................ 55 3.3.2.3 In silico MVLST........................... 55 3.3.2.4 Typing method performance assessment.............. 56 4 Thesis Conclusions 61 References 63 Listeria monocytogenes strain information 87 Thesis software project source code repositories 89 L. monocytogenes CGF40 marker information 90 In silico typing data 93 viii List of Tables 3.1 Adjusted Wallace and Simpson's index of diversity for in silico typing data against phylogenomic clusters................................. 59 3.2 P -values between Adjusted Wallace coefficients of different typing methods to the PGC defined at 20 SNPs................................ 60 1 The 37 Canadian outbreak-related L. monocytogenes strains............ 87 2 The 23 publicly available L. monocytogenes strains retrieved from NCBI..... 88 3 Thesis software project source code repository links................. 89 4 L. monocytogenes CGF marker protein products.................. 90 5 L. monocytogenes CGF markers and PCR primers................. 92 6 L. monocytogenes CGF40 clusters defined at fingerprint similarities of 100, 97.5, 95 and 90%....................................... 93 7 In silico-derived MLST data.............................. 96 8 In silico-derived MVLST data............................. 98 ix List of Figures 3.1 Pairwise SNP count heatmap of 60 L. monocytogenes strains........... 30 3.2 Defining phylogenomic clusters at various SNP thresholds............. 31 3.3 Absence/presence patterns of 169 unique accessory gene for 60 L. monocytogenes strains.......................................... 34 3.4 Absence/presence profiles of marker pool and a `good', `average' and `poor' marker set............................................. 38 3.5 CGF Optimizer screenshot............................... 42 3.6 Staggered product size PCR primer creation flowchart............... 45 3.7 CGF marker multiplexing flowchart.......................... 46 3.8 Microbial In Silico Typer screenshot......................... 51 3.9 In silico subtyping results of in silico L. monocytogenes CGF40 assay...... 54 x Abbreviations AFLP Amplified Fragment Length Polymorphism BLAST Basic Local Alignment Search Tool bp base pair CDS Coding DNA Sequence CGF Comparative Genomic Fingerprinting CGH Comparative Genomic