Adeola Mujidat Rotimi
Total Page:16
File Type:pdf, Size:1020Kb
Development of multi-locus barcodes for identification of bacterial strains and species in environmental samples using next generation sequencing technologies Adeola Mujidat Rotimi Development of multi-locus barcodes for identification of bacterial strains and species in environmental samples using next generation sequencing technologies by ADEOLA MUJIDAT ROTIMI Submitted in partial fufilment of the requirements for the degree Philosophiae Doctor in the Centre for Bioinformatics and Computational Biology Department of Biochemistry, Genetics and Microbiology Faculty of Natural and Agricultural Sciences University of Pretoria October, 2018 In the beginning was the Word, and the Word was with God, and the Word was God. John 1 vs 1 i Dedication This work is dedicated to GOD Almighty and my late mother (Mrs Modupe Ayodele Helen Salawu); may your gentle soul continue to rest in perfect peace ii SUBMISSION DECLARATION I, Adeola Mujidat Rotimi, declare that the thesis, which I hereby submit for the degree Philosophiae Doctor in the Department of Biochemistry, Genetics and Microbiology at the University of Pretoria, is my own work and has not previously been submitted by me for a degree at this or any other tertiary institution. SIGNATURE: _________ DATE: 17th October, 2018 iii PLAGIARISM STATEMENT UNIVERSITY OF PRETORIA FACULTY OF NATURAL AND AGRICULTURAL SCIENCES DEPARTMENT OF BIOCHEMISTRY, GENETICS AND MICROBIOLOGY Full name: Adeola Mujidat Rotimi Student number: 11371855 Title of the work: Development of multi-locus barcodes for identification of bacterial strains and species in environmental samples using next generation sequencing technologies Declaration 1. I understand what plagiarism entails and I am aware of the University’s policy in this regard. 2. I declare that this thesis is my original work. Where someone else’s work was used (whether from a printed source, the internet or any other source) due acknowledgment was given and reference was made according to departmental requirements. 3. I did not make use of another student’s previous work to submit it as my own. 4. I did not allow and will not allow anyone to copy my work with the intention of presenting it as his or her work. Signature _________ Date: 17th October, 2018 iv ACKNOWLEDGEMENTS Firstly: I would like to extend my greatest gratitude to the almighty GOD, the creator of heaven and earth. My GOD, I will forever be grateful to you Secondly: I would like to sincerely thank the following individuals: My husband: for believing in me; thanks for all your support, prayers, patience and understanding that made my PhD journey peaceful. Prof Oleg N. Reva (supervisor): for accepting me as your PhD student, your humility, endless motivation, patience, support and professional supervision in the successful completion of this research project. Prof Reva, you are a great researcher and I hope to be like you some day. May GOD bless you and yours. Dr Rian Pierneef: thank you, my good teacher, for always being willing to assist me. I really do appreciatiate your kind gestures. Prof Fourie Joubert (Head of Bioinformatics Centre): thank you for accepting me into the Bioinformatics laboratory and all the assistance rendered to me when needed Mr Johann Swart (Systems Administrator): I appreciate all the assistance given when needed My extended family (Colleagues): Lebirata, Shaza, Greg, Shawn, Dillon, Bianca, Ansie and all the others for their moral support and understanding My parents: Thank you for the great value you attach to education and all the sacrifices made for your kids to be educated. Siblings: Ademola, thanks for always being willing to help and to all my other siblings; you are all amazing, thanks for your support always. My kids: Opeyemi and Damilola, I love you both dearly. Friends: Maltina Chukwu, Iqra and Shane, for your tremendous support and encouragement. South African National Research Foundation (NRF) and University of Pretoria: for the financial support received for my PhD degree. v SUMMARY Metagenomic approaches have revealed the complexity of environmental microbiomes and the advancement in whole genome sequencing showed a significant level of genetic heterogeneity on species level. It has become clear that a superior pattern of bioactivity of bacteria applicable in biotechnology, as well as the enhanced virulence of pathogens, often requires distinguishing between closely related species or sub-species. Current methods for binning of metagenomic reads usually do not allow identification below the genus level and very often, stop at the level of families. In this work, an attempt was made to improve metagenome binning resolution by creating genome-specific barcodes, based on the core and accessory gene sequences. This protocol was implemented in novel software tools available for use and download from http://bargene.bi.up.ac.za/. The most abundant barcode genes from the core genomes were found to encode for ribosomal proteins, some other central metabolic genes and ABC transporters. The performance of the created metabarcode sequences was evaluated using artificially generated and publicly available metagenomic datasets. Furthermore, a program, Barcoding 2.0, was developed to align reads against barcode sequences and calculate various parameters for scoring the alignment results and individual barcodes. Taxonomic units were identified in metagenomic samples by comparison of the calculated barcode scores to set cut- off values. In the study, it was found that varying sample sizes, i.e. the number of reads in a metagenome and metabarcode lengths had no significant effect on the sensitivity and specificity of the algorithm. Receiver operating characteristics curves were calculated for different taxonomic groups based on the results of identification of the corresponding genomes in artificial metagenomic datasets and the reliability of distinguishing between species of the same genus or family by the program was close to 100%. The results showed that the novel online tool, BarcodeGenerator (http://bargene.bi.up.ac.za/), was an efficient approach to generating barcode sequences from a set of complete genomes provided by users. Another program, Barcoder 2.0, was made available from the same resource to enable efficient and practical use of metabarcodes for visualisation of distribution of organisms of interest in environmental and clinical samples. vi TABLE OF CONTENTS DEDICATION.......................................................................................................................... ii SUBMISSION DECLARATION ......................................................................................... iii PLAGIARISM STATEMENT .............................................................................................. iv ACKNOWLEDGEMENTS .................................................................................................... v SUMMARY ............................................................................................................................ vi TABLE OF CONTENTS ...................................................................................................... vii LIST OF FIGURES ............................................................................................................... xii LIST OF TABLES ................................................................................................................. xv LIST OF ABBREVIATIONS ............................................................................................. xvii CHAPTER 1: Literature Review ........................................................................................... 1 1.1 Sequencing technologies and advance in genomic studies .............................................. 1 1.1.1 First-generation sequencing: Classical sequencing ................................................... 2 1.1.2 Second-generation sequencing: Next-generation sequencing ................................... 2 1.1.3 Third-generation sequencing: Insight into the near future ........................................ 4 1.1.4 Fourth-generation sequencing: In situ sequencing .................................................... 5 1.2 Metagenomics .................................................................................................................. 6 1.2.1. Methods and approaches of metagenomics .............................................................. 6 1.2.2 NGS sequencing technology ................................................................................... 13 1.2.3 Assembly ................................................................................................................. 15 1.2.4. Binning and binning algorithms ............................................................................. 16 1.2.5 Taxonomy-dependent methods ................................................................................ 17 1.2.5.1 Alignment-based methods ................................................................................ 17 1.2.5.2 Composition-based methods ............................................................................. 19 1.2.5.3 Hybrid methods ................................................................................................. 20 1.2.6 Taxonomy-independent or read clustering approaches ........................................... 21 1.2.7 Strategies of validation of binning results ............................................................... 22 vii 1.2.8 Annotation of metagenomic reads