Computational Methods for the Identification and Characterization
Total Page:16
File Type:pdf, Size:1020Kb
Computational Methods for the Identification and Characterization of Non-Coding RNAs in Bacteria Dissertation der Mathematisch-Naturwissenschaftlichen Fakult¨at der Eberhard Karls Universit¨atT¨ubingen zur Erlangung des Grades eines Doktors der Naturwissenschaften (Dr. rer. nat.) vorgelegt von Alexander Herbig aus Altenkirchen T¨ubingen 2014 Tag der m¨undlichenQualifikation: 30.01.2015 Dekan: Prof. Dr. Wolfgang Rosenstiel 1. Berichterstatterin: PD Dr. Kay Nieselt 2. Berichterstatter: Prof. Dr. Daniel Huson 3. Berichterstatter: Prof. Dr. Rolf Backofen Zusammenfassung Forschungsergebnisse vergangener Jahre konnten zeigen wie komplex die Struktur und Regulation selbst bakterieller Transkriptome sein kann. Auch die wichtige Rolle nicht-kodierender RNAs (ncRNA), die nicht in Proteine translatiert werden, wird dabei immer deutlicher. Diese Molek¨uleerf¨ullen in der Zelle verschiedenste Aufgaben wie zum Beispiel die Regulation von Stoffwechselprozessen. Daher ist die Charakter- isierung der ncRNA-Gene eines Organismus immer mehr zu einem unverzichtbaren Teil von Systembiologie-Projekten geworden. Hierbei erlauben moderne Hochdurch- satzverfahren im Bereich der DNA- und RNA-Sequenzierung das im hohen Maße detaillierte Studium von Genomen und Transkriptomen. Die daraus resultierenden Daten m¨usseneiner vergleichenden Analyse unterzogen werden, um Variationen des Transkriptoms zwischen verschiedenen Organismen und Umweltbedingungen untersuchen zu k¨onnen.Hierf¨urwerden effiziente Computerprogramme ben¨otigt, die in der Lage sind genomische und transkriptomische Daten zu kombinieren und entsprechende Analysen automatisiert und reproduzierbar durchzuf¨uhren.Zu- dem m¨ussendiese Ans¨atzenicht-kodierende Elemente im genomischen Kontext lokalisieren und annotieren k¨onnen. In dieser Dissertation pr¨asentiere ich Computerprogramme zur L¨osungdieser Aufgaben. So wurde das Programm nocoRNAc entwickelt, welches ncRNAs in bakteriellen Genomen detektiert und diese bez¨uglich verschiedener Eigenschaften charakterisiert. Dazu geh¨orenzum Beispiel Berechnung von Transkriptionsstart- und endpunkten, Sekund¨arstrukturund m¨oglicher Interaktionspartner. nocoRNAc wurde im Rahmen einer umfangreichen Transkriptomstudie ¨uber das antibiotikapro- duzierende Bakterium Streptomyces coelicolor verwendet, wodurch die Relevanz von ncRNAs als m¨ogliche Regulatoren gezeigt werden konnte. F¨urdie komparative Analyse hoch aufgel¨osterGenom- und Transkriptomdaten multipler Organismen wurde in dieser Dissertation das SuperGenom-Konzept en- twickelt, welches bei der vergleichenden Visualisierung multipler Genome Anwen- dung fand. Zudem diente es als Grundlage f¨ureine neue Methode zur Bestimmung von Transkriptionsstartpunkten in bakteriellen Genomen. Bei der Anwendung auf das f¨urMenschen pathogene Bakterium Campylobacter jejuni konnte das Transkrip- tom dieses Organismus auf globaler Ebene charakterisiert werden. Zudem wurden mehrere bislang unbekannte ncRNAs identifiziert, darunter ein zuvor noch uncharak- terisierter CRISPR-Lokus. Hierbei handelt es sich um ein adaptives bakterielles Im- munsystem. Das Studium von Pathogenen kann auch von historischem Interesse sein. Das auf- strebende Feld der Pal¨aogenetikbefasst sich mit der Rekonstruktion und Analyse von Genomen alter, mitunter l¨angstausgestorbener Organismen. In dieser Disser- tation werden neue Methoden zur automatischen Rekonstruktion und Charakter- isierung alter bakterieller Genome eingef¨uhrt, welche zur Erforschung der Evolution von Mycobacterium leprae verwendet wurden, dem Verursacher von Lepra. Die Algorithmen und Werkzeuge, welche in dieser Dissertation entwickelt wurden, sowie die Erkenntnisse, die damit gewonnen werden konnten, stellen einen wertvollen Beitrag zum Verst¨andnisbakterieller Genome und Transkriptome dar und werden weiterhin dazu beitragen deren grundlegende evolution¨areMechanismen zu verste- hen. ii Abstract In recent years the complexity even of bacterial transcriptomes became more and more evident. The important role of so-called non-coding RNAs (ncRNA), which do not encode proteins, is increasingly recognized as they fulfill a variety of functions, such as the regulation of cellular processes or catalysis of other molecules. Therefore, the characterization of an organism's ncRNA repertoire has become an essential part of systems biology studies. In this context novel high-throughput technologies in the field of DNA and RNA sequencing allow for the investigation of genomes and transcriptomes in unprecedented detail. These methodologies produce vast amounts of data that have to be analysed comparatively in order to elucidate variations between different organisms or environmental conditions. For these tasks efficient computational methods are needed that integrate genomic and transcriptomic data from multiple data sets in an automated and reproducible manner. In addition, these approaches have to facilitate the genomic localization of ncRNA elements and their detailed annotation e.g., with respect to promoter regions or transcription start sites as well as their functional characterization such as the prediction of their targets of regulation. In this dissertation I have made a number of contributions that address these chal- lenges. The computer program nocoRNAc was developed, which predicts ncRNAs in bacterial genomes and characterizes them with respect to multiple properties such as transcription start and end points, secondary structure and potential interaction partners. nocoRNAc has been applied in the context of a comprehensive time se- ries expression study of the antibiotics producing bacterium Streptomyces coelicolor, which was cultivated under different environmental conditions. During this study the importance of ncRNAs as potential regulators became evident. For the analysis of high-resolution genomic and transcriptomic data from multi- ple organisms the SuperGenome concept was developed. The approach was applied in the context of whole-genome alignment visualization and served as the basis for an algorithm for the comparative detection of transcription start sites in bacterial genomes utilizing RNA-seq data. The application to multiple strains of the human pathogen Campylobacter jejuni allowed for the global characterization of this organ- ism's transcriptome and led to the detection of several novel ncRNAs, among them a previously uncharacterized CRISPR locus, which represents an adaptive bacterial immune system. Studying pathogens can also be of historic relevance. The emerging field of paleo- genetics focuses on the reconstruction and analysis of genomes of ancient organisms, whose DNA has been extracted from archaeological samples, such as bones. In this dissertation I present computational methods for the reconstruction and characteri- zation of ancient bacterial genomes, which have been applied to study the evolution of Mycobacterium leprae, the bacterium causing leprosy. Overall, the algorithms and tools developed in this dissertation and the insights that have been gained by their application contribute to the understanding of the structure and organization of bacterial genomes and transcriptomes and will help to elucidate the basic mechanisms that drive their evolution. iii Acknowledgements First of all, I am more than grateful to PD Dr. Kay Nieselt for giving me the opportunity to contribute to so many interesting and exciting projects. I am thankful for the outstanding support I received from her as a supervisor, for encouraging me whenever necessary and giving me a lot of freedom for creative ideas. Not least I would like to thank her for numerous comments and discussions that helped me to improve the quality of my thesis. Furthermore, I want to express my appreciation to Prof. Daniel Huson for being the co-supervisor of my dissertation. I am owing gratitude to the whole group of Integrative Transcriptomics for support and inspiring discussions on countless occasions, in particular Dr. Florian Battke, Dr. Stephan Symons, Aydın Can Polatkan and G¨unter J¨ager.Special thanks go to Dr. Florian Battke for being a great colleague, for all the joint work in multi- ple projects and manuscripts, but also for adventurous conference experiences and enlightening discussions about science and beyond. I am grateful to Sabine Gebert for always being extremely helpful in the organi- zation and planning of uncountable events and activities. Overall, I would like to thank all my colleagues in the Center for Bioinformatics for an outstanding working environment and for insightful conversations. In addition, I gratefully acknowledge all those bright scientists with whom I had the opportunity to discuss my research at conferences or other occasions. In particular, I am thankful to Prof. Rolf Backofen, Dr. Andreas Richter, Dr. Steffen Heyne, Dr. Fabrizio Costa, Prof. Peter Stadler, Prof. Robert Giegerich, Stefan Janssen, Dr. Aaron Darling and Prof. Peter Wills. I am very glad that I had the help of very skilled students during multiple projects and especially for exploring and extending the functionalities of the software pre- sented in this dissertation. In particular, I would like to thank Stefan Raue, Daniel Lehle, Mark Rurik, Andreas Friedrich, Linus Backert, Andr´eHennig and Sven Fill- inger. Furthermore, I am thankful to all my collaborators in the SysMO Stream con- sortium, especially Prof. Wolfgang Wohlleben, Dr. Yvonne Mast, Dr. Eva Wald- vogel and Merle Nentwich (University of T¨ubingen) as well as Dr. Michael Bonin, Dr. Michael Walter (Microarray Facility T¨ubingen),Prof.