Cedalion Development of a Computational Framework to Discover Pivotal Genomic Features in Newly Sequenced Plant Species
Total Page:16
File Type:pdf, Size:1020Kb
Cedalion Development of a computational framework to discover pivotal genomic features in newly sequenced plant species Word count: 33446 Lennart Raman Student number: 01102304 Supervisors: Prof. Dr. Yves Van de Peer, Dr. Oren Tzfadia A dissertation submitted to Ghent University in partial fulfilment of the requirements for the degree of Master of Science in Bioinformatics: Systems Biology Academic Year: 2016 - 2017 Preface The past academic lustrum at Ghent University has led me to writing this final dissertation, aiming at achieving the degree of Master of Science in Bioinformatics. Using this preface, I would like to seize the opportunity to thank numerous friends, family members, supervisors and professors who contributed in shaping me into the person I am today. For this project specifically, I have to show gratitude to my promoter Prof. Dr. Yves Van de Peer, who allowed me to join his team as the second Master’s thesis student amongst the Cedalion members. This truly gave me the possibility to research a fascinating and – to me – very attractive field of study. Moreover, a considerable word of thanks is owed to the project’s supervisor, Dr. Oren Tzfadia. Oren managed to professionally help me with solving difficulties by either answering my questions or by getting me into contact with other experts at VIB- UGent – all whilst keeping a positive and certainly welcoming atmosphere. Regarding recent years, much appreciation goes out to my parents. Their moral and financial support enabled me to focus on my studies in a stable manner. Furthermore, I have to acknowledge their patience as well, as my juvenile Bachelor years were at times less noteworthy compared to the period when following the Master of Science in Bioinformatics. Last but not least, my friends turned the past couple of years into some genuinely unforgettable moments. They offered me some welcoming and necessary distraction from the sometimes difficult study periods. Special attention goes hereby out to my girlfriend Fien Stepman, who positively guided me into the right direction. I’m not waving Ghent University goodbye just yet. Later this year, I’ll start doing a PhD at UZ Ghent. My hopes are to stay in close contact with current fellow students and the great people of UGent and VIB, hopefully by collaborating at joint projects. Ghent, June 2017 i ii Table of Contents LIST OF ABBREVIATIONS AND SYMBOLS ............................................................. 1 Abbreviations .............................................................................................................. 1 Symbols ....................................................................................................................... 2 ABSTRACT .......................................................................................................... 5 INTRODUCTION .................................................................................................. 7 1. The importance of genome sequencing and annotation .................................... 7 1.1. Sequencing across different research domains ...................................................... 7 1.2. Current research challenges in the plant Kingdom ................................................ 7 2. The genome sequencing project .......................................................................... 8 2.1. Collecting DNA ........................................................................................................ 8 2.2. Sequencing .............................................................................................................. 8 2.2.1. Choice of sequencing platform ....................................................................... 8 2.2.2. Read depth ...................................................................................................... 9 2.2.3. Validation ...................................................................................................... 10 2.3. Assembly ............................................................................................................... 10 2.3.1. Principles of assembly ................................................................................... 10 2.3.2. Tools .............................................................................................................. 10 2.3.3. Validation ...................................................................................................... 11 2.4. Structural annotation ............................................................................................ 11 2.4.1. Principles of structural annotation................................................................ 11 2.4.2. Validation ...................................................................................................... 12 2.4.3. Online repositories ........................................................................................ 12 2.5. Functional annotation ........................................................................................... 12 2.5.1. Terms and vocabularies................................................................................. 12 2.5.2. Not all genes can be accurately annotated ................................................... 12 2.5.3. Homology-based functional annotation ....................................................... 14 2.6. Uncovering additional biological knowledge ........................................................ 14 2.6.1. Genome-wide orthology detection ............................................................... 14 2.6.2. Extending gene annotations .......................................................................... 14 2.6.3. Genome-level biological properties .............................................................. 15 3. Cedalion .............................................................................................................. 15 3.1. Automating the whole-genome sequencing project ............................................ 15 3.1.1. The need for automation .............................................................................. 15 3.1.2. The ultimate goal beyond the plant Kingdom ............................................... 16 3.1.3. Existing whole-genome annotation pipelines ............................................... 16 3.2. Introducing Cedalion ............................................................................................. 17 3.2.1. Etymology ...................................................................................................... 18 iii 3.2.2. The workflow ................................................................................................. 19 3.3. Introducing personal contribution attempts ........................................................ 21 3.3.1. The Genome Statistics Module ..................................................................... 21 3.3.2. The GO Set Testing Module ........................................................................... 21 3.3.3. The BadiRate Module .................................................................................... 22 3.4. Benchmarking Cedalion using Zostera marina and Ulva mutabilis ...................... 26 3.4.1. Zostera marina .............................................................................................. 26 3.4.2. Ulva mutabilis ................................................................................................ 26 AIMS................................................................................................................. 27 1. The Genome Statistics Module .......................................................................... 27 1.1. Feature extraction ................................................................................................. 27 1.2. Outputting relevant information .......................................................................... 27 1.3. Example of an additional study ............................................................................. 28 2. The GO Set Testing Module ............................................................................... 28 2.1. Framework based on codon usage ....................................................................... 28 2.2. Framework based on structural features ............................................................. 28 3. The BadiRate Module ......................................................................................... 28 3.1. BadiRate options ................................................................................................... 28 3.2. Input files generation ............................................................................................ 29 3.3. Downstream processing ........................................................................................ 29 RESULTS ........................................................................................................... 31 1. The Genome Statistics Module .......................................................................... 31 1.1. Software implementation ..................................................................................... 31 1.1.1. Feature extraction ......................................................................................... 31 1.1.2. Feature visualization ..................................................................................... 32 1.1.3. Feature ranking ............................................................................................