'Next- Generation' Sequencing Data Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Novel Algorithm Development for ‘Next- Generation’ Sequencing Data Analysis Agne Antanaviciute Submitted in accordance with the requirements for the degree of Doctor of Philosophy University of Leeds School of Medicine Leeds Institute of Biomedical and Clinical Sciences 12/2017 ii The candidate confirms that the work submitted is her own, except where work which has formed part of jointly-authored publications has been included. The contribution of the candidate and the other authors to this work has been explicitly given within the thesis where reference has been made to the work of others. This copy has been supplied on the understanding that it is copyright material and that no quotation from the thesis may be published without proper acknowledgement ©2017 The University of Leeds and Agne Antanaviciute The right of Agne Antanaviciute to be identified as Author of this work has been asserted by her in accordance with the Copyright, Designs and Patents Act 1988. Acknowledgements I would like to thank all the people who have contributed to this work. First and foremost, my supervisors Dr Ian Carr, Professor David Bonthron and Dr Christopher Watson, who have provided guidance, support and motivation. I could not have asked for a better supervisory team. I would also like to thank my collaborators Dr Belinda Baquero and Professor Adrian Whitehouse for opening new, interesting research avenues. A special thanks to Dr Belinda Baquero for all the hard wet lab work without which at least half of this thesis would not exist. Thanks to everyone at the NGS Facility – Carolina Lascelles, Catherine Daley, Sally Harrison, Ummey Hany and Laura Crinnion – for the generation of NGS data used in this work and creating a supportive and stimulating work environment. Finally, I am extremely grateful to Sir Jules Thorn Charitable Trust, who funded this work. ii Abstract In recent years, the decreasing cost of ‘Next generation’ sequencing has spawned numerous applications for interrogating whole genomes and transcriptomes in research, diagnostic and forensic settings. While the innovations in sequencing have been explosive, the development of scalable and robust bioinformatics software and algorithms for the analysis of new types of data generated by these technologies have struggled to keep up. As a result, large volumes of NGS data available in public repositories are severely underutilised, despite providing a rich resource for data mining applications. Indeed, the bottleneck in genome and transcriptome sequencing experiments has shifted from data generation to bioinformatics analysis and interpretation. This thesis focuses on development of novel bioinformatics software to bridge the gap between data availability and interpretation. The work is split between two core topics – computational prioritisation/identification of disease gene variants and identification of RNA N6 -adenosine Methylation from sequencing data. The first chapter briefly discusses the emergence and establishment of NGS technology as a core tool in biology and its current applications and perspectives. Chapter 2 introduces the problem of variant prioritisation in the context of Mendelian disease, where tens of thousands of potential candidates are generated by a typical sequencing experiment. Novel software developed for candidate gene prioritisation is described that utilises data mining of tissue-specific gene expression profiles (Chapter 3). The second part of chapter investigates an alternative approach to candidate variant prioritisation by leveraging functional and phenotypic descriptions of genes and diseases from multiple biomedical domain ontologies (Chapter 4). Chapter 5 discusses N6 AdenosineMethylation, a recently re-discovered post- transcriptional modification of RNA. The core of the chapter describes novel software developed for transcriptome-wide detection of this epitranscriptomic mark from sequencing data. Chapter 6 presents a case study application of the software, reporting the previously uncharacterised RNA methylome of Kaposi’s Sarcoma Herpes Virus. The chapter further discusses a putative novel N6-methyl-adenosine -RNA binding protein and its possible roles in the progression of viral infection. iii Contents 1. Introduction .................................................................................................................... 1 1.1 Nucleic acid sequencing......................................................................................................................... 1 1.2 ‘Next-Generation’ Sequencing ............................................................................................................. 6 1.3 “Next-Generation” Sequencing Applications ............................................................................. 10 1.4 Bioinformatics Challenges in ‘Next-Generation’ Sequencing Era ..................................... 12 1.5 Overview of this work ......................................................................................................................... 14 2. Candidate Disease Gene and Variant Prioritisation ......................................................... 16 2.1 Disease Gene/Variant Identification ............................................................................................. 16 2.1.1 Genetic Linkage and Association Studies .................................................................. 16 2.1.2 Whole Exome and Whole Genome Sequencing Studies ........................................... 18 2.1.3 Putative disease variant prioritisation....................................................................... 22 2.2 Computational Candidate Disease Gene Prioritisation ......................................................... 27 3. GeneTiER: gene Tissue Expression Ranker ...................................................................... 41 3.1 Motivation ................................................................................................................................................ 41 3.2 Methods ..................................................................................................................................................... 47 3.2.1 Software Implementation ......................................................................................... 47 3.2.2 User Input Processing ................................................................................................ 48 3.2.3 GeneTiER Database ................................................................................................... 51 3.2.4 The gene prioritization algorithm .............................................................................. 54 iv 3.3 Results and Discussion ........................................................................................................................ 55 3.3.1 Performance Assessment .......................................................................................... 55 3.3.2 Case Study Genes ...................................................................................................... 65 3.3.3 GeneTiER Application ................................................................................................ 74 3.3.4 Discussion .................................................................................................................. 76 4 OVA: Ontology Variant Analysis Tool .............................................................................. 79 4.1.1 Motivation ................................................................................................................. 79 4.1.2 Representing Biomedical Knowledge in Machine-Readable Ways ........................... 79 4.1.3 Biomedical Domain Ontologies ................................................................................. 83 4.2 Methods ..................................................................................................................................................... 85 4.2.1 Overview .................................................................................................................... 85 4.2.2 Software Implementation ......................................................................................... 87 4.3 Results and Discussion ...................................................................................................................... 103 4.3.1 Performance Assessment ........................................................................................ 103 4.3.2 OVA Application....................................................................................................... 121 4.3.3 Discussion ................................................................................................................ 132 5. N6-Methyl Adenosine Sequencing ............................................................................... 136 5.1 Introduction........................................................................................................................................... 136 5.1.2 N6-Methyl Adenosine Molecular Biology................................................................ 137 5.1.3 m6A detection .......................................................................................................... 155 v 5.1.4 Computational methods for m6A-seq data analysis ................................................ 159 Box 1. Hidden Markov models .......................................................................................................... 163 Box 2.