Computational Identification of Regulatory Features Affecting
Total Page:16
File Type:pdf, Size:1020Kb
Computational identification of regulatory features affecting splicing in the human brain. Author: Warren Emmett Supervisors: Dr. Vincent Plagnol Prof. Jernej Ule UCL Genetics Institute Department of Genetics, Evolution and Environment PhD Thesis September 6, 2016 Acknowledgements \The most beautiful thing we can experience is the mysterious. It is the source of all true art and all science. He to whom this emotion is a stranger, who can no longer pause to wonder and stand rapt in awe, is as good as dead: his eyes are closed. | Albert Einstein" As Einstein points out, a sense of wonder is so important to any scientific endeavour. The opportunity I have been given to explore cellular biology using computational methods is truly a great gift. This was largely made possible by the people who have mentored me and given me chances to improve myself such as the option to undertake this PhD. I am incredibly grateful to Dr. Vincent Plagnol for this. I have had the time and space to rediscover my love for bioinformatics and cellular biology which had been lost in years of short term service work. I have gained more in these three years than I could have imagined. I am also very grateful to Prof. Jernej Ule and his lab for the opportunity to collaborate on the recursive splicing project, it was both enjoyable and a great learning experience. I would also like to thank Dr. Chris Sibley who was my mentor on the recursive project, I was very fortunate to have had his help. He is also responsible for most of the great graphs and diagrams in the recursive chapter, I have cited the paper in these cases. Furthermore I'd like to thank all my colleagues in the UGI; Dr. Jon White and Prof. Dave Curtis for their help with the statistics. Lucy van Dorp, Dr. Cian Murphy, Chris Steele and Jack Humphrey for your support and enthusiasm. I have been truly blessed to know such great people and great scientists. This thesis would not have been written without the support of family, my mother and sister who have both suffered great loss but have remained incredibly supportive, you truly are my heroes. My grandparents who have been a continuous inspiration and support to me throughout my life. And, of course, my girlfriend, Emily Frontain, who kept me company during long hours of thesis writing and revision, making these tasks a true pleasure. I am eternally grateful that I could have these wonderful people in my life. These three years have not been without sadness. My father passed away suddenly during the first year of my PhD and this was indeed a terrible loss. I did not get the chance to celebrate completion of i this PhD, nor share the successes of the last few years but he is a part of me, both literally (I have half his chromosomes!) and in every delightful moment that comes from being able to follow my passion; exploring cellular biology behind a computer with a cup of coffee. Last year, Konstantin Sofianos, my sister's partner and best friend, passed away in a hospital in Hamburg, Germany as the sun was rising on a crisp Friday morning. The soft, warm, light filled the room as I held my sister's hand in bewildered silence. He died of cancer two weeks before his 32nd birthday. He was a far brighter, more capable man than I and he is sorely missed. I dedicate this body of work to these two fine souls and their love of knowledge and passion for work. ii Contents Acknowledgements i Abstract vii 1 Introduction 1 1.1 Next generation sequencing enables high throughput, nucleotide resolution analysis of cellular processes . .1 1.2 Splicing in vertebrate genomes . .2 1.2.1 Circular RNAs are a novel class of non coding RNA created by backsplicing . .4 1.3 Analysis of RNA-seq data and evaluation of splice junctions . .6 1.3.1 Alignment of NGS fragments . .6 1.3.2 RNA-seq analysis software . .6 1.4 Sequencing datasets utilized in this thesis . .9 1.4.1 Primary datasets . .9 1.4.2 Public datasets analysed . 10 1.5 Overview of chapters . 12 1.5.1 Long genes, recursive splicing and their relationship to the brain . 12 1.5.2 Circular RNA are a pervasive phenomena linked to neuronal genes . 12 1.5.3 Exploring polymorphic variation using large exome consortia . 12 2 Recursive splicing in human brain 14 2.1 Introduction . 14 2.1.1 Long genes form a subclass with unique characteristics . 14 2.1.2 Cryptic elements in long genes . 17 iii 2.1.3 Histone modifications are a central characteristic of genes and their cell specific expression 18 2.2 Methods . 20 2.2.1 Software and tools used in this Chapter . 20 2.2.2 Ancillary public datasets . 20 2.2.3 Expression of long genes in the brain . 20 2.2.4 Identification of recursive splicing in brain . 21 2.3 Results . 25 2.3.1 Expression of long genes is enriched in the brain . 25 2.3.2 Recursive splice sites identified in human brain . 27 2.3.3 H3k36me3 signal is deficient in long introns . 30 2.4 Discussion . 37 3 Characterising circular RNA in the human brain 40 3.1 Introduction . 40 3.1.1 Challenges in identifying Circular RNAs . 45 3.2 Methods . 45 3.2.1 Accurate quantification of Circular RNA . 46 3.2.2 Pitfalls to identification of Circular RNA . 46 3.2.3 Pairwise analysis of highly similar gene pairs . 50 3.3 Results . 54 3.3.1 High confidence circRNA . 54 3.3.2 Backsplice junctions forming between closely related genes and gene-pseudogenes are abundant in the brain . 54 3.3.3 Pairwise realignment of backsplice junctions . 55 3.3.4 Reciprocal back/transplicing junctions across Tubulin genes . 56 3.3.5 Novel circRNA found in 18S rRNA . 58 3.3.6 Case study: circRNA differential expression in Bipolar disorder . 59 3.4 Discussion . 63 4 Annotating and functional determination of non-coding features using variant informa- tion from exome sequencing 65 4.1 Introduction . 65 iv 4.1.1 Variant conservation as a method to identify constrained sequence . 65 4.1.2 Branchpoints are an essential element to exon recognition and splicing . 67 4.1.3 Exploring splice site variation by integrating genomic variation with gene expression data.............................................. 68 4.2 Methods . 71 4.2.1 Calculating the cumulative variant ratio across features of interest . 71 4.2.2 Elucidation of potentially deleterious branchpoint variants . 71 4.2.3 Interpreting splice site variation through integration with gene expression data . 72 4.3 Results . 76 4.3.1 Variant ratio graphs . 76 4.3.2 Potential branchpoint disease variants . 77 4.3.3 Integrated variant and splice junction analysis . 79 4.4 Discussion . 101 4.4.1 Variant ratio graphs are a novel method to annotate human specific features . 101 4.4.2 More data are necessary to model branchpoints effectively . 101 4.4.3 Splice junctions accurately define splice changes . 102 4.4.4 Minor allele frequency is not a predictor of splice site pathogenicity . 102 4.4.5 Technical and biological variation impact data quality . 102 4.4.6 Variant splicing score captures significant splice junction change . 103 4.4.7 How the current study compares to recent publications . 103 4.4.8 Optimization is essential for reproducibility of this analysis . 104 4.4.9 Conclusion . 104 5 Conclusion 106 5.1 Core features and central concepts of the thesis . 106 5.1.1 Splicing as the primary force behind species diversity . 106 5.1.2 Splice junction reads are key to sensitive gene expression analysis . 107 5.1.3 Predicting damaging variation on splicing . 107 5.1.4 The evolution of sequencing technology and its impact on splicing analysis . 107 5.1.5 Data processing as a crucial skill in bioinformatics . 108 5.1.6 Stepping forward, understanding splicing within the context of exon definition . 109 5.2 Medical Implications . 110 v 5.3 Further thoughts and future work . 111 5.3.1 Recursive splicing as a genomic mechanism to control promoter usage . 111 5.3.2 Circular RNA as novel, brain enriched RNA molecules . 112 5.3.3 Analysis of splicing variants and their impact on transcription . 113 5.4 Final thoughts . 113 Appendix 140 vi Abstract RNA splicing has enabled a dramatic increase in species complexity. Splicing occurs in over 95% of mam- malian genes allowing the development of exceptional cellular diversity without an increase in raw gene numbers. This is highlighted by the fact that human and nematodes have the same number of genes (20,000 human genes versus 19,000 genes in Caenorhabditis elegans). Although the mechanistic process of splicing is now well understood there remains a multitude of unexplored dynamics that have only become visible with the power of next generation sequencing (NGS). The human brain is one of the best examples of an intricate cellular structure. Neuronal cell types are incredibly diverse and specialised, regulated through various transcriptional mechanisms. Recently, long genes (150kb+) have been implicated as crucial to neuronal function and their impairment has been attributed to several neurological disorders. I explore this relationship further by showing that long genes are more highly expressed in the brain than other tissues. Long genes are also distinct in that they are deficient in H3k36me3, a histone mark largely associated with splicing and active transcription. Through analysis of brain RNA-seq data, a novel splicing mechanism known as recursive splicing was identified in long introns.