Statistical Modeling of Genetic and Epigenetic Factors in Gene Structures And

Statistical Modeling of Genetic and Epigenetic Factors in Gene Structures and Transcriptional Enhancers by William Hutchins Majoros Graduate Program in Computational Biology and Bioinformatics Duke University Date:_______________________ Approved: ___________________________ Tim Reddy, Supervisor ___________________________ Sayan Mukherjee ___________________________ Raluca Gordân ___________________________ Jen-Tsan Chi Dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate Program in Computational Biology and Bioinformatics in the Graduate School of Duke University 2017 ABSTRACT Statistical Modeling of Genetic and Epigenetic Factors in Gene Structures and Transcriptional Enhancers by William Hutchins Majoros Graduate Program in Computational Biology and Bioinformatics Duke University Date:_______________________ Approved: ___________________________ Tim Reddy, Supervisor ___________________________ Sayan Mukherjee ___________________________ Raluca Gordân ___________________________ Jen-Tsan Chi An abstract of a dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate Program in Computational Biology and Bioinformatics in the Graduate School of Duke University 2017 Copyright by William Hutchins Majoros 2017 Abstract Predicting the phenotypic effects of genetic variants is a major goal in modern genetics, with direct applicability in many areas including the study of diseases in humans and animals, and the breeding of agriculturally important plants. Computational methods for interpreting genetic variants rely implicitly on annotations of functional genomic elements, such as genes and regulatory elements. Importantly, the locations and boundaries of such annotations can be altered by the presence of specific alleles, either singly or in combination, so that variant interpretation and genomic annotation should ideally be performed jointly. Such joint interpretation would enable predictions to account for the influence that one or more variants may have on the phenotypic impacts of other variants. In this dissertation I describe computational methods for variant interpretation in both gene bodies and, separately, in transcriptional enhancers that regulate the expression of genes. In the case of gene bodies, I describe novel methods for predicting how genetic variants, either singly or in combination, can impact gene structure, which I define to be the combination of a splicing pattern together with a translation reading frame. Whereas gene structure prediction methods have to date focused exclusively on annotation of reference genomes, I introduce the novel problem of annotating personal genomes of individuals or strains, and I describe and evaluate novel methods for iv addressing that problem. I show (i) that these methods are able to predict complex changes in gene structures that result from genetic variants, (ii) that they are able to jointly interpret multiple variants that are not independent in their effects, and (iii) that predictions are supported by both RNA-seq data and patterns of intolerance to mutation across human populations. In the case of transcriptional enhancers, I describe experimental and associated computational methods for assessing the impacts of genetic variants on the ability of an enhancer to drive gene expression in an episomal reporter assay. I show that these methods are able to identify variants impacting enhancer function, and I show that the functional score assigned by these methods can be used to fine-map gene expression associations. I also describe a statistical pattern recognition method for efficiently identifying drug-responsive regulatory elements genome-wide and parsing those elements into functional sub-components. I show that this model is able to identify drug-responsive enhancers with high accuracy. I show that sub-components identified by this method are enriched for distinct sets of binding motifs for transcription factors known to mediate the response to treatment by glucocorticoids, one of the most commonly used drugs in the world. Applying this model to timecourse data, I was able to cluster predicted enhancers into sets having distinct trajectories of activity over time in response to treatment. Using experimental chromatin conformation data, I show that these v trajectories associate with distinct patterns of expression for genes in physical association with these enhancers. vi Dedication This dissertation is dedicated to Brandy and Daisy. vii Contents Abstract .......................................................................................................................................... iv List of Tables ............................................................................................................................... xiv List of Figures .............................................................................................................................. xv Acknowledgements .................................................................................................................. xxv Chapter 1 – Outline ....................................................................................................................... 1 Chapter 2 – Background ............................................................................................................... 3 2.1 Gene structures ................................................................................................................. 4 2.1.1 Gene structure and its impact on the interpretation of genetic variants ............. 4 2.1.1.1 Transcription and splicing .................................................................................. 5 2.1.1.2 Translation .......................................................................................................... 13 2.1.1.3 Assaying the results of splicing and translation ........................................... 16 2.1.1.4 Interpreting genetic variants within the context of a fixed gene structure 21 2.1.1.5 Genetic variants can alter splicing ................................................................... 23 2.1.1.6 Genetic variants can alter translation reading frames .................................. 26 2.1.2 Traditional approaches to gene structure modeling ............................................ 28 2.1.2.1 Hidden Markov models .................................................................................... 29 2.1.2.2 Generalized hidden Markov models .............................................................. 33 2.1.2.3 Signal sensors ..................................................................................................... 35 2.1.2.4 Content sensors .................................................................................................. 36 2.1.2.5 Conditional random fields ................................................................................ 36 viii 2.2 Transcriptional enhancers ............................................................................................. 39 2.2.1 Enhancer function in gene regulation .................................................................... 39 2.2.2 Experimental methods for assaying enhancers .................................................... 41 2.2.3 Epigenetic indicators of enhancer state .................................................................. 46 2.2.4 Computational models of chromatin state ............................................................ 49 2.2.4.1 Multivariate hidden Markov models .............................................................. 49 2.2.4.2 ChromHMM ....................................................................................................... 50 2.2.4.3 MUMMIE ............................................................................................................ 51 2.2.4.4 Segway ................................................................................................................. 54 2.2.5 Enhancers and disease .............................................................................................. 54 Chapter 3 – High-throughput interpretation of gene structure changes ............................ 57 3.1 Motivation ....................................................................................................................... 57 3.2 Methods ........................................................................................................................... 61 3.2.1 Reconstructing haplotype sequences from a VCF file ......................................... 62 3.2.2 Identifying changes to splice patterns and reading frames ................................ 64 3.2.3 Identifying loss of function ...................................................................................... 66 3.2.4 Configuration and structured output ..................................................................... 67 3.2.5 Computational validation ........................................................................................ 67 3.3 Results .............................................................................................................................. 70 3.3.1 ACE predicts changes to gene structure ................................................................ 70 3.3.2 ACE identifies thousands of annotated human splice sites as being potentially robust to disruption ..........................................................................................................

Load more