Statistical Modeling of Genetic and Epigenetic Factors in Gene Structures and

Transcriptional Enhancers

by

William Hutchins Majoros

Graduate Program in and Duke University

Date:______Approved:

______Tim Reddy, Supervisor

______Sayan Mukherjee

______Raluca Gordân

______Jen-Tsan Chi

Dissertation submitted in partial fulfillment of the requirements for the degree of in the Graduate Program in Computational Biology and Bioinformatics in the Graduate School of Duke University

2017

ABSTRACT

Statistical Modeling of Genetic and Epigenetic Factors in Gene Structures and

Transcriptional Enhancers

by

William Hutchins Majoros

Graduate Program in Computational Biology and Bioinformatics Duke University

Date:______Approved:

______Tim Reddy, Supervisor

______Sayan Mukherjee

______Raluca Gordân

______Jen-Tsan Chi

An abstract of a dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate Program in Computational Biology and Bioinformatics in the Graduate School of Duke University

2017

Copyright by William Hutchins Majoros 2017

Abstract

Predicting the phenotypic effects of genetic variants is a major goal in modern genetics, with direct applicability in many areas including the study of diseases in humans and animals, and the breeding of agriculturally important plants.

Computational methods for interpreting genetic variants rely implicitly on annotations of functional genomic elements, such as genes and regulatory elements. Importantly, the locations and boundaries of such annotations can be altered by the presence of specific alleles, either singly or in combination, so that variant interpretation and genomic annotation should ideally be performed jointly. Such joint interpretation would enable predictions to account for the influence that one or more variants may have on the phenotypic impacts of other variants.

In this dissertation I describe computational methods for variant interpretation in both gene bodies and, separately, in transcriptional enhancers that regulate the expression of genes. In the case of gene bodies, I describe novel methods for predicting how genetic variants, either singly or in combination, can impact gene structure, which I define to be the combination of a splicing pattern together with a translation reading frame. Whereas gene structure prediction methods have to date focused exclusively on annotation of reference genomes, I introduce the novel problem of annotating personal genomes of individuals or strains, and I describe and evaluate novel methods for

iv

addressing that problem. I show (i) that these methods are able to predict complex changes in gene structures that result from genetic variants, (ii) that they are able to jointly interpret multiple variants that are not independent in their effects, and (iii) that predictions are supported by both RNA-seq data and patterns of intolerance to mutation across human populations.

In the case of transcriptional enhancers, I describe experimental and associated computational methods for assessing the impacts of genetic variants on the ability of an enhancer to drive gene expression in an episomal reporter assay. I show that these methods are able to identify variants impacting enhancer function, and I show that the functional score assigned by these methods can be used to fine-map gene expression associations.

I also describe a statistical pattern recognition method for efficiently identifying drug-responsive regulatory elements genome-wide and parsing those elements into functional sub-components. I show that this model is able to identify drug-responsive enhancers with high accuracy. I show that sub-components identified by this method are enriched for distinct sets of binding motifs for transcription factors known to mediate the response to treatment by glucocorticoids, one of the most commonly used drugs in the world. Applying this model to timecourse data, I was able to cluster predicted enhancers into sets having distinct trajectories of activity over time in response to treatment. Using experimental chromatin conformation data, I show that these

v

trajectories associate with distinct patterns of expression for genes in physical association with these enhancers.

vi

Dedication

This dissertation is dedicated to Brandy and Daisy.

vii

Contents

Abstract ...... iv

List of Tables ...... xiv

List of Figures ...... xv

Acknowledgements ...... xxv

Chapter 1 – Outline ...... 1

Chapter 2 – Background ...... 3

2.1 Gene structures ...... 4

2.1.1 Gene structure and its impact on the interpretation of genetic variants ...... 4

2.1.1.1 Transcription and splicing ...... 5

2.1.1.2 Translation ...... 13

2.1.1.3 Assaying the results of splicing and translation ...... 16

2.1.1.4 Interpreting genetic variants within the context of a fixed gene structure 21

2.1.1.5 Genetic variants can alter splicing ...... 23

2.1.1.6 Genetic variants can alter translation reading frames ...... 26

2.1.2 Traditional approaches to gene structure modeling ...... 28

2.1.2.1 Hidden Markov models ...... 29

2.1.2.2 Generalized hidden Markov models ...... 33

2.1.2.3 Signal sensors ...... 35

2.1.2.4 Content sensors ...... 36

2.1.2.5 Conditional random fields ...... 36

viii

2.2 Transcriptional enhancers ...... 39

2.2.1 Enhancer function in gene regulation ...... 39

2.2.2 Experimental methods for assaying enhancers ...... 41

2.2.3 Epigenetic indicators of enhancer state ...... 46

2.2.4 Computational models of chromatin state ...... 49

2.2.4.1 Multivariate hidden Markov models ...... 49

2.2.4.2 ChromHMM ...... 50

2.2.4.3 MUMMIE ...... 51

2.2.4.4 Segway ...... 54

2.2.5 Enhancers and disease ...... 54

Chapter 3 – High-throughput interpretation of gene structure changes ...... 57

3.1 Motivation ...... 57

3.2 Methods ...... 61

3.2.1 Reconstructing haplotype sequences from a VCF file ...... 62

3.2.2 Identifying changes to splice patterns and reading frames ...... 64

3.2.3 Identifying loss of function ...... 66

3.2.4 Configuration and structured output ...... 67

3.2.5 Computational validation ...... 67

3.3 Results ...... 70

3.3.1 ACE predicts changes to gene structure ...... 70

3.3.2 ACE identifies thousands of annotated human splice sites as being potentially robust to disruption ...... 74

ix

3.3.3 ACE confirms previous estimates of the effect of nonsense-mediated decay on transcript levels ...... 75

3.3.4 ACE’s loss-of-function predictions in healthy individuals are highly enriched for genes tolerant to mutation ...... 78

3.3.5 ACE aids interpretation of insertion and deletion variants within genes ...... 80

3.3.6 ACE accurately reconstructs human blood-group alleles at the ABO locus .... 83

3.3.7 ACE identifies complex gene-structure changes in a plant gene influencing flavor and nutritional content ...... 85

3.4 Discussion ...... 88

3.5 Supplementary methods ...... 92

3.5.1 Efficient reconstruction of haplotype sequences ...... 92

3.5.2 Mapping annotations to individualized sequences ...... 95

3.5.3 Predicting loss of function ...... 96

3.5.4. Probability models ...... 97

3.5.5 Configurable parameters ...... 98

3.5.6 Alignment of protein sequences ...... 100

3.5.7 Alignment and quantification of RNA-seq data ...... 100

3.5.8 Computing relative expression ratios for NMD targets ...... 102

3.5.9 Versions of software used ...... 103

Chapter 4 – Variant-aware gene structure prediction in personal genomes ...... 104

4.1 Motivation ...... 104

4.2 Methods ...... 109

4.2.1 Splice Graph Random Field ...... 109

x

4.2.2 SGRF features and parameter estimation ...... 112

4.2.3 Integration of SGRF into ACE+ ...... 113

4.2.4 Computational validation ...... 114

4.3 Results ...... 115

4.3.1 Prediction accuracy on 150 human genomes ...... 115

4.3.2 Logistic and splicing minigene features reflect known hnRNP but not SR protein motifs ...... 118

4.3.3 De novo splice sites are prevalent, have a wide range of effects, and are misclassified by popular tools ...... 120

4.4 Discussion ...... 124

4.5 Supplementary Methods ...... 130

4.5.1 Gene structure prediction in real genes modified with premature stop codons ...... 130

4.5.2 Regularized logistic regression training ...... 131

4.5.3 Evaluation of SGRF prediction accuracy using Geuvadis data ...... 132

4.5.4 Classification of whole exons and introns ...... 133

4.5.5 Simulating creation and destruction of splice sites ...... 134

Chapter 5 – Detecting allele-specific effects on the activity of transcriptional enhancers ...... 137

5.1. Introduction ...... 137

5.2 Results ...... 140

5.2.1 Population-scale reporter assay approach ...... 140

5.2.2 Targeted sequencing of candidate regulatory elements from a GWAS population ...... 142

xi

5.2.3 Quantifying the effects of noncoding variation in a GWAS population ...... 143

5.2.4 Identifying regulatory variants in population STARR-seq ...... 144

5.2.5 Effects of haplotypes on regulatory element activity ...... 145

5.2.6 Fine mapping genetic associations with phenotypes ...... 148

5.3 Discussion ...... 150

5.4 Future directions ...... 153

5.4.1 Scale and bias ...... 153

5.4.2 Statistical modeling of allelic effects ...... 155

Chapter 6 – Integrative modeling of dynamic epigenetic enhancer signatures ...... 157

6.1 The glucocorticoid signaling system ...... 157

6.2 Methods ...... 159

6.2.1 A multivariate model of epigenetic dynamics ...... 159

6.2.2 Training and evaluating the model ...... 162

6.2.3 Quantifying and clustering enhancer dynamics over a 12-hour timecourse . 164

6.2.4 Gene expression analysis ...... 166

6.2.5 Motif enrichment analysis ...... 166

6.3 Results ...... 167

6.3.1 Parameter estimates ...... 167

6.3.2 Classification accuracy ...... 168

6.3.3 Clustering of temporal trajectories ...... 170

6.3.4 Multipeak structure of enhancer signatures ...... 172

6.4 Discussion ...... 176

xii

Chapter 7 – Conclusions ...... 181

7.1 Interpreting genetic variants impacting gene structure ...... 181

7.2 Experimentally testing for allelic effects in enhancers ...... 190

7.3 Modeling epigenetic signatures of drug-responsive enhancers ...... 195

Appendix ...... 201

References ...... 225

Biography ...... 254

xiii

List of Tables

Table 3.1: Features of splice site models for different classes of G+C content. Order indicates number of previous positions on which the current position is conditioned in the PWM...... 98

Table 3.2: Configurable parameters in ACE, with default parameters used in the analyses described here...... 99

Table 3.3: Parameters used for trimming RNA-seq reads...... 100

Table 3.4: Versions of all software used...... 103

xiv

List of Figures

Figure 1: Introns are spliced out of transcripts, leaving only the exons (Reproduced with permission from Majoros, 2007b)...... 7

Figure 2: (A) An intron and the major trans-factors involved in its recognition (B) Nucleotide preferences of positions flanking human donor and acceptor splice sites; letter height is proportional to frequency...... 8

Figure 3: Major forms of alternative splicing. Reprinted by permission from Macmillan Publishers Ltd: Nature Reviews Genetics (Luca Cartegni, Shern L. Chew, Adrian R. Krainer, Listening to silence and understanding nonsense: exonic mutations that affect splicing, copyright 2002)...... 11

Figure 4: Three-periodic phase in coding segments. (Reproduced with permission from Majoros, 2007b)...... 16

Figure 5: Gene structure combines a splicing pattern with a coding segment. (Reproduced with permission from Majoros, 2007b) ...... 16

Figure 6: One type of splice graph. Vertices represent splice sites, and edges represent introns and exons...... 18

Figure 7: (A) A specialized type of splice graph, called an ORF graph. Vertices represent splice sites and start/stop codons. Edges represent introns and exons. Edges and vertices are annotated with scores (not shown) computed by a model. Each path corresponds to a gene prediction. The graph can be re-weighted using external evidence such as RNA-seq data. The highest-scoring path is shown in red. Phase information constrains the prediction, as coding segments are assumed to be conserved and well-formed. (B) An unconstrained ORF graph for an gene. (C) The same ORF graph as in panel B, after filtering out intron edges not supported by spliced RNA-seq reads...... 20

Figure 8: (A) State-transition diagram of an HMM-based gene structure model. States labeled with nucleotides emit only those nucleotides; these are used to emit splice sites and start/stop codons. State 0 is the silent initial/final state. E=exon, I=intron, N=intergenic. Transition patterns enforce the syntax of genes. (Reproduced with permission from Majoros, 2007b) (B) An HMM depicted as a directed graphical model. Unobservables (white vertices) are states, and observables (gray vertices) are emissions. Arrows depict dependencies. (Reproduced with permission from Majoros, 2007a)...... 32

xv

Figure 9: State-transition diagram for a GHMM. Diamond states implement signal sensors for fixed-length features; oval states implement content sensors for variable- length features. “×3” indicates that the three coding phases can have different transition and emission probabilities. Transition patterns enforce gene syntax. (Reproduced with permission from Majoros, 2007b)...... 34

Figure 10: A linear-chain CRF. (Reproduced with permission from Majoros, 2007a)...... 37

Figure 11: A multivariate HMM for identifying miRNA target sites (Reproduced from Majoros et al., 2013; used with permission)...... 52

Figure 12: Flowchart of ACE logic. Major steps are (in order): projection of annotations from the reference, analysis of potential changes to splicing, and analysis of potential changes to translation reading frames...... 62

Figure 13: (A) ACE reconstructs explicit haplotype sequences from a phased VCF file, projects reference annotations onto them, detects possible gene structure changes, and interprets changes in terms of possible loss of function. (B) When a disrupted splice site is encountered, ACE enumerates possible alternate splice forms resulting from cryptic splicing, exon skipping, intron retention, or any combination resulting from multiple variants...... 64

Figure 14:(A) Distribution of number of alternate structures predicted per disrupted splice site. (B) Distribution of proportions of predicted cryptic-site isoforms supported by at least one spliced read, when predicted isoforms are not provided to TopHat 2 (blue) and when they are provided (red). (C) Distribution of proportions of predicted cryptic-site isoforms assigned nonzero FPKM by StringTie when predicted isoforms are not provided to StringTie (blue) and when they are provided (red). (D) Distribution of proportions of predicted cryptic-site isoforms supported by at least one spliced read for splice sites simulated to be disrupted (blue) and for those that are disrupted (red). (E)

Distribution of spliced reads per junction, on log10 scale, supporting sites simulated to be disrupted (blue) versus those that are disrupted...... 73

Figure 15: (A) Distribution of log2 effect sizes of N = 578 heterozygous NMD events as measured via RNA-seq transcript quantification. Dashed line at -0.42 denotes a 25% reduction in total transcript quantity. Data were filtered to improve power (sample size≥30, mean FPKM≥1). (B) Percentiles of Residual Variant Intolerance Scores (RVIS) for N = 633 genes in which at least one individual was predicted to be homozygous for gene loss of function...... 76

xvi

Figure 16:(A) Deletion of an entire splice site (top: hg19 reference sequence; bottom: haplotypes 1 and 2 of 1000 Genomes Project sample HG00096). The resulting allele appears to retain a functional splice site despite the deletion, as concluded by ACE and supported by spliced RNA-seq reads. (B) Compensatory frameshift variants: the second variant corrects the change to the reading frame introduced by the first variant (top: hg19 reference sequence, bottom: haplotype 2 of 1000 Genomes Project sample HG00096)...... 83

Figure 17:(A) Blood-group alleles of the ABO gene (ENSG00000175164). Black: coding segment; gray: untranslated region (UTR). Reference genome hg19 has the O allele; GENCODE version 19 annotates this gene as a processed transcript with no reading frame. ACE identifies the coding segment for the O and B alleles in heterozygous individual HG00096. (Coordinates have been transformed and mapped to the forward strand). (B) Complex differences in gene structure between alleles of the waxy gene in rice, due to a single G-to-T variant in a donor splice site. ACE detects a 1 nt shift in the donor splice site in the Wxb allele, resulting in a new start codon straddling the first intron. The new start codon alters the reading frame, leading to a premature stop codon and NMD...... 86

Figure 18: (A) A splice graph random field (SGRF). Vertices denote splice sites, and edges denote exons and introns. A path from TSS (transcription start site) to TES (transcription end site) outlines a single gene structure; labels 0 and 1 denote omission or inclusion, respectively, of a vertex on the selected path. (B) Cliques and their potential functions; SGRFs have only singleton and pair cliques. Potential functions for cliques labeled with any 0 do not contribute to the score, since they do not participate in the selected path. (C) Cryptic splice sites are unannotated splice sites near an annotated splice site; disrupted splice sites exist in the reference but not in the alternate sequence; de novo splice sites exist in the alternate sequence but not in the reference...... 111

Figure 19: (A) ROC curves for the SGRF with three different content sensors; TP and FP rates were computed based on spliced RNA-seq reads from LCL cells supporting predicted novel splice junctions not found in any isoform of the gene (B) Area under ROC curves shown in panel A. (C) Difference between logistic model AUC and minigene model AUC. (D) ROC for classification of 10,000 codon exons versus 10,000 introns (red), for 400 lincRNA exons versus 400 lincRNA introns (blue), and for minigene exons with high (N=10,947) versus low (N=16,185) inclusion rates (green), using the logistic human model for classification. (E) Density of positively-scored hexamers under the human logistic model, at relative positions in 400 noncoding lincRNA exons...... 118

xvii

Figure 20: (A) Log2(number of de novo splice sites / number of disrupted splice sites) in mutation simulations (brown: requiring only a 2 bp consensus for de novo splice sites; blue: requiring sufficiently high score under splice-site model for de novo splice sites; green: requiring sufficiently high score under the splice-site model and favorable exon definition context for de novo splice sites), and in predictions supported by RNA-seq

(red); numbers above bars are raw (non-log2) ratios. (B) Estimates of relative splicing activity (red) and protein change (blue) due to de novo splice sites supported by RNA- seq. (C) Frequencies of dbSNP classifications of variants predicted to create de novo splice sites and supported by spliced RNA-seq reads. (D) Frequencies of VEP classifications of variants creating de novo splice sites supported by RNA-seq...... 122

Figure 21: (A) Population STARR-seq (“Pop-STARR”) adapts the STARR-seq assay to measure regulatory potential of multiple alleles cloned from a study population. (B) Population STARR-seq is highly reproducible. Rep1–7 are biological replicates generated from independent transfections. The x- and y- axes represent element activity (output RNA reads / input DNA reads). In each case, Spearman’s r > 0.90...... 141

Figure 22: Correlation between SNP effect sizes (x-axis: log of product of effect sizes of SNPs on haplotype) and log of observed haplotype effects (y-axis) for putative regulatory haplotypes containing more than one SNP (r = 0.54, P = 0.007). Observed haplotype effect sizes were computed as normalized ratios for each haplotype versus all pooled haplotypes at a locus: (RNAhaplotype/DNAhaplotype)/ (RNApooled/DNApooled). Solid line: regression line (slope=0.8, intercept=-0.019); dotted line: 1:1 diagonal...... 147

Figure 23: Comprehensive measurement of haplotype-specific regulatory element activity provides mechanistic insights into gene regulation. (A) Distribution of enhancer activity scores for fragments containing regulatory variants (red) and fragments containing non-regulatory variants (blue). (B) Histogram of number of SNPs per assayed element. (C) Manhattan plot of eQTLs for the long noncoding RNA LINC00881.

Blue dots indicate −log10(P-value) of LINC00881 eQTL from the Geuvadis database (left y-axis); red bars indicate −log10(FDR) for variants that alter regulatory activity in the population STARR-seq assay (right y-axis). Red dotted line indicates FDR = 1.0. (D) Association between normalized expression of long noncoding gene LINC00881 in LCLs as measured by the Geuvadis project (y-axis) and the measured effect size in population STARR-seq assay (x-axis) for SNP rs73170828 (r2 = 0.07, P = 7.6×10−9). (E) Allele-specific H3K27ac analysis of variants rs62274098 and rs73170828, both eQTLs proximal to and 5′ of LINC00881; read counts (y-axis) differed substantially between alleles for rs73170828

xviii

(Wilcoxon P = 0.058, binomial P = 0.004) but not for rs62274098 (Wilcoxon P = 0.9; binomial P = 0.92)...... 150

Figure 24: Multivariate Hidden Markov model (MV-HMM) of GC-responsive regulatory elements. (A) State-transition diagram of multivariate HMM shown with transition probabilities. (B) Mean emission probability for each state, at pre-dex (bottom) and at 3 hr of dex exposure (top). (C) Covariance matrices for emission distributions of each state...... 163

Figure 25: Consistency of k-means clustering across multiple runs. Each heatmap represents mean within-cluster predicted dex-induced enhancer activity after k-means clustering with different random initializations. Whiter values represent lower activity, redder values represent higher activity...... 165

Figure 26: Area under the receiver operating characteristics curve (AUC) values for the full model and for reduced versions of the model including only the features listed. .. 170

Figure 27: Clustering, motif, and gene-expression analyses. (A) Results of clustering trajectories of predicted enhancer activity (left). Enrichments of motifs in clusters measured as excess proportion of elements containing a significant motif match above expectation (* means P < 0.01; right). (B) Gene expression changes for genes physically interacting with enhancers in clusters shown in panel A...... 172

Figure 28: Example enhancer signature with two peaks. HMM states have been recoded: state 0.5 = background, state 1.5 = histone flank, state 2.5 = DHS peak...... 173

Figure 29: Motif enrichment in predicted peaks versus histone flanks of predicted dex- responsive enhancers genome-wide...... 175

Figure 30: Known and predicted gene structures for alleles A and O of the human blood group gene ABO (Ensembl gene ENSG00000175164). The reference genome GRCh38 contains the O allele, which contains a frameshift leading to a premature stop codon. The structure predicted by the state-of-the-art gene finder Augustus (Stanke et al., 2006) for the O allele introduces a novel exon spanning positions 14364-14392 in order to alter the reading frame and avoid the premature stop codon at position 17791, resulting in a higher likelihood due to the strong coding signal in the long final exon of the A allele. (Coordinates have been transformed and mapped to the forward strand)...... 201

xix

Figure 31: Distribution of distances (nt) between cryptic splice sites and annotated sites in DBASS, the Database of Aberrant Splice Sites (Buratti et al., 2007); outliers above 644nt were trimmed for illustration purposes only; these constitute <5% of the distribution. 202

Figure 32: Example of structured Essex output (XML reports are structured similarly). Notable features include the alignment of the reference sequence to the alternate sequence via a CIGAR string indicating insertion, deletion, and match lengths; classification of variants for the reference transcript, the mapped transcript, and any putative novel transcript structures resulting from disruptions to splice sites or changes in translation reading frames (note that variant classifications can differ between alternate transcripts); protein translations for all versions of a transcript; predicted fates of transcripts and/or proteins; and detailed descriptions of disrupted splice sites (shown) and putative cryptic sites (not present in this example)...... 203

Figure 33: (A) Distribution of proportions of ACE-predicted exon-skipping isoforms supported by at least one spliced read when such isoforms are not provided (blue) or are provided (red) as hints to TopHat 2 (Wilcoxon W = 537900, P < 2.2×10-16). (B) Proportions of exon-skipping isoforms assigned nonzero FPKM when such isoforms are not provided (blue) or are provided (red) as hints to StringTie (W = 198020, P < 2.2×10-16). (C) Proportions of exon-skipping isoforms supported by at least one spliced read, for annotated splice sites that are not disrupted (blue) and annotated splice sites that are disrupted (red) (W = 699470, P < 2.2×10-16)...... 204

Figure 34: Number of ACE-predicted novel isoforms (y-axis) across all Geuvadis samples estimated to meet or exceed a given TPM (Transcripts Per Million) threshold (x- axis), as estimated by Salmon (red), Kallisto (blue), and StringTie (green)...... 205

Figure 35: (A) Distribution of proportions of ACE-predicted exon-skipping isoforms supported by at least three spliced reads when such isoforms are not provided (blue) or are provided (red) as hints to TopHat 2 (Wilcoxon W = 762670, P < 2.2×10-16). (B) Proportions of exon-skipping isoforms assigned FPKM≥2 when such isoforms are not provided (blue) or are provided (red) as hints to StringTie (W = 198020, P < 2.2×10-16). (C) Proportions of exon-skipping isoforms supported by at least three spliced reads, for annotated splice sites that are not disrupted (blue) and annotated splice sites that are disrupted (red) (W = 771330, P < 2.2×10-16)...... 206

Figure 36: (A) Distribution of proportions of ACE-predicted exon-skipping isoforms supported by at least three spliced reads when such isoforms are not provided (blue) or are provided (red) as hints to TopHat 2 (Wilcoxon W = 579440, P < 2.2×10-16). (B) Proportions of exon-skipping isoforms assigned FPKM≥2 when such isoforms are not

xx

provided (blue) or are provided (red) as hints to StringTie (W = 198020, P < 2.2×10-16). (C) Proportions of exon-skipping isoforms supported by at least three spliced reads, for annotated splice sites that are not disrupted (blue) and annotated splice sites that are disrupted (red) (W = 716420, P < 2.2×10-16)...... 207

Figure 37: (A) Distribution of normalized read counts (normalized by total reads mapped to the locus) supporting novel splice junctions proximal to non-disrupted annotated sites. (B) Distribution of similarly normalized read counts supporting novel junctions proximal to disrupted annotated sites. The median is significantly greater than for the non-disrupted sites (Wilcoxon W = 222750000, P < 2.2×10-16) shown in panel A...... 208

Figure 38: (A) Scatterplot of cryptic splicing levels for cryptic sites near annotated splice sites that are disrupted (y-axis) or are not disrupted (x-axis). Each point corresponds to one annotated splice site observed to be disrupted in some individuals. Cryptic splicing levels were normalized by total reads mapped to the locus in the individual, then averaged across individuals in which the site was disrupted (y-axis) or not disrupted (x- axis). A Wilcoxon rank-sum test was applied to each point to remove nonsignificant results at an FDR (False Discovery Rate) threshold of 0.05. A majority of points (86%) lie above the y = x line. Median cryptic splicing levels were significantly higher in individuals in which the annotated splice site was disrupted (Wilcoxon signed rank test, V = 1135, P < 2.2×10-16). (B) Magnified view of the same scatterplot shown in panel A, showing 90% of original data points...... 209

Figure 39: (A) Distribution of proportions of transcripts with disrupted splicing among 1000 Genomes Project samples for which ACE identified at least one putative alternate splice form not predicted to entail loss of function (LOF) (maximum change in amino acid sequence was 10 amino acids). (B) Distribution of proportion of genes among 1000 Genomes Project samples identified by ACE as entailing LOF in one but not all annotated isoforms...... 210

Figure 40: (A) Distribution of log2 effect sizes of N=578 heterozygous NMD events after filtering to include only transcripts with mean FPKM≥1. (B) Distribution of log2 effect sizes of N=411 heterozygous NMD events after filtering to include only transcripts with mean FPKM≥2. (C) Distribution of log2 effect sizes of N=297 heterozygous NMD events after filtering to include only transcripts with mean FPKM≥3...... 211

Figure 41: (A) Betas (y-axis) from linear mixed-effects model with random intercepts, log2(FPKM) ~ Xb + Zu, converted to relative abundance ratios r0/2 = FPKM0 / FPKM2, where FPKMk denotes mean FPKM (fragments per kilobase of transcript per million

xxi

reads mapped) among individuals predicted to have k functional alleles of a transcript. FPKM thresholds (x-axis) of 0.1 through 15 were used to pre-filter transcripts prior to fitting the model. Only transcripts expressed in at least 30 individuals were included in the analysis. The largest Beta (b = 0.37, SE = 0.01) corresponds to an r0/2 = .60. (B) Similar plot as in panel A, for mixed-effects model with both random intercepts and random slopes. Mean r0/2 = 0.49875, indicating a halving of transcript abundance, on average, in homozygous NMD targets...... 212

Figure 42: (A) Distribution of RVIS percentiles for all human genes having an RVIS score. (B) RVIS percentiles for randomly selected transcripts with similar coding length to the transcripts plotted in Figure 15B (Wilcoxon rank-sum comparison to all genes in panel A: W = 138110000, P = 0.72). (C) RVIS percentiles for randomly selected transcripts with similar total length to the transcripts plotted in Figure 15B (comparison to all genes in panel A: W = 137890000, P = 0.75). (D) RVIS percentiles for randomly selected transcripts with matching numbers of exons to the transcripts plotted in Figure 15B (comparison to all genes in panel A: W = 138200000, P = 0.68). (E) RVIS percentiles for randomly selected transcripts with similar G+C% to the transcripts plotted in Figure 15B (comparison to all genes in panel A: W = 128960000, P = 0.99)...... 213

Figure 43: (A) Distribution of RVIS percentiles for homozygous loss-of-function genes in 1000 Genomes Project samples as predicted by ACE (Wilcoxon rank-sum comparison to RVIS percentiles for all genes in Figure 42A: W = 7378700, P < 2.2×10-16). (B) RVIS percentiles for heterozygous LOF genes (compared to RVIS percentiles for all genes: W = 53451000, P < 2.2×10-16). (C) Distribution of ncRVIS percentiles for homozygous LOF genes (compared to ncRVIS percentiles for all genes: W = 6032200, P = 5.5×10-11). (D) ncRVIS percentiles for heterozygous LOF genes (compared to ncRVIS for all genes: W = 48724000, P < 2.2×10-16)...... 214

Figure 44: (A) Distribution of numbers of isoforms per gene for all of GENCODE version 19. (B) Distribution of numbers of isoforms per gene for the N = 67 genes depicted in Figure 43B for which RVIS percentile < .20...... 215

Figure 45: (A) Variant rs11278302 in Ensembl gene ENSG00000174177 results in deletion of an entire splice site; however, the resulting sequence retains a valid donor splice site consensus and the flanking sequence scores above threshold under a positional weight matrix; furthermore, more than 30 spliced reads are assigned by TopHat 2 to this splice site in each allele of a homozygous individual, indicating that splicing is retained at this site under the alternate allele. Ensembl VEP classifies the variant as having high impact, due to the apparent loss of a splice site. (B) Variant rs67712719 in Ensembl gene

xxii

ENSG00000179588 introduces a frameshift which rs67322929 corrects, resulting in only two amino acid changes. These variants highly co-occur in 1000 Genomes Project individuals (97% of all 5008 haplotypes). However, Ensembl VEP classifies each individually as having high impact, as individually either would result in a frameshift...... 216

Figure 46: (A) Reference (hg19) and alternate alleles for Ensembl gene ENSG00000179588 in 1000 Genomes Project sample HG00096, haplotype 2, in the region of variants rs67712719, rs67322929, and rs67873604. (B) Interpreting variant rs67712719 alone would lead to a conclusion that the variant results in a frameshift and a large number of amino acid changes and protein truncation. (C) Similarly, interpretation of variant rs67322929 alone would lead to frameshift and a large change to the encoded protein. (D) Joint interpretation of all variants together reveals that changes are limited to four amino acids, two of which are deleted and two of which undergo substitution...... 217

Figure 47: Distribution of lengths (in amino acids) of affected intervals between compensatory frameshifts in 1000 Genomes Project samples...... 218

Figure 48: Distribution of simulated frameshift lengths in GENCODE protein-coding genes, assuming one frameshift per gene and uniform locations within coding segments; outliers above 3528nt (top 1%) were omitted for display purposes only. Full data set: median=260, N=162716...... 219

Figure 49: (A) Results of running an HMM gene finder on 19,000 broken genes. The gene finder was run on each gene, then a stop codon was inserted in a random location in the CDS without creating a splice site, and the gene finder was run on the modified sequence. In 11% of cases the gene finder predicted the same splice pattern on both the original sequence and the sequence modified to contain a premature stop. In 9% of cases, the gene finder predicted that no gene was present after the stop was inserted. In the remaining 80% of cases, the gene finder predicted a different splice pattern after the stop codon was inserted. (B) Relative position of inserted stop codon (relative to the spliced transcript) in cases in which the gene finder predicted the same splice pattern. There was a strong enrichment for stop codons inserted near the end of the coding segment, in which only the terminal portion of the protein would be affected, as well as a weaker enrichment near the beginning of the coding segment, in which the gene finder was able to find another start codon in the same reading frame that avoided the inserted stop codon. (C) Relative position of inserted stop codon in cases in which both the splice pattern did not change and the start codon was not changed...... 220

xxiii

Figure 50: Reproducibility of logistic regression training for content sensors. (A) Hexamer weights trained on 10,000 exon-intron pairs, versus weights for the same hexamers from an independent logistic regression applied to the same training cases. (B) Hexamer weights estimated from 20,000 training cases (x-axis) and weights for the same hexamers estimated from a subset of 10,000 training cases (y-axis)...... 221

Figure 51: (A) Effect of scaling factor rcontent/signal on AUC for SGRF applied to Thousand Genomes individual HG00096, using the human logistic content sensor. (B) AUC versus scaling factor for the SGRF using the minigene content sensor...... 221

Figure 52: Predictive accuracy of logistic signal sensors versus positional weight matrices (PWMs). (A) AUC for classification of annotated donor splice sites versus decoy sites, using logistic signal sensor (red) and PWM (blue). (B) AUC for classification of annotated acceptor splice sites versus decoy sites, using logistic signal sensor (red) and PWM (blue)...... 222

Figure 53: Number of Gs in 4096 hexamers (y-axis) as a function of logistic weights; points are sorted along the x-axis by weight...... 223

Figure 54: Example of a de novo splice site that appears to result in greater splicing activity than at the annotated splice site. Top: haplotype 1 of Thousand Genomes individual HG00118 shows evidence of splicing only at the annotated splice site. Bottom: variant rs202069778 in haplotype 2 of the same individual creates a new acceptor splice site that retains the original reading frame in the MAP4K1 gene, resulting in 8 amino acids being excluded from the encoded protein; TopHat2 aligns more spliced reads to this site than to the annotated site in this haplotype. This variant has a global MAF of 0.0002 in Thousand Genomes phase 3 samples, indicating it is possibly deleterious...... 224

xxiv

Acknowledgements

I wish to gratefully acknowledge Mark Yandell, Steven Salzberg, and Uwe Ohler for providing extensive and highly valuable mentorship during the years leading up to my graduate work at Duke, and my advisor Tim Reddy for his copious and expert guidance during my graduate studies. I also wish to thank the following individuals for their help and influence as collaborators and colleagues before or during my graduate work: Steve Finch, Mark Turner, Ian Korf, Mihaela Pertea, Brian Haas, Carson

Holt, Michael Campbell, Karen Eilbeck, Song Li, Chris Vockley, Ian McDowell, Tony

D’Ippolito, Graham Johnson, Sarah Cunningham, Nicky Lekprasert, Neel Mukherjee,

Sayan Mukherjee, David Corcoran, Molly Megraw, Iulian Pruteanu-Malinici, Gurkan

Yardimci, Justin Guinney, Stoyan Georgiev, Ayal Gussow, Dan Mace, Brad Moore,

Elizabeth Rach, Andrew Allen, Raluca Gordân, Jonathan Allen, Jonathan Eisen, Mihai

Pop, Mark DeLong, Greg Lamonte, Ashley Chi, Jason Stajich, Fred Dietrich, Dave

MacAlpine, Jennifer Wortman, Art Delcher, Mani Subramanian, Tom Heiman, Gerry

Perham, and Peter Li.

xxv

Chapter 1 – Outline

This dissertation introduces novel computational methods for the interpretation of genetic variants in the DNA encoding genes and the elements that regulate genes. As

DNA sequencing costs continue to drop, the utility of sequencing for medical diagnosis and development of therapeutics will increase, but only if we are able to make sense of that sequencing data via appropriate analysis methods. The work described here aims to improve our ability to make use of sequencing technologies in medicine and the biological sciences, though computational means.

Chapter 2 reviews the relevant biological background on many aspects of genes and gene regulation, and introduces several classes of computational models that the remaining chapters will build upon.

Chapter 3 describes a method and software implementation for interpreting genetic variants that may interrupt gene structures. The method was applied to thousands of human genomes as well as several plant varieties. Results underscore the importance of interpreting genetic variants in combination, within their native genomic context.

Chapter 4 extends the method described in Chapter 3 by introducing a probabilistic model that enables the scoring and ranking of gene structures that may result from changes to genomic sequence. The model illustrates a novel direction in

1

gene-structure modeling in that it relies primarily on non-coding features, and it relaxes the traditional assumption that all copies of genes are well-formed and fully functional.

Chapter 5 describes a new experimental assay for determining allele-specific effects on gene regulation, and computational methods for interpreting the results of that assay. These methods have applicability in the functional fine-mapping of genetic variants previously found to be in association with a disease or other phenotype.

Chapter 6 describes a pattern recognition method for identifying drug- responsive enhancers based on their epigenetic signatures at single-nucleotide resolution. Analyses of predictions made genome-wide using this model confirm known biology of the glucocorticoid receptor, GR. The model has potential applications in the mechanistic interpretation of gene regulatory elements and the genetic variants occurring within them.

Chapter 7 summarizes the work and discusses possible future directions.

2

Chapter 2 – Background

Heritable traits are encoded chemically in the genome—a collection of chromosomes consisting of linear sequences of the nucleotides adenine (A), cytosine (C), guanine (G) and thymine (T). Germline modifications to this four-letter DNA code can result in changes to phenotypes that are passed on to descendants, and thus have the potential to contribute to heritable diseases or to the of new species. Such modifications are termed genetic variants, and can be discovered by comparing genome sequences between individuals in a population.

Many genetic variants have been found via genome-wide association studies

(GWAS) to be associated with diseases or other phenotypes of interest. However, due to the phenomenon of linkage disequilibrium, which arises from the rarity of genetic recombination between variants residing near each other in the genome, combined with small population sizes relative to that recombination rate, variants that may be causal for a phenotype may be in strong association with other nearby variants that do not contribute causally to phenotypic outcomes. As such, genetic associations are confounded by non-causal variants that contribute to association signals. Fine-mapping of variants is the process of winnowing down the set of associated variants to those that are most likely to be causal (reviewed in Spain and Barrett, 2015). In the case of disease, knowing which variants are causal is enabling, as the variants can be used as diagnostic markers and potentially targeted by therapeutics (Vockley et al., 2017).

3

A popular approach to fine-mapping is to annotate functional genomic elements and then to interpret variants by predicting whether they are likely to disrupt the function of those elements. Another approach is to test each variant experimentally, as to its mechanistic effect on some aspect of molecular biology. The results of both of these approaches together inform the process of variant prioritization, in which variants deemed most likely to have a functional effect on cell biology are selected for follow-up validation in the context of the specific disease or organismal phenotype under study.

This dissertation deals primarily with the interpretation of genetic variants occurring within the contexts of two types of genomic elements: genes and transcriptional enhancers. In sections 2.1 and 2.2 I review what is currently known concerning the biology of these elements, experimental methods for assaying these elements structurally and functionally, the potential for disruption of these elements to affect phenotypes, and current methods for modeling these elements computationally.

The discussion will be specific to eukaryotes, which include all forms of life other than bacteria and archaea.

2.1 Gene structures

2.1.1 Gene structure and its impact on the interpretation of genetic variants

For the purposes of this work I define eukaryotic gene structure to be the combination of a splicing pattern for a gene together with (in the case of protein-coding genes) a translation reading frame, as described in the sections below.

4

2.1.1.1 Transcription and splicing

DNA influences organismal phenotypes via the expression of genes encoded in the DNA. Gene expression begins with the production of RNA molecules copied from a

DNA template, a process termed transcription. Transcription in eukaryotes is carried out by RNA polymerase II (Pol II), a macromolecular complex consisting of many proteins.

The polymerase assembles at the core promoter of a gene, located at the gene’s 5’ end, in preparation for transcription, which proceeds in the 5’-to-3’ direction on the sense strand. Actual transcription begins at the transcription start site (TSS), which is located a short distance 3’ of the core promoter. Transcription proceeds through the gene to the transcription end site (TES), at which point the polymerase dissociates from the DNA and is recycled for further transcription events. The result of transcription is an RNA molecule, called a transcript, that is complementary to the DNA template. The RNA transcript is built up progressively by appending ribonucleotides, one at a time, as dictated by the DNA template and base complementarity.

As the nascent RNA strand is produced and emerges from the polymerase, various enzymes associated with the C-terminal domain (CTD) of the polymerase process the emerging transcript. These modifications to the transcript are generally co- transcriptional, in that processing performed on early (more 5’) portions of the transcript are performed while the polymerase is still transcribing the further (more 3’) portions of the gene. The most upstream co-transcriptional processing is the 5’ capping of the RNA,

5

which utilizes a unique guanine-guanine 5’-to-5’ bond at the 5’-most end of the transcript, to protect that end of the transcript from exonucleic activity that would quickly degrade the transcript. The 3’ end of the transcript is eventually polyadenylated by cleaving the transcript at a polyadenylation signal (typically AATAAA or ATTAAA in humans) near the TES and then appending several hundred As. Polyadenylation assists in the nuclear export of transcripts destined for the cytoplasm (where translation takes place—section 2.1.1.2). The other major co-transcriptional activity is splicing.

In eukaryotes, RNA transcribed from genes is co-transcriptionally spliced to remove introns and retain exons (Figure 1). Whereas the unspliced sequence of a transcript is called the primary transcript or pre-mRNA, we call the spliced version the mature transcript. The mature transcript thus consists of a series of exons ligated together. Note, however, that the transcripts of a single gene can be spliced in different ways in different cell types or different conditions, resulting in alternative splicing.

Different splice forms of a gene can also co-exist within the same cell type or condition, as splicing decisions are often stochastic (Pickrell et al., 2010; Stepankiw et al., 2015).

Thus, nucleotides that are destined to be exonic in one primary transcript may be intronic in another transcript from the same gene.

6

Figure 1: Introns are spliced out of transcripts, leaving only the exons (Reproduced with permission from Majoros, 2007b).

An intron begins with a donor splice site and ends with an acceptor splice site

(Figure 2A). In humans, most donor splice sites consist of the dinucleotide GT and most acceptor splice sites consist of the dinucleotide AG. A small minority (roughly 1%) of human donor sites have a GC or AT consensus, and a small minority (roughly 1%) of human acceptor sites have an AC consensus. Thus, most human introns begin with GT and end with AG. Note that the GT and AG are part of the intron, and are thus removed with the rest of the intron during splicing.

The donor splice site is recognized by the U1 small nucleolar ribonucleoprotein (U1 snRNP) and the acceptor splice site is recognized by the U2 auxiliary factor (U2AF).

These factors recognize not only the dinucleotide consensus portions (i.e., GT for donor sites, AG for acceptor sites) of splice sites, but also have preferences for particular nucleotides flanking the consensus (Figure 2B). As such, the local sequence context

7

around a donor splice site influences the strength of that splice site, meaning that a GT with a favorable sequence context is more likely to be recognized by the splicing machinery than a GT with an unfavorable context. These preferred sequence contexts are somewhat organism-specific, so that models of splice sites need to be trained specifically for each species (Korf, 2004). Acceptor splice sites are typically preceded by a polypyrimidine tract that is enriched for pyrimidines (Cs and Ts), and a short distance 5’ of this is a location known as the branch point. The branch point in humans is typically an A, and is recognized and bound by the branch-binding protein (BBP).

Figure 2: (A) An intron and the major trans-factors involved in its recognition (B) Nucleotide preferences of positions flanking human donor and acceptor splice sites; letter height is proportional to frequency.

Once a donor site is recognized and an acceptor site is recognized 3’ of that donor site on a transcript, the intervening intronic sequence is removed via a multistep process. First, the U1 snRNP is joined and eventually replaced on the transcript by a complex consisting of other snRNPs (snRNPs U4, U5, and U6), and the U2AF and BBP are replaced on the transcript by the U2 snRNP. The U2 and U5/U6 snRNPs interact, 8

bringing the ends of the intron into mutual proximity, in preparation for the splicing reaction. These ribonucleoproteins form the spliceosome. At least some of the components just described are believed to be associated with the CTD of Pol II so that they are readily available to recognize their respective RNA target sequences immediately as those sequences emerge from the polymerase. Once the complete spliceosome is in place, the branch point nucleotide forms a bond with the first nucleotide of the donor site. This bond results in a three-way junction called a lariat

(Figure 1). Next, the final nucleotide of the preceding exon (that was immediately 5’ of the donor site) joins the first nucleotide of the following exon. This results in the two exons being ligated into a single, continuous RNA strand. The intron, now in the form of a lariat structure, is released for degradation.

As noted above, the identities of the bases flanking the consensus dinucleotides for donor and acceptor splice sites provide for additional sequence specificity.

However, it has been shown that these preferences alone cannot explain splice-site selection by the spliceosome in humans and plants, as the proximal sequences at splice sites have insufficient information content to permit reliable discrimination of real sites from decoy sites (Lim and Burge, 2001). The information in a sequence model can be defined in terms of Shannon entropy (Shannon, 1948). The higher the entropy

(uncertainty) at a position in a motif, the lower the information provided by that position in the motif. Splice site motifs generally have high information only in the 2 bp

9

consensus sequences. While flanking positions often have weak preferences for individual nucleotides, they tend to have relatively high entropy, and therefore low information content. As a result, many sites in the genome match a splice site motif model despite not being functional splice sites.

It is now known that additional signals are present in transcript sequences that influence splice site selection. These are believed to be comprised primarily of binding sites for RNA-binding proteins that regulate splicing. Heterogeneous nuclear ribonucleoproteins (hnRNPs) bind to both introns and exons, though they are believed to bind preferentially to intronic sequences, where they repress splicing at cryptic splice sites

(Zarnack et al., 2013). There is also evidence that they may compact introns (Choi et al.,

1986; Dreyfuss et al., 1993), and it is conjectured that this renders the splicing reaction more thermodynamically favorable for long introns. SR proteins (so named for their serine- and arginine-rich domains, or RS domains) also bind to both exons and introns, though they are widely believed to bind preferentially to exons. It has been demonstrated that they interact with spliceosomal components U1 and U2AF via their

RS domains (Wu and Maniatis, 1993). As such, they are believed to actively recruit components of the spliceosome to the vicinity of splice sites. Some members of the SR protein family are also involved in export of mature transcripts from the nucleus

(Shepard and Hertel, 2009).

10

Approximately 95% of human genes contain introns, and 95% of intron-bearing human genes can be spliced in multiple, distinct ways, resulting in alternative splicing

(Pan et al., 2008). There are several forms of alternative splicing (Figure 3), including the use of alternative donor or acceptor splice sites, exon skipping, and intron retention.

These are believed to be regulated by hnRNPs and SR proteins, via the differential binding of these splicing regulatory factors (SRFs) at specific locations in a primary transcript. As some hnRNPs and SR proteins are expressed in a cell-type specific manner, the presence or absence of these factors in cells are believed to enable cell-type specific splicing.

Figure 3: Major forms of alternative splicing. Reprinted by permission from Macmillan Publishers Ltd: Nature Reviews Genetics (Luca Cartegni, Shern L. Chew, Adrian R. Krainer, Listening to silence and understanding nonsense: exonic mutations that affect splicing, copyright 2002).

11

The sequences recognized by SRFs are referred to as splicing regulatory elements

(SREs), and have widely been represented using hexamers (Stadler et al., 2006; Ke et al.,

2011b; Erkelenz et al, 2014; Rosenberg et al., 2015) or, less commonly, as octamers

(Zhang and Chasin, 2004). SREs consist of four types: exonic splicing enhancers (ESEs) and exonic splicing silencers (ESSs) that occur within exons, and intronic splicing enhancers

(ISEs) and intronic splicing silencers (ISSs) that occur within introns. Splicing enhancers are believed to promote inclusion of sequences into mature transcripts. Splicing silencers are believed to promote inclusion of sequences into introns, so that they are spliced out. In particular, splicing silencers are believed to promote exon skipping when they occur within sequences that in other conditions or other cell types are normally interpreted as exons to be included in the mature transcript. While it has traditionally been assumed that SR proteins promote exon inclusion in the mature transcript and that hnRNPs promote exon skipping, that view has been challenged, and evidence is mounting that splicing decisions are influence by a complex combination of factors

(reviewed in Pandit et al., 2013). Recent computational models of cell-type specific splicing have thus used thousands of sequence features and sophisticated machine- learning methods (Xiong et al., 2015), while other recent work has emphasized the use of simpler methods based on additive hexamer models (Rosenberg et al., 2015).

It is important to note that, with few exceptions (Buckley et al., 2014), all spliceosomal RNA splicing in mammals occurs inside the nucleus. Translation (section

12

2.1.1.2), in contrast, occurs in the cytoplasm, after the transcript has been fully spliced and exported from the nucleus. As such, splicing decisions made by the eukaryotic nuclear spliceosome are complete before translation begins. Direct molecular interactions between the eukaryotic spliceosome and the cytoplasmic ribosome that performs translation have not been convincingly demonstrated, and such interactions are highly unlikely during normal operation of the cell, given that they are separated by the nuclear membrane.

2.1.1.2 Translation

Once a gene has been transcribed into an RNA transcript and the transcript has been capped, spliced, and polyadenylated, it can be exported from the nucleus for translation into a protein. Transcripts that function in ways that do not involve translation into a protein are termed noncoding transcripts. Long noncoding transcripts are denoted lncRNAs (long noncoding RNAs). A number of shorter noncoding RNAs are also known, such as transfer RNAs (tRNAs), and microRNAs (miRNAs) that participate in post-transcriptional regulation of other RNAs via sequence-specific binding. Transcripts that are destined for translation into protein are termed messenger

RNAs (mRNAs).

Translation of mRNAs into protein is performed by the ribosome, a protein complex that forms and persists stably in the cytoplasm. The ribosome initially associates with the 5’ cap of an mRNA and then travels 5’-to-3’ along the mRNA,

13

synthesizing a protein as it goes. Just as the DNA template indicates precisely which ribonucleotides are to be appended to the growing RNA transcript during transcription, the mRNA dictates which amino acids are to be appended to the growing polypeptide

(protein). The amino acids are encoded by non-overlapping nucleotide triples, called codons. This code is read out by tRNAs that are loaded into the ribosome to read the mRNA. When a stop codon is reached (TGA, TAA, or TAG in humans), translation is terminated and the polypeptide is released. Functional polypeptides will then fold into a protein structure. Aberrant proteins will typically fail to properly fold, and will be degraded by a proteasome structure.

While a ribosome initially associates with the 5’ cap of an mRNA, translation does not begin at that location on the transcript. Ribosomes are believed to scan the mRNA, 5’-to-3’, searching for an acceptable start codon at which to begin translating.

The vast majority of annotated human start codons are ATG, which codes for the amino acid methionine. However, not all ATGs are recognized as start codons. Just as flanking nucleotides influence splice site selection by the spliceosome (section 2.1.1.1), nucleotides flanking an ATG can influence whether (or how often) the ribosome begins translating at that codon. In humans and other eukaryotes, the accompanying sequence that tags functional start codons is known as a Kozak sequence (Kozak, 1990).

The ribosome scanning model (Cigan et al., 1988) thus proposes that the ribosome begins scanning at the 5’ cap of the mRNA, searching for the first (most 5’) ATG that is

14

accompanied by a strong Kozak sequence, and begins translating at that codon. Use of this model has been shown to improve bioinformatic identification of start codons in mRNA sequences (Agarwal and Bafna, 1998). The selected start codon dictates the translation reading frame. Because translation occurs via non-overlapping codons in sequence, there are three possible reading frames on the sense strand of any given transcript. That is, for any given nucleotide in a coding segment (an interval that is translated by the ribosome), that nucleotide may be in the first position of a codon

(called phase 0), the second position of a codon (phase 1), or the third position of a codon

(phase 2). The start codon defines the reading frame, since the first (5’-most) nucleotide of the start codon is by definition in phase 0, and the phase of any subsequent (more 3’) nucleotide is determined via mod 3 arithmetic with respect to the start codon (Figure 4).

Translating an RNA sequence in different reading frames can produce radically different amino acid sequences. For many human genes, reading frames other than the ones used by the ribosome frequently terminate in a premature stop codon—i.e., a stop codon that is

5’ of the normal stop. Note, however, that only a stop codon encountered in frame (such that the first nucleotide of the stop codon is in phase 0) will terminate the reading frame.

An open reading frame, or ORF, is thus defined as a contiguous interval on a chromosome that does not contain an in-frame stop codon, except as the last codon.

15

Figure 4: Three-periodic phase in coding segments. (Reproduced with permission from Majoros, 2007b).

Splicing and translation are separate processes, but the results of splicing can influence which parts of the mRNA are translated. Furthermore, splicing and translation together impose a complex structure on the mRNA in which multiple signals—splicing signals (splice sites and splicing regulatory elements) and coding signals (codons)—coexist in the same sequence (Figure 5). Because introns will have already been removed by the time the mRNA is translated, start and stop codons in the genome can be interrupted by introns.

Figure 5: Gene structure combines a splicing pattern with a coding segment. (Reproduced with permission from Majoros, 2007b)

2.1.1.3 Assaying the results of splicing and translation

Gene structures can be assayed experimentally. At the time of this writing, splicing patterns are commonly assayed via RNA sequencing, or RNA-seq. In the first step of RNA-seq, RNA molecules are extracted from cells and used as templates to form a complementary DNA strand, called a cDNA. The RNA and DNA are then denatured, the RNA is degraded, and the cDNA is made double-stranded by adding primers, DNA 16

ligase, and nucleotides. Once double-stranded DNA is fully formed, it can be sequenced using standard sequencing methods. From the resulting DNA sequences we can infer the sequences of the original RNA molecules by substituting U for T.

Once the RNA sequences are available, they can be mapped to the genes that produced them, and precise alignments can be deduced that indicate, for every nucleotide in an RNA sequence, which DNA nucleotide in the genome served as a template for that RNA residue. Alignments also indicate the locations of indels

(insertions and deletions, also called gaps). Mapping and alignment of large numbers of short reads to a genome is popularly done via aligners based on the Burroughs-Wheeler transform (Langmead et al., 2009; Li and Durbin, 2009), or via the use of suffix arrays

(Dobin et al., 2013). Special methods are needed to map spliced reads (those spanning an intron) to a reference genome, since such reads omit the intervening intronic sequence that is present in the reference sequence. A popular solution is to identify candidate splice junctions in the reference sequence, construct synthetic spliced sequences representing those junctions, and then align reads to that library using standard alignment methods (Trapnell et al., 2009).

Inferring gene structures from sequencing data can, in theory, be done by simply aligning the sequences of spliced transcripts to a reference genome and identifying the gaps in the alignment corresponding to introns. In practice, this approach is made more complicated by technical hurdles. Currently, RNA-seq is most commonly performed

17

using short-read sequencing. Because read lengths are currently shorter than most eukaryotic transcripts, RNA-seq outputs do not unambiguously indicate whole gene structures. However, by combining multiple reads one can infer gene structures. This process is known as transcript assembly.

Transcript assembly can be performed either de novo, without knowledge of a reference genome, or it can be reference-based, in which a reference genome sequence is utilized. Reference-based assembly is typically more accurate than de novo assembly, which is a harder problem due to the lack of a reference genome (Hayer et al., 2015).

Reference-based transcript assembly typically utilizes a splice graph. Formally, a graph G

= (V, E) is a set of vertices V and a set of edges E = {(v, w)}, each of which is a pair of vertices in V. In a splice graph, vertices represent either individual splice sites or whole exons, depending on the type of splice graph being used.

Figure 6: One type of splice graph. Vertices represent splice sites, and edges represent introns and exons.

In one common type of splice graph, vertices represent putative splice sites and edges represent exons and introns (Figure 6). Vertices can be assigned coordinates relative to the reference genome, and edges can be assigned genomic intervals. A single path through a splice graph for a gene thus outlines precisely a splice pattern for

18

transcripts produced from that gene, and indicates coordinates of putative exons, introns, and splice sites. Spliced reads aligned to the reference genome can also be mapped to the splice graph. Thus, reads mapping to an exon edge serve as support for that edge, and spliced reads that exactly map to splice junctions serve as support for the corresponding intron edges in the splice graph. Algorithms have been devised that utilize read counts to score whole paths through a splice graph, in order to identify whole gene structures having the most experimental support (e.g., Trapnell et al., 2010;

Bernard et al., 2014; Pertea et al., 2015). In this way, short reads are used to make inferences about whole-gene structures, despite the fact that no single read spans the entire gene.

In previous work I developed a novel transcript assembly method, called RSVP

(Majoros et al., 2014), which utilizes a modified splice graph called an ORF graph (Figure

7). In an ORF graph, the underlying splice graph is augmented with information regarding open reading frames. In particular, each edge and each vertex are annotated with scores. A traditional generalized hidden Markov model (GHMM) (section 2.1.2.2) is used to assign initial values to these scores (Majoros et al., 2004). RSVP then re-weights the graph by forming a linear combination between the initial GHMM-based score and a function of RNA-seq read support for each feature.

19

Figure 7: (A) A specialized type of splice graph, called an ORF graph. Vertices represent splice sites and start/stop codons. Edges represent introns and exons. Edges and vertices are annotated with scores (not shown) computed by a gene prediction model. Each path corresponds to a gene prediction. The graph can be re-weighted using external evidence such as RNA-seq data. The highest-scoring path is shown in red. Phase information constrains the prediction, as coding segments are assumed to be conserved and well-formed. (B) An unconstrained ORF graph for an Arabidopsis thaliana gene. (C) The same ORF graph as in panel B, after filtering out intron edges not supported by spliced RNA-seq reads.

The highest-scoring path through the re-weighted graph then represents a gene structure prediction that integrates experimental evidence regarding splicing patterns

(RNA-seq) with information about known codon biases in the organism under study. A 20

strength of the method is that other approaches tend to erroneously predict introns inside exons whenever read depth drops to zero (a common problem for lowly- expressed genes), whereas RSVP can utilize the genomic sequence and known codon biases to score regions of exons for which no reads were captured. A shortcoming of this method is that it is specific to protein-coding genes. Another shortcoming is that, like all standard gene-structure predictors, it assumes that reading frames are preserved, an assumption that fails in the case of genetic variants that alter reading frames. In

Chapters 3 and 4 I introduce novel methods for gene structure prediction that relax this assumption.

2.1.1.4 Interpreting genetic variants within the context of a fixed gene structure

Given a gene with a known set of transcript structures, genetic variants within the gene can be interpreted as to their possible effects on the function of the gene.

Traditionally, the greatest emphasis has been placed on those variants that occur within the coding segments of protein-coding genes. A variant may alter a codon so as to encode a different amino acid. While some changes to amino acids can alter protein structure and/or function, others have little detectable effect. Those amino acids situated within a functional protein domain may be more likely to alter function when they are changed than those not in a functional domain. Because the genetic code is degenerate, many codons can be changed to another codon encoding the same amino acid. Such a change is termed a synonymous substitution. Synonymous codons often differ by a single

21

nucleotide, typically in the third codon position. Other variants that do change the identity of the encoded amino acid may constitute conservative substitutions if the two amino acids have similar physicochemical properties. Such properties are used by popular variant interpretation tools (e.g., Adzhubei et al., 2010). Importantly, it is necessary to know the gene structure of a gene in order to identify the reading frame and thus the codon position or positions affected by a variant. Only then can one predict the effect on the encoded amino acid sequence.

Another commonly-used indicator of whether an amino acid substitution is likely to be deleterious is evolutionary conservation. Various methods have been devised to quantify the apparent degree of evolutionary constraint at each nucleotide location in the genome, by observing patterns of substitution across multiple species

(e.g., Siepel et al., 2005), and these are commonly used to infer that, e.g., variants in constrained positions are more likely to be deleterious than those in unconstrained positions. Evolutionary conservation can also be measured between related proteins within one species, by aligning the proteins and observing substitution patterns. This is commonly used in interpreting amino acid substitutions (Kumar et al., 2009).

Even variants within a gene but outside of the coding segment can be potentially deleterious. Regulatory elements such as transcriptional enhancers (section 2.2) can occur within introns, and variants that alter those enhancers can have phenotypic effects

(section 2.2.5). Exonic regions outside of coding segments, termed untranslated regions

22

(UTRs) can also be sensitive to genetic variants, as these regions can also contain regulatory elements such as enhancers or miRNA target sites (Majoros and Ohler, 2007).

While existing variant prioritization tools utilize some of the foregoing in interpreting likely variant effects on genes, they generally assume that gene structure annotations are both correct and fixed. It has been noted that the use of different sets of annotations can have a substantial effect on variant interpretation (Frankish et al., 2015).

Furthermore, as discussed next, gene structures can themselves change as a result of genetic variants, and those changes can alter how other variants are best interpreted. In this way, genetic variants can have non-independent effects, as addressed in Chapter 3.

2.1.1.5 Genetic variants can alter splicing

Genetic variants that disrupt functional splice sites by changing their dinucleotide consensus to a sequence not recognized by the spliceosome have the potential to be deleterious (Buratti et al., 2007). Loss of a functional splice site can lead to a number of different outcomes, including exon skipping, intron retention, or the use of a different splice site. Depending on the specific context, the magnitude of the effect of these outcomes can differ, as described below.

In the case of exon skipping, in which a whole exon is omitted from the spliced transcript, the effect can depend on the exon length, and whether the exon encodes a polypeptide segment that is critical for protein function. The mammalian DMD gene encodes dystrophin, a substrate necessary for muscular function, and its disruption can

23

lead to Duchenne muscular dystrophy (reviewed in Roberts et al., 1994). However, many of the exons in the central portion of the gene can be safely skipped and still produce a functional dystrophin homologue (England et al., 1990). Because these exons have lengths that are divisible by 3, skipping these exons does not alter the reading frame for downstream (more 3’) exons. In contrast, skipping a coding exon having length not divisible by 3 will result in a frameshift, which if not corrected by compensatory changes downstream can lead to loss of protein function (Juan-Mateu et al., 2013).

Intron retention involves the inclusion of whole introns into the mature transcript.

If translated, a retained intron will introduce new amino acids into the transcript, and can also alter the translation reading frame by introducing a premature stop codon, or by introducing a new start codon. A retained intron could also potentially introduce secondary structures that may impede translation (Doma and Parker, 2006). Because human introns can be very long, intron retention in humans typically results in premature termination of the reading frame and loss of function (Jung et al., 2015).

Intron retention is more commonly observed in plants, which have shorter introns.

Indeed, many plant genes utilize intron retention in regulated alternative splicing to produce different, functional transcripts (Kalyna et al., 2012).

In addition to exon skipping and intron retention, another possible outcome of splice site disruption is the use of a different splice site close to the disrupted site. In some cases a gene may have multiple transcripts, or splice isoforms, that differ only in

24

that each uses a different splice site at one end of an exon, and if those splice sites are sufficiently close, disruption of one site might cause the spliceosome to select the other site (Královicová and Vořechovský, 2007). In this case, if the two transcripts have similar or identical functions, disruption of a splice site may have little or no functional consequence. In other cases, the spliceosome might select another splice site that is never or very rarely ever used. Such a splice site is called a cryptic site, and its use is called cryptic splicing.

Use of a cryptic site may have a range of consequences, depending on its location. A cryptic site that would otherwise be intronic will result in lengthening of the proximal exon, and if it is a coding exon, this will result in additional nucleotides being translated. A cryptic site chosen from within an exon will shorten the exon. When the number of nucleotides added or removed from an exon is divisible by three, any translational reading frames will be retained, and the result will be a local modification to the encoded amino acid sequence. Whether such a local modification is likely to be deleterious depends on the context within the original protein, and could potentially be assessed by observing local evolutionary conservation profiles as described above.

When the number of nucleotides added or removed is not divisible by three, the effect on a coding segment will be to shift the reading frame, which is often disruptive to protein function.

25

While genetic variants that disrupt a splice site have a clear potential to alter splice patterns, variants that do not directly disrupt splice sites can still alter splicing outcomes. Variants that create a new splice site, termed a de novo splice site, can alter splicing if the new splice site is recognized by the spliceosome. Recalling that the spliceosome has preferences for specific flanking nucleotides in addition to the dinucleotide consensus sequences of donor and acceptor splice sites, variants that alter these flanking bases could potentially alter the strength of a splice site, resulting in either the strengthening of a cryptic site or the weakening of a canonical site. In addition, it has become increasingly clear in recent years that variants that alter splicing regulatory elements (SREs—section 2.1.1.1) can impact splicing outcomes, which can in turn lead to disease (Di Giacomo et al., 2013).

2.1.1.6 Genetic variants can alter translation reading frames

As alluded to earlier, changes to coding segments that have lengths not divisible by 3 alter the translation reading frame. Changing the reading frame changes the sequence of codons and thus the sequence of encoded amino acids. Unless subsequent changes restore the original reading frame downstream (section 2.1.1.2), such frameshifts affect the complete terminal portion of the encoded protein. Frameshifts that are not near the end of the encoded protein can therefore be expected to be overwhelmingly disruptive to protein function. However, even frameshifts near the end of a coding segment can conceivably result in loss of function if the remaining portion of the altered

26

reading frame contains no in-frame stop codon, because lack of a stop codon can be detected by the cell and can result in nonstop decay and transcript degradation

(Vasudevan et al., 2002; reviewed by Klauer and van Hoof, 2012).

Frameshifts can come about in a number of ways. Splicing changes are one common source of frameshifts. Genetic variants that insert or delete nucleotides (indels) are another source. A variant that creates a new start codon upstream from (5’ of) the canonical start codon can result in a frameshift if the change to the coding segment has length not divisible by three; an example is illustrated in Chapter 3 (Figure 17B).

Similarly, if a variant disrupts a canonical start codon, the ribosome scanning model predicts that the ribosome will select a downstream (more 3’) start codon if possible, and that may again result in a frameshift. Use of a different start codon, even if it does not cause a frameshift, can be potentially disruptive, as this will add or remove amino acids to the encoded protein. Similarly, a variant that disrupts a canonical stop codon can result in nonstop decay as described above, or if another stop codon exists in-frame downstream of the original but now disrupted stop codon, additional amino acids may be appended to the encoded protein.

A common result of frameshifts is the introduction of a premature stop codon, as described earlier. A premature stop codon can also be created directly by a single genetic variant. Premature stop codons (also called pre-termination codons, or PTCs) can trigger nonsense-mediated decay (NMD), resulting in degradation of the transcript and loss

27

of protein production (reviewed in Lykke-Andersen and Jensen, 2015). An established heuristic for predicting NMD in mammals and plants is the 50 bp rule (Nagy and

Maquat, 1998; Nyiko et al., 2013): if the distance (measured in nucleotides on the spliced transcript) of the PTC to the most 3’ exon junction is less than 50 bp, protein truncation is predicted. Otherwise, NMD is predicted. Recent analyses based on massively parallel splicing reporter assays found strong support for the 50 bp rule in human cells

(Rosenberg et al., 2015).

2.1.2 Traditional approaches to gene structure modeling

Work on the development of methods for gene-structure prediction—also called gene-finding—has a long history. Early methods in the 1980s predicted individual exons and ORFs in isolation (Staden and McLachlan, 1982; Fickett, 1982). In the early 1990s, methods for chaining together exons and introns using dynamic programming were developed (Snyder and Stormo, 1993; Stormo and Haussler, 1994). In the late 1990s, shortly before the initial draft sequencing of the human genome, highly accurate and efficient methods for predicting whole-gene structures in long genomic sequences were developed, based on hidden Markov models, and by the early 2000s most of the gene- finding field had converged on the use of generalized hidden Markov models (GHMMs)

(Kulp et al., 1996), also known as semi-Markov models (Burge and Karlin, 1997). In the late 2000s attention shifted from the use of GHMMs, which are generative models, to the

28

use of conditional random fields (CRFs) (Vinson et al., 2007; Bernal et al., 2007; Gross et al.,

2007), which are discriminative models. In the following sections I review these methods.

2.1.2.1 Hidden Markov models

Hidden Markov models (HMMs) have been applied to various problems in computational linguistics, most notably speech recognition (Jelinek, 1998) and part-of- speech tagging (Majoros et al., 2002), as well as a number of problems in genomics

(Durbin et al., 1998). As the stochastic, generative versions of finite automata (Hopcroft and Ullman, 1979), they are effectively grammar models and thus applicable to the problem of parsing finite-length sequences of discrete symbols from a finite alphabet such as DNA or RNA. Their probabilistic nature also allows them to be utilized for classification, by assigning probabilities to sequences.

Formally, an HMM is a machine M = (Q, A, Pt, Pe) having a finite set of states Q Î

! | | | for q , n for n= Q , a finite emission alphabet A, a transition probability distribution Pt(qi qj) i q Î Q | Î Î j , and an emission distribution Pe(x y) for x A and y Q. I adopt the convention

that M begins and ends in state 0, denoted q0, which is a silent state. At discrete time

points the machine transitions from the current state yi to a next state yj stochastically,

according to Pt(yj|yi). For every non-silent state qi, i ≠ 0, upon entering qi the machine

emits a random symbol x chosen according to Pe(x|qi). Upon re-entering q0 the machine terminates and the emitted sequence is complete.

29

Efficient algorithms for performing inference with HMMs have been widely disseminated (Rabiner, 1989; Durbin et al., 1998). Given a non-empty sequence S =

f x0x1…xL-1, the Viterbi algorithm computes the most probable state path, * = (y0, y1, … yL-1)

for yi Î Q, by which the machine could have emitted S:

∗ ������ � = � � � �

������ = � � � � �(�)

������ = � � �, �

������ = � � � � � �

������ = � � � � � � �

where y0 = y|f|-1 = q0, the silent start/stop state, and P(S) > 0 is constant over the argmax.

This optimization, referred to as Viterbi decoding, can be performed efficiently using dynamic programming (Viterbi, 1967). The dynamic programming recursion is given by:

��� � �, � − 1 � �|� � �|� �� � > 0 � �, � = � � �|� � �|� �� � = 0 for 0 < i,j < |Q| and 0 ≤ k < L. Once the recursion is complete, the final step is to choose

the state qi such that V(i, L-1) P(q0|qi) is maximal among 0 < i < |Q|. The optimal path f* can then be reconstructed by tracing back from cell V(i, L-1) according to the j selected at each step in the first line of the recursion. Viterbi decoding takes time O(N2L), for

30

number of states N=|Q| and sequence length L=|S|. For a model with a fixed number of states, this is linear in the sequence length and thus very efficient.

Whereas the Viterbi algorithm finds the most probable state path that could emit a given sequence, the forward algorithm computes the marginal probability of a sequence

S = x0x1…xL-1, summing over all state paths by which S may have been emitted. The forward algorithm also admits an efficient dynamic programming solution similar to the

Viterbi algorithm but with argmax operations replaced by sums. A by-product of the

forward algorithm is the forward table, F, where F(i, m) = P(qi, x0x1…xm), the probability

that the machine will emit subsequence x0x1…xm and that it will be in state qi at time m.

The backward algorithm similarly computes the backward table, B, where B(i, m) =

| P(xm+1xm+2…xL-1 qi), the probability that the machine will emit xm+1xm+2…xL-1 given that it

is currently in state qi. Combining the forward and backward tables, one can compute

th posterior probabilities P(yi = qi | S), the probability that the i position of sequence S was

emitted by state qi. The Baum-Welch algorithm, a form of expectation-maximization (EM)

(Dempster et al., 1977) uses the forward and backward algorithms to estimate maximum likelihood parameters for an HMM.

A simplistic HMM for gene structures is illustrated in Figure 8A. States {2, 3, 4} emit the sequence ATG (a start codon), states {8, 9, 10, 11, 12} emit TAG, TGA, or TAA (a stop codon), states {13, 14} emit GT (a donor splice sites), states {16, 17} emit AG

(acceptor splice site), state 15 emits nucleotides using a distribution trained on introns, 31

and states {5, 6, 7} emit codons according to a species-specific codon distribution.

Additional states can be added to model nucleotide biases in positions flanking splice sites or start codons, and non-consensus splice sites can be accommodated by modifying, e.g., states {13, 14} to emit other dinucleotides with low probabilities.

Figure 8: (A) State-transition diagram of an HMM-based gene structure model. States labeled with nucleotides emit only those nucleotides; these are used to emit splice sites and start/stop codons. State 0 is the silent initial/final state. E=exon, I=intron, N=intergenic. Transition patterns enforce the syntax of genes. (Reproduced with permission from Majoros, 2007b) (B) An HMM depicted as a directed graphical model. Unobservables (white vertices) are states, and observables (gray vertices) are emissions. Arrows depict dependencies. (Reproduced with permission from Majoros, 2007a).

In traditional HMMs such as the one depicted in Figure 8A, the looping structure on the intron, exon, and intergenic states will induce a geometric length distribution, which is more appropriate for some features, such as introns, than others such as exons

(Majoros, 2007b).

32

2.1.2.2 Generalized hidden Markov models

A generalization of HMMs allows for individual states to emit more than one symbol at a time, so that whole features such as splice sites, exons, and introns can be emitted by single states. The result is called a generalized HMM (GHMM) (Kulp et al.,

1996) or semi-Markov HMM (Burge and Karlin, 1997). There are a number of advantages of this approach: (1) arbitrary probabilistic submodels may be employed for computing probabilities of whole features, such as exons and introns; (2) implementation and modification of the model can be simpler, due to modularity; and (3) explicit length distributions can be imposed on coding features, which tend not to be geometric

(Majoros, 2007b). Advantages 1 and 2 in particular enable more sophisticated modeling to be employed that in practice have been found to produce more accurate predictions than those of traditional HMMs (Majoros, 2007b). An example GHMM for gene finding is illustrated in Figure 9.

The state emission submodels of a GHMM are partitioned into two classes: (1) signal sensors, which model fixed-length features such as splice sites and start/stop codons; (2) content sensors, which model variable-length features such as exons, introns, and intergenic regions. Commonly-used submodels for these sensors are described in sections 2.1.2.3 and 2.1.2.4.

33

Figure 9: State-transition diagram for a GHMM. Diamond states implement signal sensors for fixed-length features; oval states implement content sensors for variable-length features. “×3” indicates that the three coding phases can have different transition and emission probabilities. Transition patterns enforce gene syntax. (Reproduced with permission from Majoros, 2007b).

The ability of individual states to emit arbitrary sequences at each time point complicates decoding. Naïve GHMM decoding requires time O(N2L3), where the L3 term arises out of the need to search arbitrarily far back in the dynamic programming matrix for trellis links (Majoros, 2007b). In practice, GHMM decoders pre-scan the sequence with a signal sensor and threshold its score to produce a sparse set of possible locations for its corresponding state to be active. This pre-scanning heuristic together with several assumptions, including geometric noncoding length distributions and factorable content sensors (Burge, 1997), enable efficient decoding of long sequences with GHMMs. The most commonly used GHMM decoding algorithm uses prefix-sum arrays (PSAs). In previous work I devised a more memory-efficient algorithm, dynamic signal propagation 34

(DSP), that uses dynamic programming to propagate scores along a trellis (Majoros et al.,

2005a). Both of these algorithms improve upon naïve GHMM decoding, such that in practice GHMM decoding with these algorithms can be as fast as decoding with a traditional HMM. These algorithms can also be used to decode certain types of conditional random fields (CRFs), as described in section 2.1.2.5.

2.1.2.3 Signal sensors

The submodels in the signal states of a GHMM are commonly implemented using positional weight matrices (PWMs) (Stormo and Hartzell, 1989). In the probabilistic formulation of a PWM, each entry is a probability, and the entries in each column sum to

1. Thus, the PWM for state q is a matrix of values P(x|i,q) for nucleotide x and position i relative to a fixed-length window evaluated by the matrix. PWMs are commonly trained by simple counting of each nucleotide at each position in a set of training examples and normalizing these counts into probabilities. A major shortcoming of

PWMs is their assumption that positions are independent. A modification to PWMs called a weight array matrix (WAM) addresses this shortcoming by utilizing a Markov chain (section 2.1.2.4) within each column of the matrix, so that each position is modeled

| as conditional on some number of preceding positions in the matrix window, P(xi i, xi-1,

xi-2, …, xi-n, q). Various other models have been proposed that allow arbitrary dependencies between positions (reviewed in Majoros, 2007b).

35

2.1.2.4 Content sensors

Whereas signal sensors evaluate fixed-length features such as splice sites, content sensors compute probabilities for variable-length features such as exons and introns. In generative models such as GHMMs, the most common content sensor is the nth-order

Markov chain (MC), which computes the probability of a nucleotide emitted by state q conditional on the current state and on some fixed number n of preceding nucleotides,

| P(xi q, xi-1, xi-2, …, xi-n). A popular value for n is 5, so that the conditional probabilities computed by the sensor reflect hexamer frequencies and thus capture di-codon frequencies in coding segments. Training of MCs is popularly done via simple counts that are normalized into probabilities, though for large n this can result in a poor fit for moderately sized training sets. To address this, several approaches have been devised that interpolate between different orders (i.e., conditioning on different numbers of preceding nucleotides), resulting in an interpolated Markov model (IMM) (Salzberg et al.,

1998; Azad and Borodovsky, 2004).

2.1.2.5 Conditional random fields

HMMs are generative models, in that they give the probability of an abstract machine M generating a given sequence S via a given state path f: P(S, f | M). As the states effectively label each position in the sequence, the state path is a labeling that enables the sequence to be segmented into predicted intervals corresponding to putative exons and introns. In contrast, conditional random fields (CRFs) (Sutton and McCallum,

36

2006) directly model the posterior probability of a labeling f, given the sequence: P(f | S,

M). CRFs thus attempt to directly discriminate between classes, and are thus referred to as discriminative models. Discriminative models, when properly trained, are often capable of producing higher-accuracy predictions than generative models (Murphy,

2012). Generative models can also be trained discriminatively, by optimizing expected prediction accuracy rather than likelihood. Discriminatively trained generative models can sometimes produce higher-accuracy predictions than maximum-likelihood-trained generative models (Majoros and Salzberg, 2004).

Figure 10: A linear-chain CRF. (Reproduced with permission from Majoros, 2007a).

Just as an HMM can be interpreted as a directed graphical model (Figure 8B), CRFs can be interpreted as undirected graphical models. For gene structure prediction, a common topology for the graphical model is a linear-chain CRF, or LC-CRF (Figure 10), in which the unobservables y, corresponding to the labels assigned to individual positions in the input sequence, form a linear dependency chain. The observables (the nucleotides in the input sequence) are taken to be visible to the probability functions of the CRF, but the dependencies between observables are not explicitly modeled; thus, all observables are considered a single variate (Figure 10).

37

As with HMMs, CRFs may also be generalized so as to model whole features rather than individual nucleotides. The result is referred to as a generalized CRF (GCRF) or a semi-Markov CRF. As in GHMMs, GCRFs have signal sensors and content sensors, and while these need not compute properly-normalized probabilities, it is common to use standard probabilistic sensors from a GHMM within a GCRF (e.g., Vinson et al.,

2007; Bernal et al., 2007). K-mers (substrings of length K, where K < 10 typically) and their frequencies are also commonly used (e.g., Bernal et al., 2007).

In order to compute the probability of a labeling, the graphical model must be decomposed into cliques. For LC-CRFs, these consist only of singleton vertices and pairs of vertices connected directly by a single edge. Associated with each clique c is a

F potential function, c(y, x), for y the unobservables in the clique, and x the global set of observables. The probability of a given labeling y under a conditional random field is given by:

1 (� ,) � � � = � (�,) � �

(Sutton and McCallum, 2006) where yc is subset of y corresponding to clique c.

Decoding with GCRFs can be done using methods directly analogous to those employed for GHMMs (Majoros, 2007a). Training issues for GCRFs are addressed in (Majoros

2007a).

38

2.2 Transcriptional enhancers

Whereas section 2.1 focused on variation within genes, I now consider elements outside of genes that regulate gene expression. Transcriptional enhancers are relatively short (several hundred bp) genomic elements that modulate the expression levels of genes by mediating transcription initiation at promoters. In this section I review what is known regarding enhancer biology, experimental methods for measuring enhancer activity, computational methods for recognizing enhancers via their epigenetic signatures, and the role of enhancers in human disease.

2.2.1 Enhancer function in gene regulation

The coordinated regulation of global gene expression is a major determinant of organismal phenotypes. Cell differentiation in particular is determined largely through programmed changes in gene regulation (Davidson, 1990; St Johnston and Nusslein-

Volhard, 1992; Arnon and Davidson, 1997; Heintzman et al., 2009; Reik, 2007). As described above, the expression of a gene begins with the transcription of RNA from a

DNA template, and that transcription begins immediately proximal to a gene’s promoter, where the transcriptional machinery first assembles on the DNA. In the absence of any interaction with other genomic elements, many genes will be transcribed at very low, basal levels, at which the gene is considered to be effectively turned off. For many genes, physiological expression levels are achieved only through interactions with distal regulatory elements called enhancers, which can increase transcriptional output

39

manyfold (Banerji et al., 1981; Banerji et al., 1983; reviewed in Shlyueva et al., 2014). This distal interaction is possible via chromosome looping, in which the DNA bends so that points in the genome that are distant in linear space become close in three-dimensional space (reviewed in Cavalli and Misteli, 2013).

Sequences within functional enhancers are enriched for binding motifs of regulatory proteins (Serfling et al., 1985; Hardison et al., 1992; Grossman et al., 2017).

These binding motifs are short (generally <20 bp) and can be highly degenerate, with typically only a subset of motif positions having high information content. The binding of multiple transcription factor proteins in a cluster is a hallmark of both promoters

(Tijan, 1978; Dynan and Tijan, 1983; Ohler et al., 2002) and enhancers (Serfling et al.,

1985; Hardison et al., 1992). A wide variety of transcription factors (TFs) may bind at enhancers. In humans, the catalogue of known TFs numbers in the thousands

(Vaquerizas et al., 2009).

For many TFs, the ability to bind to DNA is influenced by the presence of nucleosomes. Nucleosomes are complexes of histone proteins around which DNA is wrapped in the genome’s condensed state. DNA complexed with nucleosomes (or other proteins) is termed chromatin. Condensing of the genome represses gene expression, and enables cell-type-specific expression patterns to be maintained (Natarajan et al., 2012).

Active enhancers and active promoters are typically depleted for nucleosome occupancy

(Thurman et al., 2012). While most TFs are believed to be incapable of binding to

40

nucleosome-bound DNA, some factors, termed pioneer factors, are thought to be able to bind to compacted chromatin and to promote nucleosome eviction (reviewed in Zaret and Carroll, 2011). For example, the FOX genes encode a family of transcription factors that are conserved between flies and humans (Weigel and Jäckle, 1990). One member of this family, FOXA1, can displace linker histone H1 (Iwafuchi-Doi et al., 2016), a histone that binds at the periphery of nucleosomes. Displacement of H1 results in the progressive unwinding of the rest of the nucleosome.

Enhancers may be megabases away from the genes they regulate (Nobrega et al.,

2003; Amano et al., 2009), and can themselves be located within other genes (Perry et al.,

2011), or even within a gene that they regulate (Arnold et al., 2013). Multiple enhancers can regulate the same gene (Perry et al., 2011), and one enhancer can regulate multiple genes (Nickol and Felsenfeld, 1988; González et al., 2008; Eun et al., 2013). Because of this potential complexity, and because enhancers are small and the human genome is very large, finding enhancers and determining which genes they regulate can be challenging. Experimental methods for testing putative enhancers and identifying their target genes are therefore invaluable.

2.2.2 Experimental methods for assaying enhancers

A number of experimental methods exist for assaying either the endogenous state of an enhancer in a genomic context or its ability to drive expression in a controlled setting.

41

One of the most commonly-performed assays of endogenous enhancer state is chromatin immunoprecipitation (ChIP) followed by sequencing, or ChIP-seq (Johnson et al.,

2007). In this assay, binding of a protein to DNA is fixed via formaldehyde crosslinking, the DNA is sheared into fragments, and then an antibody is used to precipitate the protein and its bound DNA fragments. Crosslinks can then be reversed and the newly unbound fragments can be sequenced. Sequencing reads are aligned to a reference genome and clusters of aligned reads are taken to indicate probable binding sites for the protein. Peak-calling software (e.g., Zhang et al., 2008b; Guo et al., 2012; Xing et al.,

2012; Wang et al., 2013) is commonly used to identify peaks of read counts as probable binding locations, though the resolution of such peaks with standard ChIP-seq protocols is several hundred bp, which is substantially larger than binding sites for individual TFs.

ChIP-exo (Rhee et al., 2011) achieves higher resolution by utilizing a 5’-to-3’ exonuclease to trim away unbound portions of immunoprecipitated fragments.

Peaks are commonly identified using a differential analysis that compares read counts in one condition to another, or that compares read counts to a control. ChIP-seq peaks do not always indicate direct binding of regulatory proteins, because tethered

(indirect) binding can also facilitate cross-linking and a detectable ChIP-seq peak. In addition to assaying TF binding, ChIP-seq can be used to identify histone modifications such as methylation or acetylation of lysine (section 2.2.3), assuming an antibody for the modification is available.

42

While ChIP-seq can be used to identify binding sites of individual TFs, other assays can identify locations exhibiting a depletion of nucleosome occupancy. DNase-seq

(Boyle et al., 2008) performs digestion of DNA using a low concentration of DNase I, sequences the undigested fragments, maps those fragments to a reference genome to identify cut sites, and then identifies regions of elevated cut counts to infer a depletion of nucleosomes in that region. These DNase hypersensitive sites (DHSs) are enriched for locations where proteins are bound to DNA (The Encode Project Consortium, 2012), and thus DNase-seq is a more general method for identifying possible regulatory binding sites than ChIP-seq. Unlike ChIP-seq, DNase-seq does not identify the specific protein(s) bound, though motif analysis can be employed to postulate the identities of proteins or protein families that might be present (e.g., Machanick and Bailey 2011; Pique-Regi et al.,

2011). ATAC-seq (Buenrostro et al., 2013) is a more recent alternative to DNase-seq that requires fewer cells and has the advantage of a simpler experimental protocol.

Identifying the genes regulated by an enhancer is a long-standing problem.

While it is common to assume that enhancers are most likely to regulate the nearest gene

(in linear genomic distance), it has been estimated that fewer than 10% of enhancers regulate their nearest gene (Sanyal et al., 2012). Chromosome conformation capture assays identify regions of the genome that are not close in linear genomic space but that interact in three dimensional space (e.g., Dekker et al., 2002; Simonis et al., 2006; Dostie et al.,

2006). Hi-C (Lieberman-Aiden et al., 2009; Belton et al., 2012) is a particular chromosome

43

conformation assay that employs high-throughput sequencing to allow interactions to be identified and quantified genome-wide. These assays do not produce nucleotide- resolution maps: a pair of interacting loci can be pinpointed to a 5 kb to 10 kb region.

Furthermore, regions that are closer to each other than 10 kb cannot be reliably detected with Hi-C currently.

While the foregoing methods assay the epigenetic state of enhancers within their native genomic context, other methods have been developed to directly quantify the ability of an enhancer to drive gene expression. While stable integration of putative regulatory elements into the genome is possible (e.g., Maricque et al., 2017), another option is to perform transient transfection of recombinant plasmids in which an inserted enhancer is given the opportunity to drive expression of a reporter gene. Luciferase assays are commonly used to test small numbers of putative regulatory elements, but are laborious to apply for large numbers of sequences.

High-throughput reporter assays (HTRAs) (e.g., Kwasnieski et al., 2012; Melnikov et al., 2012; Patwardhan et al., 2012; Sharon et al., 2012; Arnold et al., 2013; White et al.,

2013; Murtha et al., 2014) utilize an episomal reporter construct, together with a barcoding scheme and high-throughput sequencing, to measure the regulatory effects of large numbers of DNA sequences in parallel. The putative regulatory DNA is inserted into a plasmid, such that it can act as an enhancer or silencer for a reporter gene. Many thousands of such putative regulatory elements can be inserted in parallel into distinct

44

plasmids and the whole population inserted into cells. A unique barcode is inserted into the gene, so that transcripts produced from the reporter gene can be sequenced and the barcode examined to determine which unique piece of regulatory DNA was present on the plasmid. The relative amounts of RNA produced by different putative regulatory elements can be used to infer quantitative regulatory potentials for putative regulatory elements, as described in detail in Chapter 5.

HTRAs have a number of potential limitations. The basal level of transcription initiation supported by the promoter may limit detection of certain regulatory effects.

For example, if a minimal promoter is used, so that the basal transcription level of the promoter is very low, then repressive effects might not be detectable, since absolute transcription levels cannot be reduced much below basal level. However, the promoter used in the assay can be swapped out for another promoter (e.g., Zabidi et al., 2015), and in particular a promoter with high basal transcriptional output (e.g., Juven-Gershon et al., 2006) could potentially be used in an attempt to improve detection of repression.

Another limitation is that transfection into a single cell type limits the regulatory programs that can be interrogated to those involving TFs that are expressed in that particular cell type. However, this may also be viewed as a strength, in that it provides information specific to the tested cell type. An additional consideration is that plasmids typically do not undergo chromatinization to the same extent as endogenous chromosomes. As a result, elements that would normally be repressed by chromatin

45

state in a genomic context may respond in an episomal assay (Arnold et al., 2013).

Conversely, elements active in a genomic context may fail to respond in a transient reporter lacking necessary chromatin features (Dickel et al., 2014). Testing of regulatory elements within a genomic context can also be achieved, by integrating the elements into chromosomes, and this can be done in a high-throughput manner as well (e.g., Akhtar et al., 2013; Dickel et al., 2014).

One particular high-throughput reporter assay, called Self Transcribing Active

Regulatory Region sequencing (STARR-seq) (Arnold et al., 2013), places the element to be tested into the 3’ untranslated region (UTR) of the reporter gene on a plasmid, so that mRNA transcribed from the gene will contain the regulatory element within.

Sequencing those mRNA transcripts enables identification of the tested sequence, obviating the need to insert separate barcodes, since the element serves as its own barcode. Note, however, that this self-barcoding property could conceivably lead to biases compared to the use of multiple explicit barcodes per test element (Inoue and

Ahituv, 2015).

In Chapter 5 I will describe a novel use of STARR-seq, in which different alleles are tested to infer genetic effects on gene regulation.

2.2.3 Epigenetic indicators of enhancer state

The experimental assays described above allow for the characterization of either the regulatory potential of putative enhancers outside a genomic context, or of the

46

epigenetic state of regulatory elements within a genomic context. In the case of the latter assays, the main epigenetic features typically measured are chromatin accessibility, the presence of specific TFs bound directly or indirectly to DNA, and the presence of covalent histone modifications such as methylation and acetylation.

Chromatin accessibility, as commonly measured via DNase-seq or ATAC-seq, reflects a depletion of nucleosome occupancy. As such, accessible sites are candidates for locations where transcription factors might stably bind to DNA. Indeed, the potential for DNase I hypersensitivity patterns to indicate protein-DNA interactions has been known for decades (Galas and Schmitz 1978; Stamatoyannopoulos et al., 1995).

Computational models that combine DNase-seq with DNA sequence motif models have been able to predict TF ChIP data with high accuracy (Pique-Regi et al., 2011; Luo and

Hartemink, 2013; Yardimci et al., 2014). DNase hypersensitive sites (DHSs) outside of promoters are thus commonly interpreted as candidates for active enhancers in the assayed cell type (The ENCODE Project Consortium, 2007; Boyle et al., 2008; The

ENCODE Project Consortium, 2012).

Although active regulatory elements are generally depleted for nucleosomes, they may be flanked by stable nucleosomes, and the histones comprising those nucleosomes may exhibit post-translational modifications that can be assayed via ChIP- seq. Some commonly assayed histone modifications in gene regulatory studies are methylation and acetylation of lysines in histone H3, in particular lysines K4, K9, and

47

K27. Acetylation marks are denoted ac, while methylation marks are denoted me1

(monomethylation), me2 (dimethylation), and me3 (trimethylation).

The strongest associations among currently known histone marks with enhancers appear to be with marks H3K4me1, H3K4me2, and H3K27ac (Ernst et al.,

2011). While these marks are also present at many promoters, H3K4me3 provides some discrimination between enhancers and promoters, as it is enriched more strongly at promoters than at enhancers (Ernst et al., 2011). Other marks found to be enriched at enhancers include H3K27ac, H3K9me2, and H3K9me3 (Boyle et al., 2008) and H3K9ac

(Pique-Regi et al., 2011). There is no consensus on the precise set of marks that are optimal for identifying likely enhancers, but it is clear that H3K4me1 and H3K27ac are marks of active enhancers (reviewed in Shlyueva et al., 2014).

Another strongly informative epigenetic factor is the presence of the histone acetyltransferase p300 (Eckner et al., 1994), which can be assayed via ChIP-seq. p300

(also known as Ep300) is a transcriptional co-activator that acetylates histone H3 on lysine K27. p300 is a co-factor, in that it does not bind directly to DNA, but is recruited to enhancers by other factors bound there. While not all sites with p300 present are active enhancers (Shlyueva et al., 2014 and references therein), sites with detectable p300 via

ChIP-seq are strongly enriched for active enhancers (Heintzman et al., 2007; Heintzman et al., 2009).

48

2.2.4 Computational models of chromatin state

Performing ChIP-seq to assay epigenetic states genome-wide is now routine.

This opens the possibility of using this data to identify enhancers bioinformatically.

Variant prioritization efforts in particular stand to gain through the use of predicted regulatory elements, as known disease variants have been found to be enriched in noncoding regions that were predicted to be functional, based on their epigenetic state

(Ernst and Kellis, 2010; Ernst et al., 2011; Butter et al., 2012; Ni et al., 2012; The ENCODE

Project Consortium, 2012).

A plethora of machine-learning algorithms are available that can accommodate multiple continuous features in solving classification problems, and any of these could conceivably be used in classifying genomic regions as regulatory or non-regulatory as a function of chromatin state features. However, several aspects of the problem indicate a particular class of model as particularly suitable, namely multivariate hidden Markov models (MV-HMMs). The fact that predictive features cluster sequentially in the genome suggests a hidden Markov model, and the fact that multiple, quantitative features are to be used in the prediction suggests a multivariate model.

2.2.4.1 Multivariate hidden Markov models

As described in section 2.1.2.1, traditional HMMs for genomic sequences operate by transitioning from state to state and emitting a single nucleotide upon entering each state. The emission distribution over nucleotides is specific to each state. MV-HMMs

49

have multivariate outputs, so that each state emits a vector of values into a set of output

variables, (ai, bi, ci, di, ei, …), at a single time point i. This is not to be confused with

GHMMs in which each state emits values into a single variate x over a time interval, [i, i+n]. Moreover, the outputs of an MV-HMM can be discrete or continuous, and may be modeled as dependent or independent of each other. Standard methods for training and decoding HMMs can be applied to MV-HMMs once those methods have been modified to compute multivariate emission probabilities.

2.2.4.2 ChromHMM

ChromHMM (Ernst and Kellis, 2010) is an MV-HMM framework designed for genomic sequences. It has been applied specifically to output variables that denote the presence of epigenetic features such as histone marks and DHSs. Outputs are averaged over 200 bp windows. Only discrete emissions are supported, though continuous outputs can be discretized for use with ChromHMM. ChromHMM was designed specifically to be used for unsupervised learning of chromatin states (profiles of epigenetic features). The process of learning chromatin states begins with a fully-connected model topology that enforces no syntax constraints, and applies EM to learn transition and emission probabilities. Output variables are assumed to be conditionally independent, given a state. After EM completes, human intervention is required post hoc to assign meaningful labels to states, based on known biology. The model can then be applied to segment (label) the genome via Viterbi or posterior decoding.

50

2.2.4.3 MUMMIE

MUMMIE (MUltivariate Markov Modeling Inference Engine) is a general-purpose framework for implementing hand-crafted models that incorporate prior biological knowledge, and that are ideally trained using supervised learning from curated training sequences (Majoros et al., 2013). Incorporation of prior knowledge in machine-learning models can be beneficial when that knowledge is accurate (Mitchell, 1980; Wolpert,

1996). MUMMIE can model emissions at single-nucleotide resolution, can utilize full or sparse covariance matrices (for continuous features) to model dependencies between emission variates, and can accommodate arbitrary numbers of continuous and discrete data tracks. Continuous variates are modeled using Gaussian mixtures with an arbitrary (finite) number K of components:

� �|� = � � �; �, �

1 (��)�(��) � �; �, � = � (2�) �

l ∑ l £ l for state q and mixture weights qi, i qi = 1, 0 < i K. Mixture weights qi are thus state- specific, allowing each state to effect its own specific emission distribution, while

µ components ( i, Ci) are shared between all states, reducing the total number of parameters to be estimated and thus the propensity for over-fitting. Parameter estimation for continuous variates is performed as described in (Bilmes, 1998).

51

Figure 11: A multivariate HMM for identifying miRNA target sites (Reproduced from Majoros et al., 2013; used with permission).

For mixed continuous-discrete distributions, assuming each discrete variate is conditionally independent of all other variates given the current state, the discrete variates can be updated independently using the standard EM update formula for discrete-emission HMMs (Bilmes, 1998). In order to support the modeling of multi- nucleotide motifs in DNA or RNA with dependencies between positions, MUMMIE also supports higher-order Markov chains (section 2.1.2.4) for its discrete emissions. For a symbol s with left context c (i.e., s is immediately preceded by subsequence c), the update formula for the emission of s given c and given the current state q and parameters q from the previous iteration of EM is:

∋� ,() �� � � �, �, � ← ∋() ��

52

where Fqi is the forward recurrence variable and Bqi is the backward recurrence variable as described in section 2.1.2.1. The full emission probability in state q for a vector x containing the combined continuous and discrete tracks at a single position in the sequence is given by:

1 (��) � (��) � �|� = � � �, � � � ∈(�) (2�) �

for left-context cj in track j, where discrete(x) indexes the discrete tracks in x, and xcont is a vector containing only the continuous tracks. MUMMIE’s parameter estimation is implemented using multiple threads, enabling a substantial speed-up on multiprocessing systems: for a training set consisting of N sequences, MUMMIE can utilize up to N threads, enabling at most an N-fold speed-up on a system with N or more

CPUs.

MUMMIE’s utility was demonstrated by developing a model of miRNA target sites, based on PAR-CLIP data (Hafner et al., 2010) for the Argonaute protein that is involved in miRNA targeting via the RISC complex (Meister, 2013), and on evolutionary conservation (Figure 11). This model, called microMUMMIE, was found to produce high-accuracy predictions of miRNA target sites in human mRNAs (Majoros et al.,

2013). Key insights that arose during the development of this model were that competitive prediction accuracy could be produced by a simple model with a small number of states. In particular, it was found that assigning one state per peak, slope, or

53

flank in the continuous output tracks provided an effective methodology for devising a model with high parsing accuracy.

2.2.4.4 Segway

SegWay (Hoffman et al., 2012) is a similar framework to MUMMIE, except that it assumes that features are conditionally independent and provides no support for modeling DNA sequence motifs. Like ChromHMM, SegWay is geared toward unsupervised learning of chromatin states.

2.2.5 Enhancers and disease

A large proportion of variants associated with disease, as discovered via genome-wide association studies, are in non-coding regions of the genome and therefore likely to be regulatory in their effect (Visel et al., 2009; Sakabe et al., 2012; The ENCODE

Project Consortium, 2012). Variants can alter gene regulation in a number of different ways, ranging from single nucleotide polymorphisms (SNPs) that alter TF binding, to chromosomal rearrangements that alter which genes are regulated by which enhancers

(reviewed in Vockley et al., 2017).

In the case of small variants such as SNPs and short insertions/deletions (indels), a plausible mechanism for their contribution to disease is via disruption of binding potential for TFs in enhancers. Indeed, studies have found evidence that genetic variants occurring within the core binding motif can result in large differences in measured TF binding (e.g., Kasowski et al., 2010; Reddy et al., 2012). In a diploid cell

54

line derived from B cells (GM12878, an LCL cell line derived from Thousand Genomes

Project individual NA12878) showing allelic differences in TF binding, variants previously found to be associated with disease were enriched in heterozygous binding

(Reddy et al., 2012). In that study, the associated diseases were enriched for autoimmune disorders, consistent with the role of B cells in immune function. The same study found, however, that while variants within motifs could have strong effects on TF occupancy, the majority of instances of detected allelic binding differences were associated with variants outside the identified motif, suggesting other mechanisms that influence the ability of TFs to access and bind their cognate nucleotides.

There are a number of factors besides sequence motifs that influence TF binding.

DNA shape is now known to strongly influence the ability of proteins to bind to DNA

(Rohs et al., 2009; Gordân et al., 2013; Mathelier et al., 2016), and DNA shape is known to be strongly influenced by the underlying nucleotide sequence (Parker et al., 2009). As such, alteration of DNA shape via genetic variants is one possible mechanism for allelic differences in TF binding proximal to such variants. Nucleosome positioning is also influenced by DNA sequence (Gaffney et al., 2012), and individual variants have been found to influence chromatin accessibility as measured by DNase hypersensitivity

(Degner et al., 2012). As nucleosome positioning and TF binding strongly influence each other (Raveh-Sadka et al., 2009; Wasson and Hartemink, 2009; Zhang et al., 2009), another plausible mechanism for a regulatory effect is via a variant’s direct influence on

55

nucleosome positioning. Variants affecting chromatin accessibility have been found to be enriched for associations with the expression levels of proximal genes (Degner et al.,

2012), again suggesting a likely route to disease. Yet another plausible mechanism is via variants that influence histone modifications that have a role in gene regulation. Such a mechanism has been demonstrated for promoters (Carr et al., 2007) and thus may very well be found for variants within enhancers.

In summary, there are a number of different mechanisms whereby genetic variants in enhancers may influence phenotypes. It has also been demonstrated that disregulation of chromatin state, in the absence of proximal genetic variants, can lead to disease (reviewed in Sakabe et al., 2012). While I do not address this possibility directly in the current work, a model similar to the one described in Chapter 6 could potentially be used to identify epigenetic differences that may be causal for aberrant phenotypes.

Putative sites identified in this way could be tested as to their molecular or phenotypic effects via epigenome editing with the CRISPR/dCas9 system (Hilton et al., 2015; reviewed in Vockley et al., 2017).

56

Chapter 3 – High-throughput interpretation of gene structure changes

Text and figures in this chapter were previously included in modified form in the following publication:

Majoros WH, Campbell MS, Holt C, DeNardo EK, Ware D, Allen AS, Yandell M,

Reddy TE (2017) High-throughput interpretation of gene structure changes in human and nonhuman resequencing data, using ACE. Bioinformatics 33:1437-1446.

Author contributions: WHM designed and implemented the model and tested it on Thousand Genomes data. MSC and EKD applied the model to rice data. ASA provided advice on RVIS analysis. TER and MY supervised the work. WHM, TER, MY,

MSC, and CH wrote the manuscript.

3.1 Motivation

The accurate interpretation of genetic variants and their impact on gene function is central to modern genetics, with implications for both disease studies and elucidation of basic biology. However, the complexities of eukaryotic gene structure and function challenge our ability to predict the effects of genetic variants on the products of expressed genes. The context of a genetic variant—whether in an exon, intron, or intergenic region—directly impacts the interpretation of likely variant effects. A number of bioinformatic tools are available for interpretation of individual variants, including

ANNOVAR (Wang et al., 2010), SnpEff (Cingolani et al., 2012), VEP (McLaren et al.,

2016), PolyPhen (Adzhubei et al., 2010), and SIFT (Kumar et al., 2009). These tools 57

typically assume that gene structures are fixed and that multiple variants do not act in combination. A recent analysis of exome sequencing data of more than 60,000 individuals highlighted the importance of interpreting variants in the context of the entire haplotype, particularly in the case of variants that alter the annotated reading frame (Lek et al., 2016). In addition, while a number of high-quality gene annotation sets are available for humans and other species, including GENCODE (Harrow et al.,

2012), RefSeq (Pruitt et al., 2014), and Ensembl (Yates et al., 2016), it has been demonstrated that variant interpretation results can be sensitive to the gene structures used in the analysis (McCarthy et al., 2014; Frankish et al., 2015).

A productive step toward improving our understanding of how genetic variants can impact gene function in an individual is to characterize the potential changes to gene structure that may be induced by sequence variants. Methods for computational modeling and prediction of eukaryotic gene structures have been well-disseminated

(e.g., Guigo et al., 1992; Burge and Karlin, 1997; Lukashin and Borodovsky, 1998; Korf et al., 2001; Allen and Salzberg, 2005; Stanke et al., 2006; reviewed in Majoros, 2007b) and productively applied to the problem of annotating reference genomes, both human and non-human (Adams et al., 2000; Lander et al., 2001; Venter et al., 2001; Parra et al., 2007;

Haas et al., 2008; Holt and Yandell, 2011; reviewed in Yandell and Ence, 2012).

However, traditional gene-finding approaches make several assumptions that limit their application to predicting deleterious effects on gene structure in individuals.

58

Specifically, they assume that genes are well formed, have typical codon usage statistics, and ultimately produce functional proteins. Many approaches also take into account evolutionary conservation between species. Those assumptions enable gene-finding models to achieve high levels of accuracy in elucidating the structures of protein-coding genes in reference genomes. However, such assumptions also limit the ability of gene- finders to identify functional changes to gene structure between individuals of a species.

As an example, traditional de novo gene finders struggle to correctly model the

ABO gene that determines human blood group. The allele that gives rise to the O blood group contains an early frameshift inducing a premature stop codon believed to result in either mRNA degradation or translation to a different protein lacking enzymatic activity

(Yamamoto et al., 1990). Probabilistic gene finders predict an incorrect gene structure for the O allele that modifies the reading frame in order to avoid the in-frame stop codon

(Figure 30), as doing so allows a downstream exon to be annotated as coding, resulting in a higher probability according to the gene-finder’s objective function. In this way, traditional gene finders conflate multiple molecular and evolutionary processes in order to integrate diverse signals and maximize predictive accuracy in identifying functional genes in reference genomes, and in doing so are hampered in their ability to identify changes to gene structure that result in loss of function in an individual.

Here we describe a novel approach (ACE—Assessing Changes to Exons) that aids the elucidation of differences in gene structure between individuals of a species. The

59

key conceptual advance in ACE is that it does not assume that genes are fully functional in every individual. In particular, by considering within-species changes to gene structure without regard to possible downstream effects, ACE is able to identify changes to gene structure that may alter the function of the resulting protein, even if that protein is highly conserved between species. ACE can therefore predict individualized gene isoforms having altered—and possibly deleterious—protein function relative to the reference.

We demonstrate the use of ACE by generating personalized human transcriptome references for >2000 people sequenced as part of Phase 3 of the Thousand

Genomes Project (The Thousand Genomes Project Consortium, 2015). We then quantify transcript expression using RNA-seq data from matched individuals for a subset of the

1000 Genomes Project sample. That analysis reveals that predicted cases of complete or partial loss of function in protein-coding genes via nonsense-mediated decay (NMD) are detectable as a reduction in transcript levels, albeit with much variation in the degree of reduction. That analysis also validates the use of ACE for identifying novel splice forms that may result when annotated splice sites are disrupted via sequence variants. In addition, we show that transcripts predicted to suffer loss of function in healthy adults are significantly depleted in genes found to be intolerant to mutation across the human population.

60

We designed ACE to be broadly applicable across eukaryotes. For that reason, we minimized the burden of extensive retraining for use on nonhuman species. We demonstrate that feature by confirming known phenotype-causing differences in gene structures between plant varieties.

3.2 Methods

ACE projects annotations from a reference genome onto a personal genome, and then assesses possible changes to gene structures (splicing patterns and translation reading frames) as to their potential to disrupt gene function. A flowchart depicting the major processing steps in ACE is shown in Figure 12. The following sections describe each of these steps in detail.

61

Figure 12: Flowchart of ACE logic. Major steps are (in order): projection of annotations from the reference, analysis of potential changes to splicing, and analysis of potential changes to translation reading frames.

3.2.1 Reconstructing haplotype sequences from a VCF file

ACE begins by reconstructing explicit haplotype sequences based on variants given in a phased VCF file (Figure 13A), including all single-nucleotide substitutions, multinucleotide substitutions, insertions, deletions, and short copy-number variants.

VCF files may contain one or more samples (individuals). ACE processes each sample independently. ACE left-normalizes all variants (Tan et al., 2015) and disambiguates overlapping variants by computing the transitive closure of the overlap relation and applying the longest variant, provided all other overlapping variants are properly

62

nested and call for consistent substitutions. ACE provides two warning levels corresponding to overlapping variants that are compatible versus those that are incompatible. Those warnings are provided in an easily parsed format to allow filtering of sequences by confidence level prior to downstream analyses. ACE uses tabix (Li,

2011) for efficient extraction of variants in pre-specified intervals (section 3.5), thus reducing the memory requirements for genome sequencing studies across large populations and facilitating parallelization on cluster compute environments. Detailed tracking of insertions and deletions allows ACE to efficiently compute a coordinate transformation to map reference annotations to haplotype sequences without the need to perform explicit sequence alignment (section 3.5).

63

Figure 13: (A) ACE reconstructs explicit haplotype sequences from a phased VCF file, projects reference annotations onto them, detects possible gene structure changes, and interprets changes in terms of possible loss of function. (B) When a disrupted splice site is encountered, ACE enumerates possible alternate splice forms resulting from cryptic splicing, exon skipping, intron retention, or any combination resulting from multiple variants.

3.2.2 Identifying changes to splice patterns and reading frames

ACE requires that all reference gene models contain valid splice site consensus sequences as defined in a user-supplied configuration file. Similarly, ACE requires that reference protein-coding gene models contain valid start and stop codons in a consistent reading frame. Reference genes that violate those constraints are reported as possible mis-annotations and removed from further consideration. For all noncoding and coding genes, ACE identifies splice sites in the reference that change in the individualized

64

genome. Such changes may either be absolute, by disrupting a valid consensus splice site, or may weaken the splice site at flanking nucleotides. ACE evaluates the latter possibility by aligning to a probabilistic weight matrix, or PWM. Models of human splice sites are provided (section 3.5), and scripts to re-train for other organisms are also provided.

For each isoform of a gene in which a splice site is disrupted, ACE enumerates possible alternate splicing patterns for the isoform, including those in which an exon is skipped, an intron is retained, or a cryptic splice site is activated. By default, ACE identifies cryptic sites within 70 nucleotides (nt) of a disrupted site via a PWM thresholded to admit ~98% of known human splice sites. The default distance was selected after observing that ~75% of cryptic sites in DBASS, the Database of Aberrant

Splice Sites (Buratti et al., 2007), are within that distance (Figure 31). For isoforms with multiple disrupted splice sites, ACE enumerates all combinations, corresponding to the set of paths through a splice graph for the gene (Figure 13B). The splice graph is constrained to include only annotated splice sites and putative cryptic sites proximal to a disrupted annotated site.

ACE also identifies possible changes to reading frames. In cases in which the original start codon of a protein-coding gene is absent in the alternate sequence, ACE searches for the first downstream start codon of sufficient strength via a PWM. Changes to 5’ untranslated regions trigger a scan for upstream start codons that may be created as

65

a result. For transcripts annotated as noncoding, ACE searches for reading frames longer than a configurable minimum length (default: 150 nt), and reports whether the reading frame exists in both the reference and alternate sequence (suggesting possible mis-annotation of the gene as noncoding) or only the alternate sequence (suggesting possible gain of function in the alternate sequence, or loss of function in the reference individual).

3.2.3 Identifying loss of function

For protein-coding genes, ACE identifies instances of protein truncation or nonsense-mediated decay (NMD), either in the mapped transcript or in alternate transcripts proposed when a splice site is disrupted. NMD is predicted based on the linear nucleotide distance between an in-frame stop codon and the most 3’ exon junction in the spliced mRNA. Distances greater than 50 nt have been shown to trigger NMD

(Nagy and Maquat, 1998), and this phenomenon appears to be conserved between vertebrates and plants (Nyiko et al., 2013). ACE also reports likely loss of function (LOF) due to lack of either a valid in-frame stop codon or lack of a start codon scoring above the PWM threshold. Scans for start/stop codons are performed on spliced transcripts, so that start/stop codons straddling an intron are not overlooked. To enable filtering at arbitrary similarity thresholds, protein alignment scores (section 3.5), defined as the percent sequence match between the reference and alternate proteins, are reported.

Protein sequences are also emitted to allow detailed downstream analysis of amino acid

66

changes by programs such as PolyPhen (Adzhubei et al., 2010), SIFT (Kumar et al., 2009), or VAAST (Hu et al., 2013).

3.2.4 Configuration and structured output

ACE is fully configurable in all of the parameters described above, via a simple configuration file (section 3.5). ACE produces a highly structured output file (Figure 32) describing gene structures in the reference and alternate sequences and results of their detailed comparison. The variants incorporated into the haplotype sequences are listed and classified as to their context within gene elements. Classification of variants is performed separately for both mapped isoforms and putative novel splice forms, so as to highlight changes to a variant’s context between isoforms. Outputs are hierarchically structured to enable structured queries; outputs are in a novel format called Essex, which is based on LISP S-expressions, providing structure while retaining greater human readability than XML. We provide scripts for querying and filtering outputs and for converting to XML or GFF for use with other software.

3.2.5 Computational validation

To demonstrate the utility of ACE for large-scale genome sequencing projects, we used ACE to fully annotate the genomes of 2504 human samples sequenced by the

Thousand Genomes Project. The analysis was parallelized across 500 compute nodes, and required two weeks to complete. GENCODE version 19 (Harrow et al., 2012) annotations were used as reference annotations for that analysis. To validate predicted

67

novel isoforms, we aligned RNA-seq data from lymphoblastoma cell lines (LCLs) from

445 of the same individuals to the individualized genomes generated by ACE, using

TopHat 2 (Kim et al., 2013). RNA data was obtained from the Geuvadis project

(Lappalainen, et al. 2013). We used StringTie (Pertea et al., 2015) to quantify transcript abundance. Recent benchmarks have shown StringTie’s accuracy to be competitive with other state-of-the-art methods, though it is also clear that transcript abundance estimation is still an inaccurate process (Hayer et al., 2015). Thus, for validation of putative novel splice forms we rely primarily on finding spliced reads that map precisely to the putative splice junctions. We provided TopHat 2 and StringTie with both reference annotations mapped to the individualized genomes, as well as novel transcripts predicted by ACE (section 3.5). For the analyses of human genes, we disabled intron retention as it has been found to be present in the Geuvadis data at lower levels than cryptic splicing and exon skipping (Lappalainen et al., 2013; Monlong et al., 2014), and has been shown to be overwhelmingly likely to lead to loss of function in human coding genes (Braunschweig et al., 2014; Jung et al., 2015).

To quantify the effect of predicted NMD events, we analyzed the relationship between transcript abundance and the number of NMD alleles in an individual, under the hypothesis that each additional NMD allele in an individual would result in a proportionate decrease in transcript abundance for a given gene isoform. We fit a linear mixed-effects model, log2(FPKM) ~ Xb + Zu, to the transcript abundance estimates

68

provided by StringTie, where FPKM (fragments per kilobase of transcript per million reads mapped) measures transcript abundance, X is the number of functional (non-NMD) alleles, and Z is an indicator variable encoding the transcript identifier. The random- intercept term Zu incorporates a different intercept for each transcript, accounting for natural differences in expression between different transcripts and genes in LCLs.

Values of b were estimated after filtering transcripts at a range of minimum FPKM thresholds (applied to mean FPKM across all samples for each transcript), in order to assess stability of b estimates at different abundance thresholds. Estimates of b were

transformed (section 3.5) into relative abundance ratios r0/2 = FPKM0 / FPKM2, where

FPKMk denotes mean FPKM among individuals predicted to have k functional alleles of

a transcript. Thus, 1-r0/2 is the proportionate reduction in NMD homozygotes relative to individuals with two functional alleles. In addition to the model with random intercepts, we also fit a model with both random intercepts and random slopes:

log2(FPKM) ~ X(b + v) + Zu, where v is a random slope term for each transcript.

As the 1000 Genomes Project individuals were reportedly healthy adults, we expected isoforms with LOF in at least one individual to be enriched for genes tolerant of functional mutations. We expected this effect to be stronger for the genes that are found as a homozygous LOF because they will exhibit both recessive and dominant effects. To test this, we analyzed the distributions of Residual Variant Intolerance Score

(RVIS) (Petrovski et al., 2013) and noncoding RVIS (ncRVIS) (Petrovski et al., 2015)

69

percentiles for genes in which ACE predicts LOF for at least one annotated isoform of the gene in 1000 Genomes Project samples. RVIS reflects the intolerance of genes to functional mutations affecting amino acids in protein-coding genes, while ncRVIS reflects intolerance to mutations in noncoding portions of genes.

To demonstrate the applicability of ACE to nonhuman species, we also analyzed

30 rice samples with fully sequenced genomes (The 3000 Rice Genomes Project, 2014).

3.3 Results

3.3.1 ACE predicts changes to gene structure

In the 1000 Genomes Project samples, ACE predicted a modest number of alternative splice forms for each disrupted splice site: 80% of cases involve at most three alternate patterns per disrupted site (median=2, mode=1) (Figure 14A). When the alternate structures predicted in the Geuvadis samples are provided as annotations (in addition to mapped reference annotations), TopHat 2 is able to assign spliced reads to significantly more of the putative novel junctions than if TopHat 2 is provided only mapped reference annotations (cryptic-site isoforms: Figure 14B, Wilcoxon W = 513660,

P < 2.2×10-16; exon-skipping isoforms: Figure 33A, W = 537900, P < 2.2×10-16). Similarly,

StringTie assigns nonzero FPKM values to significantly more of these putative novel splice patterns when they are provided as annotations than when they are not provided

(cryptic sites: Figure 14C, W = 198020, P < 2.2×10-16; exon-skipping: Figure 33B; W =

198020, P < 2.2×10-16). As such, ACE improves the sensitivity of both spliced read

70

mapping and transcript quantification for putative novel isoforms when an annotated splice site is disrupted, and it is able to do so while predicting conservative numbers of such alternate splice patterns per disrupted site.

We also applied transcript quantitation methods Salmon (Patro et al., 2016) and

Kallisto (Bray et al., 2016) to the Geuvadis data and quantified the number of ACE- predicted novel transcripts that were assigned expression values above a range of thresholds (Figure 34). Due to the substantial differences between expression estimates by the three approaches, we instead used raw counts of spliced reads aligning exactly to predicted novel splice junctions to investigate the specificity of ACE’s predictions. As a negative control, we randomly sampled 3.25×106 non-disrupted, annotated splice sites from the Geuvadis samples, and used ACE to generate putative novel splice patterns that could result if the splice site had been disrupted. We then quantified support for these negative control splicing events via the number of spliced reads assigned by

TopHat 2 to the junctions.

Due to the stochastic nature of eukaryotic splicing, some splicing at non- annotated sites is expected (Pickrell et al., 2010; Stepankiw et al., 2015). The proportion of ACE cryptic-site predictions, for disrupted splice sites, that are supported by at least one spliced read is significantly greater (Wilcoxon rank-sum test: W = 502780, P < 2.2×10-

16) than the proportion of supported predictions for the randomly selected non- disrupted sites (Figure 14D) (exon-skipping: Figure 33C; W = 699470, P < 2.2×10-16).

71

Similar results for all of the above comparisons were obtained when applying higher read-count or FPKM thresholds (Figure 35, Figure 36). Furthermore, the numbers of spliced reads supporting predicted novel splice junctions are significantly greater in the case of disrupted splice sites than for non-disrupted sites (raw read counts: Figure 14E,

W = 785190, P < 2.2×10-16; normalized read counts: Figure 37, W = 791430, P < 2.2×10-16).

Among those transcripts with disrupted splice sites for which ACE predicted at least one alternate splice form, in 55.5% of cases at least one ACE prediction was supported by at least three spliced reads mapped to the novel splice junction. Possible outcomes that may comprise the remaining cases but that we did not investigate include: intron retention, use of cryptic sites further than the 70 bp limit, failure to sequence spliced products due to low intrinsic expression levels, and accelerated degradation of aberrant transcripts by RNA surveillance pathways. Sampling error may have also contributed.

When multiple cryptic sites were available and at least one site was supported by at least three spliced reads, support for more than one site was found in only 13.4% of cases, suggesting possible discrimination among available cryptic sites by the splicing machinery.

72

A 80000 Frequency 0 0 2 4 6 8 10 12 Number of alternate structures predicted

Without hints from ACE 4 With hints from ACE

B 2 Density 0 0.0 0.2 0.4 0.6 0.8 Proportion of cryptic−site isoforms supported by at least one spliced read

Without hints from ACE 8 With hints from ACE

C 4 Density 0 0.0 0.2 0.4 0.6 0.8 Proportion of cryptic−site isoforms assigned nonzero FPKM

Non-disrupted sites 8 Disrupted sites 6 4

D 2 Density 0 0.0 0.2 0.4 0.6 0.8 Proportion of cryptic−site isoforms supported by at least one spliced read

Non-disrupted sites E 0.8 Disrupted sites Density 0.0 0 1 2 3 4 Log (reads per junction) 10

Figure 14: (A) Distribution of number of alternate structures predicted per disrupted splice site. (B) Distribution of proportions of predicted cryptic-site isoforms supported by at least one spliced read, when predicted isoforms are not provided to TopHat 2 (blue) and when they are provided (red). (C) Distribution of proportions of predicted cryptic-site isoforms assigned nonzero FPKM by StringTie when predicted isoforms are not provided to StringTie (blue) and when they are provided (red). (D) Distribution of proportions of predicted cryptic-site isoforms supported by at least one spliced read for splice sites simulated to be disrupted (blue) and for those that are disrupted (red). (E) Distribution of spliced reads per junction, on log10 scale, supporting sites simulated to be disrupted (blue) versus those that are disrupted. 73

As an additional negative control, we quantified mean cryptic splicing activity in the vicinity of all annotated splice sites that were disrupted in some individuals but not in others. We found that cryptic splicing levels were higher in individuals with disruption of the annotated splice site (Figure 38). That result illustrates that stochastic splicing does result in occasional use of cryptic sites, but that cryptic splicing is enriched near functional sites that have been disrupted.

3.3.2 ACE identifies thousands of annotated human splice sites as being potentially robust to disruption

To further explore the utility of ACE in identifying alternate splice forms that may arise when an annotated splice site is disrupted, we simulated disruption to every annotated splice site in every protein-coding gene in the human reference and classified each site as to whether there existed an alternate splice pattern found by ACE that could produce a highly similar protein product. Only alternate splice forms that did not result in a prediction of NMD, did not lack a start or stop codon, and encoded a protein differing by no more than ten amino acids (aa) from the reference protein were accepted as potentially retaining function.

Nearly 80,000 human splice sites (78,226 / 37,7278 = 20.7%) in 15,134 genes were deemed by ACE to be potentially robust to disruption. A more conservative PWM threshold that would reject ~20% of annotated human splice sites still results in over

30,000 (32,465 / 37,7278 = 8.6%) splice sites being identified as potentially robust to disruption. These results indicate that there may be ample opportunities to reduce false 74

positives in disease studies in which splicing defects are suspected, by applying ACE for interpretation of these altered gene structures. When tissue samples are available, putative splice forms proposed by ACE can be validated against RNA-seq data by providing them as annotations to a transcript quantification pipeline as described in the previous section, or by validating protein presence via western blot.

Among Thousand Genomes Project samples, the mean proportion of transcripts with disrupted splicing for which ACE was able to identify at least one alternate structure with no predicted LOF according to the above criteria was 0.46 (SD = 0.08;

Figure 39A). This is a 2.2-fold enrichment compared to the 0.21 estimated for the genome-wide scan, possibly reflecting the effects of natural selection on this control population.

3.3.3 ACE confirms previous estimates of the effect of nonsense- mediated decay on transcript levels

Nonsense-mediated decay accounted for over two-thirds (69%) of the loss-of- function predictions in the Thousand Genomes Project samples. To better understand the impact of NMD on expression of target genes, we used the Geuvadis RNA-seq data and the transcript quantification pipeline described above to quantify the effect of NMD in terms of the average reduction in transcript levels per NMD allele, relative to individuals with two functional alleles. We first restricted our analysis to heterozygous individuals.

75

Figure 15: (A) Distribution of log2 effect sizes of N = 578 heterozygous NMD events as measured via RNA-seq transcript quantification. Dashed line at -0.42 denotes a 25% reduction in total transcript quantity. Data were filtered to improve power (sample size≥30, mean FPKM≥1). (B) Percentiles of Residual Variant Intolerance Scores (RVIS) for N = 633 genes in which at least one individual was predicted to be homozygous for gene loss of function.

Based on the results of earlier, in vitro experiments showing that NMD achieves a halving of transcript levels in episomal mini-gene constructs (Rosenberg et al., 2015), we hypothesized that each additional NMD allele at a diploid locus would reduce total transcript levels by 25%, so that the homozygous NMD state should result in a halving

of mean FPKM. In Figure 15A we show, on a log2 scale, the distribution of effect sizes E

= FPKM1 / FPKM2 for autosomal transcripts expressed in LCLs, where FPKM1 is the mean FPKM pooled among heterozygous individuals (having one NMD allele and one

functional allele), and FPKM2 is the mean FPKM pooled among individuals having two functional alleles. The observed distribution matches our expectation of a 25% reduction

(denoted by the dashed line) among heterozygotes, albeit with much variability, as also

76

noted previously based on a subset of this data from 119 individuals (MacArthur et al.,

2012). Applying higher FPKM thresholds produced similar results (Figure 40).

In order to extend the analysis to include homozygotes, we fit the linear mixed- effects model described in Methods to the Geuvadis data. Using a linear mixed model with random intercepts allows us to more rigorously account for differences in expression levels between genes and isoforms, as each isoform can have a different

(random) intercept. After filtering to include only transcripts expressed in at least 30 individuals (to improve statistical power) and having both NMD and non-NMD predictions, we were left with 578 heterozygous and 38 homozygous observations. All estimates of coefficient b were significantly different from zero (all P < 2×10-27), and estimates were robust to filtering of the data at different minimum FPKM thresholds

(Figure 41A). The largest estimated b = 0.37 (SE = 0.01) approaches, but does not

achieve, a complete halving of transcript levels (r0/2 = 0.60) for homozygotes. Adding a random slope to the model produced revised estimates with a mean of 0.499 and standard deviation 0.06, which more closely matched our expectation of a complete halving (Figure 41B).

NMD events resulting from the creation of new upstream start codons were omitted from the above analysis, as many of these likely constitute so-called upstream open reading frames (uORFs), which can affect gene expression in myriad ways (Barbosa et al., 2013). Fitting the above model to these uORF NMD predictions at various FPKM

77

thresholds consistently results in an estimated b ≤ 0, indicating that NMD in uORFs is not predictable using the established methods for downstream reading frames, possibly due to their position near the 5’ cap site on the circularized RNA (Silva et al., 2008;

Peixeiro et al., 2012) or to the potential for reinitiation of translation downstream (Neu-

Yilik et al., 2011). As such, ACE marks all NMD predictions in uORFs as hypothetical and provides position and length information for the uORF, enabling users to interpret them on a case-by-case basis.

3.3.4 ACE’s loss-of-function predictions in healthy individuals are highly enriched for genes tolerant to mutation

All 2504 individuals in the 1000 Genomes Project sample harbored alleles predicted to suffer loss of function (LOF). Using ACE we estimated a median of 148

LOF genes per individual (range: 115-192), which is higher than the estimate of 97 based on experimental validation and stringent filtering of variants in a single European individual (MacArthur et al., 2012), but similar to the estimate of 149-182 truncation events found by Phase 3 of the Thousand Genomes Project study (The Thousand

Genomes Consortium, 2015).

For healthy adults we expect LOF predictions to be enriched for genes not critical to survival, and thus to have elevated tolerance to functional mutation. As described in

Methods (section 3.2), we assessed tolerance to mutation by computing RVIS and ncRVIS percentiles for all autosomal protein-coding genes predicted to suffer LOF in at least one individual. LOF was presumed if a transcript that was well-formed in the 78

reference was predicted in the individual’s genome to suffer NMD (69% of all predicted

LOF cases), to lack a start or stop codon (8% of cases), to have a disrupted splice site in a terminal exon with no viable alternative splice forms (2% of cases), or to encode a protein differing by at least 50% of its amino acids from the reference protein (21% of cases).

Loss-of-function predictions were enriched for genes tolerant to mutation according to both RVIS and ncRVIS scores. The distribution of RVIS percentiles for homozygous LOF genes was highly biased toward genes tolerant to mutation (higher

RVIS scores), as expected (Figure 15B). The observed distribution differs significantly from the distribution of all RVIS scores (Figure 42A) (median = 80th percentile, versus

50th percentile for all genes; Wilcoxon rank-sum test: W = 7378700, P < 2.2×10-16).

Random sets of genes having similar lengths, numbers of exons, or G+C nucleotide composition resulted in distributions that could not be distinguished from uniform

(Wilcoxon rank-sum, all P > 0.6; Figure 42B-E). The bias toward tolerance was significantly higher for homozygous LOF genes than for heterozygous LOF (W =

2214700, P < 2.2×10-16; Figure 43A-B), though heterozygous LOF genes were also significantly enriched for tolerance (median = 62nd percentile; W = 53451000, P < 2.2×10-

16). Percentiles for ncRVIS were also significantly biased toward tolerance to mutation in these genes (homozygous: median = 59th percentile, W = 6032200, P = 5.5×10-11; heterozygous: median = 56th percentile, W = 48724000, P < 2.2×10-16), and that bias was

79

again higher for homozygotes than heterozygotes (W = 1812500, P = 0.005; Figure 43C-

D).

Because RVIS and ncRVIS scores are assigned to genes rather than to individual isoforms, they may not indicate intolerance levels for every isoform equally. Indeed, genes with a predicted homozygous LOF in at least one individual for at least one isoform that are classified as intolerant to variation (RVIS percentile < 0.20) were found to have significantly elevated numbers of isoforms compared to all of GENCODE

(Wilcoxon rank-sum, W = 1932000, P < 2.2×10-16) (Figure 44). This observation is consistent with the possibility that the gene-level intolerance detected by RVIS might not indicate intolerance for the particular isoforms found to suffer LOF in these samples.

Indeed, among the LOF predictions in 1000 Genomes Project samples, a majority of the genes were predicted to suffer LOF in some, but not all, of their isoforms (mean proportion among individuals was 0.59, SD = 0.03; Figure 39B), indicating that many

LOF variants do not affect all isoforms equally.

3.3.5 ACE aids interpretation of insertion and deletion variants within genes

Insertions and deletions of short sequences can substantially alter gene structures, through their effect on translation reading frames, splice sites, or start or stop codons. Proper interpretation of such variants requires analysis of the resulting sequence within the context of the correct gene structure.

80

For example, in the CTU2 gene (Ensembl gene ENSG00000174177), which is involved in post-transcriptional modification of transfer RNAs, variant rs11278302 deletes an entire donor splice site (Figure 16A), suggesting a possible effect on splicing.

Indeed, the Ensembl variant effect predictor, VEP (McLaren et al., 2016) classifies this common variant (minor allele frequency in 1000 Genomes Project samples = 0.22) as having “high impact” (Figure 45A). However, ACE discovers that the resulting sequence after deletion contains a valid donor consensus at the same location relative to the preceding exon, that the new splice site scores more highly under the donor-site PWM than the original donor site (-18.82 versus -19.77), and that the coding sequence remains unchanged, producing an identical protein. Furthermore, while sample HG00096 is homozygous for the alternate allele, TopHat 2 assigns 33 and 35 spliced reads respectively to the new splice junctions in the two haplotypes, consistent with ACE’s predictions.

An important class of insertion/deletion variants are frameshift mutations, which are insertions or deletions of a length not divisible by three in a coding sequence. These have the potential to radically alter encoded proteins by shifting the reading frame.

Frameshifts typically induce premature in-frame stop codons resulting in truncated proteins and, often, a reduction in transcript levels via NMD. In the 1000 Genomes

Project population, frameshifts were the largest contributor to predictions of NMD, accounting for 60% of predicted cases. Frameshifts were also the largest contributor to

81

LOF predictions stemming from large protein changes, accounting for 71% of cases.

When multiple frameshifts are present in a coding segment, however, their combined effect may be less severe than the predicted effect of any one frameshift if a downstream variant restores the original reading frame. Because ACE analyzes sequences after simultaneously applying all variants present, combinations of frameshifts that mutually cancel each other by restoring the original reading frame can be detected.

One example of compensatory frameshifts detected by ACE occurs in the ZFPM1 gene (ENSG00000179588), which plays a key role in erythroid differentiation. Within the coding segment of this gene are three common deletion variants, all within 10 nt of each other (Figure 16B). The first two deletions (rs67712719, rs67322929) induce frameshifts, while the third (rs67873604) maintains the reading frame. Either rs67712719 or rs67322929 in isolation would result in premature termination and a large change to the amino acid sequence (Figure 46). Consequently, VEP classifies both rs67712719 and rs67322929 as having “high impact” (Figure 45B). However, rs67712719 and rs67322929 commonly occur together in the 2504 1000 Genomes Project samples (4869 / 5008 = 97% of haplotypes), and the combination results in only two amino acid changes, as rs67322929 corrects the reading frame change introduced by rs67712719; the three variants together modify only four amino acids, due to their mutual proximity.

82

Figure 16: (A) Deletion of an entire splice site (top: hg19 reference sequence; bottom: haplotypes 1 and 2 of 1000 Genomes Project sample HG00096). The resulting allele appears to retain a functional splice site despite the deletion, as concluded by ACE and supported by spliced RNA-seq reads. (B) Compensatory frameshift variants: the second variant corrects the change to the reading frame introduced by the first variant (top: hg19 reference sequence, bottom: haplotype 2 of 1000 Genomes Project sample HG00096).

Every individual in the 1000 Genomes Project sample harbored one or more

(median = 7 per individual) compensatory frameshifts affecting ≤30 amino acids. In this sample, the observed lengths of affected intervals (in amino acids) are very short on average (Figure 47), with a median length of only 1 aa, as compared to a null expectation of 260 aa for uniformly random, non-compensated frameshifts (Figure 48). This bias toward short affected lengths may reflect selection against large functional changes in proteins.

3.3.6 ACE accurately reconstructs human blood-group alleles at the ABO locus

The human ABO gene (ENSG00000175164) is responsible for human blood types.

It encodes a glycosyltransferase that modifies carbohydrate content of red blood cell

83

antigens, with the A allele producing the A antigen, the B allele the B antigen, and the O allele being non-functional (Yamamoto et al., 1990). In the non-functional O allele, a deletion of a single guanine in exon 6 creates a frameshift resulting in an in-frame stop codon in the same exon, so that only alleles A and B have a seventh coding exon.

The ABO locus is highly diverse in human populations and has assembly issues in both the GRCh37 and GRCh38 human reference genomes. The annotated allele in

GRCh37 was the result of improper assembly of two different O alleles, while GRCh38 combined A and O alleles, producing a sequence identical to the known O1.01 allele

(Yamamoto et al., 1990; Yip, 2002). Both assemblies now contain a patch as an alternate contig that represents an A allele. GENCODE version 19, the reference annotation for all of our analyses, annotates this gene as a processed transcript, and identifies no reading frame.

In 1000 Genomes Project sample HG00096, ACE identifies a start codon and open reading frame in both haplotypes (Figure 17A), and proposes that the gene might be mis-annotated as noncoding. In haplotype 1 ACE identifies a coding gene structure that precisely matches the known O allele. In haplotype 2 ACE identifies a structure matching both the A and B alleles; translation of this structure reveals that the amino acid sequence is identical to the known B allele (Yamamoto et al., 2014). Thus, ACE has identified this individual as being heterozygous for the O and B alleles, and thus likely has a B blood type.

84

As noted in section 3.1, applying a state-of-the-art gene finder to this locus results in very different results. This stark difference highlights the importance of ACE’s method of modeling splicing decisions as independent of downstream translation effects when analyzing gene structures in re-sequencing data.

3.3.7 ACE identifies complex gene-structure changes in a plant gene influencing flavor and nutritional content

The waxy gene in domestic rice provides a test case for ACE’s ability to discover complex alterations to gene structure involving simultaneous changes to both splicing patterns and translation reading frames. Different alleles of waxy produce different ratios of amylose to amylopectin, leading to very different tastes and textures.

Moreover, as these polysaccharide starches result in substantially different glycemic indices, their relative expression in different rice varieties has nutritional relevance.

85

Figure 17: (A) Blood-group alleles of the ABO gene (ENSG00000175164). Black: coding segment; gray: untranslated region (UTR). Reference genome hg19 has the O allele; GENCODE version 19 annotates this gene as a processed transcript with no reading frame. ACE identifies the coding segment for the O and B alleles in heterozygous individual HG00096. (Coordinates have been transformed and mapped to the forward strand). (B) Complex differences in gene structure between alleles of the waxy gene in rice, due to a single G-to-T variant in a donor splice site. ACE

detects a 1 nt shift in the donor splice site in the Wxb allele, resulting in a new start codon straddling the first intron. The new start codon alters the reading frame, leading to a premature stop codon and NMD.

We provided ACE with the annotated Wxa allele as reference annotation and

projected this to the Wxb allele (Figure 17B) using variants provided by the 3000 Rice

Genomes Project. ACE recognizes that the G to T substitution caused by variant

86

id12648080 causes a disruption to the donor splice consensus at the end of the first exon in the 5’ untranslated region. It then scans for and detects a new splice site scoring above PWM threshold in the vicinity of the annotated site; the new site is 1 nt upstream of the annotated site. This 1 nt shift in the donor site results in a new splice junction in which an A at the end of the first exon joins with a TG at the beginning of the second exon. ACE recognizes the spliced ATG as a valid start codon consensus. Together with its flanking bases this putative start codon scores above PWM threshold. ACE then

proposes that the Wxb allele preferentially begins translation at this upstream start codon, and traces the resulting open reading frame, finding that it ends in a premature stop codon resulting in a prediction of NMD.

These conclusions match current understanding of how the Wxb allele functions

(Cai et al., 1998; Isshiki et al., 1998; Tian et al., 2009). The differences between Wxa and

Wxb would be particularly challenging for a traditional gene finder to identify, as gene finders based on generalized hidden Markov models (GHMMs) (section 2.1.2.2) utilize discrete states to represent multi-nucleotide features such as start codons. GHMM-based gene finders are therefore unable to predict a start codon straddling an intron using standard decoding algorithms (e.g., Majoros et al., 2005). The approach taken by ACE to separate modeling of transcription from translation simplifies the task because splicing decisions are made first. Only after introns are removed does ACE apply the ribosome

87

scanning model to search for a start codon. In this way, ACE more closely models the way splicing and translation are believed to occur in the eukaryotic cell.

3.4 Discussion

The accurate detection and interpretation of gene structure differences in the genomes of individuals or strains is an important and unsolved problem, with clear relevance to genetic studies of disease and other phenotypes. As we have shown, individual variants disrupting splice sites or reading frames do not necessarily result in

LOF. Correct disambiguation of the effects of such variants, particularly within the context of individual genomes harboring combinations of variants that may interact, has the potential to substantially reduce false positives in burden testing. We have also demonstrated that traditional gene-finding models are not suited for such applications without modification, as such models make assumptions incongruous to the task of detecting possibly deleterious changes that violate conservation patterns in genes.

Here we have proposed an alternate framework for identifying and interpreting gene structure changes, in which the potentially deleterious downstream effects of changes to gene structure are not considered when proposing such changes. By withholding information regarding possible downstream effects when considering changes to gene structures, we enable ACE to identify changes that may result in a loss or change of function, and to do so in a minimally biased manner. Because ACE has very few parameters, it is more readily applicable to other species than traditional gene

88

finding models that utilize tens of thousands of parameters and need to be retrained for each new species (Korf, 2004). Moreover, when the intended application is to provide plausible novel gene structures to an RNA-seq pipeline, the use of a minimally biased approach favoring sensitivity over specificity may be desirable, though as noted previously, interpretation of transcript abundance estimates for these putative isoforms should be undertaken with caution, as existing methods of quantitation still leave much room for improvement (Hayer et al., 2015).

The example of the ABO gene is particularly instructive, as it demonstrates a case of different gene structures in different individuals with different and medically important phenotypes (blood type). As we have shown, state-of-the-art de novo gene finders have difficulty correctly identifying the gene structures of individual alleles of this gene. In the case of the O allele, which has fewer coding exons than the A and B alleles, there is a potential for misinterpretation of variants occurring in the gene. As the final coding exon of the A and B alleles is not present in the O allele, correct interpretation of variants in that exon depends on knowing which allele is present in an individual. Furthermore, as the O allele is likely nonfunctional, accumulation of variants in that allele is likely underway (Yamamoto et al., 1990) and may lead to false positives in identification of deleterious variants when incorrect annotations are used.

The waxy gene in rice provides another example of allelic differences in gene structure precipitated by a simple sequence variant. We speculate that there may be numerous

89

other genes in which the correct interpretation of variants differs between alleles in a way that depends on knowing the correct gene structure for each allele.

ACE’s predictions of loss of function in the 1000 Genomes Project samples are highly enriched for genes tolerant of functional mutation, indicating a low false positive rate for identification of loss-of-function alleles. Furthermore, our analyses of the

Geuvadis data have confirmed that the nonsense-mediated decay pathway in humans typically does not result in complete loss of transcripts, but rather achieves a quantitative reduction on the order of a halving, albeit with much variation, often leaving many copies of NMD target isoforms undegraded. Such transcripts escaping degradation will encode truncated proteins that, if they escape further checkpoints during folding, can in some cases result in a deleterious gain-of-function and poison products (Balasubramani et al., 2015). Because ACE reports truncation products for all putative NMD targets, downstream analyses may be thereby enabled to infer deleterious effects directly or via association with phenotypes.

There is much room for enhancement of our method, for example through detailed modeling of the splicing regulatory landscape and its influence on splice site selection (e.g., Rosenberg et al., 2015; Chapter 4). It is also important to note that the accuracy of ACE’s predictions depends on the accuracy of genotype phasing. In the case of the 1000 Genomes Project data used here, much effort has gone into ensuring that data are accurately phased (Delaneau et al., 2014). As sequencing costs decrease and

90

read lengths increase, we expect phasing accuracy to continue to improve in newer resequencing studies, which will further increase ACE’s accuracy.

The use of phased haplotypes is important for joint interpretation of variants that may interact in cis, as highlighted recently by Lek et al. (2016) using exome sequencing data from ~60,000 individuals. Those authors reported an average of 23 multinucleotide polymorphisms (multiple variants that affect the same codon) per individual, and lament the lack of tools that can interpret variants in the context of a haplotype. The true mean number of compensatory variants will likely be higher than 23 when other compensatory mechanisms are considered, including frame-restoring indels and generation and/or use of alternate splice sites. These scenarios support a shift away from variant-centric analysis pipelines to tools such as ACE that generate haplotype-aware gene annotations as a way of understanding genetic variation in populations.

In summary, ACE represents an initial attempt at modeling gene structure differences among the individuals of a single species, using a novel approach that makes fewer assumptions than traditional gene-finding techniques. The abundance of human splice sites with possible robustness in the form of alternate splicing solutions that result in minimal changes to the encoded protein suggests that ACE may have ample opportunities to reduce false positives in disease studies in which splicing defects are identified but have unknown significance. ACE is equally applicable to identifying differences between lines of economically important animal or crop species, and it may

91

have utility for RNA-seq analyses and for detecting possible gain-of-function variants in cancer genomes. The design of ACE’s computational model makes it directly applicable to nonhuman species with minimal re-training, enabling studies of other model and non-model animal and plant species.

3.5 Supplementary methods

3.5.1 Efficient reconstruction of haplotype sequences

In order to enable processing of large numbers of samples efficiently, ACE utilizes indexed VCF and sequence files for rapid extraction of genic intervals for analysis. The program twoBitToFa from the KentUtils package

(https://github.com/ENCODE-DCC/kentUtils) is used to extract the reference sequence for a gene from a 2bit genome file. Variants within the gene are then extracted from a phased, indexed VCF file, using tabix. The latter results in a matrix listing the variants as rows and samples (individuals) as columns; this matrix is transposed to enable rapid processing by individual rather than by variant. The variants are then applied to the reference sequence to produce an individualized sequence in memory, one individual at a time. Once complete, each sequence is written to a FASTA file before the next individual is processed, to avoid keeping many sequences in memory when the VCF file contains many samples. This process can be parallelized by running it in batches (see below) in order to process large numbers of samples with efficient memory usage.

92

Application of variants to transform the reference sequence into an individualized sequence is done as follows. The portion of the reference sequence up to the first variant is copied into the individualized sequence. Then the variant is applied

(see below), and the following portion of the reference up to the next variant is appended, and the process is repeated until the end of the reference is reached. Thus, the individualized sequence is progressively grown rather than being modified in-place via insertion and deletion operations that would result in large numbers of copy operations. This results in a linear rather than a quadratic time operation. For substitutions (such as SNPs) and insertions, the appropriate allele is appended to the individualized sequence. For deletions, nothing is appended. Short copy-number variants are converted to substitution/insertion/deletion variants based on the number of copies indicated in the reference and alternate alleles.

In some VCF files, overlapping variants are sometimes called on the same haplotype, possibly due to genotyping or phasing errors, or due to merging of variant lists from multiple sources. Variants that overlap in the same haplotype are first disambiguated by forming the transitive closure of the overlap relation and taking the longest variant (by length of the reference allele). If all overlapping variants are consistent with the longest variant (in the sense that the longest variant subsumes the individual effects of each overlapping variant), only the longest variant is applied and the sequence is marked with a “VCF Warning” on the defline to indicate that

93

overlapping variants were encountered and successfully resolved. Some examples of inconsistent variants are: multiple insertions at the same reference location; an insertion within a deletion; two SNPs calling for different alternate bases; deletions that only partially overlap (so that neither is completely enclosed within the other). If any overlapping variants are inconsistent, none of the variants in the overlap set are applied, and the sequence is marked on the defline with a “VCF Error”. These warnings and errors enable different levels of filtering downstream, as they are copied to the annotation output after genes are mapped to the resulting sequences.

When a variant has reference and alternate alleles with different lengths, the difference in length is appended as an insertion (I) or deletion (D) element in a CIGAR string (The SAM/BAM Format Specification Working Group, 2015), whereas matching or mismatching nucleotides are indicated as M elements. The CIGAR string is stored on the defline of the resulting FASTA file, and enables later mapping of annotations from the reference to the individualized sequence (see below). CIGAR strings are copied to the annotation report to enable further analysis by users if desired. The locations of all variants applied to the sequence are given, both in the reference coordinate system and in the individualized sequence coordinate system. Variants are listed on the defline of the FASTA file, and are copied to the annotation output files for additional downstream analyses by users if desired.

94

Processing one gene at a time enables parallelization on cluster compute systems by running multiple jobs simultaneously, each processing different sets of genes. Gene coordinates are extracted from a GTF (GFF2) file. Parallelization into N processes is thus easily achieved by segmenting a GTF file into N gene sets to be processed on N different compute nodes.

3.5.2 Mapping annotations to individualized sequences

The positions of insertions and deletions as indicated in the CIGAR string result in a mapping from positions in the reference sequence to positions in the individualized

(alternate) sequence. This mapping is used to map reference annotations to the individualized sequence. For reference positions that have been deleted in the individualized sequence, the corresponding features are mapped to the nearest position that exists in the individualized sequence. For spliced exons, the first (most 5’) base of the splice site consensus of each splice site is first mapped and then used to identify the associated mapped exon coordinate. For start codons, the first (most 5’) base of the start codon is mapped. If a valid start codon consensus (as listed in the configuration file) is not found at the mapped position, a scan is made downstream (3’) of the mapped position for a start codon consensus that scores above the PWM threshold (see below).

Start codons upstream (5’) of the mapped location are considered only if such an upstream start codon either did not exist in the reference or scored below the PWM

95

threshold in the reference and scores above the PWM threshold in the individualized sequence.

3.5.3 Predicting loss of function

ACE identifies multiple classes of events that can be later filtered and interpreted as to the potential for loss of function. ACE provides detailed information pertaining to each event so that downstream inferences may be more or less conservative in inferring loss of function. For in-frame stop codons occurring earlier than the annotated stop codon, the distance of the stop codon to the most 3’ exon junction (where distance is measured in nucleotides on the spliced transcript) is compared to a minimum distance given in a configuration file (default: 50 nt) to detect nonsense-mediated decay. The actual distance to the most 3’ exon junction is reported to enable further filtering downstream if desired. The number of amino acids truncated from the resulting protein is also reported to enable further downstream filtering or analysis. Any change to the protein sequence triggers an alignment of the reference and alternate proteins (section

3.5.6). The amino acid sequences are reported, as well as the proportion of amino acids which are non-identical in the optimal alignment (as a proportion of the reference protein length), enabling downstream filtering or analysis. When a splice site is disrupted, if ACE is unable to identify a feasible alternative splice form, a prediction of

“no transcript” is given. For protein-coding genes, if no start codon or no stop codon can be found consistent with the annotated reading frame, these are also noted.

96

For the analyses reported in the foregoing sections, loss of function was inferred if any of the following were detected: NMD; loss of a splice site in which no feasible alternate splice form could be found; lack of a start codon or stop codon for protein- coding genes; protein change greater than a given threshold (50% of the protein in the analysis of genic intolerance, or 10 amino acids for the genome-wide scan for splice sites potentially robust to disruption).

3.5.4. Probability models

ACE offers several different probability models for scoring splice sites and start codons, including positional weight matrices (PWMs) with or without dependencies between columns of the matrix, and maximal dependence decomposition (MDD) (Burge,

1997), which utilizes a decision tree for detecting and modeling dependencies. For the experiments reported here, PWMs were used with either no dependencies or 1st order dependencies. For start codons, a 0th order (no dependencies) PWM capturing six nucleotides 5’ of the start codon and three nucleotides 3’ of the start codon was used.

Different splice-site models were used for different classes of G+C content, or isochores. Following (Burge, 1997), for the human experiments we defined four isochores based on G+C content: 0-43%, 43-51%, 51-57%, and 57-100%. Models were trained on annotated splice sites from chromosome 1. Dependence order and amount of

5’ and 3’ context (positions flanking the splice site consensus) were chosen to maximize

97

discriminative accuracy when applied to annotated sites versus non-annotated sites in flanking introns (see Table 3.1).

Table 3.1: Features of splice site models for different classes of G+C content. Order indicates number of previous positions on which the current position is conditioned in the PWM.

Isochore site type order nt 5’ of site nt 3’ of site 0-43% donor 0 3 10 acceptor 1 15 2 43-51% donor 0 6 10 acceptor 0 24 2 51-57% donor 0 3 12 acceptor 0 20 1 57-100% donor 0 6 10 acceptor 0 15 2

Scripts are provided for retraining PWMs for use on other organisms. Given a

GTF file of annotated genes and a FASTA file of genomic sequence, the scripts extract examples of splice sites and start codons and estimate parameters for PWMs. PWM thresholds are calculated to achieve a required sensitivity on the training set.

3.5.5 Configurable parameters

The following parameters are currently configurable in ACE by editing a configuration text file. Default values used in the experiments described in the main manuscript are given.

98

Table 3.2: Configurable parameters in ACE, with default parameters used in the analyses described here.

Parameter Description Default donor-consensus Allowable donor consensus GT,AT,AC sequences. acceptor-consensus Allowable acceptor consensus AG,AC sequences. start-codons Allowable start codon ATG consensus sequences. stop-codons Allowable stop codon TGA,TAA,TAG consensus sequences. max-splice-shift Maximal distance from an 70 nt annotated splice site to a cryptic site. min-exon-length Minimum exon length resulting 30 nt from cryptic splicing. min-intron-length Minimum intron length 30 nt resulting from cryptic splicing. allow-cryptic-sites Whether to enable prediction of yes cryptic splice sites. allow-exon-skipping Whether to enable prediction of human: yes exon skipping. rice: no allow-intron-retention Whether to enable prediction of human: no intron retention. rice: yes gap-open-penalty Gap open penalty for protein 5 alignment (see below). gap-extend-penalty Gap extend penalty for protein 10 alignment. bandwidth Bandwidth for banded protein 50 alignment. subst-matrix Amino acid substitution matrix PAM10 for protein alignment. min-orf-length Minimum length of open 150 nt reading frames during detection of upstream ORFs or mis- annotation of genes as noncoding.

99

margin-around-gene Amount of sequence 5’ and 3’ of 1000 nt annotated gene to reconstruct for analysis. ploidy Ploidy of organism. 2

3.5.6 Alignment of protein sequences

When reference and alternate protein sequences are found to be non-identical, protein alignment scores are computed using a banded dynamic-programming global alignment algorithm with affine scoring. The affine scoring is parameterized by gap open and gap extend parameters. Reported alignment scores are proportions of identical amino acids, as a percentage of the reference protein length. The default bandwidth, defined as the distance from the diagonal of the alignment matrix, of 50 amino acids was used here. A gap open penalty of 5 and gap extend penalty of 10 were used. The PAM10 substitution matrix was used only for identifying the optimal alignment.

3.5.7 Alignment and quantification of RNA-seq data

RNA-seq analyses were performed by aligning reads to the individualized genome sequences using TopHat 2 and estimating FPKM values using StringTie. Paired reads were first trimmed using Trimmomatic, using the following parameters:

Table 3.3: Parameters used for trimming RNA-seq reads.

Parameter value description HEADCROP 1 remove first base due to potential for bias in random hexamer priming LEADING 30 trim low-quality bases from 5’ end of read SLIDINGWINDOW 7:20 trim low-quality bases from 3’ end of read

100

MINLEN 40 delete the read after trimming if resulting length is too short

TopHat 2 was then used to align reads to a library containing both haplotypes, in case any read aligns better to one haplotype (allele) than the other. Reference annotations mapped to the individualized haplotypes were provided to TopHat 2 as

GTF files, to both improve alignment of unspliced reads and improve detection of novel splice junctions for alignment of spliced reads. StringTie was likewise provided with annotations mapped to the individualized sequences, to improve quantification of known isoforms.

To investigate the degree to which ACE aids TopHat 2 and StringTie in identifying novel transcripts that may result when a splice site is disrupted, two runs of the RNA-seq pipeline (TopHat 2 followed by StringTie) were performed: one in which only annotated isoforms were provided (“without hints from ACE”), and one in which both annotated isoforms and novel isoforms predicted by ACE were provided (“with hints from ACE”).

To investigate the specificity of ACE’s predicted novel splice forms, two runs of the RNA-seq pipeline were again performed. In the first run, annotated splice forms as well as novel splice forms predicted by ACE for splice sites that were actually disrupted were provided as hints to the pipeline. In the second run, annotations and novel splice forms were augmented by simulating splice disruption events at intact splice sites, so

101

that ACE would generate additional, spurious alternate splice forms. We then analyzed the difference in mapped reads for the spurious alternate splice forms versus the alternate forms generated for actual disrupted sites.

3.5.8 Computing relative expression ratios for NMD targets

The linear mixed-effects model log2(FPKM) = Xb + Zu was fit using the R package lme4 (version 1.1-12). Only transcripts expressed in at least 30 individuals were considered, and only those for which at least one individual contained a predicted NMD allele and for which at least one individual had no predicted NMD alleles. Transcripts were pre-filtered at a range of minimum FPKM thresholds (applied to mean FPKM across all samples for each transcript), in order to assess stability of estimates at different abundance thresholds. A pseudocount of 1×10-6 was added to all FPKM values to avoid taking the log of zero. This pseudocount represents the detection limit, as it is the smallest FPKM value that was detected. b values were then transformed into relative

abundance ratios, r0/2, as follows:

1-r0/2 thus represents the proportionate reduction in NMD homozygotes relative to individuals with two functional alleles.

102

3.5.9 Versions of software used

The following versions of software were used in the analyses:

Table 3.4: Versions of all software used.

Software version Augustus 3.0.3 BowTie 2.2.4 KentUtils 302 lme4 1.1-12 R 3.3.1 Samtools 1.1 StringTie 1.2.1 tabix 1.2.1 TopHat 2.0.13 Trimmomatic 0.33 Salmon 0.7.2 Kallisto 0.43.0

103

Chapter 4 – Variant-aware gene structure prediction in personal genomes

Select text and figures from this chapter are included in modified form in the following manuscript which is currently being prepared for submission to an academic journal:

Majoros WH, Holt C, Campbell MS, Ware D, Yandell M, Reddy TE. Variant- aware gene structure prediction in personal genomes. In preparation.

Author contributions: WHM designed, implemented, and tested the model.

WHM, TER, MY, CH, and MSC wrote the manuscript.

4.1 Motivation

In eukaryotes, messenger RNAs are commonly spliced, to remove intronic sequences that do not encode amino acids, prior to nuclear export and translation.

Failure to properly remove introns can result in large changes to the resulting polypeptide, and/or failure to produce a functional protein. Similarly, splicing errors that splice an RNA at the wrong location can result in loss of existing function or gain of new function. Any of these changes can be deleterious. In humans, 95% of protein- coding genes contain introns that need to be spliced out, and 95% of those intron- bearing genes can be alternatively spliced to produce multiple distinct isoforms.

Splicing errors commonly occur in many genes at a low rate even in healthy individuals (Pickrell et al., 2010; Stepankiw et al., 2015). Genetic variants that directly interrupt normal splicing signals can dramatically increase the production of aberrant 104

splice forms (Královicová et al., 2005; Buratti et al., 2007). While these errors have the potential to be deleterious, in some cases they result in small, benign changes such as the addition or removal of a single amino acid. Moreover, multiple variants in the same haplotype can act non-independently, so that methods that interpret each variant separately can produce incorrect predictions (Majoros et al., 2017; Chapter 3).

Computational methods are needed that can predict how combinations of variants present together on a haplotype affect gene splicing and the encoded protein.

Previous approaches to predicting aberrant splicing have focused on the use of machine- learning models that interpret each single-nucleotide polymorphism (SNP) individually and report only the predicted effect on a single splice site or exon (e.g., Xiong et al., 2015;

Mort et al., 2014; Woolfe et al., 2010).

There is thus a need for whole-gene models that can integrate the effects of multiple variants in a haplotype and interpret the resulting splicing patterns as to their likely effect on the encoded protein as a whole (Guigo and Valcárel, 2015). Ideally such models should be applicable to both single- and multi-nucleotide variants, as insertions and deletions can have large impacts on splicing signals and reading frames. Moreover, while some methods require large training sets of confirmed aberrant splicing cases or splicing variants implicated in disease (e.g., Mort et al., 2014; Woolfe et al., 2010), reliance only on annotated, non-aberrant splice forms for parameter estimation would enable retraining on any species for which an annotated reference genome is available.

105

Much work has gone into the development of sophisticated methods for whole- gene gene-structure prediction, as documented in the extensive literature on gene finding (e.g., Guigo et al., 1992; Burge and Karlin, 1997; Lukashin and Borodovsky, 1998;

Korf et al., 2001; Pachter et al., 2002 Meyer and Durbin, 2004; Allen and Salzberg, 2005;

Stanke et al., 2006; reviewed in Majoros, 2007b). These methods jointly model entire sequences and their whole-gene splicing patterns, typically via hidden Markov models

(HMMs) or conditional random fields (CRFs). A common assumption of these methods is that genes are well-formed and have conserved function, which is an appropriate assumption when annotating reference genomes. The growing popularity of resequencing studies and personalized genomics has created a need for tools that can accurately annotate personal genomes and the genomes of individual animal and plant breeds. As the emphasis in these studies is typically on identifying genetic differences that may have functional consequences, the commonly assumptions made by traditional gene-finding models are violated (Figure 49; section 4.5).

In this work I describe a probabilistic gene-structure model for annotating personal genomes that explicitly accounts for genetic variants and does not assume that genes are conserved or still functional. The proposed model does not utilize translation reading frames or codon statistics, and is thus applicable to both coding and noncoding genes. Traditional ab initio gene finders rely primarily on signals within coding sequence, in particular codon biases. It has been noted that coding sequences within

106

eukaryotic protein-coding genes contain other signals in addition to the codons that are normally read in-frame (Itzkovitz et al., 2010). In particular, signals that promote splicing and exon inclusion often overlap coding signals, either in-frame or out-of-frame

(Zhang et al., 2008a; Woolfe et al., 2010). These signals are referred to as splicing enhancers and are believed to serve primarily as binding sites for RNA-binding factors such as SR proteins, which are a family of proteins possessing both an RNA-binding domain and an arginine- and serine-rich domain that mediates protein-protein interactions. Splicing enhancer motifs are found not only proximal to splice sites, but within exon bodies (Woolfe et al., 2010), suggesting that SR proteins bind all along the exon, forming “a network of protein-protein interactions across the exon” (Schneider et al., 2010). This scaffolding of splicing factors across the exon body is believed to mediate the process of exon definition (Robberson et al., 1990; Berget, 1990), whereby U1 and U2 snRNPs associated with the ends of the exon are brought into close spatial proximity, and which is necessary for the exon to be included in the mature transcript (Schneider et al., 2010). Meanwhile, hnRNPs are believed to bind primarily within introns, and effectively mark them for exclusion from the mature transcript. Together, these enhancing and silencing signals enable accurate discrimination of exonic from intronic sequence by the cell (Zhang et al., 2008a).

A number of feature sets comprising scored hexamers or octamers have been proposed to capture these exon-definition signals (Zhang and Chasin, 2004; Zhang et al.,

107

2005; Stadler et al., 2006; Zhang et al., 2008a; Erkelenz et al., 2014). Most recently, a set of hexamer weights determined via massively parallel splicing reporter assays were used to evaluate individual SNPs in human exons for their potential to induce skipping of individual exons (Rosenberg et al., 2015). To our knowledge, such exon-definition features have not previously been incorporated into a whole-gene model of gene structure. To the extent that such features capture exon definition propensities, they should be informative for predicting splicing patterns of whole transcripts. By using these signals instead of codon statistics, the model we propose is applicable to predicting changes to gene structures in non-coding genes, in untranslated regions of coding genes, and in coding regions that are altered by variants that disrupt the reading frame.

Because the model can predict alterations to the reading frame, it can detect changes that may be deleterious. As such, the model is applicable to identifying differences in gene structures between recently diverged lineages, in which such differences may reflect gain or loss of function that has yet to be eliminated by natural selection but which may be of interest in a clinical setting. In contrast, comparative gene- finding models (e.g., Pachter et al., 2002; Meyer and Durbin, 2004; Majoros et al., 2005) are compare multiple reference genomes of distinct species, and assume that gene structures are conserved. Finally, because the model we propose can be trained on annotated exons and introns in a reference genome rather than relying on large numbers

108

of curated examples of aberrant splicing, it can be retrained easily on other species for which an annotated reference genome is available.

4.2 Methods

4.2.1 Splice Graph Random Field

Gene-structure prediction methods (reviewed in Majoros, 2007b) are typically based on probabilistic graphical models such as HMMs or CRFs. Whereas HMMs model the joint probability P(f, S), for sequence S and state path f, CRFs directly model the posterior probability P(f | S), where f is considered a labeling of the sequence S:

1 , � � � = � , �

The potential functions Fc are applied to the cliques c in a dependency graph (Sutton and

McCallum, 2006). One advantage of CRFs over HMMs is that arbitrary features, such as empirical hexamer weights, can be incorporated into the F functions.

We propose a CRF for gene structures in which each vertex in the field denotes a putative splice site and each edge denotes a putative exon or intron (Figure 18A). We refer to this model as a splice graph random field (SGRF). Labels are chosen from {0,1}, with 0 denoting omission and 1 denoting inclusion of the splice site in the predicted gene structure. Cliques in the SGRF consist of singletons and pairs of vertices directly connected via a single edge. Clique potential functions (Figure 18B) evaluate to nonzero

109

values only when all labels in the clique are 1, so that splice sites not included in the prediction do not contribute to its score.

SGRFs are constructed in a highly constrained manner, as follows. For a given splice isoform annotated on the reference genome, exon coordinates are projected to the genome of an individual or strain for prediction in that individual. For each projected splice site, a vertex is created and linked to the preceding vertex, resulting in a linear- chain SGRF having exactly one path that corresponds to the projected gene structure.

However, if any splice site is disrupted in the alternate sequence, its vertex is removed from the SGRF and alternate splice sites of the same type (donor/acceptor) are identified via a signal sensor (section 4.5.2) in the vicinity of the disrupted site. These are linked into the SGRF using edges of the appropriate type (exon/intron). In addition, any genetic variant that could potentially create a de novo splice site (Figure 18C) that does not exist in the reference is also added to the SGRF and linked via appropriate edges to the nearest annotated vertices already in the SGRF. In this way, the SGRF represents exactly the reference annotation when that annotation can be projected perfectly onto the alternate sequence. Only when a splice site is disrupted, or when a genetic variant creates a new putative splice site, is the SGRF expanded to include more potential paths.

We call this procedure constrained splice-graph construction (CSGC).

110

Figure 18: (A) A splice graph random field (SGRF). Vertices denote splice sites, and edges denote exons and introns. A path from TSS (transcription start site) to TES (transcription end site) outlines a single gene structure; labels 0 and 1 denote omission or inclusion, respectively, of a vertex on the selected path. (B) Cliques and their potential functions; SGRFs have only singleton and pair cliques. Potential functions for cliques labeled with any 0 do not contribute to the score, since they do not participate in the selected path. (C) Cryptic splice sites are unannotated splice sites near an annotated splice site; disrupted splice sites exist in the reference but not in the alternate sequence; de novo splice sites exist in the alternate sequence but not in the reference.

Decoding with an SGRF can be accomplished efficiently using dynamic programming. We use standard N-best decoding to find the N highest-scoring predictions, where N can be set by the user. For the experiments described here we used

N = 10.

111

4.2.2 SGRF features and parameter estimation

We refer to the potential functions for singleton cliques that represent individual splice sites as signal sensors, and to those for pair cliques that represent exons and introns as content sensors. For content sensors we use a linear combination Xb of hexamer weights b with hexamer counts X in the interval spanned by the clique. These intervals do not include splice sites at the ends of the interval. b and X are 4096-dimensional vectors corresponding to the 4096 possible hexamers. Any collection of hexamer weights can thus be used as an SGRF content sensor. Hexamer counts are extracted from all reading frames on the sense strand. For signal sensors, we score a fixed window spanning the putative splice site with a small number of flanking positions on both sides (section 4.5.2). Features are indicators (0 or 1) for whether each possible nucleotide (A, C, G, T) is present at each position.

Training of conditional random fields is typically performed via conditional maximum likelihood (CML), which globally optimizes all model parameters jointly by maximizing P(f | S). Because CML can be computationally burdensome, we instead use piecewise training (Sutton and McCallum, 2005) with the “factor-as-piece” approximation

(Sutton, 2008). Piecewise training for the SGRF consists of estimating parameters of each

potential function FAG, FGT, Fexon, and Fintron separately. Because each of these functions are linear combinations of hexamer counts or nucleotide indicators, and because standard logistic regression is equivalent to CML for a single-vertex CRF with a linear-

112

combination potential function (Sutton and McCallum, 2005), we use logistic regression to separately estimate all parameters of each potential function individually. For pair cliques we binarize the problem into classification of exons versus introns in order to apply logistic regression. This procedure is similar to that of Domke (2014), except that we employ only a single iteration of logistic regression and eliminate the belief propagation step due to time efficiency concerns. We call this simplified procedure piecewise logistic. For the experiments described here we used elastic-net logistic regression

(Zou and Hastie, 2005) in order to favor sparse parameterizations (section 4.5.2). For the experiments on human sequences we trained the content sensors on 10,000 pairs of exons and introns annotated in GENCODE version 19 (Harrow et al., 2012). Signal sensors were trained on 5000 donor splice sites and 5000 acceptor splice sites annotated in GENCODE v19. Score thresholds for signal sensors were selected to admit 99% of training splice sites.

4.2.3 Integration of SGRF into ACE+

Our previously-described software, ACE (Chapter 3; Majoros et al., 2017), interprets proposed splicing changes in terms of their effects on encoded proteins, enabling prediction of loss of function via nonsense-mediated decay (NMD), nonstop decay, loss of coding potential, and protein modification or truncation. Incorporating the SGRF into this framework enables putative splicing changes predicted by the SGRF to be evaluated in terms of their likely effect on gene function, and provides a means of

113

ranking ACE predictions based on their probability under the SGRF. We call the resulting software ACE+. ACE+ constructs personal genomes from a phased VCF file containing SNPs, multinucleotide variants, short indels, and short copy-number variants

(CNVs). It then projects reference annotations onto the alternate sequence using a coordinate transformation, applies the SGRF to predict splicing changes, and interprets the resulting changes as to the potential for loss of function. Variants within each gene are then re-interpreted as to their likely effect, based on changes in the annotation predicted by the SGRF (e.g., coding variants may become noncoding variants and vice- versa). Outputs are provided in a highly-structured format that can be converted to

XML. An API for parsing the structured output and extracting arbitrary features is provided in perl, python, and C++.

4.2.4 Computational validation

We tested the predictive accuracy of the SGRF on 150 human genomes from the

Thousand Genomes Project (The Thousand Genomes Consortium et al., 2015). RNA-seq data from LCL cells collected by the Geuvadis project (Lappalainen et al., 2013) was used to quantify support for predicted novel splice junctions. TopHat2 (Kim et al., 2013) was used to perform spliced alignment of RNA-seq reads to personalized genome sequences. Receiver-operating characteristic (ROC) curves were constructed to enable comparison of prediction accuracy between three different content sensors: (1) piecewise logistic applied to 10,000 human exon-intron pairs; (2) piecewise logistic applied to

114

10,000 exon-intron pairs from Arabidopsis thaliana, as annotated in the Araport 11 release

(Cheng et al., 2017); (3) hexamer weights estimated previously by Rosenberg et al. (2015) by fitting a sigmoid function to exon inclusion levels resulting from a massively parallel minigene experiment in human HEK293 cells.

To investigate whether the piecewise logistic training procedure is hampering predictive accuracy by not training all parameters jointly, we incorporated an additional

parameter rcontent/signal into the model, which weights the relative contributions of content sensors versus signal sensors. We performed a sensitivity analysis by applying the

model to a single Thousand Genomes individual with different values for rcontent/signal (0.1,

0.2, 0.4, 0.6, 0.8, 1, 1.5, 2, 3, 4, 5, 8, 10). A substantial improvement in predictive accuracy

for values of rcontent/signal < 1 would indicate that the content sensors are overpowering the signal sensors, and that accuracy might be improved by global training of all parameters jointly.

4.3 Results

4.3.1 Prediction accuracy on 150 human genomes

Area under the ROC curve (AUC) values for the SGRF with different content sensors indicate that on average an AUC of approximately 0.75 is achievable using either experimentally determined or bioinformatically inferred features (Figure 19A,B). The logistic model trained on human annotations achieved the highest AUC (0.75) followed closely by the model trained on splicing minigene outputs (0.72). The logistic model

115

trained on Arabidopsis performed worst (0.51), indicating that training for the target organism is necessary in order to learn organism-specific exon definition features. The median difference between the logistic human model and the minigene model was positive and significant (Wilcoxon signed-rank test: V = 43489, P = 2×10-44; Figure 19C), indicating that training on a new organism can be done effectively using logistic regression applied to annotated exons and introns. Logistic regression training took approximately 12 hours on a single CPU.

Logistic weights for human data were similar across training runs (Figure 50).

Modifying rcontent/signal away from its default value of 1 did not appreciably improve prediction accuracy on individual HG00096 (Figure 51), indicating that content sensor scores are not overpowering signal sensor scores to the detriment of predictive accuracy.

The precipitous drop in AUC as rcontent/signal approached zero indicates that splice-site scores alone are inadequate for predicting splicing changes, consistent with previous suggestions that splice sites lack sufficient information content to allow their discrimination without genomic context information (Lim and Burge, 2001). Classifying

5000 annotated versus 5000 decoy human splice sites omitted from the training set indicates that both the logistic signal sensors and PWMs achieved high classification accuracy (Figure 52), with the logistic model only slightly outperforming the PWM for donor splice sites (AUC = 0.984 versus 0.977), and exactly matching the PWM for acceptor splice sites (AUC = 0.965).

116

When the logistic human content sensor was used to perform direct classification of whole exons versus whole introns of matched lengths and with splice sites removed

(section 4.5.4), AUC was higher when presented with coding exons than with noncoding exons (0.89 versus 0.79; Figure 19D). This indicates that while logistic regression successfully learned exon definition features that enabled the model to recognize noncoding exons, it may also be inadvertently learning some features of the coding segments present in the training exons. Classification of binarized minigene splicing results (section 4.5.4) using the human logistic model resulted in a similar AUC (0.77) to that of classifying lincRNA exons (0.79) (Figure 19D). Positively scoring hexamers under the human logistic model were enriched deep into noncoding exons relative to introns

(Figure 19E), indicating that the features learned were not due to sequence biases proximal to splice sites.

117

Figure 19: (A) ROC curves for the SGRF with three different content sensors; TP and FP rates were computed based on spliced RNA-seq reads from LCL cells supporting predicted novel splice junctions not found in any isoform of the gene (B) Area under ROC curves shown in panel A. (C) Difference between logistic model AUC and minigene model AUC. (D) ROC for classification of 10,000 codon exons versus 10,000 introns (red), for 400 lincRNA exons versus 400 lincRNA introns (blue), and for minigene exons with high (N=10,947) versus low (N=16,185) inclusion rates (green), using the logistic human model for classification. (E) Density of positively- scored hexamers under the human logistic model, at relative positions in 400 noncoding lincRNA exons.

4.3.2 Logistic and splicing minigene features reflect known hnRNP but not SR protein motifs

As reported in Rosenberg et al. (2015) for the minigene features, G-rich hexamers in the human logistic model were enriched for negative scores (Figure 53), consistent with G- richness of sequences preferred by some hnRNPs (Huelga et al., 2012; Rahman et al.,

118

2015; Mauger et al., 2008). While elastic net regularization produced a sparse model containing 1966 of the 4096 possible hexamers, all nineteen hexamers containing five or more Gs were included in the model and assigned negative scores. Consensus binding motifs for hnRNPs obtained from Huelga et al. (2012) were likewise enriched for having negative scores under the human logistic model (Wilcoxon V = 329, P = 3.9×10-9), as well as under the minigene model (Wilcoxon V = 1763, P = 0.0002), consistent with our expectations of depleted hnRNP binding in exons. While scrambled versions of those consensus motifs are also commonly enriched for negative scores under both models

(logistic model: P < 0.05 in 469/1000 scrambled motif sets; minigene model: P < 0.05 in

506/1000 scrambled motif sets), hnRNPs have been characterized as having degenerate binding motifs with low sequence specificity (Singh and Valcárcel, 2005; Huelga et al.,

2012). As each consensus motif represents only the single most strongly-bound sequence for a factor, enrichment of some scrambled versions of these degenerate motifs might represent weaker binding that nevertheless supports exon definition. The most strongly negatively scoring hnRNP motif under the human logistic model was for hnRNP H, which is known to bind to poly(G) sequences (Rahman et al., 2015; Mauger et al., 2008) and has been implicated in aberrant splicing (Paul et al., 2006).

SR protein motifs obtained from Long and Caceres (2009) were not significantly biased in their scores under either the logistic model (Wilcoxon V = 6101, P = 0.08) or the minigene model (Wilcoxon V = 17713, P = 0.14). The sparse logistic model included

119

substantially more negative features than positive features (1126 versus 840), suggesting that accurate discrimination of exons from introns relies more on features depleted from exons (e.g., hnRNP binding sites) than on features enriched in them (e.g., SR protein binding sites).

Classification of exons versus length-matched introns (with splice sites removed) using densities of known SR or hnRNP motifs produced very low AUC values (hnRNP motifs: 0.62; SR protein motifs: 0.60) that were not substantially higher than random classification (0.56), indicating that both the experimentally determined features of

Rosenberg et al. (2015) and the sparse logistic features learned from human annotations represent sequence preferences not entirely explained by known SR protein or hnRNP consensus motifs.

4.3.3 De novo splice sites are prevalent, have a wide range of effects, and are misclassified by popular tools

Predictions of the SGRF with human logistic features were highly enriched for novel isoforms utilizing de novo splice sites present in the alternate sequence but absent from the reference genome, as compared to variants disrupting existing sites. Among predictions with posterior probability > 0.9, there were 3,165 de novo splice sites predicted and 750 disrupted sites, a 4.2-fold enrichment. Predictions supported by spliced RNA-seq reads were similarly enriched relative to numbers of disrupted splice sites (Figure 20A). De novo splice sites are distinct from cryptic splice sites, in that cryptic sites exist in the reference genome (and typically also the alternate sequence) 120

whereas de novo splice sites exist in the alternate sequence but not in the reference

(Figure 18C). Simulation of a simple mutation process based on context-dependent substitution (section 4.5.5) also produced a large bias of de novo sites over disrupted sites

(Figure 20A). The enrichment was greatest (125.5×) when requiring only that de novo splice sites have a canonical 2 bp consensus, and least (8.8×) when requiring that de novo sites score above threshold under the logistic splice-site model and occur in a favorable exon definition context (section 4.5.5). The existence of an enrichment in all of the simulations suggests that de novo splice sites may come into existence regularly through mutation. Moreover, each individual in the Thousand Genomes sample had on average

126 predicted de novo splice sites supported by spliced RNA-seq reads in LCLs, indicating that de novo splice sites are widespread and commonly used by the spliceosome.

As these human subjects are reported to be healthy adults, we expect natural selection to result in a strong bias for these de novo splice sites to have small effects.

Consistent with this expectation, most de novo splice sites show moderate utilization in these cells (Figure 20B). Nevertheless, some de novo splice sites in this data set show high utilization (Figure 54), and individual examples of de novo splice sites with high utilization have been documented in aberrant splicing databases (Královicová et al.,

2005; Buratti et al, 2007).

121

Figure 20: (A) Log2(number of de novo splice sites / number of disrupted splice sites) in mutation simulations (brown: requiring only a 2 bp consensus for de novo splice sites; blue: requiring sufficiently high score under splice-site model for de novo splice sites; green: requiring sufficiently high score under the splice-site model and favorable exon definition context for de novo splice sites), and in predictions

supported by RNA-seq (red); numbers above bars are raw (non-log2) ratios. (B) Estimates of relative splicing activity (red) and protein change (blue) due to de novo splice sites supported by RNA-seq. (C) Frequencies of dbSNP classifications of variants predicted to create de novo splice sites and supported by spliced RNA-seq reads. (D) Frequencies of VEP classifications of variants creating de novo splice sites supported by RNA-seq.

Furthermore, while non-NMD de novo splice sites showed the expected bias toward having small impacts on encoded proteins, 62% of RNA-supported de novo splice sites were predicted to trigger NMD, and some non-NMD de novo splice sites supported

122

by RNA-seq are predicted to result in large protein changes (blue curve in Figure 20B).

As such, de novo splice sites are capable of having large effects on splicing ratios and on encoded proteins.

A number of de novo splice sites have been implicated in disease, as evidenced by entries in the DBASS database (Královicová et al., 2005; Buratti et al, 2007) for diseases such as breast cancer, cystic fibrosis, hemophilia, muscular dystrophy, alpha- and beta- thalassemia, hypothyroidism, phenylketonuria, and others. These disease-related variants are documented as having a range of effects. For example, mutation E1+135C>T in the HBA2 (hemoglobin alpha subunit 2) gene is implicated in alpha-thalassemia, and is described as having 100% splicing utilization (Harteveld et al., 2004), whereas IVS17a-

26A>G in the CFTR (cystic fibrosis transmembrane conductance regulator) gene, implicated in cystic fibrosis, is described as resulting in leaky splicing and a mild form of the disease (Beck et al., 1999). De novo splice sites can thus present a wide range of effects in the clinical setting.

Despite their known role in a number of diseases, de novo splice sites are commonly misinterpreted as noncoding, synonymous, or missense mutations. For the

3289 Thousand Genomes variants predicted to result in de novo splice sites and that were supported by spliced RNA-seq reads in LCLs, dbSNP (Sherry et al., 2001) predicted only

0.9% to be involved in splicing (Figure 20C). Similarly, Ensembl’s VEP tool (McLaren et al., 2016) predicted only 3.9% to be involved in splicing (Figure 20D).

123

4.4 Discussion

While early efforts at computational gene prediction in the 1980s focused on finding individual exons based on their codon usage statistics, the sequencing of large chromosome segments by the human genome project circa 2000 (Venter et al., 2001;

Lander et al., 2001), together with bioinformatic advances such as the application of grammar models and dynamic programming to DNA sequence, enabled the development of methods that could efficiently and accurately predict whole gene structures in the late 1990s (Kulp et al., 1996; Burge and Karlin, 1997). These whole-gene structure annotations in turn enabled downstream analyses of genome-wide protein coding properties and the identification of gene families and functional annotation via protein similarity. As such, these bioinformatic advances had a measurable impact on our understanding of genomics.

The strong codon signal imposed on coding segments by natural selection enabled these methods to accurately chain exons together into multi-exon transcripts, by enforcing the constraint that translation reading frames must be contiguous and consistent across exons. The assumption of intact reading frames enables highly accurate prediction of genes in reference genomes, where it is natural to assume that most genes are well-formed, evolutionarily conserved, and functional. For predicting the consequences of individual genetic variation on gene structure, the assumption of contiguous reading frames matching organism-specific codon biases can result in

124

incorrect predictions. As a major goal of personal genomics is to identify functional variants that may be implicated in disease, these biases are problematic.

In this work we have developed a novel model in which exon definition features are used instead of coding signals, to avoid the biases that reference annotation methods impose. Our results on 150 human genomes indicate that exon definition features can be automatically learned via standard machine-learning methods applied to annotated training genes. Those learned features can then be effectively used to discriminate splicing changes that are supported by RNA-seq evidence in a single cell type. Our use of logistic regression and reliance only on annotated exons and introns for training, as opposed to curated examples of aberrant splicing, renders ACE+ applicable to other organisms, such as economically important species.

Our observation that de novo splice sites appear to be readily created via simple mutation, that each individual has numerous such sites supported by spliced RNA-seq reads in one cell type, and that such sites have the potential to have a wide range of effects both in splicing utilization and in resulting changes to amino acid sequences, indicates that this is an important class of variants. That these variants are commonly misinterpreted by popular tools as being non-splicing related indicates that there is a pressing need for the development of new models that can accurately identify these variants, particularly as a number of de novo splice sites have been implicated in diseases. Given the prevalence of de novo splice variants in individual human genomes

125

and the difficulty that existing tools have in identifying them as such, it is conceivable that this class of cryptic variants may account for a non-negligible fraction of unexplained disease cases.

The method we have proposed does not fully solve the problem of identifying modified splicing patterns in individual genomes, as there is much room for improvement above the 0.75 AUC reported here. First, the ability to learn exon definition features from training examples that are enriched for coding exons, without being biased toward coding features, is an unavoidable problem given that most annotated genes available for training are coding genes. Training only on out-of-frame hexamers might seem an obvious solution. However, preliminary experiments indicated that the use of only out-of-frame hexamers during training did not improve prediction accuracy. Indeed, using all frames during training produced slightly higher classification accuracy of splicing minigene outputs. Classification of randomized exonic sequences showing high versus low inclusion rates resulted in a classification accuracy of 0.765 AUC using a model trained on out-of-frame features only. Using a model trained on all reading frames resulted in a classification accuracy of 0.773 AUC.

This difference may be due to the resulting increase in sample size for training.

Furthermore, it may be expected that di-codon patterns in reading frames impose biases even on out-of-frame hexamers, and conversely that exon definition features may occur in-frame. One possible solution is to modify the regression problem to simultaneously

126

learn both coding and exon definition features using separate parameters, so that coding biases can be minimized in the learned exon definition parameters.

The use of massively parallel splicing minigene experiments to ascertain empirical hexamer weights is another solution, as the use of randomized exonic sequence mitigates biases due to natural selection on functional content, though other biases may remain (such as that due to NMD when the randomized exon is coding—

Rosenberg et al., 2015). This solution is likely species-specific, as our results using

Arabidopsis features to predict splicing in humans indicates that features trained on one species provide poor predictive accuracy on evolutionary distant taxa. This was also seen in gene-finding with coding features (Korf, 2004). Furthermore, peculiarities of the specific minigene used can result in unintended biases, such as features that reflect base complementarity to specific splice sites in the minigene (Rosenberg et al., 2015).

The use of collections of hexamer weights to represent exon definition potential has become popular in recent years. However, these hexamer models make strong assumptions about the independence of features residing near or far from each other in linear sequence space. It is known that SR proteins extensively interact, and that SR proteins compete with both other SR proteins and with hnRNPs for binding sites in mRNAs (Pandit et al., 2013; Rahman et al., 2015), and likewise that hnRNPs can interact in positive or negative ways (Huelga et al., 2012). Other examples of interactions between splicing regulators have been documented (reviewed in Ke and Chasin, 2011).

127

It has also been demonstrated that splicing decisions can be influenced by epigenetic effects such as nucleosome positioning and histone modifications (reviewed in Zhou et al., 2014), and by binding of specific transcription factors at promoters and possibly even distal enhancers (reviewed in Kornblihtt et al., 2013).

There is thus much room for incorporation of features beyond simple hexamer weights. Such improvements could be incorporated into the SGRF via modification of the F functions. Assuming the constrained graph construction process remains unchanged, modifications to the F functions will not negatively impact decoding efficiency, as the graph will remain sparse. Modifications to the training process may however be necessary, particularly if new features integrate information across larger intervals or consider combinatoric interactions. While piecewise training via simple logistic regression worked well for the initial model described here, for an expanded model with wider dependencies, an iterative method such as the one proposed by

Domke (2014) that combines logistic regression with belief propagation may be required.

Our use of reference annotations as training examples results in a model that does not reflect splicing regulatory differences between cell types. As alternative splicing is often regulated in a cell-type specific manner, it can be expected that aberrant splicing will also exhibit cell-type specific patterns; such cell-type-specific effects have been modeled previously, though not in a whole-gene model that can accommodate multiple variants jointly (Xiong et al., 2015). One possible means of addressing this is to

128

train separate models on transcripts found to be expressed in individual cell types via

RNA-seq data, such as that published by the GTex project (Melé et al., 2015).

Our inability to link positively-scoring hexamers in either the logistic model or the minigene model published by Rosenberg et al. (2015) to known SR protein motifs could be explained by a number of possibilities. Some SR proteins and hnRNPs have been characterized as participating in both specific and nonspecific binding, and may rely on co-factors for specific binding (Singh and Valcárcel, 2005). It has also been shown that individual SR proteins can have both a positive and a negative effect on exon definition in different contexts, as can hnRNPs (Pandit et al., 2013; Huelga et al., 2012;

Singh and Valcárcel, 2005). The fact that densities of known SR protein motifs or of known hnRNP motifs did not produce strong classification accuracy in the experiments described here supports the notion that while these molecules have been demonstrated to play important roles in splicing, determining their effect via simple consensus motif counts may not be feasible in general. That both the logistic model and the minigene model were able to achieve much higher classification accuracy suggests that these models are detecting features relevant to exon definition, though at present it is not known with certainty what biological significance individual hexamers in these models have. Novel experimental work will likely be required to ascertain whether these features represent unknown binding motifs for known SR proteins or hnRNPs, or possibly for unknown splicing factors or co-factors.

129

4.5 Supplementary Methods

4.5.1 Gene structure prediction in real genes modified with premature stop codons

To systematically investigate the behavior of traditional gene structure predictors on genes with possible loss-of-function mutations in personal genomes, we ran a commonly used HMM-based gene finder (Stanke et al., 2006) on 19,000 human genes modified to include a premature stop codon. First, the HMM was applied to the genomic sequence of each gene (including 1000 nucleotides of flanking sequence on either side) without modification, and the gene structure predicted by the HMM was noted; no comparison to the annotated gene structure was made. Then a stop codon was inserted at a randomly-selected location in the predicted coding segment, while ensuring that the inserted stop codon did not inadvertently create a canonical donor or acceptor splice site consensus. Then the HMM was run again to obtain a second prediction, this time using the sequence with the premature stop codon. The splice pattern predicted for the modified sequence was compared to the splice pattern predicted for the unmodified sequence. As traditional gene finders key largely on codon biases in contiguous reading frames, we hypothesized that the gene finder would frequently modify its predicted splice pattern so as to omit the premature stop codon.

We tabulated the number of times this occurred.

130

4.5.2 Regularized logistic regression training

For logistic signal and content sensors we applied regularized (elastic net) logistic regression. We used glmnet version 2.0-2, with a = 0.5 to interpolate equally between L1 and L2 regularization, with regularization strength l selected to minimize the mean cross-validation error (Friedman et al., 2010). This value of a was chosen after observing in preliminary runs on individual HG00096 that a = 0.5 produced higher prediction accuracy as validated via RNA-seq (AUC = 0.79) than pure lasso regression (a = 0, AUC =

0.76) and pure ridge regression (a = 1.0, AUC = 0.77). For content sensors, all 4096 hexamers were used as features. Hexamer counts were extracted from all reading frames on the sense strand of training exons and introns, and regularized logistic regression was applied to learn a weight for each hexamer for classifying exons versus

introns. These weights were taken together as the model for Fexon. To obtain Fintron, the weights of the exon model were negated.

For signal sensors, a fixed window around each splice site was used to define indicator variables at each position within the window. For donor splice sites, 6 bp of sequence was included to the left (5’) of the 2 bp consensus, as well as 12 bp of sequence to the right (3’) of the consensus; only sequences GT, GC, and AT were accepted as valid consensuses. For acceptor splice sites, 20 bp of sequence left (5’) of the consensus and 2 bp right (3’) of the consensus were included in the window; consensuses AG and AC were considered valid. At each window position we set four indicator variables, one for

131

each of the four possible nucleotides, to indicate which nucleotide was present at that position in the current training or test case. For example, if nucleotide C was present at that position in the training case, the indicator for C was set to 1 and the indicators for A,

G, and T were set to 0. A pseudocount of 0.1 was added to all counts. Regularized logistic regression (a = 0.5) was applied to learn a vector of weights for the indicators within the window. Separate models were learned for donor splice sites and acceptor splice sites.

4.5.3 Evaluation of SGRF prediction accuracy using Geuvadis data

RNA-seq data from LCLs for 150 individuals in the Geuvadis project was used to validate the SGRF predictions. For each individual and each gene in GENCODE v19, phased variants were used to construct explicit haplotype sequences for the gene, as previously described (Majoros et al., 2017; Chapter 3). Insertion/deletion variants were used to infer an alignment between the reference sequence and the personal genome.

For each annotated isoform of each gene, the isoform was projected onto the personal genomic sequence using the inferred alignment. The SGRF was then applied to produce predicted splice forms in the personal genome. Predictions were made separately using the human logistic model, the Arabidopsis logistic model, and the minigene weights from

Rosenberg et al. (2015). RNA-seq data was then aligned to the personal genome by

TopHat2 (Kim et al., 2013), with the projected annotations and pooled predictions provided to TopHat2 as annotations. Alignment to the personal genome rather than to

132

the reference was performed in order to reduce reference bias in read-mapping (Degner et al., 2009). For each prediction, novel splice junctions that did not occur in any annotated isoform of the gene were considered to be consistent with RNA-seq if at least one spliced read exactly matched both coordinates of the splice junction. Transcripts not expressed in LCLs at an FPKM of at least 3 were omitted from the analysis, as these may unfairly penalize predictors. True positive rate and false positive rate were calculated by tabulating predictions at the individual splice-junction level and were then used to construct ROC curves. Reported ROC curves reflect only the accuracy of novel junctions not occurring in any annotated isoform of the gene.

4.5.4 Classification of whole exons and introns

To assess whether learned hexamer weights reflected coding features, noncoding features, or both, we used the logistic regression model alone (without the SGRF) to classify an equal number of exons versus introns shortened to the same length. Splice sites and 10 bp flanking them were removed from all introns prior to shortening them to match exon lengths. Testing was performed only on exons and introns not included in the training set. The standard logistic function was used to compute P(exon|sequence) and classification was performed under the rule that P > 0.5 dictates an exon prediction and P < 0.5 dictates an intron prediction.

For evaluation of prediction of minigene splicing outcomes, spliced read counts were obtained from Rosenberg et al. (2015) for the randomized sequences between

133

alternate splice donor sites on the minigene. Sequences with appreciable numbers of read counts splicing at locations other than the two main splice sites, SD1 and SD2, were discarded. Sequences for which count(SD1) > 2count(SD2) were counted as negative cases of exon definition (i.e., the exon most often failed to extend to the further splice site, SD2), and sequences for which count(SD2) > 2count(SD1) were counted as positive cases of exon definition (i.e., the exon most often succeeded at extending to the further splice site, SD2). The logistic model trained on human annotations was tested by using the standard logistic function as above to classify the randomized sequences as positives or negatives. These predictions were evaluated against the known classifications based on spliced read counts as described above. True positive rate and false positive rate were calculated accordingly to produce an ROC curve.

4.5.5 Simulating creation and destruction of splice sites

To investigate the propensity of a simple mutation process to create or destroy splice sites, a previously-estimated context-dependent DNA substitution matrix (Allen et al., 2013) was used to simulate mutations jointly conditional on the nucleotide being mutated and the two immediately flanking nucleotides (one nt 5’ of the mutated site and one nt 3’ of the mutated site). Simulations were performed for 1,991 genes on human chromosome 1, with 1,000 mutations being applied per gene. Each mutation was reversed prior to sampling the next mutation, so that each mutated sequence had an edit distance of exactly 1 from the original genomic sequence.

134

Each mutation was then assessed as to its ability to create or destroy a splice site.

For annotated splice sites, if a mutation changed a canonical donor (GT, GC, or AT) or acceptor (AG or AC) splice consensus to a non-consensus, the site was counted as disrupted. In addition, if a mutation did not change the consensus but did modify a flanking position such that the splice-site score dropped below the threshold of the logistic sensor (chosen to admit 99% of training splice sites), it was also counted as disruption of the splice site.

For mutations that did not disrupt a splice site, three criteria levels were applied to determine whether the mutation could create a de novo splice site. For the least stringent level, if the mutation changed a non-consensus 2 bp sequence to any consensus sequence for either donor or acceptor splice sites, the mutation was counted as creating a de novo splice site. For the second stringency level, if a non-consensus 2 bp sequence was changed to a consensus sequence, the corresponding logistic signal sensor was applied to a window (section 4.5.2) containing the consensus and the logistic sensor was used to classify the sequence as a splice site or a non-splice site. Only sites in which both a non- consensus 2 bp sequence was transformed by the mutation into a consensus sequence and the logistic sensor classified the window as a splice site were counted as creation events. For the most stringent level, a mutation was required to create a 2 bp splice site consensus, be classified by the signal sensor as a splice site, and occur in a favorable exon definition context to be counted as a de novo splice site. Exon definition potential

135

was measured using the logistic content sensors. For de novo splice sites occurring in annotated introns, the intervening sequence between the de novo splice site and the exon it would extend was evaluated using the exon content sensor. For de novo splice sites occurring in annotated exons, the propensity to shorten the exon was measured by applying the intronic content sensor to the sequence that would become intronic.

Thresholds for signal and content sensors were selected so as to admit 99% of training sequences. Three simulations were performed using the same random number seed, one simulation per stringency level.

136

Chapter 5 – Detecting allele-specific effects on the activity of transcriptional enhancers

Portions of the text and figures in this chapter were previously included in the following publication:

Vockley CM*, Guo C*, Majoros WH*, Nodzenski M, Scholtens DM, Hayes MG,

Lowe WL, Reddy TE (2015) Massively parallel quantification of the regulatory effects of non-coding genetic variation in a human cohort. Genome Research 25:1206-1214.

* Denotes shared first-authorship.

Author contributions: WHM performed computational analyses to compute allelic effects from STARR-seq results and assess their statistical significance, assessed reproducibility, assessed haplotype effects and additivity of effects of variants within haplotypes, and demonstrated fine-mapping in eQTLs. TER performed allele-specific

H3K27ac analyses to corroborate fine-mapping analyses. CMV and GC performed experiments. TER, CMV, GC, and WHM wrote the manuscript.

5.1. Introduction

There are now several examples of noncoding genetic variants that alter the activity of regulatory elements and contribute substantially to complex traits and human diseases (Olansky et al., 1992; Nicolae et al., 2010; Maurano et al., 2012; Corradin et al.,

2014; Stadhouders et al., 2014; Guo et al., 2015). Such examples are likely representative of a larger trend that genetic variations in regulatory elements are a major contributor to complex phenotypes and disease (Maurano et al., 2012; Gusev et al., 2014). Genetic 137

effects on gene regulation are pervasive, as demonstrated by association studies revealing expression quantitative trait loci (eQTL) for the majority of human genes (Cantor et al., 2010; Stranger and Raj, 2013; Battle et al., 2014). Recent studies have further demonstrated that genetic variants associated with DNase I hypersensitivity, a strong predictor of the presence of a regulatory element, explain a substantial proportion of eQTLs (Degner et al., 2012), and individuals who are heterozygous in those elements likely have heritable allele-specific open chromatin and transcription factor binding

(Birney et al., 2010; McDaniell et al., 2010; Reddy et al., 2012). Although there is now much evidence supporting the contributions of regulatory variation to human phenotypes, systematically identifying the specific variants and regulatory elements that contribute to phenotype remains a major challenge.

One of the major reasons that challenge remains is that patterns of recombination across the genome limit the resolution of genetic association studies and prevent the identification of specific causal variants. That limitation motivates the development of complementary empirical approaches to assay the consequences of noncoding genetic variation on regulatory element activity (Feng et al., 2013; Fogarty et al., 2014;

Stadhouders et al., 2014; Guo et al., 2015). In a reporter gene expression assay, for example, a gene regulatory element is cloned into a plasmid, where the element can control the expression of a fluorescent or chemiluminescent protein. The plasmid is then transfected or infected into cells, and the activity of the regulatory element is estimated

138

by measuring the expression of the reporter gene. Several examples have now shown that reporter assays are a valuable tool to compare the function of genetically different versions of the same regulatory element and to identify noncoding variants that explain genetic associations with gene expression and phenotypes (Fogarty et al., 2014; Guo et al., 2015). Recent advances have dramatically increased the throughput of reporter assays by embedding molecular barcodes within the reporter gene that can later be observed with DNA sequencing (Patwardhan et al., 2009; Kwasnieski et al., 2012;

Melnikov et al., 2012; White et al., 2013), and the regulatory activity of more than one million unique DNA fragments can now be assayed in a single experiment using such massively parallel reporter assays (Arnold et al., 2013).

Here, we have developed a novel high-throughput approach to efficiently measure the activity of regulatory elements captured from the genomes of a human study population. Previous approaches to identify genetic effects on regulatory element activity have used DNA synthesis and random mutagenesis to generate mutations in select regulatory elements (Patwardhan et al., 2009; Melnikov et al., 2012; White et al.,

2013). By instead assaying putative regulatory elements captured from donor genomes, members of the Reddy lab developed an assay for high-throughput empirical measurement of the effects of regulatory variants specific to a study population. With that method, haplotypes are maintained within each regulatory element, empirical measurement of the combined effects of all common, rare, and personal variants within

139

a regulatory element are possible. The result is individual-specific measurements of regulatory element activity across the study population. Because candidate regulatory elements are assayed independently of one another, the approach is an effective strategy to identify causal mutations within large regions of statistical association between genotype and phenotype. Together, these results demonstrate that population-scale functional reporter assays are a valuable strategy for identifying specific causal genetic variants and haplotypes within genomic loci previously associated with phenotype.

5.2 Results

5.2.1 Population-scale reporter assay approach

We designed an empirical strategy to measure the activity of specific candidate regulatory elements across a population of individuals (Figure 21A). The strategy is based on the STARR-seq assay (Arnold et al., 2013) in combination with computational analyses to infer regulatory effects. Briefly, in STARR-seq, candidate regulatory elements are cloned into the 3′ untranslated region (UTR) of a reporter gene. The resulting plasmid pool is then transfected into host cells, where the cloned elements can regulate expression of the reporter gene in which they are embedded. High-throughput sequencing of the 3′ UTR of the expressed reporter gene mRNA can then be used to estimate the regulatory activity of each element.

140

Figure 21: (A) Population STARR-seq (“Pop-STARR”) adapts the STARR-seq assay to measure regulatory potential of multiple alleles cloned from a study population. (B) Population STARR-seq is highly reproducible. Rep1–7 are biological replicates generated from independent transfections. The x- and y- axes represent element activity (output RNA reads / input DNA reads). In each case, Spearman’s r > 0.90.

To leverage the STARR-seq approach to measure the activity of candidate regulatory elements across a population of individuals, we first generate a targeted sequencing library of regulatory elements from donor genomes. We then modify the resulting fragment libraries such that the sequence of the terminal 15 bp at each end of each fragment matches the ends of the cloning site in the STARR-seq backbone. We then insert the captured regulatory elements into the STARR-seq backbone and expand the resulting input library in Escherichia coli. To assay the activity of each captured fragment, we transfect the input library into a human liver carcinoma cell line, HepG2, and use

250-bp paired-end sequencing to observe the abundance of each allele of each element in

141

the input pool of transfected DNA and in the expressed reporter gene mRNA. Using an allele-specific analysis strategy, we then estimate the effect of each allele on regulatory element activity. We call this approach population STARR-seq, or Pop-STARR.

5.2.2 Targeted sequencing of candidate regulatory elements from a GWAS population

As demonstration of the aforementioned approach, we focused on candidate regulatory elements from a 250-kb region on Chromosome 3 (3q25) that were previously found to be associated with measures of adiposity at birth (Urbanek et al., 2013). We selected the regions to assay based on evidence from the ENCODE Project Consortium

(2012) that suggests potential regulatory activity. Specifically, we aggregated open chromatin data from 40 different cell types relevant to metabolism, which yielded an initial set of 128 open chromatin sites. We further prioritized those sites by selecting

DNase I hypersensitive sites (DHSs) that were present in at least two or more cell lines, resulting in a total 104 DHSs. We designed 174 PCR amplicons to amplify from the 104 candidate regulatory elements. The amplicons had an average length of 409 bp. We then used multiplex PCR to amplify those elements from 95 individuals at the extremes of adiposity in the genetic association cohort (Urbanek et al., 2013).

To quantify the genetic variation in the captured elements, we sequenced the regions using paired-end 250-bp sequencing. That read length was sufficient to observe the entire sequence of each amplicon. Sequencing was completed to a median depth of

1500×, resulting in the identification of 321 genetic variants in the captured elements. 142

Twenty-three percent of the variants identified were specific to the study population as determined by their absence from dbSNP and the 1000 Genomes Project Consortium database (Sherry et al., 2001; The 1000 Genomes Project Consortium, 2012). The ratio of transitions to transversions was similar between the captured variants and those found in the 1000 Genomes Project, suggesting that the novel variants were unlikely due to systematic sequencing errors. We identified a substantially greater fraction of rare and personal variants in our targeted sequencing, likely due to increased sequencing depth that supported more highly powered variant calling. The preponderance of study- specific variants emphasizes the importance of assaying regulatory elements captured from the genomes of the study population rather than from a separate cohort.

5.2.3 Quantifying the effects of noncoding variation in a GWAS population

To quantify the activity of the captured candidate regulatory elements, we cloned the captured amplicons into the 3′ UTR of the STARR-seq reporter gene (Arnold et al., 2013) to generate an input plasmid library. The input library covered 99% of the targeted sequence and included both alleles of 88% of the variants observed in targeted sequencing of the region at a median coverage of approximately 2200×. We then performed seven independent transfections of the input library into HepG2 cells and used targeted high-throughput sequencing of the expressed reporter gene transcripts to measure the allele-specific regulatory activity for each amplicon. The sequencing generated a median coverage of the target amplicons of approximately 13,000× and 143

assayed both alleles of 283 of 321 SNPs detected in the input library. Of the assayed

SNPs, 83 (29%) were rare, defined as a minor allele frequency <1%. We observed a similar fraction of rare SNPs in the input library (32%), suggesting that there was minimal bias against rare variants in the assays.

There was strong correlation between the allele ratios in each pair of output libraries (Spearman’s r between 0.90 and 0.97) (Figure 21B), demonstrating reproducibility of the assay. To identify individual variants that have a statistically significant effect on regulatory activity after taking into account differences in read depth, we pooled reads from the replicate output libraries and compared relative variant abundance to the input library using Fisher’s exact test, as described below. We identified 27 common and nine rare regulatory variants with a false discovery rate

(FDR) < 5%. The identified variants had fold changes in regulatory activity ranging from

0.25 to 3.96, consistent with previous observations using saturation mutagenesis of enhancers (Patwardhan et al., 2012).

5.2.4 Identifying regulatory variants in population STARR-seq

Haplotype sequences were imputed using the phased VCF file by inserting phased variants into reference sequences from the hg19 genome assembly. Sequencing reads were aligned to these haplotypes using Bowtie 2 (Langmead and Salzberg, 2012) with strict match parameters (mismatch, gap open, and gap extend penalties all set to

100) to ensure exact matching to individual haplotypes. Read counts at each SNP were

144

tallied using SAMtools mpileup (Li et al., 2009). Replicates were pooled to increase statistical power. SNPs having fewer than two reads of either input DNA or pooled

RNA were discarded from further analysis. Fisher’s exact test was used to detect significant differences in minor allele frequency between input DNA and output RNA; a pseudocount of 1 was added to each table entry in Fisher’s exact test. Two-tailed P- values were adjusted to control the false discovery rate (FDR) to <5% via procedure p.adjust() in the standard R package “stats” (R Core Team, 2015), which implements the

Benjamini-Hochberg procedure (Benjamini and Hochberg, 1995). Of 283 SNPs tested, 36 were found significant at an FDR-adjusted level of 0.05. SNP effect sizes for each allele were computed as the ratio of normalized read counts between variants:

(RNA0/DNA0)/(RNA1/DNA1) for DNA and pooled RNA read counts for alleles 0 and 1.

As an experimental validation of the method, five regulatory variants identified by Fisher’s exact test as having significant allelic effects were tested via a standard luciferase assay. In all five cases, the allele with the higher STARR-seq expression showed higher luciferase expression.

5.2.5 Effects of haplotypes on regulatory element activity

For 98 of the amplicons, there was more than one polymorphic site, allowing us to ask whether multiple variants act independently to alter regulatory element activity at the haplotype level. To investigate that possibility, we computed haplotype effect

145

sizes analogous to SNP effect sizes, as normalized ratios for each haplotype versus all pooled haplotypes at a locus:

(RNAhaplotype/DNAhaplotype) / (RNApooled/DNApooled)

Statistical significance was assessed via Fisher’s exact test. That analysis allowed us to estimate the relative expression of each of the more than 450 distinct haplotypes assayed and revealed 24 haplotypes across 16 amplicons that significantly altered regulatory element activity (adjusted P < 0.05, Fisher’s exact test). We then evaluated the extent to which the independent contributions of the estimated effects of each SNP in a haplotype predicted the observed activity of the entire haplotype (Figure 22). The correlation between the effects predicted by individual SNPs and the effects of the haplotype (r = 0.54, P = 0.007) supports an overall consistency between SNP effects and their combination into haplotype effects.

146

Figure 22: Correlation between SNP effect sizes (x-axis: log of product of effect sizes of SNPs on haplotype) and log of observed haplotype effects (y-axis) for putative regulatory haplotypes containing more than one SNP (r = 0.54, P = 0.007). Observed haplotype effect sizes were computed as normalized ratios for each

haplotype versus all pooled haplotypes at a locus: (RNAhaplotype/DNAhaplotype)/

(RNApooled/DNApooled). Solid line: regression line (slope=0.8, intercept=-0.019); dotted line: 1:1 diagonal.

However, there was substantial residual variation that may be due to either experimental noise or synergistic effects between variants within haplotypes. Measuring haplotype-scale effects in larger populations will also be important to establish the distribution of natural functional variation in regulatory elements and may provide insights into the role of gene regulation in a wide variety of biological processes.

147

5.2.6 Fine mapping genetic associations with phenotypes

One of the major goals of functionally evaluating regulatory variants is to determine genetic effects on regulatory element activity that may explain genetic associations with phenotypes. To demonstrate that our strategy can support such fine mapping, we investigated a set of SNPs associated with the expression of a long noncoding RNA LINC00881 in the region. Specifically, the Geuvadis project (Lappalainen et al. 2013) identified a cluster of nine eQTLs associated with the expression of

LINC00881 in lymphoblastoid cell lines (LCLs). The variants associated with

LINC00881 span ∼12 kb of the genome. The statistical significance of the association with

LINC00881 was similar across all nine variants, likely due to high linkage disequilibrium across the region (Figure 23C). Four of the nine eQTLs were also assayed in the 95 individuals with our population scale reporter assays. Only one variant, rs73170828, located 242 bp upstream of the annotated LINC00881 transcription start site, significantly altered reporter gene expression (FDR = 0.02). In the eQTL analysis and in our population scale reporter assays, the reference allele of rs73170828 was associated with increased gene expression and increased regulatory activity, respectively (Figure

23D). Together, these results suggest that the promoter-proximal variant rs73170828 is a causal variant that regulates the transcription of LINC00881 and explains the association of the other eQTLs in the region.

148

As independent support of the regulatory function of rs73170828, we searched for evidence of allele-specific histone 3 lysine 27 acetylation (H3K27ac), a histone modification associated with active gene regulation (Creyghton et al., 2010). In ChIP-seq experiments performed on LCLs derived from five individuals heterozygous for rs73170828 (Kilpinen et al., 2013), there was substantially higher H3K27ac on the reference allele across the LCLs (P = 0.058, paired Wilcoxon test). Furthermore, there was an overall significant increase in the number of reads aligning to the reference allele when compared to a null model in which the same proportion of reads align to each allele (binomial P = 0.004). Those results are concordant with increased regulatory activity of the reference allele in our reporter assays and increased LINC00881 expression. The second closest assayed variant, rs62274098, did not have significant allele-specific H3K27ac (binomial P = 0.92), suggesting again that rs73170828 and not neighboring variants mechanistically contributes to the expression of LINC00881 (Figure

23E). Together, these results show that our novel approach for quantifying the effects of noncoding variation on gene regulation within cohorts reveals likely causal variants that contribute to genotype-phenotype associations.

149

Figure 23: Comprehensive measurement of haplotype-specific regulatory element activity provides mechanistic insights into gene regulation. (A) Distribution of enhancer activity scores for fragments containing regulatory variants (red) and fragments containing non-regulatory variants (blue). (B) Histogram of number of SNPs per assayed element. (C) Manhattan plot of eQTLs for the long noncoding RNA

LINC00881. Blue dots indicate −log10(P-value) of LINC00881 eQTL from the Geuvadis

database (left y-axis); red bars indicate −log10(FDR) for variants that alter regulatory activity in the population STARR-seq assay (right y-axis). Red dotted line indicates FDR = 1.0. (D) Association between normalized expression of long noncoding gene LINC00881 in LCLs as measured by the Geuvadis project (y-axis) and the measured effect size in population STARR-seq assay (x-axis) for SNP rs73170828 (r2 = 0.07, P = 7.6×10−9). (E) Allele-specific H3K27ac analysis of variants rs62274098 and rs73170828, both eQTLs proximal to and 5′ of LINC00881; read counts (y-axis) differed substantially between alleles for rs73170828 (Wilcoxon P = 0.058, binomial P = 0.004) but not for rs62274098 (Wilcoxon P = 0.9; binomial P = 0.92).

5.3 Discussion

In this work, we developed a novel high-throughput empirical approach to measure the regulatory effects of noncoding human genetic variation directly from the

DNA of individuals from a population-based study cohort. The ability to assay directly

150

from cohort DNA samples is an important distinction from previous high-throughput reporter assays because it allows investigation of variants and haplotypes that are not present in existing databases of human genetic variation. As rare variants are typically not observed frequently enough to support a statistical association, rare-variant burden tests instead collapse or aggregate variants and correlate the overall burden of those variants with phenotypes (Li and Leal, 2008; Zawistowski et al., 2010). Although burden testing within the coding regions of the genome can leverage predicted effects on the resulting protein (Choi et al., 2012; Hu et al., 2013), modeling regulatory element activity based on sequence alone remains a major challenge. Measuring regulatory activity directly from cohort DNA provides a possible empirical solution that allows the regulatory machinery of the cell to determine the cumulative effects of all regulatory variation in the element tested and allows for inference about the activity of that regulatory element that would not be possible otherwise.

The ability to associate empirically measured regulatory function and phenotype is especially needed in light of recent studies suggesting that coordination of regulatory effects between alleles may explain how weak effects of individual noncoding variants contribute to overall phenotypes (Corradin et al., 2014; Stadhouders et al., 2014; Guo et al., 2015). As we have shown, assaying regulatory elements outside the context of genetic linkage enables identification of individual regulatory elements that contribute to observed associations with gene expression. Importantly, however, genetic linkage is

151

maintained within each individual regulatory element tested. That feature allows for measuring the effects of regulatory element haplotypes on element activity without the confounding effects of a nearby regulatory element. For those reasons, the approach described here has the ability to both resolve independent effects in multiple regulatory elements while also maintaining local epistatic interactions between variants within an individual element.

For any complex disease, multiple types of cells are likely relevant to an observed phenotype. Additionally, the causal regulatory elements may only be active under certain environmental conditions, or an interaction with the environment may amplify the effect. Transient reporter assays have been shown to recapitulate cell-type- and environment-specific gene regulation (Pennacchio et al., 2006; Gisselbrecht et al.,

2013; Shlyueva et al., 2014). Because the input plasmid libraries generated in this study are a renewable resource that can be readily expanded in E. coli, the same captured regulatory elements can be assayed in numerous cell models and environmental contexts. Doing so may have particular benefit for identifying the specific cells or environments that are more relevant to a given genetic association signal.

There are both advantages and disadvantages intrinsic to the architecture of the

STARR-seq assay platform. Among the advantages is the potential to characterize dual functioning enhancer-promoters (Arnold et al., 2013). We detected regulatory variants within TSS-proximal regions of two of the three genes located within our test locus,

152

suggesting that the elements that contain these variants serve as dual function enhancer- promoters. The approach is limited by the observation that enhancers often have promoter-specific activity in transient transfection assays, indicating that alternative promoters may be required in some cases (Zabidi et al., 2015). Addressing those shortcomings will further increase the ability to assign regulatory causes to genetic associations.

Taken together, the approach demonstrated here enables measurement of the functional variation in regulatory activity across human populations and provides a novel and general path forward to identify disease-related perturbations in regulatory mechanisms after the completion of a genome-wide association study.

Detailed experimental methods and supplemental figures can be found in

(Vockley et al., 2015).

5.4 Future directions

5.4.1 Scale and bias

Additional work remains to generalize the approach described above to work effectively and efficiently for longer target regions and larger populations.

In the work described above, the 250 bp paired-end sequencing reads spanned the entire amplicons. Because the amplicons were short, explicit haplotype sequences could be constructed for all regions from all individuals. After de-duplicating haplotypes, reads could be assigned unambiguously to haplotypes. When scaling up to

153

longer regions from larger populations, the number of possible haplotypes increases. In the limit of whole genomes, space requirements for storing the haplotypes become demanding, and indexing of the sequences for alignment becomes time-consuming.

Moreover, due to sharing of haplotype blocks between individuals, a given sequencing read may align to many whole haplotypes that share a haplotype block, and some alignment programs will interpret such reads as originating from repetitive DNA and will discard them. Thus, an alternative representation of haplotypes is needed.

The need for representing explicit haplotype sequences derives from the issue of reference bias. Aligning reads to a single reference genome could potentially induce a bias against the alleles not contained in the reference, due to the aligner penalizing mismatches between a read and the reference (Degner et al., 2009). Such biases might cause either false positives or false negatives when testing the statistical significance of variants and haplotypes. Reference bias has previously been avoided by constructing explicit haplotype sequences from a phased VCF file (e.g., Rozowsky et al., 2011; Reddy et al., 2012).

Recent developments in methods for representing populations of genomes promise to alleviate the issue of reference bias in an efficient and elegant way. Variant graphs, also called genome graphs or graph genomes (reviewed in Paten et al., 2017), represent populations of genomes in a way that collapses shared subsequences so that they are not duplicated. Individual haplotypes correspond to individual paths through

154

the graph. Short genetic variants such as single-nucleotide polymorphisms (SNPs) give rise to bubbles where an individual follows one path across the bubble, while the reference genome follows the other path across the bubble. Methods for efficiently aligning sequencing reads to variant graphs have been recently proposed (e.g., Sirén et al., 2014). These methods are only now being implemented in software and made available to users (e.g., HISAT2: Kim et al., 2015). When mature, stable implementations of these approaches are available, they could be incorporated in the bioinformatic analysis pipelines for population STARR-seq, as the elimination of reference bias in read alignment may improve detection of regulatory variants.

5.4.2 Statistical modeling of allelic effects

While experimental validation using a luciferase reporter assay agreed with predictions based on Fisher’s exact test, there is room for improvement in the statistical approach used here. Fisher’s exact test has been shown to be overly conservative (Crans and Shuster, 2008). One alternative for testing that the allele ratio in the RNA differs from the allele ratio in the DNA at a single bi-allelic locus is via the beta-binomial distribution:

� �|�, �, �

= ����(�|� + 1, � − � + 1)�����(�|�, �)��

� 1 = � (1 − �)�� � � � + 1, � − � + 1

155

where kRNA is the number of RNA reads containing the alternate allele; kDNA is the

number of DNA reads containing the alternate allele; nRNA and nDNA are the total read counts in RNA and DNA, respectively, for this locus; Beta(p|a, b) is the beta distribution;

Binom(k|n, p) is the binomial distribution; and B(a, b) is the beta function. As the beta- binomial is a posterior predictive distribution, it gives the probability of newly observed data, conditional on previously observed data, assuming that both data sets are drawn from the same distribution. As such, it can serve as a null hypothesis that the allele ratios in RNA and DNA are not different. Rejecting this null hypothesis implies a regulatory effect of one allele. Unlike the Fisher’s exact test, which models sampling without replacement via a hypergeometric distribution, the beta-binomial models sampling with replacement under a Bernoulli process with fixed but unknown parameter. To the extent that an allelic bias in transcription initiation rate is not affected by the sampling process, the beta-binomial may be more appropriate than Fisher’s exact test for this application.

Neither of these tests account for variability between experimental replicates. In the tests performed here, counts were simply pooled between replicates. Effectively accounting for variability between replicates might be accomplished using similar methods to those applied in differential gene expression analysis (e.g., Robinson et al.,

2010). Work on adapting the latter methods to accommodate allelic effects is ongoing.

156

Chapter 6 – Integrative modeling of dynamic epigenetic enhancer signatures

Portions of this chapter are included in modified form in the following manuscript which is being prepared for submission to an academic journal:

McDowell IC, Barrera A, Hong LK, Leichter SM, Majoros WH, Dumitrascu B,

Luo K, D'Ippolito AM, Song L, Safi A, Vockley CM, Lu J, Kocak DD, Bartelt LC,

Gersbach CA, Hartemink AJ, Engelhardt BE, Crawford GE, Reddy TE (2017)

Glucocorticoid Receptor Recruits to Enhancers and Drives Reprogramming by Motif- directed Binding. In preparation, order of authorship not yet determined.

Author contributions: WHM designed, implemented, and tested the hidden

Markov model, and performed clustering, motif enrichment, and gene expression analyses. ICM formulated the problem and preprocessed the data to serve as inputs to the model. AMD generated Hi-C data that was used to associate enhancers with genes.

LKH, SML, and LCB performed DNase-seq and ChIP-seq experiments. CMV performed

STARR-seq experiments to identify genomic elements for training and validation.

6.1 The glucocorticoid signaling system

The ability of the cell to respond adaptively to external stimuli is critical to cellular viability. In multicellular organisms, intercellular communication via signaling molecules is mediated by nuclear receptors that respond to the presence of external signals, such as hormones, by binding to DNA and initiating a genomic response. That genomic response can entail entire cascades of gene expression changes as well as 157

epigenetic modifications to both genes and distal regulatory elements. Such is the case with the glucocorticoid receptor (GR).

Glucocorticoids (GCs) are steroid hormones that mediate the response to stress

(Denver, 2009). Synthetic glucocorticoids, such as dexamethasone (dex), are widely used as anti-inflammatory drugs (Hsiao et al., 2010). Treatment of eukaryotic cells with dex induces GR, which in non-induced state is bound up in the cytoplasm by heat-shock proteins, to translocate into the nucleus. As noted in section 2.2.1, GR is capable of binding directly or indirectly to DNA. Upon induction GR binds to thousands of locations in the genome and regulates hundreds of genes (Wang et al., 2004; Chen et al.,

2008; Reddy et al., 2009; John et al., 2011; Biddie et al., 2011). As also noted previously, different modes of GR binding appear to have different regulatory impacts (Vockley et al., 2016). In particular, GR is believed to interact with members of the AP-1 family of

TFs, as well as other factors such as C/EBP (Grøntved et al., 2013), to functionally tune the response to GCs. GR can have different binding partners in different contexts, with different effects. As such, the GR-mediated response to GCs genome-wide is complex and not yet fully understood.

A first step toward improving our understanding of GR-mediated gene regulation is to identify GC-responsive enhancers genome-wide. With those elements in hand we could then ascertain which features of those elements (either singly or in combination) most strongly influence gene expression responses; we might also hope to

158

tease apart the regulatory logic that lies behind the distinct dynamics of individual GC- responsive genes. Drug-responsive enhancers can be identified using an episomal reporter assay such as STARR-seq, as described in Chapter 5. However, applying such methods genome-wide is at present an expensive and laborious undertaking, and indeed the first efforts to apply STARR-SEQ genome-wide in human cells are only just beginning (e.g., Muerdter et al., 2017). In contrast, ChIP-seq and ATAC-seq are routine and inexpensive assays that, together with bioinformatic models such as the one I describe below, can be used to identify functional enhancers genome-wide.

In this chapter I describe a model-based approach to identifying dex-responsive enhancers via their epigenetic signatures. The resulting model both identifies such enhancers and parses them into putative binding regions at nucleotide resolution, and can be expected to have direct applicability in prioritizing putative binding locations so as to identify the most likely TFs involved in the function of a given enhancer. This ability will in turn likely be useful for variant prioritization, as variants that disrupt binding sites for TFs have been implicated in disease (Kasowski et al., 2010; Reddy et al.,

2012).

6.2 Methods

6.2.1 A multivariate model of epigenetic dynamics

In order to study the GC response in great detail, hundreds of experiments were performed in our lab that assayed multiple epigenetic features at multiple points over a

159

12-hour timecourse in human A549 (lung carcinoma cell line) cells treated with dex. It was found in a previous study by our lab that dex-responsive and non-responsive GR binding sites differ in histone marks H3K4me1 and H3K27ac, and in the presence of co- activator p300 (Vockley et al., 2016). Our lab assayed these features at all time points via

ChIP-seq, as well as H3K4me2 and GR binding. In addition, DNase hypersensitivity was assayed via DNase-seq, and gene expression was assayed via RNA-seq. Finally, Hi-

C was performed to assay chromatin contacts.

It has been noted that binding sites for activating TFs in enhancers are often flanked by histone marks such as H3K4me2 and H3K27ac and are generally positioned within accessible chromatin in cell types in which they are functional (Ernst et al., 2011).

Indeed, computational models for predicting steady-state enhancers based on epigenetic marks such as these have been described, such as ChromHMM (Ernst and Kellis, 2010).

However, ChromHMM and other existing approaches are geared toward unsupervised clustering of epigenetic marks at fixed resolution, and do not model dynamics. SegWay

(Hoffman et al., 2012), for example, is capable of 1 bp resolution, but it is also geared toward unsupervised clustering.

I therefore designed and trained a multivariate hidden Markov model (MV-HMM) on known dex-responsive enhancers to learn spatial and temporal dependencies between chromatin features at nucleotide resolution. My earlier success in modeling miRNA binding sites in CLIP data (section 2.2.4.3; Majoros et al., 2013) suggested a

160

general methodology for designing MV-HMMs for novel pattern recognition problems in the supervised learning setting. In this methodology, a single state is allocated for each peak, upslope, downslope, or background region in the prototypical multivariate profile being modeled. Given that we expect the prototypical enhancer to contain an open chromatin region (“peak”) flanked by regions of histone modifications (“histone domains”), I chose a 5-state linear structure to capture the central peak, the two flanking histone domains, and optional background regions on the flanks (Figure 24A). Each state has a self-transition (in addition to a transition to the next state in sequence) to allow each state to emit a contiguous interval via a series of transitions back to itself.

The resulting model generates a single enhancer and then terminates.

Each state possesses a joint probability distribution over DNase-seq and ChIP- seq features (DNase, p300, H3K4me1, H3K4me2, H3K27ac) for two time points: time 0

(pre-dex-treatment) and some later time point, t (t hours of dex treatment), for a total of ten features. Features measured at time t thus reflect cumulative changes in response to stimulus over a t-hour interval. These emission distributions are characterized by a mean vector (Figure 24B) and a full covariance matrix (Figure 24C), so that features are not assumed to be conditionally independent as in ChromHMM and SegWay. The resulting model thus represents coordinated changes in spatial epigenetic signatures over time.

161

The model was implemented and trained using MUMMIE (Majoros et al., 2013).

In addition to full covariance matrices, MUMMIE permits the use of sparse matrices, which may be useful if a larger-dimensional version of this model were to be implemented in the future.

6.2.2 Training and evaluating the model

I trained the model on DNA fragments that our lab had immunoprecipitated with an antibody to GR and that our lab had found (via STARR-seq experiments) to enhance transcriptional activity after 3 hours of dex treatment, as compared to a pre-dex control (Vockley et al., 2016). Peaks in p300 ChIP-seq signal were called by MACS2 version v.2.1.0.20151222 (Zhang et al., 2008b), and for each peak the 2000 bp interval centered on the peak was extracted and provided to the model for training. Features

(Dnase-seq and ChIP-seq for p300, H3K4me1, H3K4me2, and H3K27ac) were standardized to have mean zero and standard deviation 1 over all 12 hours, separately for each 2000 bp window.

GR ChIP-seq was omitted as a predictive feature to enable detection of non-GR- bound dex-responsive enhancers. As described in (Vockley et al., 2016), elements were labeled as reporter-positive or reporter-negative by applying a Wilcoxon rank-sum test to test whether STARR-seq output was significantly higher in post-dex than pre-dex condition.

162

Figure 24: Multivariate Hidden Markov model (MV-HMM) of GC-responsive regulatory elements. (A) State-transition diagram of multivariate HMM shown with transition probabilities. (B) Mean emission probability for each state, at pre-dex (bottom) and at 3 hr of dex exposure (top). (C) Covariance matrices for emission distributions of each state.

163

Expectation maximization (EM) (section 2.1.2.1) was employed for learning means and covariances of a separate multivariate Gaussian distribution for emissions within each state, as well as probabilities of transitioning between states. While MUMMIE supports finite mixtures of Gaussians for the emission distributions, mixtures were not used for this model. EM was continued until the log likelihood increased by less than

0.01. All runs converged in under 150 iterations. One hundred models were randomly initialized and trained via EM. The model achieving the highest likelihood on the training set was retained as the single foreground model. Only reporter-positive elements (1229 in number) were used in training this foreground model. A single-state background model was trained on reporter-negative elements (1235 in number).

Putative enhancers were scored by computing the log-likelihood ratio (LLR) of each enhancer under the foreground model versus the background model. Likelihoods were computed using the forward algorithm (section 2.1.2.1). Area under the receiver- operating characteristics curve (AUC) values were computed using these LLRs, via five- fold cross-validation. An additional set of 300 models were trained on the full set of

1229 reporter-positive elements and the model with the highest likelihood on the training set was selected for deployment genome-wide.

6.2.3 Quantifying and clustering enhancer dynamics over a 12-hour timecourse

While the model described above evaluates changes between only two time points, I devised a means of using this 2-timepoint model to compute quantitative 164

trajectories of enhancer response over a 12-hour timecourse, as follows. Given post-dex timepoints T = {0.5, 1, 2, 3, 4, 5, 6, 7, 8, 10, 12} hours, I applied the model to each pair (0, t) for t Î T, to compute the cumulative response at each successive timepoint relative to time 0 (pre-dex). This was performed for each 2000 bp window centered on a p300 peak genome-wide, producing an enhancer activity trajectory for each putative enhancer genome-wide.

Figure 25: Consistency of k-means clustering across multiple runs. Each heatmap represents mean within-cluster predicted dex-induced enhancer activity after k-means clustering with different random initializations. Whiter values represent lower activity, redder values represent higher activity.

To facilitate clustering of trajectories, a relative activity level for each time point was computed by transforming each LLR to a value of 0, 0.5, or 1, based on whether the

LLR was less than -1000, between -1000 and 1000, or greater than 1000, respectively.

Trajectories of these transformed activity levels over time were clustered via k-means

165

clustering with k = 10; repeated clustering from random initial assignments produced similar clustering results (Figure 25).

6.2.4 Gene expression analysis

To obtain empirical support for the predicted activity levels and clustering of trajectories, I compared mean activity levels for each enhancer cluster to mean expression changes in genes that are likely to be regulated by those enhancers.

Physical interactions between enhancers and genes were inferred via direct overlap of Hi-C anchor intervals (D’Ippolito et al., 2017) with an enhancer at one end of an interaction loop and the TSS of a gene at the other end of the loop. TSS annotations were obtained from GENCODE version 22 (Harrow et al., 2012). Gene expression was measured using RNA-seq in the same cell type at each point in the 12-hour timecourse.

EdgeR (Robinson et al., 2010) was used to compute log2 moderated fold-changes at each timepoint compared to time 0, based on RNA read counts mapping to annotated exons.

Enhancer activity scores for each enhancer were compared to mean log2 fold changes in expression of corresponding genes (via Hi-C contacts) via Spearman rank correlation, to assess the degree to which enhancer activity scores computed by the MV-HMM predict gene expression at the same time point.

6.2.5 Motif enrichment analysis

To obtain further evidence supporting the biological relevance of the enhancer clusters, I analyzed motif enrichment for motifs known to play a prominent role in GR

166

function. In particular, using orthogonal techniques, members of our lab have established that GR motifs are enriched in enhancers that respond early in the timecourse to dex, that C/EBP motifs are enriched in enhancers that respond later in the timecourse, and that AP-1 motifs are enriched in enhancers that are repressed or do not respond to dex (McDowell et al., 2017).

Motif matches were obtained by applying MAST v.4.10.0 (Bailey et al., 2009) with a threshold of P < 0.05, using motif profiles from JASPAR (Sandelin et al., 2004). Motifs for CEBPA, CEBPG, ATF4, HLF, and NFIL3 were used to identify likely binding sites of any C/EBP TF. Motifs for FOS, FOSL1, and JUN were used to identify likely binding sites of any AP-1 TF. Enrichment of a motif within a cluster was quantified as the excess proportion of elements containing the motif relative to the expected proportion under the assumption of independence. Statistical significance of motif enrichments was assessed via Fisher’s exact test applied to a contingency table tabulating motif presence or absence in the current cluster versus all other clusters pooled.

6.3 Results

6.3.1 Parameter estimates

EM parameter estimates revealed a canonical signature of a GC-responsive element in which p300 and chromatin accessibility increased centrally and activation- associated histone modifications increased in the immediate flanks (Figure 24B). The model indicates positive covariance for p300 and chromatin accessibility in the middle of

167

the regulatory region, positive covariance for activation-associated histone marks in the flanks, and negative covariance between the two groups of central and flanking features

(Figure 24C). These results demonstrate that temporal changes in accessibility and occupancy of p300, H3K4me1, H3K4me2, and H3K27ac have a characteristic spatial arrangement, and this signature is jointly recognized by the collection of states comprising the model.

6.3.2 Classification accuracy

The full model achieved high cross-validation accuracy (AUC = 0.80) in classifying enhancers as dex-responsive versus non-dex-responsive (Figure 26). To assess the importance of individual features, I trained restricted versions of the model containing only subsets of the original feature set. None of these restricted versions of the model achieved an AUC as large as that of the full model (Wilcoxon rank-sum test: all comparisons P < 0.05 except full model versus 0 hr features only: AUC = 0.80 versus

0.77; P = 0.11), demonstrating that integrative modeling of these features has predictive value beyond that of any single feature. The second most accurate model (AUC = 0.77) included all features for time 0 hr only, and no features for time 3 hr. Remarkably, this 0 hr model achieved a higher AUC than the 3 hr model that included all features for time

3 hr and no features for time 0 hr (AUC of 0.77 versus 0.73; P = 0.028), suggesting that the state of the enhancer prior to dex treatment is as informative as, and possibly more informative than, the state of the enhancer after treatment. Similarly, each feature other

168

than DNase was individually more informative when measured at time 0 hr than at 3 hr, in terms of predicting dex-responsiveness of the enhancer (H3K27ac 0 hr AUC versus 3 hr AUC: 0.73 versus 0.60, P = 0.004; p300: 0.71 versus 0.60, P = 0.008; H3K4me1: 0.65 versus 0.50, P = 0.008; H3K4me2: 0.64 versus 0.45, P = 0.008; DNase: 0.52 versus 0.50, P =

0.21). Moreover, the accuracy of the model with only 0 hr features (AUC = 0.77) was not statistically significantly less than that of the full model (AUC = 0.80) (P = 0.11), again underscoring the importance of the pre-dex state of an enhancer in predicting dex- responsiveness. The third most informative feature set consisted of H3K27ac at both time points; the accuracy of the 2-timepoint H3K27ac model (AUC = 0.76) was comparable to that of the 2-timepoint p300 model (AUC = 0.74), as expected given the known role of p300 in acetylation of H3K27. DNase as a lone feature was least informative as compared to all other features, at either timepoint (AUC = 0.52 at time 0 hr, P < 0.01 for all comparisons; AUC = 0.50 at time 3 hr, all comparisons nonsigificant except H3K27ac: P = 0.016), indicating that chromatin accessibility alone is a poor predictor of dex-responsive enhancer activity, though change in chromatin accessibility in response to dex is more informative than either timepoint alone (AUC = 0.67, both P <

0.01).

169

Figure 26: Area under the receiver operating characteristics curve (AUC) values for the full model and for reduced versions of the model including only the features listed.

6.3.3 Clustering of temporal trajectories

I used the trained HMM to discover dex-responsive enhancers genome-wide and across the time course. I clustered elements into ten clusters according to the likelihood of dex-responsiveness under the MV-HMM and tested for known motif enrichment

(Figure 27A). The GR motif was enriched in clusters of elements that responded early to treatment, either within the first two hours or at hours 2–4, whereas motifs indicative of

C/EBP binding were enriched in clusters of elements with delayed or extended response, consistent with results obtained by our lab using orthogonal techniques (McDowell et al., 2017). Motifs indicative of direct AP-1 binding were enriched in the cluster containing elements predicted to lack dex-induced activation throughout the time course (cluster 4). These results suggest that motif-driven GR binding is the primary

170

activator of dex-responsive enhancers, as opposed to non-motif driven (i.e., tethered) GR binding. Furthermore, I found that patterns of regulatory activity associated with corresponding patterns of expression changes for genes with which those elements physically interacted according to Hi-C chromatin loops. Most notably, enhancers with increased activity were enriched for interactions with up-regulated genes (clusters 1, 8, and 10 in Figure 27A,B), and non-responsive enhancers were physically associated with repressed genes. The overall statistical association between enhancer activity and gene expression across clusters was very strong (spearman r = 0.73, P = 2.8×10-19).

171

Figure 27: Clustering, motif, and gene-expression analyses. (A) Results of clustering trajectories of predicted enhancer activity (left). Enrichments of motifs in clusters measured as excess proportion of elements containing a significant motif match above expectation (* means P < 0.01; right). (B) Gene expression changes for genes physically interacting with enhancers in clusters shown in panel A.

6.3.4 Multipeak structure of enhancer signatures

Visualizing the predictive features for individual enhancer signatures led to the observation that many dex-responsive enhancer signatures have multiple DNase and/or p300 peaks (Figure 28). Moreover, the peaks in the same enhancer signature can be heterotypic in terms of the changes from 0 hr to 3 hr of DNase and p300. In the example

172

shown in Figure 28, the left peak is non-dex-responsive in that DNase decreases from 0 hr to 3 hr (red to yellow) and p300 does not substantially increase from 0 hr to 3 hr

(green to purple). The right peak, in contrast, exhibits a strong response in both DNase and p300.

Figure 28: Example enhancer signature with two peaks. HMM states have been recoded: state 0.5 = background, state 1.5 = histone flank, state 2.5 = DHS peak.

To parse out individual peaks of multipeak enhancer signatures, I modified the

MV-HMM to allow it to predict multiple peaks per signature. In particular, I inserted an additional transition, from state 4 to state 3, to allow the model to emit arbitrary sequences of peaks separated by histone regions. Training this modified MV-HMM via

EM applied to the entire model repeatedly resulted in mis-convergence. I therefore applied a modified version of EM in which all parameters except for the transitions out

173

of state 4 were fixed to their previously-estimated values, and only the transitions out of state 4 were re-estimated during EM. In the example of Figure 28, the peak-state predictions (regions in which the HMM track is at y = 2.5) closely match the DNase and p300 peaks. Filtering peak predictions at a minimum length of 100 bp, the majority of signatures were identified by the modified model as having more than one peak as predicted by state 3 (time 3 hr: 9663 / 16665 enhancers = 58%). While previous research has noted that dex-responsive enhancers tend to cluster in the linear genome (Vockley et al., 2016), those previous observations were at the scale of tens of kilobases, whereas the peaks in these multipeak signatures are separated by on average 171 bp, with 90% of interpeak distances being £405 bp.

To assess whether these peak predictions may be useful for limiting the search space for functional TF binding sites, I tabulated motifs identified with MAST (Bailey and Gribskow, 1998) and tested their relative enrichment in predicted peaks (MV-HMM state 3) versus predicted histone regions (MV-HMM states 2 and 4) at 3 hr post-dex, using Fisher’s exact test. Motifs tested included JASPAR models for GR/AR/MR, CTCF,

AP-1 (JUN, FOS, FOSL1), KLF, FOX (FOXA1, FOXA2, FOXB1, FOXC1, FOXD3, FOXF2,

FOXO1, FOXP1, FOXQ1), and C/EBP (CEBPA, CEBPG, ATF4, HLF, NFIL3). All elements were trimmed to 100 bp to eliminate biases due to differing element lengths.

174

Figure 29: Motif enrichment in predicted peaks versus histone flanks of predicted dex-responsive enhancers genome-wide.

AP-1 and GR were strongly enriched in peaks versus histone flanks at 3 hr (both

P < 1×10-128; Figure 29), consistent with the known importance of these two factors in GC- responsive enhancers (Vockley et al., 2016). CTCF showed almost no enrichment in 3 hr peaks versus histone flanks, consistent with its known role as an insulator rather than a transcriptional activator. Known motifs for KLF and FOX family members also showed very little enrichment in 3 hr peaks versus flanks; this may be due to their supposed role as “pioneer factors”, which are presumed to be able to bind to closed chromatin and recruit remodeling complexes that can open the chromatin to allow other factors to bind.

To the extent that these factors are able to function as pioneers, they would be expected to be enriched in the vicinity of enhancers, but not necessarily within DNase peaks of dex-responsive enhancers as observed several hours post-dex, consistent with the observations described here. In contrast, known motifs for C/EBP family members were

175

moderately but significantly enriched in peaks versus flanks (P = 1.3×10-120), despite the supposed role of C/EBP members as pioneer factors. It may be that one or more C/EBP members play a more direct role in GC-responsive transcriptional activation than other pioneer factors; indeed, there is evidence that C/EBPb binds cooperatively with GR

(Steger et al., 2010), indicating that their motifs should be spatially clustered.

6.4 Discussion

Multivariate HMMs have been used previously for unsupervised clustering of epigenetic features and segmentation of genomes into putative elements such as genes and steady-state enhancers. In this work I have shown that a model that incorporates prior biological knowledge can be used to identify stimulus responsive enhancers with high accuracy, and to quantify activity dynamics over a timecourse. Whereas previous works have started with a fully-connected state topology and performed unsupervised learning, I instead applied a methodology that I previously developed and found to be effective when implementing a model of miRNA target sites (Majoros et al., 2013): to assign a state to each peak, slope, or background region in the canonical multivariate signal profile, to connect those states in sequence as they occur in the modeled profile, and then to train the model using supervised training applied to a set of examples that have been verified as functional. As in the earlier miRNA target site study, this methodology has proven effective in the development of an accurate predictive model of drug-responsive enhancers, possibly due to the detailed modeling of element

176

substructure at nucleotide resolution, and to the use of a model designed based on prior knowledge of the problem structure.

Evaluating the model on STARR-seq-verified enhancers demonstrated that integrative modeling has enabled higher prediction accuracy than any individual feature would allow on its own. As sequencing costs continue to decline and more features are measured from samples, integrative modeling will become correspondingly more important in terms of both maximizing predictive accuracy of bioinformatic tools and improving our understanding of complex systems in biology. For multivariate sequences, MV-HMMs are an obvious choice for integrative modeling, as they enable both classification of whole elements and parsing of element signatures into subcomponents such as putative TF binding sites. MUMMIE (Majoros et al., 2013), with its ability to model sequences at single-base-pair resolution, is an ideal tool for such applications, and was designed specifically for multivariate pattern-recognition tasks in which prior knowledge (in the form of known substructure of modeled elements) is to be incorporated with a hand-crafted state topology. Such modeling is more laborious than unsupervised learning of fully-connected models, as more modeling decisions must be made, and in some cases the model must be trained in pieces and then re- assembled after the individual pieces are trained via supervised learning, as is commonly done for gene-finding models. However, it has been noted that

177

unsupervised learning of fully-connected HMM topologies often results in poor convergence (Durbin et al., 1998).

While previously-published models of chromatin state have enabled genome- wide identification of regulatory elements, the work described here addresses the more specific case of drug-responsive enhancers. Biological organisms, as complex dynamical systems, are best understood not as static entities, but as dynamic machines. As our genome is also a dynamical system with distinct states that enable complex developmental and regulatory programs, and as these programs are subject to perturbations that can cause disease, models that capture genome dynamics are likely to dominate in the future. In this work I implemented a two-timepoint model that I then used iteratively to evaluate likelihoods over a timecourse. A number of alternative approaches to modeling the full timecourse could be considered for future investigation.

One possibility would be to incorporate features from all timepoints directly into the model. This approach has the drawback that the resulting model would represent a single class of trajectories. Given training examples for multiple classes of trajectories, a multiple-timepoint MV-HMM could be trained for each such class. However, for large numbers of timepoints the covariance matrix would become unwieldy, necessitating the use of sparse matrices. A more elegant model might be derived, however, by utilizing a

Gaussian process model to compute emission probabilities within the MV-HMM.

178

A key feature of the enhanced model with the added back-transition from state 4 to state 3 is that it appears to be capable of identifying individual peaks within multipeak enhancer signatures. Motif analysis revealed that the peaks identified by the multipeak model were enriched for GR and AP-1 motifs, both of which are known to be prime factors in the glucocorticoid response. As TF motifs within enhancer peaks might be more often functionally relevant than the same motifs in flanking regions, peak- calling should be highly useful in the interpretation of genetic variants that have the potential to disrupt binding sites. While peak calling in ChIP data for individual TFs is now commonplace, the use of “multivariate peak calling” as performed by the MV-

HMM might have some advantages over standard univariate peak calling. In particular, the incorporation of histone modification signals as done here might enable finer identification of likely TF binding regions, as competition between the peak state and the histone flank states may help to discriminate more finely between the region of open chromatin and the nucleosome-bound regions flanking it.

Finally, the functional significance of multipeak enhancer signatures has yet to be fully explored. As noted previously, examples can be found in which one peak in a signature is dex-responsive in DNase or p300 or both, while other peaks may be non- responsive. As shown in previous work by our lab, dex-responsive enhancers often cluster spatially in the genome, and there is evidence suggesting a looping mechanism whereby direct-bound GR can interact with distal AP-1 sites to produce synergistic

179

regulatory responses (Vockley et al., 2016). While those earlier interactions occurred at a larger spatial distance, on average, than the distance between peaks in the multipeak signatures discovered with the MV-HMM, it is conceivable that similar synergistic mechanisms might be implicated in the multipeak enhancer signatures. These possibilities are currently being pursued in our lab.

180

Chapter 7 – Conclusions

In this dissertation I have described efforts to improve the interpretation of genetic variants in genes and in transcriptional enhancers, through the use of traditional machine learning methods in combination with detailed model design that incorporates known biology. Improved variant interpretation has the potential to directly impact clinical outcomes, particularly as sequencing of patient DNA becomes more commonplace. As such, the focus of this work is timely and of wide relevance. I have also described a method for identifying drug-responsive enhancers based on their multivariate epigenetic signatures. This latter model likely has applications in prioritizing genomic regions to test experimentally for allelic effects, and may therefore also prove useful in variant interpretation. In the following sections I will reflect on the implications and possible future directions for all of these efforts.

7.1 Interpreting genetic variants impacting gene structure

The computational modeling of eukaryotic gene structure has a long history stretching back to the early 1980s. Over the course of several decades, emphasis shifted from the prediction of individual splice sites and exons to, in the late 1990s, prediction of whole gene structures using HMMs. The goal of that work was to annotate reference genomes, in which all genes are assumed to be functional and well-formed. Because all genes were assumed to encode functional proteins, those models were able to make use of codon bias statistics in continuous translation reading frames across successive exons

181

in a gene. The use of that information enabled those models to achieve very high levels of predictive accuracy on reference genomes, because codon bias provides a strong signal that enables the accurate chaining together of exons that jointly maintain a valid reading frame. However, for genes that have become disrupted by genetic variants, the foregoing assumption regarding intact translational reading frames is counter- productive, because it impedes the search for disruptive variants. As demonstrated in

Chapters 3 and 4, when a traditional gene structure model is applied to genes that have been disrupted so as to include a premature stop codon, the model incorrectly concludes that the splicing patterns of those genes must have changed, so that the apparent disruption to the reading frame is minimized. This behavior is a direct result of the assumption of intact reading frames, and is clearly undesirable when the goal is to find disruptive variants and accurately gauge their likelihood of being deleterious.

In Chapters 3 and 4 I described work that addresses this shortcoming in traditional gene-finding models. In this work I explored the possibility of separating the prediction of splicing changes from the interpretation of their effects on translation, an approach I call delayed interpretation. Whereas traditional gene finders have a strong bias against predicting large disruptions to translation reading frames in personal genomes, the use of delayed interpretation eliminates that bias by withholding translation signals from the process of splicing prediction. While the impact of splicing changes on translation products is still assessed and reported, that assessment is performed only

182

after the algorithm has committed to predicting a given splicing pattern. As such, translation effects no longer influence splicing prediction.

Unfortunately, the most informative sequence features used by ab initio gene finders are translation signals and the splice site sequences, so that removal of translation signals leaves only the splice sites themselves. It has been demonstrated that splice sites lack sufficient information content to allow their accurate discrimination, absent other information, from decoy sites, in higher eukaryotes (Lim and Burge, 2001).

In the model described in Chapter 3, ACE, I utilized only splice site sequences in predicting splicing changes, which limited its ability to discriminate true splice sites from decoy sites. Because that model addresses only one type of splicing change and because the number of possible outcomes for that class of splicing changes tends to be very small, the software is still useful as an enumerator of possible splicing outcomes.

However, before the model could be extended to predict other types of splicing changes, it required additional predictive features beyond splice site scores, to avoid an excessively high false positive rate.

In Chapter 4 I therefore extended ACE into a more elaborate model, ACE+, in which additional features were used to augment the information from splice sites. In particular, ACE+ utilizes a simple model of the processes of exon definition and intron definition. I showed that this model is able to predict several types of splicing changes.

While there is room for improvement in predictive accuracy, this model represents the

183

first attempt to predict whole-gene gene structures in protein-coding genes using only non-coding features. The success of this model in predicting aberrant splicing outcomes demonstrates that predicting gene structures without the use of coding information is feasible. Moreover, this model more accurately reflects our current understanding of how splicing decisions are made in the cell, in which the twin processes of exon definition and intron definition influence which portions of the pre-mRNA are included or excluded from the mature transcript.

The greatest limiting factor in the model’s predictive accuracy is very likely our general lack of understanding of the processes of exon and intron definition. ACE+ relies on a simplistic model of exon and intron definition based on linear combinations of hexamer scores. Hexamer-based models of exon definition have been in use for over a decade, and are now commonly used to predict exon skipping as a result of individual genetic variants in isolation (Soukarieh et al., 2016). While the mechanistic role of any individual hexamer in such a model is generally not known, hexamer-based models of exon definition are posited to capture overall binding propensity of splicing regulatory factors, such as SR proteins and hnRNPs, to a given RNA sequence. It has been posited that the success of exon definition depends on the binding of sufficient numbers of positively-acting splicing regulatory factors within exon bodies, as these factors are believed to form a scaffold across the exon that is necessary for exon inclusion in the mature transcript (Ke and Chasin, 2010; Schneider et al., 2010). Intron definition is

184

similarly understood, though for a different set of splicing regulatory factors. It is believed that splicing outcomes are a direct result of these processes of exon and intron definition.

As I noted in Chapter 4, it is known that splicing regulatory factors extensively interact, in both cooperative and competitive ways. SR proteins and hnRNPs can compete for the same binding sites on RNA, and they may also interact spatially via steric hindrance or via protein-protein interaction domains. Compensatory relationships between splice sites and splicing regulatory elements have also been documented (Ke et al., 2008). Existing hexamer models do not capture these interactions, because they rely on simple linear combinations of hexamer scores. This limitation impacts ACE+ because

ACE+ currently uses such models in its potential functions. An important challenge for future work will be to both improve our understanding of these interactions and their impacts on splicing, and to devise a means of encoding those interactions in a way that can be directly incorporated into computational models such as ACE+.

Discovering interactions between features might be accomplished via machine- learning approaches. In particular, we might consider adding interaction terms to our regularized logistic regression model. Simply adding all possible pairwise interactions for all 4096 hexamers will likely not be feasible, due to the excessive number of resulting terms. Some form of feature selection will therefore be needed for these interaction terms. One possible heuristic is to identify pairs of hexamers that co-occur more often

185

than expected by chance at short distances within exons, suggesting a synergistic effect.

Pairs that are significantly depleted within exons may also be informative. Regularized logistic regression could then be applied to the augmented feature set, whether extracted from annotated genes or from splicing minigene outputs. The resulting model could then be utilized directly within ACE+.

Another important phenomenon not reflected in additive hexamer models is

RNA secondary structure. As noted by Rosenberg et al. (2015), splicing outcomes in minigene experiments were frequently influenced by the formation of secondary structures that impeded recognition of one splice site, so that a cryptic splice site was instead selected by the spliceosome. This was detrimental to the model learned from the minigenes, as the learned features were specific to the minigene and not generalizable to a genomic context. However, computational models of secondary structure, in particular stochastic context-free grammars (SCFG), are tractable for short sequences, and could be applied in a genomic context. As such, SCFG modeling of only the sequence immediately flanking a splice site would be feasible within ACE+ without negatively impacting execution time. Longer-range secondary structures have also been implicated by the finding that specific pairs of motifs at the ends of introns predict splicing outcomes (Friedman et al., 2008; Ke and Chasin, 2010). While the latter motif patterns implicate a structure in which intron ends are paired, detection of that pattern was based

186

on simple motif matching and is therefore computationally simpler than full SCFG modeling.

Because ACE+ utilizes a simple random field model in which whole exons and whole introns are evaluated via individual potential functions, any of the foregoing features could conceivably be incorporated into ACE+ via appropriate modifications to those potential functions. As such, ACE+ provides a flexible framework into which additional information can be incorporated in the future. As additional insights into the biology of splicing are gained, these can be incorporated into the model in a modular way by updating the model’s potential functions. Conversely, additional features which are found to improve predictive accuracy of the model might prompt the formulation of testable hypotheses regarding novel mechanisms in splicing.

One shortcoming of the use of annotations to train ACE+ is that genomic annotations are typically not cell-type specific. It is known that splicing is often regulated in a cell-type specific manner, and it is likely that aberrant splicing also has cell-type specific aspects. For organisms in which transcriptomic data has been collected in different cell types, it might be feasible to train separate versions of ACE+ for each cell type. In particular, to obtain a model specific to one cell type, we might consider training ACE+ on transcripts known to occur only in that cell type. For cell types having few cell-type specific transcripts, low sample sizes might negatively impact parameter estimates. In those cases, it might be feasible to augment the cell-type specific training

187

data with pooled data from other cell types, but with the training examples from the target cell type up-weighted so as to more strongly influence parameter estimates.

While hexamer-based models of exon definition are already in use to predict aberrant splicing, these are generally applied only to individual SNPs, and only to predict local changes in gene structures proximal to the SNP. In Chapter 3 I demonstrated the value of analyzing variants in combination. In particular, I showed that ACE could identify cases in which one frameshift was compensated by a downstream frameshift that rescued the original reading frame. Additional forms of compensatory interaction between multiple variants might be prevalent, and could be modeled in ACE+. Any interaction contained within a single element (exon, intron, or splice site) could be modeled within the corresponding potential function for that element. Interactions between variants in different elements would require changes to the modular structure of ACE+. While additional work will be required to devise efficient means of modeling longer-range interactions, it might be expected that many compensatory interactions are local in nature and therefore could be encapsulated within individual potential functions in many cases. Furthermore, as described in

Chapter 4, ACE+ utilizes a constrained splice graph construction algorithm that results in sparse graphs in its random field. Because these models are sparse, modeling of longer- range interactions may still be feasible without resulting in unacceptable execution times during decoding.

188

Finally, ACE+ applied to 150 human genomes revealed an unexpected abundance of genetic variants that resulted in de novo splice sites—that is, functional splice sites absent from the reference genome. I also showed, through simulations of a mutation process, that a random mutation is more likely to create a functional splice site than to disrupt an existing splice site. As noted in Chapter 4, de novo splice sites have previously been implicated in a number of diseases. Nevertheless, this class of variants appears to be under-appreciated, as existing tools do not accurately classify them as impacting splicing. Variants that impact exon inclusion by disrupting an exonic splicing enhancer are another class of variants that are often misinterpreted as synonymous or nonsynonymous variants (Pagani and Baralle, 2004). The traditional view that only variants occurring within an actual splice site can alter splicing is clearly incorrect.

Indeed, it has been estimated that over half of disease-causing mutations within genes may impact splicing (López-Bigas et al., 2005). Improved bioinformatic approaches to identifying these variants are clearly needed. ACE and ACE+ demonstrate the value of modeling whole gene structures in identifying variants that impact splicing and interpreting their effects on encoded proteins, and it is hoped that these modeling frameworks will prove useful in the future as substrates for the development of more accurate predictors.

189

7.2 Experimentally testing for allelic effects in enhancers

As noted in Chapter 2, a large proportion of variants associated with disease occur in non-coding regions of the genome, and it has been suggested that many of these act by altering gene regulatory programs. Much effort has recently gone into the development of high-throughput reporter assays to identify transcriptional enhancers, and in Chapter 5 I described a related assay, Pop-STARR, which allows us to test for allelic effects of genetic variants within enhancers. While the Pop-STARR assay represents a major advance in our ability to investigate genetic components of gene regulation, there is much that can be improved in our approach to analyzing the outputs of this assay. There are also a number of opportunities for us to expand the utility of this new experimental tool.

Our initial approach to detecting allelic effects in Pop-STARR outputs simply tested for a difference in allele frequencies between DNA reads (produced by sequencing the plasmids containing the reporter gene) and RNA reads (produced by sequencing the mRNAs transcribed from the plasmids). We have used both Fisher’s exact test and the beta-binomial test to detect these differences. In both cases, we are testing the null hypothesis that allele frequencies in the RNA are drawn from the same distribution as the allele frequencies in the DNA. This null hypothesis is equivalent to stating that there is no allelic effect in the rate at which the reporter gene is transcribed into RNA. These tests were applied directly to read counts that had been pooled across

190

biological replicates, so that the variance between replicates was ignored. Ignoring that variance might negatively impact the accuracy of our predictions of regulatory variants, because a higher-than-expected variance can result in inappropriately rejecting the null hypothesis that the RNA counts and the DNA counts are drawn from the same underlying distribution. In addition, our estimates of effect size, computed by forming ratios of allele counts in the RNA and DNA reads, can be expected to be highly unstable when read counts are small.

As such, we are currently exploring alternative approaches to both testing for the existence of allelic effects and estimating the magnitudes of those effects. One possible approach is to formulate a probabilistic generative model in which sources of variability are explicitly accounted for in the generation of each observed or latent variable in the model. For example, the variability in read counts between biological replicates could be modeled via a binomial sampling process in which the read count of one allele is drawn from a binomial distribution with unknown frequency parameter p, and different replicates are considered independent samples from that same distribution. The beta- binomial test posits such an unknown parameter and integrates over all possible values of that parameter. As currently used in testing Pop-STARR outputs, we provide the pooled read counts for the two alleles of a bi-allelic variant as pseudo-observations to parameterize the beta prior within the beta-binomial test. However, if our data are over- dispersed relative to this prior, our test may have a high false positive rate. Using an

191

explicit generative model that accounts for replicates may allow us to fit parameters that more accurately capture this dispersion between replicates.

Given a generative model in which latent variables represent unobserved quantities such as the allele frequencies in the DNA and in the RNA, we may be able to use a Markov chain Monte Carlo (MCMC) approach to efficiently sample from the posterior distribution of effect sizes. If we can accurately reconstruct that posterior distribution, this will enable us to report not only a point estimate of the effect size of an allele, but also a credible interval indicating our confidence in that point estimate.

Testing for the existence of an allelic effect can then be done by assessing the posterior probability that the effect size is different from 1. Given the small number of variables that would likely be included in such a model, MCMC sampling should be very efficient, so long as an effective proposal function can be devised.

Another major challenge in analyzing Pop-STARR outputs is the potential for reference bias. In order to obtain read counts for each genetic variant, both RNA and

DNA reads must first be aligned to the genome. This alignment step utilizes a short- read aligner to rapidly align large numbers of reads to a human reference genome. By definition, the haploid reference genome contains only the reference allele for any variant. When a read contains an alternate allele, the aligner sees a mismatch between the allele in the read and the allele in the reference genome. This mismatch results in an alignment penalty, and in cases where there are multiple mismatches (due to sequencing

192

errors or multiple variants in close proximity, or a combination of these) the read may be discarded by the aligner, resulting in a bias against the alternate allele. Such biases could result in incorrect assessments of allelic effects.

In the work described in Chapter 5, we mitigated the effects of reference bias by aligning to a custom set of haplotypes constructed from a phased VCF file. To the extent that phasing was accurate and all variants in the assayed genomic regions were genotyped, the use of custom haplotype sequences as a substrate for alignment should have eliminated reference bias. However, if any variants were not genotyped, or if phasing of variants was not completely accurate, some systematic biases may have remained. A more severe limitation of this method is that for larger target regions and larger numbers of donor individuals, the number of haplotype sequences that need to be constructed for the alignment step becomes large. Using a large set of haplotype sequences as the substrate for alignment is problematic because a read may align to a large number of different haplotypes sharing a common subsequence, and these multi- mapped reads are sometimes discarded by aligners.

An elegant solution to this problem is to utilize a special data structure variously known as a variant graph, a graph genome, or a genome graph (Paten et al., 2017). Such a graph represents explicit haplotype sequences as different paths through the graph, but without duplicating sequences that are common to multiple haplotypes. Reads can be aligned directly to the graph, and because the graph collapses subsequences common to

193

multiple haplotypes, the problem of multi-mapped reads is lessened. Because the graph explicitly represents different alleles of each variant, reference bias should also be substantially reduced. Software for aligning directly to such graphs is only now being developed by various groups. As that software becomes available it could be adopted for use in analysis of allele-specific data such as the outputs of Pop-STARR.

While the experiments described in Chapter 5 assayed relatively short amplicons obtained using custom PCR primers, scaling this approach to longer regions can be done more efficiently by employing capture arrays applied to fragmented genomic material.

Under the latter approach, a genetic variant may be included in multiple fragments covering different, overlapping intervals on the genome. Each such fragment may contain multiple variants. For two genetic variants X and Y in close proximity on the genome, some of the fragments containing X might also contain Y, while other fragments containing X might not contain Y. As such, the regulatory potential of different fragments containing X could be influenced in different ways by the local sequence context of X. In addition, when the effect of a variant is due to its occurring within a transcription factor binding site, fragments that contain only part of that binding site will likely exhibit different effects in the assay than fragments containing the whole binding site. As such, it may be necessary to jointly model multiple variants, multiple fragments, and the relation between fragments, variants, and putative TF binding sites, in order to maximize predictive accuracy.

194

Constructing such a model will be challenging, due to the large number of possible factors involved and the large number of parameters to be estimated. However, such a model may allow us to elucidate interactions between variants, such as compensatory or synergistic effects, and may provide novel insights into such aspects of gene regulation as buffering and evolutionary plasticity. Indeed, the ability of Pop-

STARR to provide quantitative measures of regulatory potential for individual variants as well as for variants in combination may enable the construction of sequence-based models that can improve our understanding of how enhancers work mechanistically.

For example, a model that could accurately predict Pop-STARR outputs based solely on perturbations to DNA sequence could be highly informative in terms of understanding how enhancers function, at least within the controlled setting of the reporter construct.

Translating this understanding to the endogenous chromatin context would be a worthy longer-term goal.

7.3 Modeling epigenetic signatures of drug-responsive enhancers

While the Pop-STARR assay represents a major advance in our ability to identify regulatory variants, applying this assay genome-wide to large populations is currently not feasible, due to both cost and the scale of effort required. However, a number of other assays are available which could inform the selection of regions to test experimentally for allelic effects. In particular, chromatin immunoprecipitation (ChIP) assays for histone modifications, and assays for chromatin accessibility such as ATAC- 195

seq, are now routine and inexpensive for a single cell line. As I showed in Chapter 6, a multivariate pattern recognition model applied to epigenetic signatures comprised of

DNase-seq, p300, H3K4me1, H3K4me2, and H3K27ac profiles can accurately detect drug-responsive enhancers, as validated by STARR-seq. Applying this model to find all glucocorticoid-responsive enhancers genome-wide is very efficient. The results of that genome-wide scan could potentially serve as a starting point in prioritizing regions to test for allelic effects that may modulate the drug response. This could potentially be done for enhancers responding to other drugs as well, and may provide mechanistic insights into inter-individual differences in responses to individual drugs.

More generally, as sequencing becomes cheaper it continues to find use in an increasingly wider array of assays, so that multivariate measurements at nucleotide resolution along the genome become more commonplace. This naturally leads to the problem of how to effectively utilize that data in an integrative manner. Multivariate

HMMs are an obvious choice for modeling multivariate sequence data, and the results described in Chapter 6 demonstrate how integrative modeling can outperform simpler models that use only a subset of features.

In Chapter 2 I described how previous work on microRNA target site prediction led me to devise a general methodology for designing hand-crafted models of multivariate signatures for genomic elements. That methodology called for using a small number of HMM states, such that each state can be assigned to a specific landmark

196

in the canonical multivariate signature being modeled. Applying that methodology to the epigenetic enhancer signatures, I arrived at a very simple HMM with only five states in a strict linear arrangement. That this initial model exhibited high predictive accuracy without requiring any modifications to the model topology suggests that there is indeed utility in incorporating prior knowledge of the problem structure into the model at the outset. While previous authors have demonstrated the value of unsupervised clustering to discover epigenetic patterns genome-wide in an unbiased fashion, the methodology used in the present work represents a complementary approach. This approach may be more or less useful in other contexts, depending on the problem at hand and on how much is already known about the structure of the elements being modeled. For problem domains in which little is already known, an unsupervised approach may be more appropriate. Indeed, the patterns exploited in manually constructing the model used here were previously uncovered, in part, using unsupervised models (e.g., Ernst et al.,

2011).

A number of enhancements to the model described here are possible. Perhaps the most obvious enhancement would be to include motif information in the model.

Currently the model uses only continuous attributes as predictive features. However, the underlying framework in which this model is implemented supports the use of joint continuous and discrete multivariate distributions, including the use of explicit DNA or

RNA sequences. This could be used to incorporate motif information into the prediction

197

process. As noted in Chapter 6, some motifs previously implicated in the glucocorticoid response were found here to be enriched in regions matched by the HMM’s central

DNase peak state, relative to other parts of an enhancer. Meanwhile, other motifs, particularly those believed to represent binding of pioneer factors or insulators, were substantially less enriched in that peak state. These patterns suggest an augmented model containing state-specific distributions over motif occurrences. Such a model might exhibit higher predictive accuracy than the model described in Chapter 6. Such a model might also be useful in identifying distinct classes of enhancers, based on the presence of different TFs, either singly or in combination. The incorporation of ChIP TF signals into the model could also be useful in seeking either higher predictive accuracy or the ability to identify different classes of enhancers.

The enrichment of specific motifs in the peak state suggests another use for the model beyond the classification of whole enhancers as drug-responsive versus non- responsive: the segmentation of individual enhancers into their sub-components. In particular, the enrichment of specific motifs within the central DNase peak state suggests that the model may be useful for identifying the parts of enhancers in which it is most useful to search for genetic variants having regulatory effects. These segments could be cloned into a reporter construct, such as that used by Pop-STARR, to test for allelic effects. Indeed, it may be useful to use Pop-STARR to assay both the central

DNase peak regions and separately the histone flank regions of a large set of enhancers,

198

and then to assess whether regulatory variants are more commonly found in the DNase peak regions than in the histone-enriched flanks. The distribution of regulatory variants across these different enhancer components may be informative as to the role of sequence features in determining enhancer function. For example, some regulatory variants may act via effects on TF binding motifs or local DNA shape near TF binding sites, whereas variants in flanking regions might impact nucleosome occupancy and the ability to support stable epigenetic signals in those regions.

In Chapter 6 I also described an approach to computing temporal profiles of enhancer activity dynamics, by applying the HMM to epigenetic signatures at successive timepoints. I showed that these temporal activity profiles reproducibly clustered into a small number of trajectories. Each such cluster represents a set of enhancers having similar dynamics over the 12-hour timecourse. Using Hi-C data, I then showed that gene expression dynamics were correlated to the enhancer activity dynamics, for genes in physical association with those enhancers. This demonstrates that changes in epigenetic state at an enhancer are informative in predicting gene expression responses.

A useful goal would be to seek a model which could predict those epigenetic state changes based solely on DNA sequence. Connecting that model to the enhancer HMM would then allow us to predict gene expression responses based only on DNA sequence.

A further goal would then be to predict how genetic variants can modulate that response, as alluded to earlier.

199

Finally, a possible extension to the HMM that would more directly address the temporal aspects of epigenetic state is to replace the multivariate Gaussian emission distributions with a multivariate Gaussian process model. The multivariate Gaussian distributions currently used in the model are stationary. A Gaussian process model can represent biases in responses over time. A multivariate Gaussian process model could represent such changes in several variates simultaneously. Using such a model to compute HMM emission probabilities would result in a multivariate Gaussian process

HMM (MGP-HMM). Such a model would address dynamics directly, as opposed to the ad hoc method used here, in which successive timepoints are evaluated independently.

The main challenge in developing a MGP-HMM would be in training the model. Given that EM is commonly used for training HMMs and that EM solutions for training

Gaussian process models have been proposed (Ranjan et al., 2016), an EM formulation for training a MGP-HMM might be feasible.

200

Appendix

Figure 30: Known and predicted gene structures for alleles A and O of the human blood group gene ABO (Ensembl gene ENSG00000175164). The reference genome GRCh38 contains the O allele, which contains a frameshift leading to a premature stop codon. The structure predicted by the state-of-the-art gene finder Augustus (Stanke et al., 2006) for the O allele introduces a novel exon spanning positions 14364-14392 in order to alter the reading frame and avoid the premature stop codon at position 17791, resulting in a higher likelihood due to the strong coding signal in the long final exon of the A allele. (Coordinates have been transformed and mapped to the forward strand).

201

Figure 31: Distribution of distances (nt) between cryptic splice sites and annotated sites in DBASS, the Database of Aberrant Splice Sites (Buratti et al., 2007); outliers above 644nt were trimmed for illustration purposes only; these constitute <5% of the distribution.

202

Figure 32: Example of structured Essex output (XML reports are structured similarly). Notable features include the alignment of the reference sequence to the alternate sequence via a CIGAR string indicating insertion, deletion, and match lengths; classification of variants for the reference transcript, the mapped transcript, and any putative novel transcript structures resulting from disruptions to splice sites or changes in translation reading frames (note that variant classifications can differ between alternate transcripts); protein translations for all versions of a transcript; predicted fates of transcripts and/or proteins; and detailed descriptions of disrupted splice sites (shown) and putative cryptic sites (not present in this example).

203

Figure 33: (A) Distribution of proportions of ACE-predicted exon-skipping isoforms supported by at least one spliced read when such isoforms are not provided (blue) or are provided (red) as hints to TopHat 2 (Wilcoxon W = 537900, P < 2.2×10-16). (B) Proportions of exon-skipping isoforms assigned nonzero FPKM when such isoforms are not provided (blue) or are provided (red) as hints to StringTie (W = 198020, P < 2.2×10-16). (C) Proportions of exon-skipping isoforms supported by at least one spliced read, for annotated splice sites that are not disrupted (blue) and annotated splice sites that are disrupted (red) (W = 699470, P < 2.2×10-16).

204

Figure 34: Number of ACE-predicted novel isoforms (y-axis) across all Geuvadis samples estimated to meet or exceed a given TPM (Transcripts Per Million) threshold (x-axis), as estimated by Salmon (red), Kallisto (blue), and StringTie (green).

205

Figure 35: (A) Distribution of proportions of ACE-predicted exon-skipping isoforms supported by at least three spliced reads when such isoforms are not provided (blue) or are provided (red) as hints to TopHat 2 (Wilcoxon W = 762670, P < 2.2×10-16). (B) Proportions of exon-skipping isoforms assigned FPKM≥2 when such isoforms are not provided (blue) or are provided (red) as hints to StringTie (W = 198020, P < 2.2×10-16). (C) Proportions of exon-skipping isoforms supported by at least three spliced reads, for annotated splice sites that are not disrupted (blue) and annotated splice sites that are disrupted (red) (W = 771330, P < 2.2×10-16).

206

Figure 36: (A) Distribution of proportions of ACE-predicted exon-skipping isoforms supported by at least three spliced reads when such isoforms are not provided (blue) or are provided (red) as hints to TopHat 2 (Wilcoxon W = 579440, P < 2.2×10-16). (B) Proportions of exon-skipping isoforms assigned FPKM≥2 when such isoforms are not provided (blue) or are provided (red) as hints to StringTie (W = 198020, P < 2.2×10-16). (C) Proportions of exon-skipping isoforms supported by at least three spliced reads, for annotated splice sites that are not disrupted (blue) and annotated splice sites that are disrupted (red) (W = 716420, P < 2.2×10-16).

207

Figure 37: (A) Distribution of normalized read counts (normalized by total reads mapped to the locus) supporting novel splice junctions proximal to non- disrupted annotated sites. (B) Distribution of similarly normalized read counts supporting novel junctions proximal to disrupted annotated sites. The median is significantly greater than for the non-disrupted sites (Wilcoxon W = 222750000, P < 2.2×10-16) shown in panel A.

208

Figure 38: (A) Scatterplot of cryptic splicing levels for cryptic sites near annotated splice sites that are disrupted (y-axis) or are not disrupted (x-axis). Each point corresponds to one annotated splice site observed to be disrupted in some individuals. Cryptic splicing levels were normalized by total reads mapped to the locus in the individual, then averaged across individuals in which the site was disrupted (y-axis) or not disrupted (x-axis). A Wilcoxon rank-sum test was applied to each point to remove nonsignificant results at an FDR (False Discovery Rate) threshold of 0.05. A majority of points (86%) lie above the y = x line. Median cryptic splicing levels were significantly higher in individuals in which the annotated splice site was disrupted (Wilcoxon signed rank test, V = 1135, P < 2.2×10-16). (B) Magnified view of the same scatterplot shown in panel A, showing 90% of original data points.

209

Figure 39: (A) Distribution of proportions of transcripts with disrupted splicing among 1000 Genomes Project samples for which ACE identified at least one putative alternate splice form not predicted to entail loss of function (LOF) (maximum change in amino acid sequence was 10 amino acids). (B) Distribution of proportion of genes among 1000 Genomes Project samples identified by ACE as entailing LOF in one but not all annotated isoforms.

210

Figure 40: (A) Distribution of log2 effect sizes of N=578 heterozygous NMD events after filtering to include only transcripts with mean FPKM≥1. (B) Distribution of log2 effect sizes of N=411 heterozygous NMD events after filtering to include only transcripts with mean FPKM≥2. (C) Distribution of log2 effect sizes of N=297 heterozygous NMD events after filtering to include only transcripts with mean FPKM≥3.

211

Figure 41: (A) Betas (y-axis) from linear mixed-effects model with random intercepts, log2(FPKM) ~ Xb + Zu, converted to relative abundance ratios r0/2 = FPKM0 / FPKM2, where FPKMk denotes mean FPKM (fragments per kilobase of transcript per million reads mapped) among individuals predicted to have k functional alleles of a transcript. FPKM thresholds (x-axis) of 0.1 through 15 were used to pre-filter transcripts prior to fitting the model. Only transcripts expressed in at least 30 individuals were included in the analysis. The largest Beta (b = 0.37, SE = 0.01)

corresponds to an r0/2 = .60. (B) Similar plot as in panel A, for mixed-effects model

with both random intercepts and random slopes. Mean r0/2 = 0.49875, indicating a halving of transcript abundance, on average, in homozygous NMD targets.

212

Figure 42: (A) Distribution of RVIS percentiles for all human genes having an RVIS score. (B) RVIS percentiles for randomly selected transcripts with similar coding length to the transcripts plotted in Figure 15B (Wilcoxon rank-sum comparison to all genes in panel A: W = 138110000, P = 0.72). (C) RVIS percentiles for randomly selected transcripts with similar total length to the transcripts plotted in Figure 15B (comparison to all genes in panel A: W = 137890000, P = 0.75). (D) RVIS percentiles for randomly selected transcripts with matching numbers of exons to the transcripts plotted in Figure 15B (comparison to all genes in panel A: W = 138200000, P = 0.68). (E) RVIS percentiles for randomly selected transcripts with similar G+C% to the transcripts plotted in Figure 15B (comparison to all genes in panel A: W = 128960000, P = 0.99).

213

Figure 43: (A) Distribution of RVIS percentiles for homozygous loss-of- function genes in 1000 Genomes Project samples as predicted by ACE (Wilcoxon rank-sum comparison to RVIS percentiles for all genes in Figure 42A: W = 7378700, P < 2.2×10-16). (B) RVIS percentiles for heterozygous LOF genes (compared to RVIS percentiles for all genes: W = 53451000, P < 2.2×10-16). (C) Distribution of ncRVIS percentiles for homozygous LOF genes (compared to ncRVIS percentiles for all genes: W = 6032200, P = 5.5×10-11). (D) ncRVIS percentiles for heterozygous LOF genes (compared to ncRVIS for all genes: W = 48724000, P < 2.2×10-16).

214

Figure 44: (A) Distribution of numbers of isoforms per gene for all of GENCODE version 19. (B) Distribution of numbers of isoforms per gene for the N = 67 genes depicted in Figure 43B for which RVIS percentile < .20.

215

Figure 45: (A) Variant rs11278302 in Ensembl gene ENSG00000174177 results in deletion of an entire splice site; however, the resulting sequence retains a valid donor splice site consensus and the flanking sequence scores above threshold under a positional weight matrix; furthermore, more than 30 spliced reads are assigned by TopHat 2 to this splice site in each allele of a homozygous individual, indicating that splicing is retained at this site under the alternate allele. Ensembl VEP classifies the variant as having high impact, due to the apparent loss of a splice site. (B) Variant rs67712719 in Ensembl gene ENSG00000179588 introduces a frameshift which rs67322929 corrects, resulting in only two amino acid changes. These variants highly co-occur in 1000 Genomes Project individuals (97% of all 5008 haplotypes). However, Ensembl VEP classifies each individually as having high impact, as individually either would result in a frameshift.

216

Figure 46: (A) Reference (hg19) and alternate alleles for Ensembl gene ENSG00000179588 in 1000 Genomes Project sample HG00096, haplotype 2, in the region of variants rs67712719, rs67322929, and rs67873604. (B) Interpreting variant rs67712719 alone would lead to a conclusion that the variant results in a frameshift and a large number of amino acid changes and protein truncation. (C) Similarly, interpretation of variant rs67322929 alone would lead to frameshift and a large change to the encoded protein. (D) Joint interpretation of all variants together reveals that changes are limited to four amino acids, two of which are deleted and two of which undergo substitution.

217

Figure 47: Distribution of lengths (in amino acids) of affected intervals between compensatory frameshifts in 1000 Genomes Project samples.

218

Figure 48: Distribution of simulated frameshift lengths in GENCODE protein- coding genes, assuming one frameshift per gene and uniform locations within coding segments; outliers above 3528nt (top 1%) were omitted for display purposes only. Full data set: median=260, N=162716.

219

Figure 49: (A) Results of running an HMM gene finder on 19,000 broken genes. The gene finder was run on each gene, then a stop codon was inserted in a random location in the CDS without creating a splice site, and the gene finder was run on the modified sequence. In 11% of cases the gene finder predicted the same splice pattern on both the original sequence and the sequence modified to contain a premature stop. In 9% of cases, the gene finder predicted that no gene was present after the stop was inserted. In the remaining 80% of cases, the gene finder predicted a different splice pattern after the stop codon was inserted. (B) Relative position of inserted stop codon (relative to the spliced transcript) in cases in which the gene finder predicted the same splice pattern. There was a strong enrichment for stop codons inserted near the end of the coding segment, in which only the terminal portion of the protein would be affected, as well as a weaker enrichment near the beginning of the coding segment, in which the gene finder was able to find another start codon in the same reading frame that avoided the inserted stop codon. (C) Relative position of inserted stop codon in cases in which both the splice pattern did not change and the start codon was not changed.

220

Figure 50: Reproducibility of logistic regression training for content sensors. (A) Hexamer weights trained on 10,000 exon-intron pairs, versus weights for the same hexamers from an independent logistic regression applied to the same training cases. (B) Hexamer weights estimated from 20,000 training cases (x-axis) and weights for the same hexamers estimated from a subset of 10,000 training cases (y-axis).

Figure 51: (A) Effect of scaling factor rcontent/signal on AUC for SGRF applied to Thousand Genomes individual HG00096, using the human logistic content sensor. (B) AUC versus scaling factor for the SGRF using the minigene content sensor.

221

Figure 52: Predictive accuracy of logistic signal sensors versus positional weight matrices (PWMs). (A) AUC for classification of annotated donor splice sites versus decoy sites, using logistic signal sensor (red) and PWM (blue). (B) AUC for classification of annotated acceptor splice sites versus decoy sites, using logistic signal sensor (red) and PWM (blue).

222

Figure 53: Number of Gs in 4096 hexamers (y-axis) as a function of logistic weights; points are sorted along the x-axis by weight.

223

Figure 54: Example of a de novo splice site that appears to result in greater splicing activity than at the annotated splice site. Top: haplotype 1 of Thousand Genomes individual HG00118 shows evidence of splicing only at the annotated splice site. Bottom: variant rs202069778 in haplotype 2 of the same individual creates a new acceptor splice site that retains the original reading frame in the MAP4K1 gene, resulting in 8 amino acids being excluded from the encoded protein; TopHat2 aligns more spliced reads to this site than to the annotated site in this haplotype. This variant has a global MAF of 0.0002 in Thousand Genomes phase 3 samples, indicating it is possibly deleterious.

224

References

Adams MD, et al. (2000) The genome sequence of Drosophila melanogaster. Science 287:2185–2195.

Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, Kondrashov AS, Sunyaev SR (2010) A method and server for predicting damaging missense mutations. Nature Methods 7:248-249.

Agarwal P, Bafna V (1998) The ribosome scanning model for translation initiation: implications for gene prediction and full-length cDNA detection. Proc Int Conf Intell Syst Mol Biol 6:2-7.

Akhtar W, de Jong J, Pindyurin AV, Pagie L, Meuleman W, de Ridder J, Berns A, Wessels LF, van Lohuizen M, van Steensel B (2013) Chromatin position effects assayed by thousands of reporters integrated in parallel. Cell 154:914-927.

Alberts B, Johnson A, Julian Lewis, Martin Raff, Keith Roberts, and Peter Walter (2002) Molecular Biology of the Cell, 4th edition. Garland Science.

Allen AS, Berkovic SF, Cossette P, Delanty N, Dlugos D, Eichler EE, Epstein MP, Glauser T, Goldstein DB, Han Y, Epi4K Consortium, Epilepsy Phenome/Genome Project (2013) De novo mutations in epileptic encephalopathies. Nature 501:217–221.

Allen JE, Salzberg SL (2005) JIGSAW: integration of multiple sources of evidence for gene prediction. Bioinformatics 21:3596–3603.

Amano T, Sagai T, Tanabe H, Mizushina Y, Nakazawa H, Shiroishi T (2009) Chromosomal dynamics at the Shh locus: limb bud-specific differential regulation of competence and active transcription. Developmental Cell 16:47-57.

Arnold CD, Gerlach D, Stelzer C, Boryń ŁM, Rath M, Stark A (2013) Genome-wide quantitative enhancer activity maps identified by STARR-seq. Science 339:1074- 1077.

Arnone MI, Davidson EH (1997) The hardwiring of development: organization and function of genomic regulatory systems. Development 124:1851-1864.

Azad RK, Borodovsky M (2004) Effects of choice of DNA sequence model structure on gene identification accuracy. Bioinformatics 20:993-1005.

225

Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, Ren J, Li WW, Noble WS (2009) MEME SUITE: tools for motif discovery and searching. Nucleic Acids Research 37:W202-8.

Bailey TL, Gribskov M (1998) Combining evidence using p-values: application to sequence homology searches. Bioinformatics 14:48-54.

Balasubramani A, Larjo A, Bassein JA, Chang X, Hastie RB, Togher SM, Lähdesmäki H, Rao A (2015) Cancer-associated ASXL1 mutations may act as gain-of-function mutations of the ASXL1–BAP1 complex. Nature Communications 6:7307.

Banerji J, Rusconi S, Schaffner W (1981) Expression of a β-globin gene is enhanced by remote SV40 DNA sequences. Cell 27:299–308.

Barbosa C, Peixeiro I, Romao L (2013) Gene expression regulation by upstream open reading frames and human disease. PLoS Genetics 9(8):e1003529.

Battle A, Mostafavi S, Zhu X, Potash JB, Weissman MM, McCormick C, Haudenschild CD, Beckman KB, Shi J, Mei R, et al. (2014) Characterizing the genetic basis of transcriptome diversity through RNA-sequencing of 922 individuals. Genome Research 24:14–24.

Beck S, Penque D, Garcia S, Gomes A, Farinha C, Mata L, Gulbenkian S, Gil-Ferreira K, Duarte A, Pacheco P, Barreto C, Lopes B, Cavaco J, Lavinha J, Amaral MD (1999) Cystic fibrosis patients with the 3272-26A-->G mutation have mild disease, leaky alternative mRNA splicing, and CFTR protein at the cell membrane. Human Mutation 14:133-144.

Belton JM, McCord RP, Gibcus JH, Naumova N, Zhan Y, Dekker J (2012) Hi-C: a comprehensive technique to capture the conformation of genomes. Methods 58:268-276.

Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Series B Stat Methodol 57:289–300.

Bernal A, Crammer K, Hatzigeorgiou A, Pereira F (2007) Global discriminative learning for higher-accuracy computational gene prediction. PLoS Computational Biology 16:e54.

226

Bernard E, Jacob L, Mairal J, Vert JP (2014) Efficient RNA isoform identification and quantification from RNA-Seq data with network flows. Bioinformatics 30:2447- 2455.

Berget SM (1990) Exon recognition in vertebrate splicing. Journal of Biological Chemistry 270:2411-2414.

Biddie SC, John S, Sabo PJ, Thurman RE, Johnson TA, Schiltz RL, Miranda TB, Sung MH, Trump S, Lightman SL, Vinson C, Stamatoyannopoulos JA, Hager GL (2011) Transcription factor AP1 potentiates chromatin accessibility and glucocorticoid receptor binding. Molecular Cell 43:145-155.

Bilmes J (1998) A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models. ICSI TR-97-021, U.C. Berkeley, USA.

Birney E, Lieb JD, Furey TS, Crawford GE, Iyer VR (2010) Allele-specific and heritable chromatin signatures in humans. Hum Mol Genet 19:R204–R209.

Boyle AP, Davis S, Shulha HP, Meltzer P, Margulies EH, Weng Z, Furey TS, Crawford GE (2008) High-resolution mapping and characterization of open chromatin across the genome. Cell 132:311–322.

Braunschweig U, Barbosa-Morais NL, Pan Q, Nachman EN, Alipanahi B, Gonatopoulos- Pournatzis T, Frey B, Irimia M, Blencowe BJ (2014) Widespread intron retention in mammals functionally tunes transcriptomes. Genome Research 10:1101/gr.177790.114.

Bray NL, Pimentel H, Melsted P, Pachter L (2016) Near-optimal probabilistic RNA-seq quantification. Nature Biotechnology 34:525-527.

Buckley PT, Khaladkar M, Kim J, Eberwine J (2014) Cytoplasmic intron retention, function, splicing, and the sentinel RNA hypothesis. Wiley Interdiscip Rev RNA 5:223-230.

Buenrostro JD, Giresi PG, Zaba LC, Chang HY, Greenleaf WJ (2013) Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nature Methods 10:1213-1218.

Buratti E, Baralle M, Baralle FE (2006) Defective splicing, disease and therapy: searching for master checkpoints in exon definition. Nucleic Acids Research 34:3494-3510.

227

Buratti E, Chivers M, Královicová J, Romano M, Baralle M, Krainer AR, Vorechovsky I (2007) Aberrant 5' splice sites in human disease genes: mutation pattern, nucleotide structure and comparison of computational tools that predict their utilization. Nucleic Acids Research 35:4250-4256.

Burge (1997) Identification of genes in human genomic DNA. PhD Dissertation. Stanford University.

Burge C, Karlin S (1997) Prediction of complete gene structures in human genomic DNA. Journal of Molecular Biology 268:78-94.

Butter F, Davison L, Viturawong T, Scheibe M, Vermeulen M, Todd JA, Mann M (2012) Proteome-wide analysis of disease-associated SNPs that show allele-specific transcription factor binding. PLoS Genetics 8(9):e1002982.

Cai X, Wang ZY, Xing YY, Zhang JL, Hong MM (1998) Aberrant splicing of intron 1 leads to the heterogeneous 5’ UTR and decreased expression of waxy gene in rice cultivars of intermediate amylose content. Plant Journal 14:459-465.

Cantor RM, Lange K, Sinsheimer JS (2010) Prioritizing GWAS results: a review of statistical methods and recommendations for their application. Am J Hum Genet 86:6–22.

Carr MS, Yevtodiyenko A, Schmidt CL, Schmidt JV (2007) Allele-specific histone modifications regulate expression of the Dlk1–Gtl2 imprinted domain. Genomics 89:280-290.

Cavalli G, Misteli T (2013) Functional implications of genome topology. Nature Structural & Molecular Biology 20:290-299.

Cheng C-Y, Krishnakumar V, Chan A, Schobel S, Town CD (2017) Araport 11: a complete reannotation of the Arabidopsis thaliana reference genome. Plant Journal 89:789–804.

Chen W, Dang T, Blind RD, Wang Z, Cavasotto CN, Hittelman AB, Rogatsky I, Logan SK, Garabedian MJ (2008) Glucocorticoid receptor phosphorylation differentially affects target gene expression. Molecular Endocrinology 22:1754-1766.

Choi YD, Grabowski PJ, Sharp PA, Drefuss G (1986) Heterogeneous Nuclear Ribonucleoproteins: Role in RNA Splicing. Science 231:1534-1539.

228

Choi Y, Sims GE, Murphy S, Miller JR, Chan AP (2012) Predicting the functional effect of amino acid substitutions and indels. PLoS One 7:e46688.

Cigan AM, Feng L, Donahue TF (1988) tRNAi(met) functions in directing the scanning ribosome to the start site of translation. Science 242:93-97.

Cingolani P, Platts A, Wang le L, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Ruden DM (2012) A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly 6:80-92.

Corradin O, Saiakhova A, Akhtar-Zaidi B, Myeroff L, Willis J, Cowper-Sallari R, Lupien M, Markowitz S, Scacheri PC (2014) Combinatorial effects of multiple enhancer variants in linkage disequilibrium dictate levels of gene expression to confer susceptibility to common traits. Genome Research 24:1–13.

Crans GG, Shuster JJ (2008) How conservative is Fisher’s exact test? A quantitative evaluation of the two-sample comparative binomial trial. Statistics in Medicine 27:3598-3611.

Creyghton MP, Cheng AW, Welstead GG, Kooistra T, Carey BW, Steine EJ, Hanna J, Lodato MA, Frampton GM, Sharp PA, et al. (2010) Histone H3K27ac separates active from poised enhancers and predicts developmental state. Proc Natl Acad Sci 107:21931–21936.

Cui K, Zhao K (2012) Genome-wide approaches to determining nucleosome occupancy in metazoans using MNase-Seq. Methods in Molecular Biology 833:413-419.

Davidson EH (1990) How embryos work: a comparative view of diverse modes of cell fate specification. Development 108:365-389.

Degner JF, Marioni JC, Pai AA, Pickrell JK, Nkadori E, Gilad Y, Pritchard JK (2009) Effect of read-mapping biases on detecting allele-specific expression from RNA- sequencing data. Bioinformatics 25:3207-3212.

Degner JF, Pai AA, Pique-Regi R, Veyrieras JB, Gaffney DJ et al. (2012) DNase I sensitivity QTLs are a major determinant of human expression variation. Nature 482:390–394.

Dekker J, Rippe K, Dekker M, Kleckner N (2002) Capturing chromosome conformation. Science 295:1306-1311.

229

Denver RJ (2009) Structural and functional evolution of vertebrate neuroendocrine stress systems. Annals of the New York Academy of Sciences 1163:1-16.

Delaneau O, Marchini J, 1000 Genomes Project Consortium (2014) Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel. Nature Communications 5:3934.

Dempster A, Laird N, Rubin D (1977) Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society B (Methodological) 39:1-38.

Dickel DE, Zhu Y, Nord AS, Wylie JN, Akiyama JA, Afzal V, Plajzer-Frick I, Kirkpatrick A, Göttgens B, Bruneau BG, Visel A, Pennacchio LA (2014) Function-based identification of mammalian enhancers using site-specific integration. Nature Methods 11:566-571.

Di Giacomo D, Gaildrat P, Abuli A, Abdat J, Frébourg T, Tosi M, Martins A (2013) Functional Analysis of a Large set of BRCA2 exon 7 Variants Highlights the Predictive Value of Hexamer Scores in Detecting Alterations of Exonic Splicing Regulatory Elements. Human Mutation 34:1547-1557.

D'Ippolito AM, McDowell IC, Barrera A, Hong LK, Leichter SM, Bartelt LC, Vockley CM, Majoros WH, Safi A, Song L, Gersbach CA, Crawford GE, Reddy TE (2017) Glucocorticoids modulate the pre-existing chromatin structure to regulate gene expression. In preparation.

Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29:15-21.

Doma MK, Parker R (2006) Endonucleolytic cleavage of eukaryotic mRNAs with stalls in translation elongation. Nature 440:561–564.

Domke (2014) Training structured predictors through iterated logistic regression. In: Advanced Structured Prediction. MIT Press.

Dostie J, Richmond TA, Arnaout RA, Selzer RR, Lee WL, Honan TA, Rubio ED, Krumm A, Lamb J, Nusbaum C, Green RD, Dekker J (2006) Chromosome Conformation Capture Carbon Copy (5C): a massively parallel solution for mapping interactions between genomic elements. Genome Research 16:1299-1309.

230

Dreyfuss G, Matunis MJ, Piñol-Roma S, Burd CG (1993) hnRNP proteins and the biogenesis of mRNA. Annu Rev Biochem 62:289-321.

Durbin R, Eddy SR, Krogh A, Mitchison G (1998) Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press.

Dynan WS, Tjian R (1983) The promoter-specific transcription factor Sp1 binds to upstream sequences in the SV40 early promoter. Cell 35:79–87.

Eckner R, Ewen ME, Newsome D, Gerdes M, DeCaprio JA, Lawrence JB, Livingston DM (1994) Molecular cloning and functional analysis of the adenovirus E1A- associated 300-kD protein (p300) reveals a protein with properties of a transcriptional adaptor. Genes and Development 8:869-884.

The ENCODE Project Consortium (2007). Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447:799–816.

The ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the human genome. Nature 489:57-74.

England SB, Nicholson LV, Johnson MA, Forrest SM, Love DR, et al. (1990) Very mild muscular dystrophy associated with the deletion of 46% of dystrophin. Nature 343:180–182.

Erkelenz S, Theiss S, Otte M, Widera M, Peter JO, Schaal H (2014) Genomic HEXploring allows landscaping of novel potential splicing regulatory elements. Nucleic Acids Research 42:10681–10697.

Ernst J, Kellis M (2010) Discovery and characterization of chromatin states for systematic annotation of the human genome. Nature Biotechnology 28:817-825.

Ernst J, Kheradpour P, Mikkelsen TS, Shoresh N, Ward LD, Epstein CB, Zhang X, Wang L, Issner R, Coyne M, Ku M, Durham T, Kellis M, Bernstein BE (2011) Mapping and analysis of chromatin state dynamics in nine human cell types. Nature 473:43-49.

Eun B, Sampley ML, Good AL, Gebert CM, Pfeifer K (2013) Promoter cross-talk via a shared enhancer explains paternally biased expression of Nctc1 at the Igf2/H19/Nctc1 imprinted locus. Nucleic Acids Res. 2013 Jan;41(2):817-26.

231

Feng Q, Vickers KC, Anderson MP, Levin MG, Chen W, Harrison DG, Wilke RA (2013) A common functional promoter variant links CNR1 gene expression to HDL cholesterol level. Nature Communications 4:1973.

Fickett JW (1982) Recognition of protein coding regions in DNA sequences. Nucleic Acids Research 10:5303-5318.

Fogarty MP, Cannon ME, Vadlamudi S, Gaulton KJ, Mohlke KL (2014) Identification of a regulatory variant that binds FOXA1 and FOXA2 at the CDC123/CAMK1D type 2 diabetes GWAS locus. PLoS Genetics 10:e1004633.

Frankish A, Uszczynska B, Ritchie GR, Gonzalez JM, Pervouchine D, Petryszak R, Mudge JM, Fonseca N, Brazma A, Guigo R, Harrow J (2015) Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction. Genomics 16(Suppl 8):S2.

Friedman BA, Stadler MB, Shomron N, Ding Y, Burge CB (2008) Ab initio identification of functionally interacting pairs of cis-regulatory elements. Genome Research 18:1643-1651.

Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33:1-22.

Gaffney DJ, McVicker G, Pai AA, Fondufe-Mittendorf YN, Lewellen N, Michelini K, Widom J, Gilad Y, Pritchard JK (2012) Controls of Nucleosome Positioning in the Human Genome. PLoS Genetics 8:e1003036.

Galas DJ, Schmitz A (1978) DNase footprinting: a simple method for the detection of protein-DNA binding specificity. Nucleic Acids Research 5:3157-3170.

Gisselbrecht SS, Barrera LA, Porsch M, Aboukhalil A, Estep PW III, Vedenko A, Palagi A, Kim Y, Zhu X, Busser BW, et al. (2013) Highly parallel assays of tissue-specific enhancers in whole Drosophila embryos. Nat Methods10:774–780.

Goldstein I, Hager GL (2017) Dynamic enhancer function in the chromatin context. WIREs Syst Biol Med 2017, e1390. doi: 10.1002/wsbm.1390.

González AN, Lu H, Erickson JW (2008) A shared enhancer controls a temporal switch between promoters during Drosophila primary sex determination. PNAS 25:18436-18441.

232

Gordân R, Shen N, Dror I, Zhou T, Horton J, Rohs R, Bulyk ML (2013). Genomic regions flanking E-box binding sites influence DNA binding specificity of bHLH transcription factors through DNA shape. Cell Rep 3:1093–1104.

Grøntved L, John S, Baek S, Liu Y, Buckley JR, Vinson C, Aguilera G, Hager GL (2013) C/EBP maintains chromatin accessibility in liver and facilitates glucocorticoid receptor recruitment to steroid response elements. EMBO 32:1568-1583.

Gross SS, Do CB, Sirota M, Batzoglou S (2007) CONTRAST: A discriminative, phylogeny-free approach to multiple informant de novo gene prediction. Genome Biology 8:r269.

Grossman SR, Zhang X, Wang L, Engreitz J, Melnikov A, Rogov P, Tewhey R, Isakova A, Deplancke B, Bernstein BE, Mikkelsen TS, Lander ES (2017) Systematic dissection of genomic features determining transcription factor binding and enhancer function. PNAS 114:E1291-E1300.

Guigo R, Knudsen S, Drake N, Smith T (1992) Prediction of gene structure. Journal of Molecular Biology 226:141–157.

Guigo R, Valcárel J (2015) Prescribing splicing. Science 347:124-125.

Guo C, Ludvik AE, Arlotto ME, Hayes MG, Armstrong LL, Scholtens DM, Brown CD, Newgard CB, Becker TC, Layden BT, Lowe WL, Reddy TE (2015) Coordinated regulatory variation associated with gestational hyperglycaemia regulates expression of the novel hexokinase HKDC1. Nature Communications 6:6069.

Guo Y, Mahony S, Gifford DK (2012) High resolution genome wide binding event finding and motif discovery reveals transcription factor spatial binding constraints. PLoS Computational Biology 8:e1002638.

Gusev A, Lee SH, Trynka G, Finucane H, Vilhjálmsson BJ, Xu H, Zang C, Ripke S, Bulik- Sullivan B, Stahl E, et al. (2014) Partitioning heritability of regulatory and cell- type-specific variants across 11 common diseases. Am J Hum Genet 95:535–552.

Haas BJ, Salzberg SL, Zhu W, Pertea M, Allen JE, Orvis J, White O, Buell CR, Wortman JR (2008) Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biology 9:R7.

Hafner M, Landthaler M, Burger L, Khorshid M, Hausser J, Berninger P, Rothballer A, Ascano M, Jungkamp AC, Munschauer M, Ulrich A, Wardle GS, Dewell S,

233

Zavolan M, Tuschi T (2010) Transcriptome-wide identification of RNA-binding protein and microRNA target sites by PAR-CLIP. Cell 141:129-141.

Hardison R, Xu J, Jackson J, Mansberger J, Selifonova O, Grotch B, Biesecker J, Petrykowska H, Miller W (1993) Comparative analysis of the locus control region of the rabbit beta-like globin gene cluster: HS3 increases transient expression of an embryonic epsilon-globin gene. Nucleic Acids Research 21:1265-1272.

Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, Aken BL, Barrell D, Zadissa A, Searle S, Barnes I, Bignell A, et al. (2012) GENCODE: the reference human genome annotation for The ENCODE Project. Genome Research 22:1760-1774.

Harteveld CL, Wijermans PW, van Delft P, Rasp E, Haak HL, Giordano PC (2004) An alpha-thalassemia phenotype in a Dutch Hindustani, caused by a new point mutation that creates an alternative splice donor site in the first exon of the alpha2-globin gene. Hemoglobin 28:255-259.

Hayer KE, Pizarro A, Lahens NF, Hogenesch JB, Grant GR (2015) Benchmark analysis of algorithms for determining and quantifying full-length mRNA splice forms from RNA-seq data. Bioinformatics 31:3938-3945.

Heintzman ND, Hon GC, Hawkins RD, Kheradpour P, Stark A, Harp LF, Ye Z, Lee LK, Stuart RK, Ching CW, Ching KA, Antosiewicz-Bourget JE, Liu H, Zhang X, Green RD, Lobanenkov VV, Stewart R, Thomson JA, Crawford GE, Kellis M, Ren B (2009) Histone modifications at human enhancers reflect global cell-type specific gene expression. Nature 459:108–112.

Heintzman ND, Stuart RK, Hon G, Fu Y, Ching CW, Hawkins RD, Barrera LO, Van Calcar S, Qu C, Ching KA, Wang W, Weng Z, Green RD, Crawford GE, Ren B (2007) Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nature Genetics 39:311-318.

Hilton IB, D’Ippolito AM, Vockley CM, Thakore PI, Crawford GE, Reddy TE, Gersbach CA (2015) Epigenome editing by a CRISPRCas9-based acetyltransferase activates genes from promoters and enhancers. Nat Biotechnology 33:510-517.

Hoffman MM, Buske OJ, Wang J, Weng Z, Bilmes JA, Noble WS (2012) Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nature Methods 18:473-476.

234

Holt C, Yandell M (2011) MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics 12:491.

Hoffman MM, Buske OJ, Wang J, Weng Z, Bilmes JA, Noble WS (2012) Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nature Methods 5:473-476.

Hopcroft J, Ullman JD (1979) Introduction to Automata Theory, Languages, and Computation. Addison-Wesley.

Hsiao CJ, Cherry DK, Beatty PC, Rechtsteiner EA (2010) National Ambulatory Medical Care Survey: 2007 summary. Natl Health Stat Report 27:1-32.

Huelga SC, Vu AQ, Arnold JD, Liang TY, Liu PP, Yan BY, Donohue JP, Shiue L, Hoon S, Brenner S, Ares M Jr, Yeo GW (2012) Integrative Genome-wide Analysis Reveals Cooperative Regulation of Alternative Splicing by hnRNP Proteins. Cell Reports 1:167-178.

Hu H, Huff CD, Moore B, Flygare S, Reese MG, Yandell M (2013) VAAST 2.0: improved variant classification and disease-gene identification using a conservation- controlled amino acid substitution matrix. Genet Epidemiol 37:622-634.

Inoue F, Ahituv N (2015) Decoding enhancers using massively parallel reporter assay. Genomics 106:159-164.

Isshiki M, Morino K, Nakajima M, Okagaki RJ, Wessler SR, Izawa T, Shimamoto K (1998) A naturally occurring functional allele of the rice waxy locus has a GT to TT mutation at the 5’ splice site of the first intron. Plant Journal 15:133–138.

Itzkovitz S, Hodis E, Segal E (2010) Overlapping codes within protein-coding sequences. Genome Research 20:1582-1589.

Iwafuchi-Doi M, Donahue G, Kakumanu A, Watts JA, Mahony S, Pugh BF, Lee D, Kaestner KH, Zaret KS (2016) The Pioneer Transcription Factor FoxA Maintains an Accessible Nucleosome Configuration at Enhancers for Tissue-Specific Gene Activation. Molecular Cell 62:79-91.

Jelinek F (1998) Statistical Methods for Speech Recognition. MIT Press.

Johnson DS, Mortazavi A, Myers RM, Wold B (2007) Genome-wide mapping of in vivo protein-DNA interactions. Science 316:1497-502.

235

John S, Sabo PJ, Thurman RE, Sung MH, Biddie SC, Johnson TA, Hager GL, Stamatoyannopoulos JA (2011) Chromatin accessibility pre-determines glucocorticoid receptor binding patterns. Nature Genetics 43:264-268.

Juan-Mateu J, González-Quereda L, Rodríguez MJ, Verdura E, Lázaro K, Jou C, Nascimento A, Jiménez-Mallebrera C, Colomer J, Monges S, Lubieniecki F, Foncuberta ME, Pascual-Pascual SI, Molano J, Baiget M, Gallano P (2013) Interplay between DMD point mutations and splicing signals in Dystrophinopathy phenotypes. PLoS One 8:e59916.

Jung H, Lee D, Lee J, Park D, Kim YJ, Park WY, Hong D, Park PJ, Lee E (2015) Intron retention is a widespread mechanism of tumor-suppressor inactivation. Nature Genetics 47:1242–1248.

Juven-Gershon T, Cheng S, Kadonaga JT (2006) Rational design of a super core promoter that enhances gene expression. Nature Methods 3:917-922.

Kalyna M, Simpson CG, Syed NH, Lewandowska D, Marquez Y, Kusenda B, Marshall J, Fuller J, Cardle L, McNicol J, Dinh HQ, Barta A, Brown JW (2012) Alternative splicing and nonsense-mediated decay modulate expression of important regulatory genes in Arabidopsis. Nucleic Acids Research 40:2454-2469.

Kasowski M, Grubert F, Heffelfinger C, Hariharan M, Asabere A,Waszak SM, Habegger L, Rozowsky J, Shi M, Urban AE, et al. (2010) Variation in transcription factor binding among humans. Science 328:232–235.

Ke S, Chasin LA (2010) Intronic motif pairs cooperate across exons to promote pre- mRNA splicing. Genome Biology 11:R84.

Ke S, Chasin LA (2011) Context-dependent splicing regulation. RNA Biology 8:384-388.

Ke S, Shang S, Kalachikov SM, Morozova I, Yu L, Russo JJ, Ju J, Chasin LA (2011) Quantitative evaluation of all hexamers as exonic splicing elements. Genome Research 21:1360–1374.

Ke S, Zhang XH, Chasin LA (2008) Positive selection acting on splicing motifs reflects compensatory evolution. Genome Research 18:533-543.

Kilpinen H, Waszak SM, Gschwind AR, Raghav SK, Witwicki RM, Orioli A, Migliavacca E, Wiederkehr M, Gutierrez-Arcelus M, Panousis NI, et al. (2013) Coordinated effects of sequence variation on DNA binding, chromatin structure, and transcription. Science 342:744–747.

236

Kim D, Langmead B, Salzberg SL (2015) HISAT: a fast spliced aligner with low memory requirements. Nature Methods 12:357-360.

Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL (2013) TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biology 14:R36.

Klauer AA, van Hoof (2012) Degradation of mRNAs that lack a stop codon: a decade of nonstop progress. Wiley Interdiscip Rev RNA 3:649-660.

Korf I (2004) Gene finding in novel genomes. BMC Bioinformatics 5:59.

Korf I, Flicek P, Duan D, Brent MR (2001) Integrating genomic homology into gene structure prediction. Bioinformatics 17:S140–S148.

Kornblihtt AR, Schor IE, Alló M, Dujardin G, Petrillo E, Muñoz MJ (2013) Alternative splicing: a pivotal step between eukaryotic transcription and translation. Nat Rev Mol Cell Biol 14:153-165.

Kozak M (1990) Downstream secondary structure facilitates recognition of initiator codons by eukaryotic ribosomes. PNAS 87:8301–8305.

Královicová J, Christensen MB, Vořechovský I (2005) Biased exon/intron distribution of cryptic and de novo 3' splice sites. Nucleic Acids Research 33:4882-4898.

Královicová J, Vořechovský I (2007) Global control of aberrant splice site activation by auxiliary splicing sequences: evidence for a gradient in exon and intron definition. Nucleic Acids Research 35:6399–6413.

Kulp D, Haussler D, Reese M, Eeckman F (1996) A generalized hidden Markov model for the recognition of human genes in DNA. Proc Int Conf Intell Syst Mol Biol 4:134-142.

Kumar P, Henikoff S, Ng PC (2009) Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nature Protocols 4:1073- 1081.

Kwasnieski JC, Mogno I, Myers CA, Corbo JC, Cohen BA (2012) Complex effects of nucleotide variants in a mammalian cis-regulatory element. PNAS 109:19498– 19503.

237

Lander ES, et al. (2001) Initial sequencing and analysis of the human genome. Nature 409:860–921.

Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10:R25.

Langmead B, Salzberg SL (2012) Fast gapped-read alignment with Bowtie 2. Nature Methods 9:357–359.

Lappalainen T, Sammeth M, Friedländer MR, 't Hoen PA, Monlong J, Rivas MA, Gonzàlez-Porta M, Kurbatova N, Griebel T, Ferreira PG, Barann M, Wieland T, Greger L, van Iterson M, Almlöf J, Ribeca P, Pulyakhina I, Esser D, Giger T, Tikhonov A, Sultan M, Bertier G, MacArthur DG, Lek M, Lizano E, Buermans HP, Padioleau I, Schwarzmayr T, Karlberg O, Ongen H, Kilpinen H, Beltran S, Gut M, Kahlem K, Amstislavskiy V, Stegle O, Pirinen M, Montgomery SB, Donnelly P, McCarthy MI, Flicek P, Strom TM; Geuvadis Consortium, Lehrach H, Schreiber S, Sudbrak R, Carracedo A, Antonarakis SE, Häsler R, Syvänen AC, van Ommen GJ, Brazma A, Meitinger T, Rosenstiel P, Guigó R, Gut IG, Estivill X, Dermitzakis ET (2013) Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501:506-511.

Lek M, et al. (2016) Analysis of protein-coding genetic variation in 60,706 humans. Nature 536:285–291.

Li B, Leal SM (2008) Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet 83:311–321.

Lieberman-Aiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, Amit I, Lajoie BR, Sabo PJ, Dorschner MO, Sandstrom R, Bernstein B, Bender MA, Groudine M, Gnirke A, Stamatoyannopoulos J, Mirny LA, Lander ES, Dekker J (2009) Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326:289-293.

Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25:1754–1760.

Li H (2011) Tabix: fast retrieval of features from generic TAB-delimited files. Bioinformatics 27:718-719.

238

Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Genome Project Data Processing S (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics 25:2078–2079.

Lim LP, Burge CB (2001) A computational analysis of sequence features involved in recognition of short introns. PNAS 98:11193–11198.

Long JC, Caceres JF (2009) The SR protein family of splicing factors: master regulators of gene expression. Biochem J 417:15-27.

López-Bigas N, Audit B, Ouzounis C, Parra G, Guigó R (2005) Are splicing mutations the most frequent cause of hereditary disease? FEBS Letters 579:1900-1903.

Luo K, Hartemink AJ (2013) Using DNase digestion data to accurately identify transcription factor binding sites. Pacific Symposium on Biocomputing 2013:80– 91.

Lukashin AV, Borodovsky M (1998) GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res 26:1107–1115.

Lykke-Andersen S, Jensen TH (2015) Nonsense-mediated mRNA decay: an intricate machinery that shapes transcriptomes. Nature Reviews Molecular Cell Biology 16:665-677.

MacArthur DG, et al. (2012) A systematic survey of loss-of-function variants in human protein-coding genes. Science 335:823-8.

Machanick P, Bailey TL (2011) MEME-ChIP: motif analysis of large DNA datasets. Bioinformatics 27:1696-1697.

Majoros W, Ohler U (2007) Spatial preferences of microRNA targets in 3' untranslated regions. BMC Genomics 8:152.

Majoros W, Pertea M, Delcher A, Salzberg SL (2005) Efficient decoding algorithms for generalized hidden Markov model gene finders. BMC Bioinformatics 6:16.

Majoros WH (2007a) Conditional Random Fields. Online supplement to: Methods for Computational Gene Prediction. http://www.geneprediction.org/book/CRFs.pdf

Majoros WH (2007b) Methods for Computational Gene Prediction. Cambridge University Press.

239

Majoros WH, Campbell MS, Holt C, DeNardo EK, Ware D, Allen AS, Yandell M, Reddy TE (2017) High-throughput interpretation of gene structure changes in human and nonhuman resequencing data, using ACE. Bioinformatics 33:1437–1446.

Majoros WH, Lebeck N, Ohler U, Li S (2014) Improved transcript isoform discovery using ORF graphs. Bioinformatics 30:1958-1964.

Majoros WH, Lekprasert P, Mukherjee N, Skalsky RL, Corcoran DL, Cullen BR, Ohler U (2013) MicroRNA target site identification by integrating sequence and binding information. Nature Methods 10:630-633.

Majoros WH, Pertea M, Antonescu C, Salzberg SL (2003) GlimmerM, Exonomy, and Unveil: Three ab initio Eukaryotic Genefinders. Nucleic Acids Research 31:3601- 3604.

Majoros WH, Pertea M, Salzberg SL (2004) TIGRscan and GlimmerHMM : two open source ab initio eukaryotic gene finders. Bioinformatics 20:2878-2879.

Majoros WM, Pertea M, Salzberg SL (2005) Efficient implementation of a generalized pair hidden Markov model for comparative gene finding. Bioinformatics 21:1782- 1788.

Majoros WH, Salzberg SL (2004) An empirical analysis of training protocols for probabilistic gene finders. BMC Bioinformatics 5:206.

Majoros WH, Subramanian GM, Yandell MD (2002) Identification of Key Concepts in Biomedical Literature using a Modified Markov Heuristic. Bioinformatics 19:402- 407.

Maricque BB, Dougherty JD, Cohen BA (2017) A genome-integrated massively parallel reporter assay reveals DNA sequence determinants of cis-regulatory activity in neural cells. Nucleic Acids Research 45:e16.

Mathelier A, Xin B, Chiu T-P, Yang L, Rohs R, Wasserman WW (2016) DNA Shape Features Improve Transcription Factor Binding Site Predictions In Vivo. Cell Systems 3:278–286.

Mauger DM, Lin C, Garcia-Blanco MA (2008) hnRNP H and hnRNP F complex with Fox2 to silence fibroblast growth factor receptor 2 exon IIIc. Molecular Cell Biology 28:5403-5419.

240

Maurano MT, Humbert R, Rynes E, Thurman RE, Haugen E, Wang H, Reynolds AP, Sandstrom R, Qu H, Brody J, et al. (2012) Systematic localization of common disease-associated variation in regulatory DNA. Science 337: 1190–1195.

McCarthy DJ, et al. (2014) Choice of transcripts and software has a large effect on variant annotation. Genome Medicine 6:26.

McDaniell R, Lee BK, Song L, Liu Z, Boyle AP, Erdos MR, Scott LJ, Morken MA, Kucera KS, Battenhouse A, et al. (2010) Heritable individual-specific and allele-specific chromatin signatures in humans. Science 328:235–239.

McDowell IC, et al. (2017) Glucocorticoid Receptor Recruits to Enhancers and Drives Reprogramming by Motif-directed Binding. In preparation.

McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GRS, Thormann A, Flicek P, Cunningham F (2016) The Ensembl Variant Effect Predictor. Genome Biology 17:122.

Meister G (2013) Argonaute proteins: functional insights and emerging roles. Nature Reviews Genetics 14:447–459.

Melé M, Ferreira PG, Reverter F, DeLuca DS, Monlong J, Sammeth M, Young TR, Goldmann JM, Pervouchine DD, Sullivan TJ, Johnson R, Segrè AV, Djebali S, Niarchou A; GTEx Consortium, Wright FA, Lappalainen T, Calvo M, Getz G, Dermitzakis ET, Ardlie KG, Guigó R (2015) The human transcriptome across tissues and individuals. Science 348:660-665.

Melnikov A, Murugan A, Zhang X, Tesileanu T, Wang L, Rogov P, Feizi S, Gnirke A, Callan CG Jr, Kinney JB, Kellis M, Lander ES, Mikkelsen TS (2012) Systematic dissection and optimization of inducible enhancers in human cells using a massively parallel reporter assay. Nature Biotechnology 30:271-277.

Meyer IM, Durbin R (2004) Gene structure conservation aids similarity based gene prediction. Nucleic Acids Research 32:776-783.

Mitchell TM (1980) The need for bias in learning generalizations. Technical Report CBM-TR-117, Rutgers University.

Monlong J, Calvo M, Ferreira PG, Guigó R (2014) Identification of genetic variants associated with alternative splicing using sQTLseekeR. Nature Communications 5:4698.

241

Montgomery SB, et al. (2011) Rare and common regulatory variation in population-scale sequenced human genomes. PLoS Genetics 7:e1002144.

Mort M, Sterne-Weiler T, Li B, Ball EV, Cooper DN, Radivojac P, Sanford JR, Mooney SD (2014) MutPred Splice: machine learning-based prediction of exonic variants that disrupt splicing. Genome Biology 15:R19.

Muerdter F, Boryń LM, Woodfin AR, Neumayr C, Rath M, Zabidi MA, Pagani M, Haberle V, Kazmar T, Catarino RR, Schernhuber K, Arnold CD, Stark A (2017) Resolving systematic errors in widely-used enhancer activity assays in human cells enables genome-wide functional enhancer characterization. bioRxiv. https://doi.org/10.1101/164590

Murphy KP (2012) Machine Learning: A Probabilistic Perspective. MIT Press.

Murtha M, Tokcaer-Keskin Z, Tang Z, Strino F, Chen X, Wang Y, Xi X, Basilico C, Brown S, Bonneau R, et al. (2014) FIREWACh: high-throughput functional detection of transcriptional regulatory modules in mammalian cells. Nature Methods 11:559– 565.

Nagy E, Maquat LE (1998) A rule for termination-codon position within intron- containing genes: when nonsense affects mRNA abundance. Trends Biochem Sci 23:198–199.

Natarajan A, Yardimci GG, Sheffield NC, Crawford GE, Ohler U (2012) Predicting cell- type-specific gene expression from regions of open chromatin. Genome Research 22:1711-1722.

Neu-Yilik G, Amthor B, Gehring NH, Bahri S, Paidassi H, Hentze MW, Kulozik AE (2011) Mechanism of escape from nonsense-mediated mRNA decay of human b- globin transcripts with nonsense mutations in the first exon. RNA 17:843–854.

Nickol JM, Felsenfeld G (1988) Bidirectional Control of the Chicken β- and ε-globin Genes by a Shared Enhancer. PNAS 85:2548-2552.

Nicolae DL, Gamazon E, ZhangW, Duan S, Dolan ME, Cox NJ (2010) Trait-associated SNPs are more likely to be eQTLs: annotation to enhance discovery from GWAS. PLoS Genet 6: e1000888.

Ni Y, Hall AW, Battenhouse A, Iyer VR (2012) Simultaneous SNP identification and assessment of allele-specific bias from ChIP-seq data. BMC Genetics 13:46.

242

Nobrega MA, Ovcharenko I, Afzal V, Rubin EM(2003) Scanning human gene deserts for long-range enhancers. Science 302:413-413.

Nyiko T, Kerényi F, Szabadkai L, Benkovics AH, Major P, Sonkoly B, Mérai Z, Barta E, Niemiec E, Kufel J, Silhavy D (2013) Plant nonsense-mediated mRNA decay is controlled by different autoregulatory circuits and can be induced by an EJC-like complex. Nucleic Acids Research 41:6715–6728.

Ohler U, Liao GC, Niemann H, Rubin GM (2002) Computational analysis of core promoters in the Drosophila genome. Genome Biology 3(12):RESEARCH0087.

Olansky L, Welling C, Giddings S, Adler S, Bourey R, Dowse G, Serjeantson S, Zimmet P, Permutt MA (1992) A variant insulin promoter in non-insulin- dependent diabetes mellitus. J Clin Invest 89:1596–1602.

Pachter L, Alexanderson M, Cawley S (2002) Applications of generalized pair hidden Markov models to alignment and gene finding problems. Journal of Computational Biology 9:389-399.

Pagani F, Baralle FE (2004) Genomic variants in exons and introns: identifying the splicing spoilers. Nature Reviews Genetics 5:389-396.

Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ (2008) Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nature Genetics 40:1413-1415.

Pandit S, Zhou Y, Shiue L, Coutinho-Mansfield G, Li H, Qiu J, Huang J, Yeo GW, Ares M Jr, Fu XD (2013) Genome-wide Analysis Reveals SR Protein Cooperation and Competition in Regulated Splicing. Molecular Cell 50:223–235.

Parker SCJ, Hansen L, Abaan HO, Tullius TD, Margulies EH (2009) Local DNA Topography Correlates with Functional Noncoding Regions of the Human Genome. Science 324:389-392.

Parra G, Bradnam K, Korf I (2007) CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 23:1061–1067.

Paten B, Novak AM, Eizenga JM, Garrison E (2017) Genome graphs and the evolution of genome inference. Genome Research 27:665-676.

243

Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C (2016) Salmon provides accurate, fast, and bias-aware transcript expression estimates using dual-phase inference. bioRxiv doi: http://dx.doi.org/ 10.1101/021592

Patwardhan RP, Hiatt JB, Witten DM, Kim MJ, Smith RP, May D, Lee C, Andrie JM, Lee SI, Cooper GM, et al. (2012) Massively parallel functional dissection of mammalian enhancers in vivo. Nature Biotechnology 30:265–270.

Patwardhan RP, Lee C, Litvin O, Young DL, Pe’er D, Shendure J (2009) High resolution analysis of DNA regulatory elements by synthetic saturation mutagenesis. Nat Biotechnol 27:1173–1175.

Paul S, Dansithong W, Kim D, Rossi J, Webster NJ, Comai L, Reddy S (2006) Interaction of muscleblind, CUG-BP1 and hnRNP H proteins in DM associated aberrant IR splicing. EMBO Journal 25:4271-4283

Peixeiro I, Inácio Â, Barbosa C, Silva AL, Liebhaber SA, Romão L (2012) Interaction of PABPC1 with the translation initiation complex is critical to the NMD resistance of AUG-proximal nonsense mutations. Nucleic Acids Research 40:1160–1173.

Pennacchio LA, Ahituv N, Moses AM, Prabhakar S, NobregaMA, Shoukry M, Minovitsky S, Dubchak I, Holt A, Lewis KD, et al. (2006) In vivo enhancer analysis of human conserved non-coding sequences. Nature 444:499–502.

Perry MW, Boettigerb AN, Levine M (2011) Multiple enhancers ensure precision of gap gene-expression patterns in the Drosophila embryo. PNAS 108:13570-13575.

Pertea M, Pertea GM, Antonescu CM, Chang TC, Mendell JT, Salzberg SL (2015) StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nature Biotech doi:10.1038/nbt.3122.

Petrovski S, Wang Q, Heinzen EL, Allen AS, Goldstein DB (2013) Genic intolerance to functional variation and the interpretation of personal genomes. PLoS Genetics 9:e1003709.

Petrovski S, Gussow AB, Wang Q, Halvorsen M, Han Y, Weir WH, Allen AS, Goldstein DB (2015) The Intolerance of Regulatory Sequence to Genetic Variation Predicts Gene Dosage Sensitivity. PLoS Genetics 11:e1005492.

Pickrell JK, Pai AA, Gilad Y, Pritchard JK (2010) Noisy splicing drives mRNA isoform diversity in human cells. PLoS Genetics 6:e1001236.

244

Pique-Regi R, Degner JF, Pai AA, Gaffney DJ, Gilad Y, Pritchard JK (2011) Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data. Genome Research 21:447–455.

Pruitt KD, et al. (2014) RefSeq: an update on mammalian referenced sequences. Nucleic Acids Research 42(Database):D756-763.

Rabiner L (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77:257-286.

Rahman MA, Azuma Y, Nasrin F, Takeda J, Nazim M, Bin Ahsan K, Masuda A, Engel AG, Ohno K (2015) SRSF1 and hnRNP H antagonistically regulate splicing of COLQ exon 16 in a congenital myasthenic syndrome. Scientific Reports 5:13208.

Ranjan R, Huang B, Fatehi A (2016) Robust Gaussian process modeling using EM algorithm. Journal of Process Control 42:125-136.

Raveh-Sadka T,Levo M, Segal E (2009) Incorporating nucleosomes into thermodynamic models of transcription regulation. Genome Research 19:1480–1496.

R Core Team (2015) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.Rproject.org/.

Reddy TE, Gertz J, Pauli F, Kucera KS, Varley KE, Newberry KM, Marinov GK, Mortazavi A, Williams BA, Song L, Crawford GE, Wold B, Willard HF, Myers RM (2012) Effects of sequence variation on differential allelic transcription factor occupancy and gene expression. Genome Research 22:860–869.

Reddy TE, Pauli F, Sprouse RO, Neff NF, Newberry KM, Garabedian MJ, Myers RM (2009) Genomic determination of the glucocorticoid response reveals unexpected mechanisms of gene regulation. Genome Research 19:2163-2171.

Reik W (2007) Stability and flexibility of epigenetic gene regulation in mammalian development. Nature 447:425-432.

Rhee HS, Sung H, Pugh BJ (2011) Comprehensive Genome-wide Protein-DNA Interactions Detected at Single-Nucleotide Resolution. Cell 147:1408–1419.

Robberson BL, Cote GJ, Berget SM (1990) Exon definition may facilitate splice site selection in RNAs with multiple exons. Molecular Cell Biology 10:84-94.

245

Roberts RG, Gardner RJ, Bobrow M (1994) Searching for the 1 in 2,400,000: a review of dystrophin gene point mutations. Human Mutation 4:1-11.

Robinson MD, McCarthy DJ, Smyth GK (2010) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26:139-140.

Rohs R, West SM, Sosinky A, Liu P, Mann RS, Honig B (2009) The role of DNA shape in protein-DNA recognition. Nature 461:1248-1253.

Rosenberg AB, Patwardhan RP, Shendure J, Seelig G (2015) Learning the sequence determinants of alternative splicing from millions of random sequences. Cell 163:698–711.

Rozowsky J, Abyzov A, Wang J, Alves P, Raha D, Harmanci A, Leng J, Bjornson R, Kong Y, Kitabayashi N, Bhardwaj N, Rubin M, Snyder M, Gerstein M (2011) AlleleSeq: analysis of allele-specific expression and binding in a network framework. Molecular Systems Biology 7:522.

Sakabe NJ, Savic D, Nobrega MA (2012) Transcriptional enhancers in development and disease. Genome Biology 13:238.

Salzberg SL, Pertea M, Delcher AL, Gardner MJ, Tettelin H (1998) Interpolated Markov models for eukaryotic gene finding. Genomics 59:24-31.

Sandelin A, Alkema W, Engström P, Wasserman WW, Lenhard B (2004) JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Research 32:D91-D94.

Sanyal A, Lajoie BR, Jain G, Dekker J (2012) The long-range interaction landscape of gene promoters. Nature 489:109–113.

Schneider M, Will CL, Anokhina M, Tazi J, Urlaub H, Lührmann R (2010) Exon definition complexes contain the tri-snRNP and can be directly converted into B- like precatalytic splicing complexes. Molecular Cell 38:223-235.

Serfling E, Jasin M, Schaffner W (1985) Enhancers and eukaryotic gene transcription. Trends in Genetics 1:224-230.

Shannon CE (1948) A Mathematical Theory of Communication. Bell System Technical Journal 27:379–423.

246

Sharon E, Kalma Y, Sharp A, Raveh-Sadka T, Levo M, Zeevi D, Keren L, Yakhini Z, Weinberger A, Segal E (2012) Inferring gene regulatory logic from high- throughput measurements of thousands of systematically designed promoters. Nature Biotechnology 30:521–530.

Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 29:308–311.

Shlyueva D, Stampfel G, Stark A (2014) Transcriptional enhancers: from properties to genome-wide predictions. Nature Reviews Genetics 15:272-286.

Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, Weinstock GM, Wilson RK, Gibbs RA, Kent WJ, Miller W, Haussler D (2005) Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Research 15:1034-1050.

Silva AL, Ribeiro P, Inácio A, Liebhaber SA, Romão L (2008) Proximity of the poly(A)- binding protein to a premature termination codon inhibits mammalian nonsense- mediated mRNA decay. RNA 14:563–576.

Simonis M, Klous P, Splinter E, Moshkin Y, Willemsen R, de Wit E, van Steensel B, de Laat W (2006) Nuclear organization of active and inactive chromatin domains uncovered by chromosome conformation capture-on-chip (4C). Nature Genetics 38:1348-1354.

Singh R, Valcárcel J (2005) Building specificity with nonspecific RNA binding proteins. Nat Struct Mol Biol 12:645-653.

Sirén J, Välimäki N, Mäkinen V (2014) Indexing Graphs for Path Queries with Applications in Genome Research. IEEE/ACM Trans Comput Biol Bioinform. 11:375-388.

Snyder EE, Stormo GD (1993) Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks. Nucleic Acids Research 21:607-613.

Soukarieh O, Gaildrat P, Hamieh M, Drouet A, Baert-Desurmont S, Frébourg T, Tosi M, Martins A (2016) Exonic Splicing Mutations Are More Prevalent than Currently Estimated and Can Be Predicted by Using In Silico Tools. PLoS Genetics DOI:10.1371/journal.pgen.1005756.

247

Spain SL, Barrett JC (2015) Strategies for fine-mapping complex traits. Human Molecular Genetics 24:R111–R119.

Staden R, McLachlan AD (1982) Codon preference and its use in identifying protein coding regions in long DNA sequences. Nucleic Acids Research 10:141-156.

Stadhouders R, Aktuna S, Thongjuea S, Aghajanirefah A, Pourfarzad F, van Ijcken W, Lenhard B, Rooks H, Best S, Menzel S, et al. (2014) HBS1LMYB intergenic variants modulate fetal hemoglobin via long-range MYB enhancers. J Clin Invest 124:1699–1710.

Stadler MB, Shomron N, Yeo GW, Schneider A, Xiao X, Burge CB (2006) Inference of splicing regulatory activities by sequence neighborhood analysis. PLoS Genetics 24:e191.

Stamatoyannopoulos JA, Goodwin A, Joyce T, Lowrey CH (1995) NF-E2 and GATA binding motifs are required for the formation of DNase I hypersensitive site 4 of the human beta-globin locus control region. EMBO Journal 14:106–116.

Stanke M, Schöffmann O, Morgenstern B, Waack S (2006) Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics 7:62.

Steger DJ, Grant GR, Schupp M, Tomaru T, Lefterova MI, Schug J, Manduchi E, Stoeckert CJ Jr, Lazar MA (2010) Propagation of adipogenic signals through an epigenomic transition state. Genes and Development 24:1035-1044.

Stepankiw N, Raghavan M, Fogarty EA, Grimson A, Pleiss JA (2015) Widespread alternative and aberrant splicing revealed by lariat sequencing. Nucleic Acids Research 43:8488–8501.

St Johnston D, Nusslein-Volhard C (1992) The origin of pattern and polarity in the Drosophila embryo. Cell 68:201–219.

Stormo GD, Hartzell GW (1989) Identifying protein-binding sites from unaligned DNA fragments. PNAS 86:1183-1187.

Stormo GD, Haussler D (1994) Optimally parsing a sequence into different classes based on multiple types of evidence. Proc Int Conf Intell Syst Mol Biol 2:369-375.

Stranger BE, Raj T (2013) Genetics of human gene expression. Curr Opin Genet Dev 23:627–634.

248

Sutton C (2008) Efficient training methods for conditional random fields. PhD , University of Massachusetts, Amherst.

Sutton C, McCallum A (2005) Piecewise training for undirected models. Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence. p568-575.

Sutton C, McCallum A (2006) An introduction to conditional random fields for relational learning. In: Getoor L & Taskar B (eds.) Introduction to statistical relational learning. MIT Press.

Tan A, Abecasis GR1, Kang HM (2015) Unified Representation of Genetic Variants. Bioinformatics 31:2202-2204.

The 1000 Genomes Project Consortium, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, Marchini JL, McCarthy S, McVean GA, Abecasis GR (2015) A global reference for human genetic variation. Nature 526:68-74.

The 3000 Rice Genomes Project (2014) The 3000 rice genomes project. GigaScience 3:7.

The SAM/BAM Format Specification Working Group (2015) Sequence Alignment/Map Format Specification. https://samtools.github.io/hts-specs/SAMv1.pdf

Thurman RE, Rynes E, Humbert R, Vierstra J, Maurano MT, Haugen E, Sheffield NC, Stergachis AB, Wang H, Vernot B, Garg K, John S, Sandstrom R, Bates D, Boatman L, Canfield TK, Diegel M, Dunn D, Ebersol AK, Frum T, Giste E, Johnson AK, Johnson EM, Kutyavin T, Lajoie B, Lee BK, Lee K, London D, Lotakis D, Neph S, Neri F, Nguyen ED, Qu H, Reynolds AP, Roach V, Safi A, Sanchez ME, Sanyal A, Shafer A, Simon JM, Song L, Vong S, Weaver M, Yan Y, Zhang Z, Zhang Z, Lenhard B, Tewari M, Dorschner MO, Hansen RS, Navas PA, Stamatoyannopoulos G, Iyer VR, Lieb JD, Sunyaev SR, Akey JM, Sabo PJ, Kaul R, Furey TS, Dekker J, Crawford GE, Stamatoyannopoulos JA (2012) The accessible chromatin landscape of the human genome. Nature 489:75-82.

Tian Z, Qian Q, Liu Q, Yan M, Liu X, Yan C, Liu G, Gao Z, Tang S, Zeng D, Wang Y, Yu J, Gu M, Li J (2009) Allelic diversities in rice starch biosynthesis lead to a diverse array of rice eating and cooking qualities. PNAS 106:21760–21765.

Tjian R (1978) The binding site on SV40 DNA for a T antigen-related protein. Cell 13:165–179.

Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L (2010) Transcript assembly and quantification by RNA-Seq

249

reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology 28:511-515.

Trapnell C, Pachter L, Salzberg SL (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25:1105-1111.

Urbanek M, Hayes MG, Armstrong LL, Morrison J, Lowe LP, Badon SE, Scheftner D, Pluzhnikov A, Levine D, Laurie CC, et al. (2013) The chromosome 3q25 genomic region is associated with measures of adiposity in newborns in a multi-ethnic genome-wide association study. Hum Mol Genet 22:3583–3596.

Vaquerizas JM, Kummerfeld SK, Teichmann SA, Luscombe NM (2009) A census of human transcription factors: function, expression and evolution. Nature Reviews Genetics 10:252-263.

Vasudevan S, Peltz SW, Wilusz CJ (2002). Non-stop decay—a new mRNA surveillance pathway. BioEssays 24:785–788.

Venter JC, et al. (2001) The sequence of the human genome. Science 291:1304–1351.

Vinson J, DeCaprio D, Pearson M, Luoma S, Galagan J (2007) Comparative Gene Prediction using Conditional Random Fields. In: B Scholkpf, J Platt, T Hoffman (eds.), Advances in Neural Information Processing Systems 19, MIT Press, Cambridge, MA.

Visel A, Rubin EM, Pennacchio LA (2009) Genomic views of distant-acting enhancers. Nature 461:199-205.

Viterbi AJ (1967) Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory 13:260–269.

Vockley CM, Barrera A, Reddy TE (2017) Decoding the role of regulatory element polymorphisms in complex disease. Current Opinion in Genetics & Development 43:38–45.

Vockley CM, D’Ippolito AM, McDowell IC, Majoros WH, Safi A, Song L, Crawford GE, Reddy TE (2016) Direct GR Binding Sites Potentiate Clusters of TF Binding across the Human Genome. Cell 166:1269–1281.

Vockley CM*, Guo C*, Majoros WH*, Nodzenski M, Scholtens DM, Hayes MG, Lowe WL, Reddy TE (2015) Massively parallel quantification of the regulatory effects

250

of non-coding genetic variation in a human cohort. Genome Research 25:1206- 1214.

Wang JC, Derynck MK, Nonaka DF, Khodabakhsh DB, Haqq C, Yamamoto KR (2004) Chromatin immunoprecipitation (ChIP) scanning identifies primary glucocorticoid receptor target genes. PNAS 101:15603-15608.

Wang J, Lunyak VV, Jordan IK (2013) BroadPeak: a novel algorithm for identifying broad peaks in diffuse ChIP-seq datasets. Bioinformatics 29:492-493.

Wang K, Li M, Hakonarson H (2010) ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Research 38(16):e164.

Wasson T, Hartemink AJ (2009) An ensemble model of competitive multi-factor binding of the genome. Genome Research 19:2101-2112.

Weigel D, Jäckle H (1990) The fork head domain, a novel DNA-binding motif of eucaryotic transcription factors? Cell 63:455–456.

White MA, Myers CA, Corbo JC, Cohen BA (2013) Massively parallel in vivo enhancer assay reveals that highly local features determine the cis-regulatory function of ChIP-seq peaks. Proc Natl Acad Sci 110:11952–11957.

Woolfe A, Mullikin JC, Elnitski L (2010) Genomic features defining exonic variants that modulate splicing. Genome Biology 11:R20.

Wolpert D (1996) The lack of a priori distinctions between learning algorithms. Neural Computation 8:1341-1390.

Wu JY, Maniatis T (1993) Specific interactions between proteins implicated in splice site selection and regulated alternative splicing. Cell 75:1061-1070.

Xing H, Mo Y, Liao W, et al. (2012) Genome-wide localization of protein-dna binding and histone modification by a Bayesian change-point method with Chip-Seq data. PLoS Computational Biology 8:e1002613.

Xiong HY, Alipanahi B, Lee LJ, Bretschneider H, Merico D, Yuen RK, Hua Y, Gueroussov S, Najafabadi HS, Hughes TR, Morris Q, Barash Y, Krainer AR, Jojic N, Scherer SW, Blencowe BJ, Frey BJ (2015) The human splicing code reveals new insights into the genetic determinants of disease. Science 347:1254806.

251

Yamamoto F, Clausen H, White T, Marken J, Hakomori S (1990) Molecular genetic basis of the histo-blood group ABO system. Nature 345:229–233.

Yamamoto F, Cid E, Yamamoto M, Saitou N, Bertranpetit J, Blancher A (2014) An integrative evolution theory of histo-blood group ABO and related genes. Sci Rep 4:6601.

Yandell M, Ence D (2012) A beginner's guide to eukaryotic genome annotation. Nature Reviews Genetics 13:329-342.

Yardimci GG, Frank CL, Crawford GE, Ohlwer U (2014) Explicit DNase sequence bias modeling enables high-resolution transcription factor footprint detection. Nucleic Acids Research 42:11865–11878.

Yates A, et al. (2016) Ensembl 2016. Nucleic Acids Research 44(Database):D710-716.

Yip SP (2002) Sequence variation at the human ABO locus. Annals of Human Genetics 2002, 66(Pt 1):1-27.

Zabidi MA, Arnold CD, Schernhuber K, Pagani M, Rath M, Frank O, Stark A (2015) Enhancer–core-promoter specificity separates developmental and housekeeping gene regulation. Nature 518:556–559.

Zaret KS, Carroll JS (2011) Pioneer transcription factors: establishing competence for gene expression. Genes and Development 25:2227–2241.

Zarnack K, König J, Tajnik M, Martincorena I, Eustermann S, Stévant I, Reyes A, Anders S, Luscombe NM, Ule J (2013) Direct competition between hnRNP C and U2AF65 protects the transcriptome from the exonization of Alu elements. Cell 152:453- 466.

Zawistowski M, Gopalakrishnan S, Ding J, Li Y, Grimm S, Zollner S (2010) Extending rare-variant testing strategies: analysis of noncoding sequence and imputed genotypes. Am J Hum Genet 87:604–617.

Zhang C, Li W-H, Krainer AR, Zhang MQ (2008) RNA landscape of evolution for optimal exon and intron discrimination. PNAS 105:5797-5802.

Zhang XH, Chasin LA (2004) Computational definition of sequence motifs governing constitutive exon splicing. Genes and Development 18:1241-50.

252

Zhang XH, Kangsamaksin T, Chao MS, Banerjee JK, Chasin LA (2005) Exon inclusion is dependent on predictable exonic splicing enhancers. Molecular Cell Biology 25:7323-7332.

Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers RM, Brown M, Li W, Liu XS (2008) Model-based analysis of ChIP-Seq (MACS). Genome Biology 9:R137.

Zhang Y, Moqtaderi Z, Rattner BP, Euskirchen G, Snyder M, Kadonaga JT, Liu XS, Struhl K (2009) Intrinsic histone-DNA interactions are not the major determinant of nucleosome positions in vivo. Nature Structural Molecular Biology 16:847- 852.

Zhou HL, Luo G, Wise JA, Lou H (2014) Regulation of alternative splicing by local histone modifications: potential roles for RNA-guided mechanisms. Nucleic Acids Research 42:701-713.

Zou H, Hastie T (2005) Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society Series B:301–320.

253

Biography

Bill Majoros was born on December 25th, 1970 in Quakertown, PA. For his undergraduate work he attended Penn State University, where he was supported in part by the Chris Mader memorial scholarship, and from which he received a BS in

Computer Science with High Distinction (magna cum laude). During the late 1990s he worked as a Senior Researcher in computational linguistics for the Thomson

Corporation. From 1999 to 2001 he contributed to the initial sequencing and analysis of the human genome at Celera Genomics in Rockville, MD. In 2007 Cambridge University

Press published his book Methods for Computational Gene Prediction. To date he has also authored fifteen first-author peer-reviewed articles and appeared as co-author on an additional twenty peer-reviewed articles. Several of these are listed below:

Majoros WH, Campbell MS, Holt C, DeNardo EK, Ware D, Allen AS, Yandell M, Reddy TE (2017) High-throughput interpretation of gene structure changes in human and nonhuman resequencing data, using ACE. Bioinformatics 33:1437–1446.

Majoros WH, Lebeck N, Ohler U, Li S (2014) Improved transcript isoform discovery using ORF graphs. Bioinformatics 30:1958-1964.

Majoros WH, Lekprasert P, Mukherjee N, Skalsky RL, Corcoran DL, Cullen BR, Ohler U (2013) MicroRNA target site identification by integrating sequence and binding information. Nature Methods 10:630-633.

Majoros WH, Ohler U (2010) Modeling the Evolution of Regulatory Elements by Simultaneous Detection and Alignment with Phylogenetic Pair HMMs. PLoS Computational Biology 6(12): e1001037.

Majoros WH, Ohler U (2008) Complexity Reduction in Context-dependent DNA Substitution Models. Bioinformatics 25:185-82.

254