Statistical Modeling of Genetic and Epigenetic Factors in Gene Structures and
Transcriptional Enhancers
by
William Hutchins Majoros
Graduate Program in Computational Biology and Bioinformatics Duke University
Date:______Approved:
______Tim Reddy, Supervisor
______Sayan Mukherjee
______Raluca Gordân
______Jen-Tsan Chi
Dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate Program in Computational Biology and Bioinformatics in the Graduate School of Duke University
2017
ABSTRACT
Statistical Modeling of Genetic and Epigenetic Factors in Gene Structures and
Transcriptional Enhancers
by
William Hutchins Majoros
Graduate Program in Computational Biology and Bioinformatics Duke University
Date:______Approved:
______Tim Reddy, Supervisor
______Sayan Mukherjee
______Raluca Gordân
______Jen-Tsan Chi
An abstract of a dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate Program in Computational Biology and Bioinformatics in the Graduate School of Duke University
2017
Copyright by William Hutchins Majoros 2017
Abstract
Predicting the phenotypic effects of genetic variants is a major goal in modern genetics, with direct applicability in many areas including the study of diseases in humans and animals, and the breeding of agriculturally important plants.
Computational methods for interpreting genetic variants rely implicitly on annotations of functional genomic elements, such as genes and regulatory elements. Importantly, the locations and boundaries of such annotations can be altered by the presence of specific alleles, either singly or in combination, so that variant interpretation and genomic annotation should ideally be performed jointly. Such joint interpretation would enable predictions to account for the influence that one or more variants may have on the phenotypic impacts of other variants.
In this dissertation I describe computational methods for variant interpretation in both gene bodies and, separately, in transcriptional enhancers that regulate the expression of genes. In the case of gene bodies, I describe novel methods for predicting how genetic variants, either singly or in combination, can impact gene structure, which I define to be the combination of a splicing pattern together with a translation reading frame. Whereas gene structure prediction methods have to date focused exclusively on annotation of reference genomes, I introduce the novel problem of annotating personal genomes of individuals or strains, and I describe and evaluate novel methods for
iv
addressing that problem. I show (i) that these methods are able to predict complex changes in gene structures that result from genetic variants, (ii) that they are able to jointly interpret multiple variants that are not independent in their effects, and (iii) that predictions are supported by both RNA-seq data and patterns of intolerance to mutation across human populations.
In the case of transcriptional enhancers, I describe experimental and associated computational methods for assessing the impacts of genetic variants on the ability of an enhancer to drive gene expression in an episomal reporter assay. I show that these methods are able to identify variants impacting enhancer function, and I show that the functional score assigned by these methods can be used to fine-map gene expression associations.
I also describe a statistical pattern recognition method for efficiently identifying drug-responsive regulatory elements genome-wide and parsing those elements into functional sub-components. I show that this model is able to identify drug-responsive enhancers with high accuracy. I show that sub-components identified by this method are enriched for distinct sets of binding motifs for transcription factors known to mediate the response to treatment by glucocorticoids, one of the most commonly used drugs in the world. Applying this model to timecourse data, I was able to cluster predicted enhancers into sets having distinct trajectories of activity over time in response to treatment. Using experimental chromatin conformation data, I show that these
v
trajectories associate with distinct patterns of expression for genes in physical association with these enhancers.
vi
Dedication
This dissertation is dedicated to Brandy and Daisy.
vii
Contents
Abstract ...... iv
List of Tables ...... xiv
List of Figures ...... xv
Acknowledgements ...... xxv
Chapter 1 – Outline ...... 1
Chapter 2 – Background ...... 3
2.1 Gene structures ...... 4
2.1.1 Gene structure and its impact on the interpretation of genetic variants ...... 4
2.1.1.1 Transcription and splicing ...... 5
2.1.1.2 Translation ...... 13
2.1.1.3 Assaying the results of splicing and translation ...... 16
2.1.1.4 Interpreting genetic variants within the context of a fixed gene structure 21
2.1.1.5 Genetic variants can alter splicing ...... 23
2.1.1.6 Genetic variants can alter translation reading frames ...... 26
2.1.2 Traditional approaches to gene structure modeling ...... 28
2.1.2.1 Hidden Markov models ...... 29
2.1.2.2 Generalized hidden Markov models ...... 33
2.1.2.3 Signal sensors ...... 35
2.1.2.4 Content sensors ...... 36
2.1.2.5 Conditional random fields ...... 36
viii
2.2 Transcriptional enhancers ...... 39
2.2.1 Enhancer function in gene regulation ...... 39
2.2.2 Experimental methods for assaying enhancers ...... 41
2.2.3 Epigenetic indicators of enhancer state ...... 46
2.2.4 Computational models of chromatin state ...... 49
2.2.4.1 Multivariate hidden Markov models ...... 49
2.2.4.2 ChromHMM ...... 50
2.2.4.3 MUMMIE ...... 51
2.2.4.4 Segway ...... 54
2.2.5 Enhancers and disease ...... 54
Chapter 3 – High-throughput interpretation of gene structure changes ...... 57
3.1 Motivation ...... 57
3.2 Methods ...... 61
3.2.1 Reconstructing haplotype sequences from a VCF file ...... 62
3.2.2 Identifying changes to splice patterns and reading frames ...... 64
3.2.3 Identifying loss of function ...... 66
3.2.4 Configuration and structured output ...... 67
3.2.5 Computational validation ...... 67
3.3 Results ...... 70
3.3.1 ACE predicts changes to gene structure ...... 70
3.3.2 ACE identifies thousands of annotated human splice sites as being potentially robust to disruption ...... 74
ix
3.3.3 ACE confirms previous estimates of the effect of nonsense-mediated decay on transcript levels ...... 75
3.3.4 ACE’s loss-of-function predictions in healthy individuals are highly enriched for genes tolerant to mutation ...... 78
3.3.5 ACE aids interpretation of insertion and deletion variants within genes ...... 80
3.3.6 ACE accurately reconstructs human blood-group alleles at the ABO locus .... 83
3.3.7 ACE identifies complex gene-structure changes in a plant gene influencing flavor and nutritional content ...... 85
3.4 Discussion ...... 88
3.5 Supplementary methods ...... 92
3.5.1 Efficient reconstruction of haplotype sequences ...... 92
3.5.2 Mapping annotations to individualized sequences ...... 95
3.5.3 Predicting loss of function ...... 96
3.5.4. Probability models ...... 97
3.5.5 Configurable parameters ...... 98
3.5.6 Alignment of protein sequences ...... 100
3.5.7 Alignment and quantification of RNA-seq data ...... 100
3.5.8 Computing relative expression ratios for NMD targets ...... 102
3.5.9 Versions of software used ...... 103
Chapter 4 – Variant-aware gene structure prediction in personal genomes ...... 104
4.1 Motivation ...... 104
4.2 Methods ...... 109
4.2.1 Splice Graph Random Field ...... 109
x
4.2.2 SGRF features and parameter estimation ...... 112
4.2.3 Integration of SGRF into ACE+ ...... 113
4.2.4 Computational validation ...... 114
4.3 Results ...... 115
4.3.1 Prediction accuracy on 150 human genomes ...... 115
4.3.2 Logistic and splicing minigene features reflect known hnRNP but not SR protein motifs ...... 118
4.3.3 De novo splice sites are prevalent, have a wide range of effects, and are misclassified by popular tools ...... 120
4.4 Discussion ...... 124
4.5 Supplementary Methods ...... 130
4.5.1 Gene structure prediction in real genes modified with premature stop codons ...... 130
4.5.2 Regularized logistic regression training ...... 131
4.5.3 Evaluation of SGRF prediction accuracy using Geuvadis data ...... 132
4.5.4 Classification of whole exons and introns ...... 133
4.5.5 Simulating creation and destruction of splice sites ...... 134
Chapter 5 – Detecting allele-specific effects on the activity of transcriptional enhancers ...... 137
5.1. Introduction ...... 137
5.2 Results ...... 140
5.2.1 Population-scale reporter assay approach ...... 140
5.2.2 Targeted sequencing of candidate regulatory elements from a GWAS population ...... 142
xi
5.2.3 Quantifying the effects of noncoding variation in a GWAS population ...... 143
5.2.4 Identifying regulatory variants in population STARR-seq ...... 144
5.2.5 Effects of haplotypes on regulatory element activity ...... 145
5.2.6 Fine mapping genetic associations with phenotypes ...... 148
5.3 Discussion ...... 150
5.4 Future directions ...... 153
5.4.1 Scale and bias ...... 153
5.4.2 Statistical modeling of allelic effects ...... 155
Chapter 6 – Integrative modeling of dynamic epigenetic enhancer signatures ...... 157
6.1 The glucocorticoid signaling system ...... 157
6.2 Methods ...... 159
6.2.1 A multivariate model of epigenetic dynamics ...... 159
6.2.2 Training and evaluating the model ...... 162
6.2.3 Quantifying and clustering enhancer dynamics over a 12-hour timecourse . 164
6.2.4 Gene expression analysis ...... 166
6.2.5 Motif enrichment analysis ...... 166
6.3 Results ...... 167
6.3.1 Parameter estimates ...... 167
6.3.2 Classification accuracy ...... 168
6.3.3 Clustering of temporal trajectories ...... 170
6.3.4 Multipeak structure of enhancer signatures ...... 172
6.4 Discussion ...... 176
xii
Chapter 7 – Conclusions ...... 181
7.1 Interpreting genetic variants impacting gene structure ...... 181
7.2 Experimentally testing for allelic effects in enhancers ...... 190
7.3 Modeling epigenetic signatures of drug-responsive enhancers ...... 195
Appendix ...... 201
References ...... 225
Biography ...... 254
xiii
List of Tables
Table 3.1: Features of splice site models for different classes of G+C content. Order indicates number of previous positions on which the current position is conditioned in the PWM...... 98
Table 3.2: Configurable parameters in ACE, with default parameters used in the analyses described here...... 99
Table 3.3: Parameters used for trimming RNA-seq reads...... 100
Table 3.4: Versions of all software used...... 103
xiv
List of Figures
Figure 1: Introns are spliced out of transcripts, leaving only the exons (Reproduced with permission from Majoros, 2007b)...... 7
Figure 2: (A) An intron and the major trans-factors involved in its recognition (B) Nucleotide preferences of positions flanking human donor and acceptor splice sites; letter height is proportional to frequency...... 8
Figure 3: Major forms of alternative splicing. Reprinted by permission from Macmillan Publishers Ltd: Nature Reviews Genetics (Luca Cartegni, Shern L. Chew, Adrian R. Krainer, Listening to silence and understanding nonsense: exonic mutations that affect splicing, copyright 2002)...... 11
Figure 4: Three-periodic phase in coding segments. (Reproduced with permission from Majoros, 2007b)...... 16
Figure 5: Gene structure combines a splicing pattern with a coding segment. (Reproduced with permission from Majoros, 2007b) ...... 16
Figure 6: One type of splice graph. Vertices represent splice sites, and edges represent introns and exons...... 18
Figure 7: (A) A specialized type of splice graph, called an ORF graph. Vertices represent splice sites and start/stop codons. Edges represent introns and exons. Edges and vertices are annotated with scores (not shown) computed by a gene prediction model. Each path corresponds to a gene prediction. The graph can be re-weighted using external evidence such as RNA-seq data. The highest-scoring path is shown in red. Phase information constrains the prediction, as coding segments are assumed to be conserved and well-formed. (B) An unconstrained ORF graph for an Arabidopsis thaliana gene. (C) The same ORF graph as in panel B, after filtering out intron edges not supported by spliced RNA-seq reads...... 20
Figure 8: (A) State-transition diagram of an HMM-based gene structure model. States labeled with nucleotides emit only those nucleotides; these are used to emit splice sites and start/stop codons. State 0 is the silent initial/final state. E=exon, I=intron, N=intergenic. Transition patterns enforce the syntax of genes. (Reproduced with permission from Majoros, 2007b) (B) An HMM depicted as a directed graphical model. Unobservables (white vertices) are states, and observables (gray vertices) are emissions. Arrows depict dependencies. (Reproduced with permission from Majoros, 2007a)...... 32
xv
Figure 9: State-transition diagram for a GHMM. Diamond states implement signal sensors for fixed-length features; oval states implement content sensors for variable- length features. “×3” indicates that the three coding phases can have different transition and emission probabilities. Transition patterns enforce gene syntax. (Reproduced with permission from Majoros, 2007b)...... 34
Figure 10: A linear-chain CRF. (Reproduced with permission from Majoros, 2007a)...... 37
Figure 11: A multivariate HMM for identifying miRNA target sites (Reproduced from Majoros et al., 2013; used with permission)...... 52
Figure 12: Flowchart of ACE logic. Major steps are (in order): projection of annotations from the reference, analysis of potential changes to splicing, and analysis of potential changes to translation reading frames...... 62
Figure 13: (A) ACE reconstructs explicit haplotype sequences from a phased VCF file, projects reference annotations onto them, detects possible gene structure changes, and interprets changes in terms of possible loss of function. (B) When a disrupted splice site is encountered, ACE enumerates possible alternate splice forms resulting from cryptic splicing, exon skipping, intron retention, or any combination resulting from multiple variants...... 64
Figure 14:(A) Distribution of number of alternate structures predicted per disrupted splice site. (B) Distribution of proportions of predicted cryptic-site isoforms supported by at least one spliced read, when predicted isoforms are not provided to TopHat 2 (blue) and when they are provided (red). (C) Distribution of proportions of predicted cryptic-site isoforms assigned nonzero FPKM by StringTie when predicted isoforms are not provided to StringTie (blue) and when they are provided (red). (D) Distribution of proportions of predicted cryptic-site isoforms supported by at least one spliced read for splice sites simulated to be disrupted (blue) and for those that are disrupted (red). (E)
Distribution of spliced reads per junction, on log10 scale, supporting sites simulated to be disrupted (blue) versus those that are disrupted...... 73
Figure 15: (A) Distribution of log2 effect sizes of N = 578 heterozygous NMD events as measured via RNA-seq transcript quantification. Dashed line at -0.42 denotes a 25% reduction in total transcript quantity. Data were filtered to improve power (sample size≥30, mean FPKM≥1). (B) Percentiles of Residual Variant Intolerance Scores (RVIS) for N = 633 genes in which at least one individual was predicted to be homozygous for gene loss of function...... 76
xvi
Figure 16:(A) Deletion of an entire splice site (top: hg19 reference sequence; bottom: haplotypes 1 and 2 of 1000 Genomes Project sample HG00096). The resulting allele appears to retain a functional splice site despite the deletion, as concluded by ACE and supported by spliced RNA-seq reads. (B) Compensatory frameshift variants: the second variant corrects the change to the reading frame introduced by the first variant (top: hg19 reference sequence, bottom: haplotype 2 of 1000 Genomes Project sample HG00096)...... 83
Figure 17:(A) Blood-group alleles of the ABO gene (ENSG00000175164). Black: coding segment; gray: untranslated region (UTR). Reference genome hg19 has the O allele; GENCODE version 19 annotates this gene as a processed transcript with no reading frame. ACE identifies the coding segment for the O and B alleles in heterozygous individual HG00096. (Coordinates have been transformed and mapped to the forward strand). (B) Complex differences in gene structure between alleles of the waxy gene in rice, due to a single G-to-T variant in a donor splice site. ACE detects a 1 nt shift in the donor splice site in the Wxb allele, resulting in a new start codon straddling the first intron. The new start codon alters the reading frame, leading to a premature stop codon and NMD...... 86
Figure 18: (A) A splice graph random field (SGRF). Vertices denote splice sites, and edges denote exons and introns. A path from TSS (transcription start site) to TES (transcription end site) outlines a single gene structure; labels 0 and 1 denote omission or inclusion, respectively, of a vertex on the selected path. (B) Cliques and their potential functions; SGRFs have only singleton and pair cliques. Potential functions for cliques labeled with any 0 do not contribute to the score, since they do not participate in the selected path. (C) Cryptic splice sites are unannotated splice sites near an annotated splice site; disrupted splice sites exist in the reference but not in the alternate sequence; de novo splice sites exist in the alternate sequence but not in the reference...... 111
Figure 19: (A) ROC curves for the SGRF with three different content sensors; TP and FP rates were computed based on spliced RNA-seq reads from LCL cells supporting predicted novel splice junctions not found in any isoform of the gene (B) Area under ROC curves shown in panel A. (C) Difference between logistic model AUC and minigene model AUC. (D) ROC for classification of 10,000 codon exons versus 10,000 introns (red), for 400 lincRNA exons versus 400 lincRNA introns (blue), and for minigene exons with high (N=10,947) versus low (N=16,185) inclusion rates (green), using the logistic human model for classification. (E) Density of positively-scored hexamers under the human logistic model, at relative positions in 400 noncoding lincRNA exons...... 118
xvii
Figure 20: (A) Log2(number of de novo splice sites / number of disrupted splice sites) in mutation simulations (brown: requiring only a 2 bp consensus for de novo splice sites; blue: requiring sufficiently high score under splice-site model for de novo splice sites; green: requiring sufficiently high score under the splice-site model and favorable exon definition context for de novo splice sites), and in predictions supported by RNA-seq
(red); numbers above bars are raw (non-log2) ratios. (B) Estimates of relative splicing activity (red) and protein change (blue) due to de novo splice sites supported by RNA- seq. (C) Frequencies of dbSNP classifications of variants predicted to create de novo splice sites and supported by spliced RNA-seq reads. (D) Frequencies of VEP classifications of variants creating de novo splice sites supported by RNA-seq...... 122
Figure 21: (A) Population STARR-seq (“Pop-STARR”) adapts the STARR-seq assay to measure regulatory potential of multiple alleles cloned from a study population. (B) Population STARR-seq is highly reproducible. Rep1–7 are biological replicates generated from independent transfections. The x- and y- axes represent element activity (output RNA reads / input DNA reads). In each case, Spearman’s r > 0.90...... 141
Figure 22: Correlation between SNP effect sizes (x-axis: log of product of effect sizes of SNPs on haplotype) and log of observed haplotype effects (y-axis) for putative regulatory haplotypes containing more than one SNP (r = 0.54, P = 0.007). Observed haplotype effect sizes were computed as normalized ratios for each haplotype versus all pooled haplotypes at a locus: (RNAhaplotype/DNAhaplotype)/ (RNApooled/DNApooled). Solid line: regression line (slope=0.8, intercept=-0.019); dotted line: 1:1 diagonal...... 147
Figure 23: Comprehensive measurement of haplotype-specific regulatory element activity provides mechanistic insights into gene regulation. (A) Distribution of enhancer activity scores for fragments containing regulatory variants (red) and fragments containing non-regulatory variants (blue). (B) Histogram of number of SNPs per assayed element. (C) Manhattan plot of eQTLs for the long noncoding RNA LINC00881.
Blue dots indicate −log10(P-value) of LINC00881 eQTL from the Geuvadis database (left y-axis); red bars indicate −log10(FDR) for variants that alter regulatory activity in the population STARR-seq assay (right y-axis). Red dotted line indicates FDR = 1.0. (D) Association between normalized expression of long noncoding gene LINC00881 in LCLs as measured by the Geuvadis project (y-axis) and the measured effect size in population STARR-seq assay (x-axis) for SNP rs73170828 (r2 = 0.07, P = 7.6×10−9). (E) Allele-specific H3K27ac analysis of variants rs62274098 and rs73170828, both eQTLs proximal to and 5′ of LINC00881; read counts (y-axis) differed substantially between alleles for rs73170828
xviii
(Wilcoxon P = 0.058, binomial P = 0.004) but not for rs62274098 (Wilcoxon P = 0.9; binomial P = 0.92)...... 150
Figure 24: Multivariate Hidden Markov model (MV-HMM) of GC-responsive regulatory elements. (A) State-transition diagram of multivariate HMM shown with transition probabilities. (B) Mean emission probability for each state, at pre-dex (bottom) and at 3 hr of dex exposure (top). (C) Covariance matrices for emission distributions of each state...... 163
Figure 25: Consistency of k-means clustering across multiple runs. Each heatmap represents mean within-cluster predicted dex-induced enhancer activity after k-means clustering with different random initializations. Whiter values represent lower activity, redder values represent higher activity...... 165
Figure 26: Area under the receiver operating characteristics curve (AUC) values for the full model and for reduced versions of the model including only the features listed. .. 170
Figure 27: Clustering, motif, and gene-expression analyses. (A) Results of clustering trajectories of predicted enhancer activity (left). Enrichments of motifs in clusters measured as excess proportion of elements containing a significant motif match above expectation (* means P < 0.01; right). (B) Gene expression changes for genes physically interacting with enhancers in clusters shown in panel A...... 172
Figure 28: Example enhancer signature with two peaks. HMM states have been recoded: state 0.5 = background, state 1.5 = histone flank, state 2.5 = DHS peak...... 173
Figure 29: Motif enrichment in predicted peaks versus histone flanks of predicted dex- responsive enhancers genome-wide...... 175
Figure 30: Known and predicted gene structures for alleles A and O of the human blood group gene ABO (Ensembl gene ENSG00000175164). The reference genome GRCh38 contains the O allele, which contains a frameshift leading to a premature stop codon. The structure predicted by the state-of-the-art gene finder Augustus (Stanke et al., 2006) for the O allele introduces a novel exon spanning positions 14364-14392 in order to alter the reading frame and avoid the premature stop codon at position 17791, resulting in a higher likelihood due to the strong coding signal in the long final exon of the A allele. (Coordinates have been transformed and mapped to the forward strand)...... 201
xix
Figure 31: Distribution of distances (nt) between cryptic splice sites and annotated sites in DBASS, the Database of Aberrant Splice Sites (Buratti et al., 2007); outliers above 644nt were trimmed for illustration purposes only; these constitute <5% of the distribution. 202
Figure 32: Example of structured Essex output (XML reports are structured similarly). Notable features include the alignment of the reference sequence to the alternate sequence via a CIGAR string indicating insertion, deletion, and match lengths; classification of variants for the reference transcript, the mapped transcript, and any putative novel transcript structures resulting from disruptions to splice sites or changes in translation reading frames (note that variant classifications can differ between alternate transcripts); protein translations for all versions of a transcript; predicted fates of transcripts and/or proteins; and detailed descriptions of disrupted splice sites (shown) and putative cryptic sites (not present in this example)...... 203
Figure 33: (A) Distribution of proportions of ACE-predicted exon-skipping isoforms supported by at least one spliced read when such isoforms are not provided (blue) or are provided (red) as hints to TopHat 2 (Wilcoxon W = 537900, P < 2.2×10-16). (B) Proportions of exon-skipping isoforms assigned nonzero FPKM when such isoforms are not provided (blue) or are provided (red) as hints to StringTie (W = 198020, P < 2.2×10-16). (C) Proportions of exon-skipping isoforms supported by at least one spliced read, for annotated splice sites that are not disrupted (blue) and annotated splice sites that are disrupted (red) (W = 699470, P < 2.2×10-16)...... 204
Figure 34: Number of ACE-predicted novel isoforms (y-axis) across all Geuvadis samples estimated to meet or exceed a given TPM (Transcripts Per Million) threshold (x- axis), as estimated by Salmon (red), Kallisto (blue), and StringTie (green)...... 205
Figure 35: (A) Distribution of proportions of ACE-predicted exon-skipping isoforms supported by at least three spliced reads when such isoforms are not provided (blue) or are provided (red) as hints to TopHat 2 (Wilcoxon W = 762670, P < 2.2×10-16). (B) Proportions of exon-skipping isoforms assigned FPKM≥2 when such isoforms are not provided (blue) or are provided (red) as hints to StringTie (W = 198020, P < 2.2×10-16). (C) Proportions of exon-skipping isoforms supported by at least three spliced reads, for annotated splice sites that are not disrupted (blue) and annotated splice sites that are disrupted (red) (W = 771330, P < 2.2×10-16)...... 206
Figure 36: (A) Distribution of proportions of ACE-predicted exon-skipping isoforms supported by at least three spliced reads when such isoforms are not provided (blue) or are provided (red) as hints to TopHat 2 (Wilcoxon W = 579440, P < 2.2×10-16). (B) Proportions of exon-skipping isoforms assigned FPKM≥2 when such isoforms are not
xx
provided (blue) or are provided (red) as hints to StringTie (W = 198020, P < 2.2×10-16). (C) Proportions of exon-skipping isoforms supported by at least three spliced reads, for annotated splice sites that are not disrupted (blue) and annotated splice sites that are disrupted (red) (W = 716420, P < 2.2×10-16)...... 207
Figure 37: (A) Distribution of normalized read counts (normalized by total reads mapped to the locus) supporting novel splice junctions proximal to non-disrupted annotated sites. (B) Distribution of similarly normalized read counts supporting novel junctions proximal to disrupted annotated sites. The median is significantly greater than for the non-disrupted sites (Wilcoxon W = 222750000, P < 2.2×10-16) shown in panel A...... 208
Figure 38: (A) Scatterplot of cryptic splicing levels for cryptic sites near annotated splice sites that are disrupted (y-axis) or are not disrupted (x-axis). Each point corresponds to one annotated splice site observed to be disrupted in some individuals. Cryptic splicing levels were normalized by total reads mapped to the locus in the individual, then averaged across individuals in which the site was disrupted (y-axis) or not disrupted (x- axis). A Wilcoxon rank-sum test was applied to each point to remove nonsignificant results at an FDR (False Discovery Rate) threshold of 0.05. A majority of points (86%) lie above the y = x line. Median cryptic splicing levels were significantly higher in individuals in which the annotated splice site was disrupted (Wilcoxon signed rank test, V = 1135, P < 2.2×10-16). (B) Magnified view of the same scatterplot shown in panel A, showing 90% of original data points...... 209
Figure 39: (A) Distribution of proportions of transcripts with disrupted splicing among 1000 Genomes Project samples for which ACE identified at least one putative alternate splice form not predicted to entail loss of function (LOF) (maximum change in amino acid sequence was 10 amino acids). (B) Distribution of proportion of genes among 1000 Genomes Project samples identified by ACE as entailing LOF in one but not all annotated isoforms...... 210
Figure 40: (A) Distribution of log2 effect sizes of N=578 heterozygous NMD events after filtering to include only transcripts with mean FPKM≥1. (B) Distribution of log2 effect sizes of N=411 heterozygous NMD events after filtering to include only transcripts with mean FPKM≥2. (C) Distribution of log2 effect sizes of N=297 heterozygous NMD events after filtering to include only transcripts with mean FPKM≥3...... 211
Figure 41: (A) Betas (y-axis) from linear mixed-effects model with random intercepts, log2(FPKM) ~ Xb + Zu, converted to relative abundance ratios r0/2 = FPKM0 / FPKM2, where FPKMk denotes mean FPKM (fragments per kilobase of transcript per million
xxi
reads mapped) among individuals predicted to have k functional alleles of a transcript. FPKM thresholds (x-axis) of 0.1 through 15 were used to pre-filter transcripts prior to fitting the model. Only transcripts expressed in at least 30 individuals were included in the analysis. The largest Beta (b = 0.37, SE = 0.01) corresponds to an r0/2 = .60. (B) Similar plot as in panel A, for mixed-effects model with both random intercepts and random slopes. Mean r0/2 = 0.49875, indicating a halving of transcript abundance, on average, in homozygous NMD targets...... 212
Figure 42: (A) Distribution of RVIS percentiles for all human genes having an RVIS score. (B) RVIS percentiles for randomly selected transcripts with similar coding length to the transcripts plotted in Figure 15B (Wilcoxon rank-sum comparison to all genes in panel A: W = 138110000, P = 0.72). (C) RVIS percentiles for randomly selected transcripts with similar total length to the transcripts plotted in Figure 15B (comparison to all genes in panel A: W = 137890000, P = 0.75). (D) RVIS percentiles for randomly selected transcripts with matching numbers of exons to the transcripts plotted in Figure 15B (comparison to all genes in panel A: W = 138200000, P = 0.68). (E) RVIS percentiles for randomly selected transcripts with similar G+C% to the transcripts plotted in Figure 15B (comparison to all genes in panel A: W = 128960000, P = 0.99)...... 213
Figure 43: (A) Distribution of RVIS percentiles for homozygous loss-of-function genes in 1000 Genomes Project samples as predicted by ACE (Wilcoxon rank-sum comparison to RVIS percentiles for all genes in Figure 42A: W = 7378700, P < 2.2×10-16). (B) RVIS percentiles for heterozygous LOF genes (compared to RVIS percentiles for all genes: W = 53451000, P < 2.2×10-16). (C) Distribution of ncRVIS percentiles for homozygous LOF genes (compared to ncRVIS percentiles for all genes: W = 6032200, P = 5.5×10-11). (D) ncRVIS percentiles for heterozygous LOF genes (compared to ncRVIS for all genes: W = 48724000, P < 2.2×10-16)...... 214
Figure 44: (A) Distribution of numbers of isoforms per gene for all of GENCODE version 19. (B) Distribution of numbers of isoforms per gene for the N = 67 genes depicted in Figure 43B for which RVIS percentile < .20...... 215
Figure 45: (A) Variant rs11278302 in Ensembl gene ENSG00000174177 results in deletion of an entire splice site; however, the resulting sequence retains a valid donor splice site consensus and the flanking sequence scores above threshold under a positional weight matrix; furthermore, more than 30 spliced reads are assigned by TopHat 2 to this splice site in each allele of a homozygous individual, indicating that splicing is retained at this site under the alternate allele. Ensembl VEP classifies the variant as having high impact, due to the apparent loss of a splice site. (B) Variant rs67712719 in Ensembl gene
xxii
ENSG00000179588 introduces a frameshift which rs67322929 corrects, resulting in only two amino acid changes. These variants highly co-occur in 1000 Genomes Project individuals (97% of all 5008 haplotypes). However, Ensembl VEP classifies each individually as having high impact, as individually either would result in a frameshift...... 216
Figure 46: (A) Reference (hg19) and alternate alleles for Ensembl gene ENSG00000179588 in 1000 Genomes Project sample HG00096, haplotype 2, in the region of variants rs67712719, rs67322929, and rs67873604. (B) Interpreting variant rs67712719 alone would lead to a conclusion that the variant results in a frameshift and a large number of amino acid changes and protein truncation. (C) Similarly, interpretation of variant rs67322929 alone would lead to frameshift and a large change to the encoded protein. (D) Joint interpretation of all variants together reveals that changes are limited to four amino acids, two of which are deleted and two of which undergo substitution...... 217
Figure 47: Distribution of lengths (in amino acids) of affected intervals between compensatory frameshifts in 1000 Genomes Project samples...... 218
Figure 48: Distribution of simulated frameshift lengths in GENCODE protein-coding genes, assuming one frameshift per gene and uniform locations within coding segments; outliers above 3528nt (top 1%) were omitted for display purposes only. Full data set: median=260, N=162716...... 219
Figure 49: (A) Results of running an HMM gene finder on 19,000 broken genes. The gene finder was run on each gene, then a stop codon was inserted in a random location in the CDS without creating a splice site, and the gene finder was run on the modified sequence. In 11% of cases the gene finder predicted the same splice pattern on both the original sequence and the sequence modified to contain a premature stop. In 9% of cases, the gene finder predicted that no gene was present after the stop was inserted. In the remaining 80% of cases, the gene finder predicted a different splice pattern after the stop codon was inserted. (B) Relative position of inserted stop codon (relative to the spliced transcript) in cases in which the gene finder predicted the same splice pattern. There was a strong enrichment for stop codons inserted near the end of the coding segment, in which only the terminal portion of the protein would be affected, as well as a weaker enrichment near the beginning of the coding segment, in which the gene finder was able to find another start codon in the same reading frame that avoided the inserted stop codon. (C) Relative position of inserted stop codon in cases in which both the splice pattern did not change and the start codon was not changed...... 220
xxiii
Figure 50: Reproducibility of logistic regression training for content sensors. (A) Hexamer weights trained on 10,000 exon-intron pairs, versus weights for the same hexamers from an independent logistic regression applied to the same training cases. (B) Hexamer weights estimated from 20,000 training cases (x-axis) and weights for the same hexamers estimated from a subset of 10,000 training cases (y-axis)...... 221
Figure 51: (A) Effect of scaling factor rcontent/signal on AUC for SGRF applied to Thousand Genomes individual HG00096, using the human logistic content sensor. (B) AUC versus scaling factor for the SGRF using the minigene content sensor...... 221
Figure 52: Predictive accuracy of logistic signal sensors versus positional weight matrices (PWMs). (A) AUC for classification of annotated donor splice sites versus decoy sites, using logistic signal sensor (red) and PWM (blue). (B) AUC for classification of annotated acceptor splice sites versus decoy sites, using logistic signal sensor (red) and PWM (blue)...... 222
Figure 53: Number of Gs in 4096 hexamers (y-axis) as a function of logistic weights; points are sorted along the x-axis by weight...... 223
Figure 54: Example of a de novo splice site that appears to result in greater splicing activity than at the annotated splice site. Top: haplotype 1 of Thousand Genomes individual HG00118 shows evidence of splicing only at the annotated splice site. Bottom: variant rs202069778 in haplotype 2 of the same individual creates a new acceptor splice site that retains the original reading frame in the MAP4K1 gene, resulting in 8 amino acids being excluded from the encoded protein; TopHat2 aligns more spliced reads to this site than to the annotated site in this haplotype. This variant has a global MAF of 0.0002 in Thousand Genomes phase 3 samples, indicating it is possibly deleterious...... 224
xxiv
Acknowledgements
I wish to gratefully acknowledge Mark Yandell, Steven Salzberg, and Uwe Ohler for providing extensive and highly valuable mentorship during the years leading up to my graduate work at Duke, and my advisor Tim Reddy for his copious and expert guidance during my graduate studies. I also wish to thank the following individuals for their help and influence as collaborators and colleagues before or during my graduate work: Steve Finch, Mark Turner, Ian Korf, Mihaela Pertea, Brian Haas, Carson
Holt, Michael Campbell, Karen Eilbeck, Song Li, Chris Vockley, Ian McDowell, Tony
D’Ippolito, Graham Johnson, Sarah Cunningham, Nicky Lekprasert, Neel Mukherjee,
Sayan Mukherjee, David Corcoran, Molly Megraw, Iulian Pruteanu-Malinici, Gurkan
Yardimci, Justin Guinney, Stoyan Georgiev, Ayal Gussow, Dan Mace, Brad Moore,
Elizabeth Rach, Andrew Allen, Raluca Gordân, Jonathan Allen, Jonathan Eisen, Mihai
Pop, Mark DeLong, Greg Lamonte, Ashley Chi, Jason Stajich, Fred Dietrich, Dave
MacAlpine, Jennifer Wortman, Art Delcher, Mani Subramanian, Tom Heiman, Gerry
Perham, and Peter Li.
xxv
Chapter 1 – Outline
This dissertation introduces novel computational methods for the interpretation of genetic variants in the DNA encoding genes and the elements that regulate genes. As
DNA sequencing costs continue to drop, the utility of sequencing for medical diagnosis and development of therapeutics will increase, but only if we are able to make sense of that sequencing data via appropriate analysis methods. The work described here aims to improve our ability to make use of sequencing technologies in medicine and the biological sciences, though computational means.
Chapter 2 reviews the relevant biological background on many aspects of genes and gene regulation, and introduces several classes of computational models that the remaining chapters will build upon.
Chapter 3 describes a method and software implementation for interpreting genetic variants that may interrupt gene structures. The method was applied to thousands of human genomes as well as several plant varieties. Results underscore the importance of interpreting genetic variants in combination, within their native genomic context.
Chapter 4 extends the method described in Chapter 3 by introducing a probabilistic model that enables the scoring and ranking of gene structures that may result from changes to genomic sequence. The model illustrates a novel direction in
1
gene-structure modeling in that it relies primarily on non-coding features, and it relaxes the traditional assumption that all copies of genes are well-formed and fully functional.
Chapter 5 describes a new experimental assay for determining allele-specific effects on gene regulation, and computational methods for interpreting the results of that assay. These methods have applicability in the functional fine-mapping of genetic variants previously found to be in association with a disease or other phenotype.
Chapter 6 describes a pattern recognition method for identifying drug- responsive enhancers based on their epigenetic signatures at single-nucleotide resolution. Analyses of predictions made genome-wide using this model confirm known biology of the glucocorticoid receptor, GR. The model has potential applications in the mechanistic interpretation of gene regulatory elements and the genetic variants occurring within them.
Chapter 7 summarizes the work and discusses possible future directions.
2
Chapter 2 – Background
Heritable traits are encoded chemically in the genome—a collection of chromosomes consisting of linear sequences of the nucleotides adenine (A), cytosine (C), guanine (G) and thymine (T). Germline modifications to this four-letter DNA code can result in changes to phenotypes that are passed on to descendants, and thus have the potential to contribute to heritable diseases or to the evolution of new species. Such modifications are termed genetic variants, and can be discovered by comparing genome sequences between individuals in a population.
Many genetic variants have been found via genome-wide association studies
(GWAS) to be associated with diseases or other phenotypes of interest. However, due to the phenomenon of linkage disequilibrium, which arises from the rarity of genetic recombination between variants residing near each other in the genome, combined with small population sizes relative to that recombination rate, variants that may be causal for a phenotype may be in strong association with other nearby variants that do not contribute causally to phenotypic outcomes. As such, genetic associations are confounded by non-causal variants that contribute to association signals. Fine-mapping of variants is the process of winnowing down the set of associated variants to those that are most likely to be causal (reviewed in Spain and Barrett, 2015). In the case of disease, knowing which variants are causal is enabling, as the variants can be used as diagnostic markers and potentially targeted by therapeutics (Vockley et al., 2017).
3
A popular approach to fine-mapping is to annotate functional genomic elements and then to interpret variants by predicting whether they are likely to disrupt the function of those elements. Another approach is to test each variant experimentally, as to its mechanistic effect on some aspect of molecular biology. The results of both of these approaches together inform the process of variant prioritization, in which variants deemed most likely to have a functional effect on cell biology are selected for follow-up validation in the context of the specific disease or organismal phenotype under study.
This dissertation deals primarily with the interpretation of genetic variants occurring within the contexts of two types of genomic elements: genes and transcriptional enhancers. In sections 2.1 and 2.2 I review what is currently known concerning the biology of these elements, experimental methods for assaying these elements structurally and functionally, the potential for disruption of these elements to affect phenotypes, and current methods for modeling these elements computationally.
The discussion will be specific to eukaryotes, which include all forms of life other than bacteria and archaea.
2.1 Gene structures
2.1.1 Gene structure and its impact on the interpretation of genetic variants
For the purposes of this work I define eukaryotic gene structure to be the combination of a splicing pattern for a gene together with (in the case of protein-coding genes) a translation reading frame, as described in the sections below.
4
2.1.1.1 Transcription and splicing
DNA influences organismal phenotypes via the expression of genes encoded in the DNA. Gene expression begins with the production of RNA molecules copied from a
DNA template, a process termed transcription. Transcription in eukaryotes is carried out by RNA polymerase II (Pol II), a macromolecular complex consisting of many proteins.
The polymerase assembles at the core promoter of a gene, located at the gene’s 5’ end, in preparation for transcription, which proceeds in the 5’-to-3’ direction on the sense strand. Actual transcription begins at the transcription start site (TSS), which is located a short distance 3’ of the core promoter. Transcription proceeds through the gene to the transcription end site (TES), at which point the polymerase dissociates from the DNA and is recycled for further transcription events. The result of transcription is an RNA molecule, called a transcript, that is complementary to the DNA template. The RNA transcript is built up progressively by appending ribonucleotides, one at a time, as dictated by the DNA template and base complementarity.
As the nascent RNA strand is produced and emerges from the polymerase, various enzymes associated with the C-terminal domain (CTD) of the polymerase process the emerging transcript. These modifications to the transcript are generally co- transcriptional, in that processing performed on early (more 5’) portions of the transcript are performed while the polymerase is still transcribing the further (more 3’) portions of the gene. The most upstream co-transcriptional processing is the 5’ capping of the RNA,
5
which utilizes a unique guanine-guanine 5’-to-5’ bond at the 5’-most end of the transcript, to protect that end of the transcript from exonucleic activity that would quickly degrade the transcript. The 3’ end of the transcript is eventually polyadenylated by cleaving the transcript at a polyadenylation signal (typically AATAAA or ATTAAA in humans) near the TES and then appending several hundred As. Polyadenylation assists in the nuclear export of transcripts destined for the cytoplasm (where translation takes place—section 2.1.1.2). The other major co-transcriptional activity is splicing.
In eukaryotes, RNA transcribed from genes is co-transcriptionally spliced to remove introns and retain exons (Figure 1). Whereas the unspliced sequence of a transcript is called the primary transcript or pre-mRNA, we call the spliced version the mature transcript. The mature transcript thus consists of a series of exons ligated together. Note, however, that the transcripts of a single gene can be spliced in different ways in different cell types or different conditions, resulting in alternative splicing.
Different splice forms of a gene can also co-exist within the same cell type or condition, as splicing decisions are often stochastic (Pickrell et al., 2010; Stepankiw et al., 2015).
Thus, nucleotides that are destined to be exonic in one primary transcript may be intronic in another transcript from the same gene.
6
Figure 1: Introns are spliced out of transcripts, leaving only the exons (Reproduced with permission from Majoros, 2007b).
An intron begins with a donor splice site and ends with an acceptor splice site
(Figure 2A). In humans, most donor splice sites consist of the dinucleotide GT and most acceptor splice sites consist of the dinucleotide AG. A small minority (roughly 1%) of human donor sites have a GC or AT consensus, and a small minority (roughly 1%) of human acceptor sites have an AC consensus. Thus, most human introns begin with GT and end with AG. Note that the GT and AG are part of the intron, and are thus removed with the rest of the intron during splicing.
The donor splice site is recognized by the U1 small nucleolar ribonucleoprotein (U1 snRNP) and the acceptor splice site is recognized by the U2 auxiliary factor (U2AF).
These factors recognize not only the dinucleotide consensus portions (i.e., GT for donor sites, AG for acceptor sites) of splice sites, but also have preferences for particular nucleotides flanking the consensus (Figure 2B). As such, the local sequence context
7
around a donor splice site influences the strength of that splice site, meaning that a GT with a favorable sequence context is more likely to be recognized by the splicing machinery than a GT with an unfavorable context. These preferred sequence contexts are somewhat organism-specific, so that models of splice sites need to be trained specifically for each species (Korf, 2004). Acceptor splice sites are typically preceded by a polypyrimidine tract that is enriched for pyrimidines (Cs and Ts), and a short distance 5’ of this is a location known as the branch point. The branch point in humans is typically an A, and is recognized and bound by the branch-binding protein (BBP).
Figure 2: (A) An intron and the major trans-factors involved in its recognition (B) Nucleotide preferences of positions flanking human donor and acceptor splice sites; letter height is proportional to frequency.
Once a donor site is recognized and an acceptor site is recognized 3’ of that donor site on a transcript, the intervening intronic sequence is removed via a multistep process. First, the U1 snRNP is joined and eventually replaced on the transcript by a complex consisting of other snRNPs (snRNPs U4, U5, and U6), and the U2AF and BBP are replaced on the transcript by the U2 snRNP. The U2 and U5/U6 snRNPs interact, 8
bringing the ends of the intron into mutual proximity, in preparation for the splicing reaction. These ribonucleoproteins form the spliceosome. At least some of the components just described are believed to be associated with the CTD of Pol II so that they are readily available to recognize their respective RNA target sequences immediately as those sequences emerge from the polymerase. Once the complete spliceosome is in place, the branch point nucleotide forms a bond with the first nucleotide of the donor site. This bond results in a three-way junction called a lariat
(Figure 1). Next, the final nucleotide of the preceding exon (that was immediately 5’ of the donor site) joins the first nucleotide of the following exon. This results in the two exons being ligated into a single, continuous RNA strand. The intron, now in the form of a lariat structure, is released for degradation.
As noted above, the identities of the bases flanking the consensus dinucleotides for donor and acceptor splice sites provide for additional sequence specificity.
However, it has been shown that these preferences alone cannot explain splice-site selection by the spliceosome in humans and plants, as the proximal sequences at splice sites have insufficient information content to permit reliable discrimination of real sites from decoy sites (Lim and Burge, 2001). The information in a sequence model can be defined in terms of Shannon entropy (Shannon, 1948). The higher the entropy
(uncertainty) at a position in a motif, the lower the information provided by that position in the motif. Splice site motifs generally have high information only in the 2 bp
9
consensus sequences. While flanking positions often have weak preferences for individual nucleotides, they tend to have relatively high entropy, and therefore low information content. As a result, many sites in the genome match a splice site motif model despite not being functional splice sites.
It is now known that additional signals are present in transcript sequences that influence splice site selection. These are believed to be comprised primarily of binding sites for RNA-binding proteins that regulate splicing. Heterogeneous nuclear ribonucleoproteins (hnRNPs) bind to both introns and exons, though they are believed to bind preferentially to intronic sequences, where they repress splicing at cryptic splice sites
(Zarnack et al., 2013). There is also evidence that they may compact introns (Choi et al.,
1986; Dreyfuss et al., 1993), and it is conjectured that this renders the splicing reaction more thermodynamically favorable for long introns. SR proteins (so named for their serine- and arginine-rich domains, or RS domains) also bind to both exons and introns, though they are widely believed to bind preferentially to exons. It has been demonstrated that they interact with spliceosomal components U1 and U2AF via their
RS domains (Wu and Maniatis, 1993). As such, they are believed to actively recruit components of the spliceosome to the vicinity of splice sites. Some members of the SR protein family are also involved in export of mature transcripts from the nucleus
(Shepard and Hertel, 2009).
10
Approximately 95% of human genes contain introns, and 95% of intron-bearing human genes can be spliced in multiple, distinct ways, resulting in alternative splicing
(Pan et al., 2008). There are several forms of alternative splicing (Figure 3), including the use of alternative donor or acceptor splice sites, exon skipping, and intron retention.
These are believed to be regulated by hnRNPs and SR proteins, via the differential binding of these splicing regulatory factors (SRFs) at specific locations in a primary transcript. As some hnRNPs and SR proteins are expressed in a cell-type specific manner, the presence or absence of these factors in cells are believed to enable cell-type specific splicing.
Figure 3: Major forms of alternative splicing. Reprinted by permission from Macmillan Publishers Ltd: Nature Reviews Genetics (Luca Cartegni, Shern L. Chew, Adrian R. Krainer, Listening to silence and understanding nonsense: exonic mutations that affect splicing, copyright 2002).
11
The sequences recognized by SRFs are referred to as splicing regulatory elements
(SREs), and have widely been represented using hexamers (Stadler et al., 2006; Ke et al.,
2011b; Erkelenz et al, 2014; Rosenberg et al., 2015) or, less commonly, as octamers
(Zhang and Chasin, 2004). SREs consist of four types: exonic splicing enhancers (ESEs) and exonic splicing silencers (ESSs) that occur within exons, and intronic splicing enhancers
(ISEs) and intronic splicing silencers (ISSs) that occur within introns. Splicing enhancers are believed to promote inclusion of sequences into mature transcripts. Splicing silencers are believed to promote inclusion of sequences into introns, so that they are spliced out. In particular, splicing silencers are believed to promote exon skipping when they occur within sequences that in other conditions or other cell types are normally interpreted as exons to be included in the mature transcript. While it has traditionally been assumed that SR proteins promote exon inclusion in the mature transcript and that hnRNPs promote exon skipping, that view has been challenged, and evidence is mounting that splicing decisions are influence by a complex combination of factors
(reviewed in Pandit et al., 2013). Recent computational models of cell-type specific splicing have thus used thousands of sequence features and sophisticated machine- learning methods (Xiong et al., 2015), while other recent work has emphasized the use of simpler methods based on additive hexamer models (Rosenberg et al., 2015).
It is important to note that, with few exceptions (Buckley et al., 2014), all spliceosomal RNA splicing in mammals occurs inside the nucleus. Translation (section
12
2.1.1.2), in contrast, occurs in the cytoplasm, after the transcript has been fully spliced and exported from the nucleus. As such, splicing decisions made by the eukaryotic nuclear spliceosome are complete before translation begins. Direct molecular interactions between the eukaryotic spliceosome and the cytoplasmic ribosome that performs translation have not been convincingly demonstrated, and such interactions are highly unlikely during normal operation of the cell, given that they are separated by the nuclear membrane.
2.1.1.2 Translation
Once a gene has been transcribed into an RNA transcript and the transcript has been capped, spliced, and polyadenylated, it can be exported from the nucleus for translation into a protein. Transcripts that function in ways that do not involve translation into a protein are termed noncoding transcripts. Long noncoding transcripts are denoted lncRNAs (long noncoding RNAs). A number of shorter noncoding RNAs are also known, such as transfer RNAs (tRNAs), and microRNAs (miRNAs) that participate in post-transcriptional regulation of other RNAs via sequence-specific binding. Transcripts that are destined for translation into protein are termed messenger
RNAs (mRNAs).
Translation of mRNAs into protein is performed by the ribosome, a protein complex that forms and persists stably in the cytoplasm. The ribosome initially associates with the 5’ cap of an mRNA and then travels 5’-to-3’ along the mRNA,
13
synthesizing a protein as it goes. Just as the DNA template indicates precisely which ribonucleotides are to be appended to the growing RNA transcript during transcription, the mRNA dictates which amino acids are to be appended to the growing polypeptide
(protein). The amino acids are encoded by non-overlapping nucleotide triples, called codons. This code is read out by tRNAs that are loaded into the ribosome to read the mRNA. When a stop codon is reached (TGA, TAA, or TAG in humans), translation is terminated and the polypeptide is released. Functional polypeptides will then fold into a protein structure. Aberrant proteins will typically fail to properly fold, and will be degraded by a proteasome structure.
While a ribosome initially associates with the 5’ cap of an mRNA, translation does not begin at that location on the transcript. Ribosomes are believed to scan the mRNA, 5’-to-3’, searching for an acceptable start codon at which to begin translating.
The vast majority of annotated human start codons are ATG, which codes for the amino acid methionine. However, not all ATGs are recognized as start codons. Just as flanking nucleotides influence splice site selection by the spliceosome (section 2.1.1.1), nucleotides flanking an ATG can influence whether (or how often) the ribosome begins translating at that codon. In humans and other eukaryotes, the accompanying sequence that tags functional start codons is known as a Kozak sequence (Kozak, 1990).
The ribosome scanning model (Cigan et al., 1988) thus proposes that the ribosome begins scanning at the 5’ cap of the mRNA, searching for the first (most 5’) ATG that is
14
accompanied by a strong Kozak sequence, and begins translating at that codon. Use of this model has been shown to improve bioinformatic identification of start codons in mRNA sequences (Agarwal and Bafna, 1998). The selected start codon dictates the translation reading frame. Because translation occurs via non-overlapping codons in sequence, there are three possible reading frames on the sense strand of any given transcript. That is, for any given nucleotide in a coding segment (an interval that is translated by the ribosome), that nucleotide may be in the first position of a codon
(called phase 0), the second position of a codon (phase 1), or the third position of a codon
(phase 2). The start codon defines the reading frame, since the first (5’-most) nucleotide of the start codon is by definition in phase 0, and the phase of any subsequent (more 3’) nucleotide is determined via mod 3 arithmetic with respect to the start codon (Figure 4).
Translating an RNA sequence in different reading frames can produce radically different amino acid sequences. For many human genes, reading frames other than the ones used by the ribosome frequently terminate in a premature stop codon—i.e., a stop codon that is
5’ of the normal stop. Note, however, that only a stop codon encountered in frame (such that the first nucleotide of the stop codon is in phase 0) will terminate the reading frame.
An open reading frame, or ORF, is thus defined as a contiguous interval on a chromosome that does not contain an in-frame stop codon, except as the last codon.
15
Figure 4: Three-periodic phase in coding segments. (Reproduced with permission from Majoros, 2007b).
Splicing and translation are separate processes, but the results of splicing can influence which parts of the mRNA are translated. Furthermore, splicing and translation together impose a complex structure on the mRNA in which multiple signals—splicing signals (splice sites and splicing regulatory elements) and coding signals (codons)—coexist in the same sequence (Figure 5). Because introns will have already been removed by the time the mRNA is translated, start and stop codons in the genome can be interrupted by introns.
Figure 5: Gene structure combines a splicing pattern with a coding segment. (Reproduced with permission from Majoros, 2007b)
2.1.1.3 Assaying the results of splicing and translation
Gene structures can be assayed experimentally. At the time of this writing, splicing patterns are commonly assayed via RNA sequencing, or RNA-seq. In the first step of RNA-seq, RNA molecules are extracted from cells and used as templates to form a complementary DNA strand, called a cDNA. The RNA and DNA are then denatured, the RNA is degraded, and the cDNA is made double-stranded by adding primers, DNA 16
ligase, and nucleotides. Once double-stranded DNA is fully formed, it can be sequenced using standard sequencing methods. From the resulting DNA sequences we can infer the sequences of the original RNA molecules by substituting U for T.
Once the RNA sequences are available, they can be mapped to the genes that produced them, and precise alignments can be deduced that indicate, for every nucleotide in an RNA sequence, which DNA nucleotide in the genome served as a template for that RNA residue. Alignments also indicate the locations of indels
(insertions and deletions, also called gaps). Mapping and alignment of large numbers of short reads to a genome is popularly done via aligners based on the Burroughs-Wheeler transform (Langmead et al., 2009; Li and Durbin, 2009), or via the use of suffix arrays
(Dobin et al., 2013). Special methods are needed to map spliced reads (those spanning an intron) to a reference genome, since such reads omit the intervening intronic sequence that is present in the reference sequence. A popular solution is to identify candidate splice junctions in the reference sequence, construct synthetic spliced sequences representing those junctions, and then align reads to that library using standard alignment methods (Trapnell et al., 2009).
Inferring gene structures from sequencing data can, in theory, be done by simply aligning the sequences of spliced transcripts to a reference genome and identifying the gaps in the alignment corresponding to introns. In practice, this approach is made more complicated by technical hurdles. Currently, RNA-seq is most commonly performed
17
using short-read sequencing. Because read lengths are currently shorter than most eukaryotic transcripts, RNA-seq outputs do not unambiguously indicate whole gene structures. However, by combining multiple reads one can infer gene structures. This process is known as transcript assembly.
Transcript assembly can be performed either de novo, without knowledge of a reference genome, or it can be reference-based, in which a reference genome sequence is utilized. Reference-based assembly is typically more accurate than de novo assembly, which is a harder problem due to the lack of a reference genome (Hayer et al., 2015).
Reference-based transcript assembly typically utilizes a splice graph. Formally, a graph G
= (V, E) is a set of vertices V and a set of edges E = {(v, w)}, each of which is a pair of vertices in V. In a splice graph, vertices represent either individual splice sites or whole exons, depending on the type of splice graph being used.
Figure 6: One type of splice graph. Vertices represent splice sites, and edges represent introns and exons.
In one common type of splice graph, vertices represent putative splice sites and edges represent exons and introns (Figure 6). Vertices can be assigned coordinates relative to the reference genome, and edges can be assigned genomic intervals. A single path through a splice graph for a gene thus outlines precisely a splice pattern for
18
transcripts produced from that gene, and indicates coordinates of putative exons, introns, and splice sites. Spliced reads aligned to the reference genome can also be mapped to the splice graph. Thus, reads mapping to an exon edge serve as support for that edge, and spliced reads that exactly map to splice junctions serve as support for the corresponding intron edges in the splice graph. Algorithms have been devised that utilize read counts to score whole paths through a splice graph, in order to identify whole gene structures having the most experimental support (e.g., Trapnell et al., 2010;
Bernard et al., 2014; Pertea et al., 2015). In this way, short reads are used to make inferences about whole-gene structures, despite the fact that no single read spans the entire gene.
In previous work I developed a novel transcript assembly method, called RSVP
(Majoros et al., 2014), which utilizes a modified splice graph called an ORF graph (Figure
7). In an ORF graph, the underlying splice graph is augmented with information regarding open reading frames. In particular, each edge and each vertex are annotated with scores. A traditional generalized hidden Markov model (GHMM) (section 2.1.2.2) is used to assign initial values to these scores (Majoros et al., 2004). RSVP then re-weights the graph by forming a linear combination between the initial GHMM-based score and a function of RNA-seq read support for each feature.
19
Figure 7: (A) A specialized type of splice graph, called an ORF graph. Vertices represent splice sites and start/stop codons. Edges represent introns and exons. Edges and vertices are annotated with scores (not shown) computed by a gene prediction model. Each path corresponds to a gene prediction. The graph can be re-weighted using external evidence such as RNA-seq data. The highest-scoring path is shown in red. Phase information constrains the prediction, as coding segments are assumed to be conserved and well-formed. (B) An unconstrained ORF graph for an Arabidopsis thaliana gene. (C) The same ORF graph as in panel B, after filtering out intron edges not supported by spliced RNA-seq reads.
The highest-scoring path through the re-weighted graph then represents a gene structure prediction that integrates experimental evidence regarding splicing patterns
(RNA-seq) with information about known codon biases in the organism under study. A 20
strength of the method is that other approaches tend to erroneously predict introns inside exons whenever read depth drops to zero (a common problem for lowly- expressed genes), whereas RSVP can utilize the genomic sequence and known codon biases to score regions of exons for which no reads were captured. A shortcoming of this method is that it is specific to protein-coding genes. Another shortcoming is that, like all standard gene-structure predictors, it assumes that reading frames are preserved, an assumption that fails in the case of genetic variants that alter reading frames. In
Chapters 3 and 4 I introduce novel methods for gene structure prediction that relax this assumption.
2.1.1.4 Interpreting genetic variants within the context of a fixed gene structure
Given a gene with a known set of transcript structures, genetic variants within the gene can be interpreted as to their possible effects on the function of the gene.
Traditionally, the greatest emphasis has been placed on those variants that occur within the coding segments of protein-coding genes. A variant may alter a codon so as to encode a different amino acid. While some changes to amino acids can alter protein structure and/or function, others have little detectable effect. Those amino acids situated within a functional protein domain may be more likely to alter function when they are changed than those not in a functional domain. Because the genetic code is degenerate, many codons can be changed to another codon encoding the same amino acid. Such a change is termed a synonymous substitution. Synonymous codons often differ by a single
21
nucleotide, typically in the third codon position. Other variants that do change the identity of the encoded amino acid may constitute conservative substitutions if the two amino acids have similar physicochemical properties. Such properties are used by popular variant interpretation tools (e.g., Adzhubei et al., 2010). Importantly, it is necessary to know the gene structure of a gene in order to identify the reading frame and thus the codon position or positions affected by a variant. Only then can one predict the effect on the encoded amino acid sequence.
Another commonly-used indicator of whether an amino acid substitution is likely to be deleterious is evolutionary conservation. Various methods have been devised to quantify the apparent degree of evolutionary constraint at each nucleotide location in the genome, by observing patterns of substitution across multiple species
(e.g., Siepel et al., 2005), and these are commonly used to infer that, e.g., variants in constrained positions are more likely to be deleterious than those in unconstrained positions. Evolutionary conservation can also be measured between related proteins within one species, by aligning the proteins and observing substitution patterns. This is commonly used in interpreting amino acid substitutions (Kumar et al., 2009).
Even variants within a gene but outside of the coding segment can be potentially deleterious. Regulatory elements such as transcriptional enhancers (section 2.2) can occur within introns, and variants that alter those enhancers can have phenotypic effects
(section 2.2.5). Exonic regions outside of coding segments, termed untranslated regions
22
(UTRs) can also be sensitive to genetic variants, as these regions can also contain regulatory elements such as enhancers or miRNA target sites (Majoros and Ohler, 2007).
While existing variant prioritization tools utilize some of the foregoing in interpreting likely variant effects on genes, they generally assume that gene structure annotations are both correct and fixed. It has been noted that the use of different sets of annotations can have a substantial effect on variant interpretation (Frankish et al., 2015).
Furthermore, as discussed next, gene structures can themselves change as a result of genetic variants, and those changes can alter how other variants are best interpreted. In this way, genetic variants can have non-independent effects, as addressed in Chapter 3.
2.1.1.5 Genetic variants can alter splicing
Genetic variants that disrupt functional splice sites by changing their dinucleotide consensus to a sequence not recognized by the spliceosome have the potential to be deleterious (Buratti et al., 2007). Loss of a functional splice site can lead to a number of different outcomes, including exon skipping, intron retention, or the use of a different splice site. Depending on the specific context, the magnitude of the effect of these outcomes can differ, as described below.
In the case of exon skipping, in which a whole exon is omitted from the spliced transcript, the effect can depend on the exon length, and whether the exon encodes a polypeptide segment that is critical for protein function. The mammalian DMD gene encodes dystrophin, a substrate necessary for muscular function, and its disruption can
23
lead to Duchenne muscular dystrophy (reviewed in Roberts et al., 1994). However, many of the exons in the central portion of the gene can be safely skipped and still produce a functional dystrophin homologue (England et al., 1990). Because these exons have lengths that are divisible by 3, skipping these exons does not alter the reading frame for downstream (more 3’) exons. In contrast, skipping a coding exon having length not divisible by 3 will result in a frameshift, which if not corrected by compensatory changes downstream can lead to loss of protein function (Juan-Mateu et al., 2013).
Intron retention involves the inclusion of whole introns into the mature transcript.
If translated, a retained intron will introduce new amino acids into the transcript, and can also alter the translation reading frame by introducing a premature stop codon, or by introducing a new start codon. A retained intron could also potentially introduce secondary structures that may impede translation (Doma and Parker, 2006). Because human introns can be very long, intron retention in humans typically results in premature termination of the reading frame and loss of function (Jung et al., 2015).
Intron retention is more commonly observed in plants, which have shorter introns.
Indeed, many plant genes utilize intron retention in regulated alternative splicing to produce different, functional transcripts (Kalyna et al., 2012).
In addition to exon skipping and intron retention, another possible outcome of splice site disruption is the use of a different splice site close to the disrupted site. In some cases a gene may have multiple transcripts, or splice isoforms, that differ only in
24
that each uses a different splice site at one end of an exon, and if those splice sites are sufficiently close, disruption of one site might cause the spliceosome to select the other site (Královicová and Vořechovský, 2007). In this case, if the two transcripts have similar or identical functions, disruption of a splice site may have little or no functional consequence. In other cases, the spliceosome might select another splice site that is never or very rarely ever used. Such a splice site is called a cryptic site, and its use is called cryptic splicing.
Use of a cryptic site may have a range of consequences, depending on its location. A cryptic site that would otherwise be intronic will result in lengthening of the proximal exon, and if it is a coding exon, this will result in additional nucleotides being translated. A cryptic site chosen from within an exon will shorten the exon. When the number of nucleotides added or removed from an exon is divisible by three, any translational reading frames will be retained, and the result will be a local modification to the encoded amino acid sequence. Whether such a local modification is likely to be deleterious depends on the context within the original protein, and could potentially be assessed by observing local evolutionary conservation profiles as described above.
When the number of nucleotides added or removed is not divisible by three, the effect on a coding segment will be to shift the reading frame, which is often disruptive to protein function.
25
While genetic variants that disrupt a splice site have a clear potential to alter splice patterns, variants that do not directly disrupt splice sites can still alter splicing outcomes. Variants that create a new splice site, termed a de novo splice site, can alter splicing if the new splice site is recognized by the spliceosome. Recalling that the spliceosome has preferences for specific flanking nucleotides in addition to the dinucleotide consensus sequences of donor and acceptor splice sites, variants that alter these flanking bases could potentially alter the strength of a splice site, resulting in either the strengthening of a cryptic site or the weakening of a canonical site. In addition, it has become increasingly clear in recent years that variants that alter splicing regulatory elements (SREs—section 2.1.1.1) can impact splicing outcomes, which can in turn lead to disease (Di Giacomo et al., 2013).
2.1.1.6 Genetic variants can alter translation reading frames
As alluded to earlier, changes to coding segments that have lengths not divisible by 3 alter the translation reading frame. Changing the reading frame changes the sequence of codons and thus the sequence of encoded amino acids. Unless subsequent changes restore the original reading frame downstream (section 2.1.1.2), such frameshifts affect the complete terminal portion of the encoded protein. Frameshifts that are not near the end of the encoded protein can therefore be expected to be overwhelmingly disruptive to protein function. However, even frameshifts near the end of a coding segment can conceivably result in loss of function if the remaining portion of the altered
26
reading frame contains no in-frame stop codon, because lack of a stop codon can be detected by the cell and can result in nonstop decay and transcript degradation
(Vasudevan et al., 2002; reviewed by Klauer and van Hoof, 2012).
Frameshifts can come about in a number of ways. Splicing changes are one common source of frameshifts. Genetic variants that insert or delete nucleotides (indels) are another source. A variant that creates a new start codon upstream from (5’ of) the canonical start codon can result in a frameshift if the change to the coding segment has length not divisible by three; an example is illustrated in Chapter 3 (Figure 17B).
Similarly, if a variant disrupts a canonical start codon, the ribosome scanning model predicts that the ribosome will select a downstream (more 3’) start codon if possible, and that may again result in a frameshift. Use of a different start codon, even if it does not cause a frameshift, can be potentially disruptive, as this will add or remove amino acids to the encoded protein. Similarly, a variant that disrupts a canonical stop codon can result in nonstop decay as described above, or if another stop codon exists in-frame downstream of the original but now disrupted stop codon, additional amino acids may be appended to the encoded protein.
A common result of frameshifts is the introduction of a premature stop codon, as described earlier. A premature stop codon can also be created directly by a single genetic variant. Premature stop codons (also called pre-termination codons, or PTCs) can trigger nonsense-mediated decay (NMD), resulting in degradation of the transcript and loss
27
of protein production (reviewed in Lykke-Andersen and Jensen, 2015). An established heuristic for predicting NMD in mammals and plants is the 50 bp rule (Nagy and
Maquat, 1998; Nyiko et al., 2013): if the distance (measured in nucleotides on the spliced transcript) of the PTC to the most 3’ exon junction is less than 50 bp, protein truncation is predicted. Otherwise, NMD is predicted. Recent analyses based on massively parallel splicing reporter assays found strong support for the 50 bp rule in human cells
(Rosenberg et al., 2015).
2.1.2 Traditional approaches to gene structure modeling
Work on the development of methods for gene-structure prediction—also called gene-finding—has a long history. Early methods in the 1980s predicted individual exons and ORFs in isolation (Staden and McLachlan, 1982; Fickett, 1982). In the early 1990s, methods for chaining together exons and introns using dynamic programming were developed (Snyder and Stormo, 1993; Stormo and Haussler, 1994). In the late 1990s, shortly before the initial draft sequencing of the human genome, highly accurate and efficient methods for predicting whole-gene structures in long genomic sequences were developed, based on hidden Markov models, and by the early 2000s most of the gene- finding field had converged on the use of generalized hidden Markov models (GHMMs)
(Kulp et al., 1996), also known as semi-Markov models (Burge and Karlin, 1997). In the late 2000s attention shifted from the use of GHMMs, which are generative models, to the
28
use of conditional random fields (CRFs) (Vinson et al., 2007; Bernal et al., 2007; Gross et al.,
2007), which are discriminative models. In the following sections I review these methods.
2.1.2.1 Hidden Markov models
Hidden Markov models (HMMs) have been applied to various problems in computational linguistics, most notably speech recognition (Jelinek, 1998) and part-of- speech tagging (Majoros et al., 2002), as well as a number of problems in genomics
(Durbin et al., 1998). As the stochastic, generative versions of finite automata (Hopcroft and Ullman, 1979), they are effectively grammar models and thus applicable to the problem of parsing finite-length sequences of discrete symbols from a finite alphabet such as DNA or RNA. Their probabilistic nature also allows them to be utilized for classification, by assigning probabilities to sequences.
Formally, an HMM is a machine M = (Q, A, Pt, Pe) having a finite set of states Q Î
! | | | for q , n for n= Q , a finite emission alphabet A, a transition probability distribution Pt(qi qj) i q Î Q | Î Î j , and an emission distribution Pe(x y) for x A and y Q. I adopt the convention
that M begins and ends in state 0, denoted q0, which is a silent state. At discrete time
points the machine transitions from the current state yi to a next state yj stochastically,
according to Pt(yj|yi). For every non-silent state qi, i ≠ 0, upon entering qi the machine
emits a random symbol x chosen according to Pe(x|qi). Upon re-entering q0 the machine terminates and the emitted sequence is complete.
29
Efficient algorithms for performing inference with HMMs have been widely disseminated (Rabiner, 1989; Durbin et al., 1998). Given a non-empty sequence S =
f x0x1…xL-1, the Viterbi algorithm computes the most probable state path, * = (y0, y1, … yL-1)
for yi Î Q, by which the machine could have emitted S:
∗ ������ � = � � � �
������ = � � � � �(�)
������ = � � �, �
������ = � � � � � �