Fstitch: a Fast and Simple Algorithm for Detecting Nascent RNA Transcripts

FStitch: A fast and simple algorithm for detecting nascent RNA transcripts Joseph Azofeifa Mary A. Allen Manuel Lladser Department of Computer BioFrontiers Institute Department of Applied Science University of Colorado Mathematics University of Colorado 596 UCB, JSCBB University of Colorado 596 UCB, JSCBB Boulder, CO 80309 526 UCB Boulder, CO 80309 Boulder, CO 80309 ∗ Robin Dowell Department of MCD Biology & Computer Science BioFrontiers Institute University of Colorado 596 UCB, JSCBB Boulder, CO 80309 ABSTRACT Keywords We present a fast and simple algorithm to detect nascent Nascent Transcription, Logisitic Regression, Hidden Markov RNA transcription in global nuclear run-on sequencing Models (GRO-seq). GRO-seq is a relatively new protocol that cap- tures nascent transcripts from actively engaged polymerase, providing a direct read-out on bona fide transcription. Most traditional assays, such as RNA-seq, measure steady state 1. INTRODUCTION RNA levels, which are affected by transcription, post-trans- Almost all cellular stimulations trigger global transcriptional criptional processing, and RNA stability. A detailed study changes. To date, most studies of transcription have em- of GRO-seq data has the potential to inform on many as- ployed RNA-seq or microarrays. These assays, though pow- pects of the transcription process. GRO-seq data, however, erful, measure steady state RNA levels. Consequently, they presents unique analysis challenges that are only beginning are not true measures of transcription because steady state to be addressed. Here we describe a new algorithm, Fast levels are influenced by not only transcription but also RNA Read Stitcher (FStitch), that takes advantage of two pop- stability. Only recently has a method for direct measur- ular machine-learning techniques, a hidden Markov model ment of transcription genome-wide become available. This (HMM) and logistic regression to robustly classify which re- technique, known as global run-on sequencing (GRO-seq), gions of the genome are transcribed. Our algorithm builds simultaneously detects the amount and direction of actively on the strengths of previous approaches but is accurate, de- engaged polymerases at every position within the genome[7]. pendent on very little training data, robust to varying read GRO-seq has already drastically influenced our understand- depth, annotation agnostic, and fast. ing of the transcription process, as most of the genome is transcribed but rapidly degraded. Categories and Subject Descriptors The earliest and most common approach to GRO-seq anal- I.2.1 [ARTIFICIAL INTELLIGENCE]: Applications and ysis is annotation centric[7, 22, 17, 14] and only two efforts Expert Systems|Medicine and science have attempted to identify regions of active transcription directly from GRO-seq data[2, 11]. The first of these ap- General Terms proaches used a two state Hidden Markov model that was Algorithms, Experimentation parametrized based on available annotations[11]. This approach has the advantage of calling large contiguous re- ∗corresponding author [email protected] gions as transcribed, but fails to call many unannotated regions because their length and transcription levels do not Permission to make digital or hard copies of all or part of this work for mimic well annotated regions. The more recent approach, personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear called Vespucci, uses a sliding-window (specified by two user- this notice and the full citation on the first page. Copyrights for components dependent parameters) that joins transcripts together based of this work owned by others than ACM must be honored. Abstracting with on read depth, but requires the user to tune the algorithm credit is permitted. To copy otherwise, or republish, to post on servers or to with each new dataset[2]. The windowing scheme, in prin- redistribute to lists, requires prior specific permission and/or a fee. Request ciple, has the benefit of not depending on annotation, how- permissions from [email protected]. ever, in practice, regions of transcription are often broken BCB’14, September 20–23, 2014, Newport Beach, CA, USA. Copyright 2014 ACM 978-1-4503-2894-4/14/09 ...$15.00. into discontiguous sections, requiring the use of annotations http://dx.doi.org/10.1145/2649387.2649427 to improve their strategy[2]. The design of our software is largely motivated by both the strengths and shortcomings of these previous efforts[11, 21, 2]. In particular, we propose a fast and robust method that takes advantage of a logistic regression classifier em- bedded within a hidden Markov model as a means of learning non-linear decision boundaries that classify regions of active nascent transcription[19]. Such an algorithmic technique shares a similar structure with a Maximum Entropy Markov Models[19]. Thusly, our methodology allows for parameters to adapt on the fly to new data. It is annotation Figure 1: Contig length and coverage statistics dis- agnostic, effectively identifies cohesive regions of active tran- criminate active from inactive nascent transcription. scription, has a rapid runtime, and is easily parameterized A contig (blue) is defined by the length of overlapping reads by providing a small number of training examples. x1 in Table 1. Coverage statistics define mean, median, mode and variance of reads (black bars) across a contig. In places where there are no reads, a gap (green) is defined 2. MATERIALS AND METHODS by a negative length value and all contig coverage statistics 2.1 Algorithm Description are set to zero. The GRO-seq technique measures nascent transcripts pro- duced from actively engaged polymerase. Because splicing 2.2 Datasets has not yet occurred, each transcript covers a contiguous re- This study takes advantage of three previously published gion of the underlying genome, reflecting the extent of poly- GRO-seq datasets: MCF-7 [11], IMR90 [7] and HCT116[1] merase activity. Sequencing reads obtained from the GRO- as well as two previously published ChIP-Poll II datasets: seq protocol represent a sampling of the underlying tran- HCT116[13] and MCF7 [15]. For each experiment, raw reads scripts in proportion to their relative abundances. Ideally, were mapped and aligned to the hg19 genome using Bowtie2 overlapping reads could be merged into contigs, or regions with the command bowtie -S -t -v 2 -best[16]. ENCODE of continuous read coverage, defining regions of active tran- provided H3K27ac and DNAse peak calls for IMR90 [21, 6], scription. But because of uneven sampling, read coverage MCF7 [9, 12] and HCT116 [9, 24] as well as ChIA-PET peak within active regions may not be continuous. Furthermore, calls for HCT116[10]. the sequencing and mapping process is noisy, therefore reads can also map to inactive regions. We hand annotated the entire length of chromosome 1 in the GRO-seq dataset from HCT116[1] to perform k-fold cross Transcription can be modeled as a discrete time-series in- validation. For all testing, 95% of the labeled dataset was dexed by genomic coordinates where transcriptional activ- removed from training and used to asses model accuracy. To ity of adjacent base-pairs are correlated. Similar to prior be clear, the entire labeled HCT116 training data contains models of GRO-seq[11], we model this process as an ergodic 17,776 labeled active regions. In both the IMR90 and MCF7 first-order Markov chain where transcription oscillates be- GRO-seq datasets, 7 regions considered active and 7 regions tween active and inactive states. Unlike previous models, considered inactive were labeled for parameter estimation. which classify individual nucleotides, our model emits from These training sets with genomic coordinates and labels are each state a contig representative of an active or inactive freely downloadable at http://dowell.colorado.edu. region (Figure 1). Each contig can be described by two feature classes: contig length (maximum length of overlapping reads) and contig coverage statistics (Table 1). Active 2.3 Parameter Estimation states, in general, contain a combination of long regions with Both the Markov model transition probabilities and the con- high signal interspersed with short regions of relatively no ditional state emission probabilities are estimated via a user signal. Hence our HMM framework allows for the classifica- defined, labeled training set. Given that read mapping can tion of a continuous active region, containing one or more be noisy and nascent transcripts can be present at very low contigs, despite the variability in coverage of individual nu- levels, estimating the parameters that define transcription cleotides that is inherent in short read sequencing data. from those describing read-mapping poses a difficult prob- lem. To this end, the algorithm requires a small training Many of the features of contigs are dependent on sequencing dataset, provided by the user, in which regions of the genome depth of a particular experiment. Therefore, we must learn are defined as either active or inactive. We show in section the emission and transition probabilities of each state from 3.1 that little training data is needed to retain high model ac- a user provided training set. In our case, the training set curacy. Intuitively, we define model accuracy as the fraction corresponds to regions of active and inactive transcription. of base pairs where the user-label and classification-label Given a training set, we learn the conditional

Fstitch: a Fast and Simple Algorithm for Detecting Nascent RNA Transcripts

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support