BIOINFORMATICS DOI: 10.1093/Bioinformatics/Btg1003
Total Page:16
File Type:pdf, Size:1020Kb
Vol. 19 Suppl. 1 2003, pages i34–i43 BIOINFORMATICS DOI: 10.1093/bioinformatics/btg1003 Predicting bacterial transcription units using sequence and expression data Joseph Bockhorst 1, 2,∗, Yu Qiu 3, Jeremy Glasner 3, Mingzhu Liu 3, Frederick Blattner 3 and Mark Craven 2, 1 1Department of Computer Sciences, 2Department of Biostatistics & Medical Informatics and 3Laboratory of Genetics, University of Wisconsin, Madison, Wisconsin 53706, USA Received on January 6, 2003; accepted on February 20, 2003 ABSTRACT has two key properties. First, the approach can provide a Motivation: A key aspect of elucidating gene regulation coherent set of predictions for related regulatory elements in bacterial genomes is identifying the basic units of of various types. Second, the approach can take advantage transcription. We present a method, based on probabilistic of both DNA sequence data and gene expression data. language models, that we apply to predict operons, The prediction task that we consider here is the fol- promoters and terminators in the genome of Escherichia lowing. Given the genome sequence and gene expression coli K-12. Our approach has two key properties: (i) data for a prokaryotic organism, predict all of the tran- it provides a coherent set of predictions for related scription units (TUs) in the genome. The definition of a regulatory elements of various types and (ii) it takes transcription unit that we assume is: the complete extent advantage of both DNA sequence and gene expression of a sequence of DNA that is transcribed to produce a data, including expression measurements from inter-genic single mRNA transcript. That is, each of our predicted probes. transcription units consists of a sequence of co-transcribed Results: Our experimental results show that we are able genes along with the associated polymerase binding site to predict operons and localize promoters and terminators and transcription termination signal. Our approach is with high accuracy. Moreover, our models that use both to use a probabilistic grammar for DNA sequence and sequence and expression data are more accurate than expression data. We train this model from a set of known those that use only one of these two data sources. operons, promoters and terminators. The definition of Availability: Our E.coli transcription-unit predic- operon we use is a set of genes that are transcribed as a tions are available from http://www.biostat.wisc.edu/ unit under some condition. Note that, in contrast to our gene-regulation/. definition of a transcription unit, our operon definition Contact: [email protected]; does not include the exact start and end of transcription [email protected] points. The gene expression data we use comes from INTRODUCTION Affymetrix oligonucleotide arrays for E.coli that in- clude probes for both coding and inter-genic regions. A central challenge in computational biology is to uncover Because they contain probes corresponding to inter-genic the complete gene regulation network of an organism. regions, these arrays are quite advantageous for identi- This challenge can now be profitably attacked given the fying the 5 and 3 ends of transcription units. However, availability of complete genomes and high-throughput our approach is general enough that it does not require technologies for interrogating the states of cells. A key expression data from this type of array. Our approach can step in addressing the challenge is to assemble a ‘parts list’ use expression data from any type of microarray, and is of the regulatory elements for a given genome. We present still applicable even when expression data is not available. an approach, based on probabilistic language models, that There is a large body of research on using computational uses sequence and expression data to predict a variety methods to recognize regulatory elements in prokaryotic of regulatory elements in prokaryotic genomes. We apply DNA. This work includes methods that learn models to this approach to the task of predicting transcription units recognize promoters (Pedersen et al., 1996; Thieffry et in the genome of Escherichia coli K-12. Our approach al., 1998), terminators (Brendel et al., 1986; Carafa et al., 1990; Ermolaeva et al., 2000), and operons (Salgado et al., ∗ To whom correspondence should be addressed. 2000; Moreno-Hagelsieb and Collado-Vides, 2002). Our i34 Bioinformatics 19(1) c Oxford University Press 2003; all rights reserved. Predicting transcription units using sequence and expression work differs from these efforts in several respects. First, in biases of neighboring ORFs. The third sequence of obser- making our predictions we use not only sequence data, but vations, which we denote by z, is based on the expression also expression data. Second, our model simultaneously measurements made by the probes on the oligonucleotide predicts promoters, terminators and operons instead of array. These three sequences are aligned by associating treating these as separate prediction tasks. Third, we each codon usage and expression observation with a predict the entire extent of transcription units. specific position in the DNA sequence. Table 1 provides a In earlier research (Craven et al., 2000; Bockhorst et al., toy example illustrating each these sequences. 2003), we developed probabilistic methods for predicting For the analysis considered here, we assume that we operons using sequence and expression data. The work are given the (predicted) coordinates of every ORF in the presented herein extends our earlier work by predicting genome. The basic units that our model processes when the complete extent of transcription units, as opposed making predictions are runs of genes. We define a run to predicting just which sequences of genes are in the to be a sequence of DNA that (i) contains genes only on same operons. Additionally, our new approach extends one strand, and (ii) is bracketed by genes on the opposite our earlier work by simultaneously predicting a coherent strand. Given such a run, we would like our model to set of promoter, terminator and operon predictions for predict all of the TUs contained within it. That is, we want a complete genome. The notion of coherency here is the model to identify the sequences of genes that are co- that a set of predictions satisfies the relationships that transcribed, along with the precise 5 and 3 ends of the necessarily hold among the regulatory elements being DNA sequence corresponding to each transcript. predicted. For example, each predicted TU must be In the remainder of this section we describe in detail bracketed by a promoter and a terminator. Also, the how we derive codon-usage and expression observation expression data used previously contained expression sequences. intensities only for genes while here we also utilize inter- genic probe measurements. Codon usage sequence Yada et al. (1999) developed an approach to predict A variety of factors influence an ORF’s codon usage transcription units from sequence data, and Tjaden et properties (Karlin et al., 1998) including some, such as al. (2002) developed an approach to predict transcription gene function, expression level and evolutionary history, units using expression data from Affymetrix E.coli arrays. that also influence the grouping of ORFs into transcription However, in contrast to our approach, these methods use units. In previous work we found consideration of codon only a single source of data. As our experiments show, usage properties to be beneficial to an operon prediction the combination of DNA sequence data and expression task (Bockhorst et al., 2003). data results in more accurate predictions than either To derive a sequence of codon usage based observations alone. Additionally, the use of DNA sequence data allows from a run we first associate each ORF o in the run with {o} our method to make predictions for genes that are not a set of codon bias vectors ba , one for each amino vw o expressed under the measured conditions. Finally, we acid a. Let u be a codon that codes for a and nuvw predict transcription terminators whereas Tjaden et al. be the number of times codon uvw appears in o. Then, the do not, and our approach can naturally be extended to elements of the bias vectors are given by: predict other regulatory elements, such as transcription o = ˆo − ¯ factor binding sites, as well. ba,uvw f(uvw|a) f(uvw|a) Several other research groups have addressed the task of ¯ predicting operons. Ermolaeva et al. (2001) use compar- where f(uvw|a) is the frequency with which a is encoded isons of gene order in multiple genomes to predict oper- by uvw (relative to other codings for a) over the whole ons. Zheng et al. (2002) use information about biochemi- genome and cal pathways to predict metabolism-related operons. Since o ¯ n vw + f( vw| ) these methods use different sources of evidence than we fˆo = u u a (uvw|a) o + do, we view them as complementary. xyz∈codons(a) nxyz 1 DATA REPRESENTATION is the smoothed frequency with which a is coded for by uvw in o. The sum in the denominator ranges over all Our approach to predicting transcription units is based codons that code for amino acid a. on a probabilistic language model, which describes TUs Next, we calculate the codon usage similarity between in terms of three sequences of observations. The first each pair of neighboring ORFs in the run. The codon sequence of observations, which we denote by x, consists usage similarity between the two ORFs o and q is defined of the nucleotides of the DNA sequence. The second as: sequence of observations, which we denote by y,is = ( , ) = o · q . yi Sim o q ba ba composed of the similarities between the codon usage a i35 J.Bockhorst et al. Table 1. A toy example of an observation sequence for a region of DNA at an ORF boundary intergenic ORF probes •• • •• x DNA sequence c c cg a gaa g a A TGCGT y ORF-ORF codon usage similarity ––––––––––0.4 ––––– ORF-ORF expression correlation ––––––––––0.6 ––––– z upstream expression correlation – 0.8 ––0.7 –––0.4 – – ––––– downstream expression correlation – 0.3 ––0.4 –––0.2 – – ––––– Nucleotides shown in lower case correspond to inter-genic positions and those shown in upper case correspond to coding positions.