Building a Dictionary for Genomes: Identification of Presumptive Regulatory Sites by Statistical Analysis

Building a dictionary for genomes: Identification of presumptive regulatory sites by statistical analysis Harmen J. Bussemaker*†, Hao Li†‡, and Eric D. Siggia† Center for Studies in Physics and Biology, The Rockefeller University, Box 25, 1230 York Avenue, New York, NY 10021 Communicated by Mitchell J. Feigenbaum, The Rockefeller University, New York, NY, June 8, 2000 (received for review April 18, 2000) The availability of complete genome sequences and mRNA expres- respond are uniformly distributed over the 16 chromosomes. sion data for all genes creates new opportunities and challenges Chromatin structure within the regulatory region can influence for identifying DNA sequence motifs that control gene expression. transcription, but for the few cases analyzed in detail, this An algorithm, ‘‘MobyDick,’’ is presented that decomposes a set of structure is itself under the control of cis-acting elements and DNA sequences into the most probable dictionary of motifs or therefore amenable to the type of analysis we propose (compare words. This method is applicable to any set of DNA sequences: for ref. 10). The multiple sequence motifs can be missed because of example, all upstream regions in a genome or all genes expressed insufficient sequence data if genes are divided into small sub- under certain conditions. Identification of words is based on a groups and analyzed separately (11). To detect multiple regu- probabilistic segmentation model in which the significance of latory elements with optimal statistics, all of the regulatory longer words is deduced from the frequency of shorter ones of regions in the genome need to be analyzed together. Alterna- various lengths, eliminating the need for a separate set of refer- tively, to identify regulatory elements responsive to a specific ence data to define probabilities. We have built a dictionary with genetic or environmental change, all genes that respond in a 1,200 words for the 6,000 upstream regulatory regions in the yeast DNA microarray experiment should be analyzed as a group. genome; the 500 most significant words (some with as few as 10 We present an algorithm, ‘‘MobyDick,’’ suitable for discover- copies in all of the upstream regions) match 114 of 443 experi- ing multiple motifs from a large collection of sequences, e.g., all mentally determined sites (a significance level of 18 standard of the upstream regions of the yeast genome or all of the genes deviations). When analyzing all of the genes up-regulated during up-regulated during sporulation. The approach we took formal- sporulation as a group, we find many motifs in addition to the few izes how one would proceed to decipher a ‘‘text’’ consisting of a previously identified by analyzing the subclusters individually to long string of letters written in an unknown language in which the expression subclusters. Applying MobyDick to the genes de- words are not delineated. The algorithm is based on a statistical repressed when the general repressor Tup1 is deleted, we find mechanics model that segments the string probabilistically into known as well as putative binding sites for its regulatory partners. ‘‘words’’ and concurrently builds a ‘‘dictionary’’ of these words. MobyDick can simultaneously find hundreds of different motifs, he availability of complete genome sequences has facilitated each of them present in only a small subset of the sequences, e.g., Twhole-genome expression analysis and led to rapid accumu- 10–100 copies within the 6,000 upstream regions in the yeast lation of gene expression data from high-density DNA microar- genome. The algorithm does not need external reference data set ray experiments. Correlating the information coded in the to calibrate probabilities and finds the optimal lengths of motifs genome sequences to these expression data are crucial to automatically. We illustrate and validate the approach by seg- understanding how transcription is regulated on a genomic scale. menting a scrambled English novel, by extracting regulatory A great challenge is to decipher the information coded in the motifs from the entire yeast genome, and by analyzing data regulatory regions of genes that control transcription. So far the generated from a few DNA microarray experiments. development of computational tools for identifying regulatory elements has lagged behind those for sequence comparison and Theory and Methods gene discovery. One approach has been to delineate, as sharply The regulatory sequences in a eukaryotic genome typically have as possible, a group of 10–100 coregulated genes (1–3) and then elements or motifs that are shared by sets of functionally related find a pattern common to most of the upstream regions. The genes. These motifs are separated by random spacers that have analysis tools used range from general multiple-alignment algo- diverged sufficiently to be modeled as a random background. rithms yielding a weight matrix (4–6) to comparison of the The sequence data are modeled as the concatenation of words frequency counts of substrings with some reference set (1). w drawn at random with frequency pw from a probabilistic These approaches typically reveal a few responsive elements per ‘‘dictionary’’ D; there is no syntax. The words can be of different group of genes in the specific experiment analyzed. lengths, and typically regulatory elements emerge as longer Although multiple copies of a single sequence motif may words whereas shorter words represent background. Intuitively, confer expression of a reporter gene with a minimal core to build a dictionary from a text, one starts from the frequency promoter, real genomes are not designed as disjoint groups of of individual letters, finds over-represented pairs, adds them to genes each controlled by a single factor. Instead, to tailor a gene’s the dictionary, determines their probabilities, and continues to expression to many different conditions, multiple control elements are required and signals are integrated (7). It is assumed *Present address: Swammerdam Institute for Life Sciences and Amsterdam Center for that genomewide control is achieved by a combinatorial use of Computational Science, University of Amsterdam, Kruislaan 318, 1098 SM Amsterdam, multiple sequence elements. For instance, the microarray ex- The Netherlands. periments for yeast have shown quantitatively that most occur- †E-mail: [email protected], [email protected], or [email protected]. rences of proven binding motifs are in nonresponding genes, and ‡Present address: Departments of Biochemistry and Biophysics, University of California, San many responding genes do not have the motif (8, 9). Thus there Francisco, CA 94143. are multiple sequence motifs to be found (10). Other variables The publication costs of this article were defrayed in part by page charge payment. This such as copy number and motif position relative to translation article must therefore be hereby marked “advertisement” in accordance with 18 U.S.C. start cannot single out the active genes. Regulation by long-range §1734 solely to indicate this fact. changes in chromosome structure cannot explain the observa- Article published online before print: Proc. Natl. Acad. Sci. USA, 10.1073͞pnas.180265397. tions in several experiments (8, 9), where 500–1,000 genes that Article and publication date are at www.pnas.org͞cgi͞doi͞10.1073͞pnas.180265397 10096–10100 ͉ PNAS ͉ August 29, 2000 ͉ vol. 97 ͉ no. 18 Downloaded by guest on September 28, 2021 build larger fragments in this way. The algorithm loops over a expected frequency of the length-3 word TAT (assumed to be fitting step to compute the optimal assignment of pw given the embedded in a longer string) can be deduced from the various entries w in the dictionary, followed by a prediction step, which ways it can be made by concatenations of the words in the adds new words and terminates when no words are found to be dictionary: AT.A.T, AT.AT, T.A.T, and T.AT. Note that the over-represented above a certain prescribed threshold. occurrence of a word is contingent on a partition; its frequency Given the entries w of a dictionary (the lexicon), the optimal is the total probability of all partitions that delimit it, which has pw is found by maximizing Z(S, pw), the probability of obtaining no simple relation to the number of times the substring occurs the sequence S for a given set of normalized dictionary proba- in the data. In practice, to check the completeness of the bilities pw. For our model, dictionary, we consider all pairs of dictionary words w,wЈ and ask ͑ ͒ whether the average number of occurrences of the composite ϭ ͸ ͹ ͑ ͒Nw P Z pw , [1] word w,wЈ created by juxtaposition exceeds by a statistically P w ¶ ͗ ͘ significant amount the number predicted by the model , Nw pwЈ (or equivalently ͗N ͘p ). If so, the composite word is added to where the sum is over all possible segmentations P of S, i.e., all w’ w possible ways in which the sequence S can be generated by the dictionary. Here the statistical significance of longer words is based on the probability of shorter words, thus eliminating the concatenation of words in the dictionary, and Nw(P) denotes the number of times word w is used in a given segmentation. For need for an external reference data set to define probability. It example, if the dictionary D ϭ {A,T,AT} and the sequence S ϭ should be noted that the background model for the statistical test TATA, then there are only two possible ways S can be segmented, evolves as the dictionary grows. Most other models assume a T.A.T.A and T.AT.A, where ‘‘.’’ denotes a word separator. The static background (1, 4–6, 12). ϭ 2 2 ϩ Although the juxtaposition of dictionary words is a very corresponding Z(TATA) (pA pT pA pT pAT). efficient means for narrowing the search for underpredicted For a given set of dictionary words w, we fit the pw to the sequence S by maximizing the likelihood function in Eq.

Building a Dictionary for Genomes: Identification of Presumptive Regulatory Sites by Statistical Analysis

DNA Sequencing and Sorting: Identifying Genetic Variations

A Tool for Detecting Base Mis-Calls in Multiple Sequence Alignments by Semi-Automatic Chromatogram Inspection

A Brief History of Sequence Logos

Bioinformatics Manual Sequence Data Analysis with CLC Main Workbench

A STAT Protein Domain That Determines DNA Sequence Recognition Suggests a Novel DNA-Binding Domain

Bioinformatics I Sanger Sequence Analysis

Identification of Factors That Interact with the E IA-Inducible Adenovirus E3 Promoter

Lecture 7: Sequence Motif Discovery

What Are DNA Sequence Motifs?

Identification of a Consensus DNA-Binding Site for The

Genome-Wide Profiling of PPAR :RXR and RNA Polymerase II Occupancy

Consensus Sequence Design As a General Strategy to Create Hyperstable, Biologically Active Proteins