Finding Patterns in Biological Sequences
Bronaˇ Brejov´a, Chrysanne DiMarco, Tom´aˇs Vinaˇr Sandra Romero Hidalgo Department of Computer Science Department of Statistics University of Waterloo University of Waterloo Gina Holguin, Cheryl Patten Department of Biology University of Waterloo Project report for CS798g, Fall 2000
Abstract In this report we provide an overview of known techniques for discovery of patterns of biological sequences (DNA and proteins). We also provide biological motivation, and methods of biological verification of such patterns. Finally we list publicly available tools and databases for pattern discovery. On-line supplement is available through http://genetics.uwaterloo.ca/∼tvinar/cs798g/motif.
Contents
1 Introduction 3
2 Biological Motivation for Pattern Discovery 3 2.1 Pattern discovery in proteins ...... 3 2.2 Pattern discovery in non-coding regions ...... 5 2.3 Tandem Repeats ...... 7
3 Algorithms 7 3.1 Introduction ...... 7 3.1.1 Computer science questions ...... 7 3.1.2 Input sequences ...... 8 3.1.3 Types of patterns ...... 8 3.2 Exhaustive search ...... 10 3.2.1 Enumerating all patterns...... 10 3.2.2 Exhaustive search on graphs ...... 12 3.3 Creating long patterns from short patterns ...... 14 3.3.1 TEIRESIAS algorithm ...... 14 3.3.2 Work related to TEIRESIAS algorithm ...... 16 3.4 Iterative heuristic methods ...... 16 3.4.1 Gibbs sampling ...... 16 3.4.2 Other iterative methods ...... 18 3.4.3 From iteration to PTAS ...... 18 3.5 Machine learning methods ...... 19 3.5.1 Expectation maximization ...... 19 3.5.2 Hidden Markov models ...... 20 3.5.3 Improvements of HMM models ...... 22 3.6 Methods using additional information ...... 22 3.6.1 Finding motifs in aligned sequences ...... 22 3.6.2 Global properties of a sequence ...... 23 3.6.3 Using phylogenetic tree ...... 23 3.6.4 Use of secondary/tertiary structure ...... 24
1 3.7 Finding Tandem Repeats ...... 25
4 Statistical Significance 25 4.1 z-score ...... 25 4.2 Information content ...... 26 4.3 Sensitivity, specificity and related measures ...... 27
5 Biological Verification 27
6 Success Stories 29 6.1 Tuberculosis ...... 29 6.2 Coiled coils in histidine kinases ...... 29 6.3 Conclusion ...... 29
7 Conclusion 30
References 30
A Publicly Available Software Tools 34
B Databases of Patterns and Motifs 41 B.1 Databases of protein domains ...... 41 B.2 Databases of transcription factors ...... 47 B.3 Other databases ...... 49
2 1 Introduction 2 Biological Motivation for
The goal of our project was to study known tech- Pattern Discovery niques for discovering patterns in biological se- Nucleotide and protein sequences contain patterns quences. Patterns we want to flnd usually corre- or motifs that have been preserved through evolu- spond to functionally or structurally important el- tion because they are important to the structure or ements in proteins or DNA sequences. There is an function of the molecule. In proteins, these con- assumption that these important regions are bet- served sequences may be involved in the binding of ter conserved in evolution and therefore they occur the protein to its substrate or to another protein, more frequently than expected. Pattern discovery is may comprise the active site of an enzyme or may one of the fundamental problems in bioinformatics. determine the three dimensional structure of the It can be used in multiple sequence alignment, pro- protein. Nucleotide sequences outside of coding re- tein structure and function prediction, characteriza- gions in general tend to be less conserved among or- tion of protein families, promoter signal detection, ganisms, except where they are important for func- and other areas. tion, that is, where they are involved in the regula- Introduction to the problem and biological moti- tion of gene expression. Discovery of motifs in pro- vation of motif discovery can be found in Section tein and nucleotide sequences can lead to determi- 2. nation of function and to elucidation of evolutionary Section 3 gives overview of known algorithms. relationships among sequences. We have concentrated on the problem of discov- ering previously unknown patterns. From biolog- ical point of view it is equally important to have 2.1 Pattern discovery in proteins tools for flnding known patterns in new sequences, however this is usually not so interesting from algo- With the accumulation of nucleotide sequences for rithmic point of view. Therefore we touch on this the entire genomes of many difierent organisms, issue only lightly in cases when it is not obvious. comes the need to make sense out of all of the We also do not try to compare individual methods information. Attempts have been made to or- based on their performance or their ability to flnd ganize all of the proteins encoded in these ge- most relevant patterns. The reason is that individ- nomic sequences into families based on the presence ual approaches vary widely in the type of pattern of common signature sequences [Linial et al., 1997, they try to flnd, performance guarantees and so on. Rigoutsos et al., 1999]. Members of protein fam- Even authors of experimental comparative studies ilies are often characterized by more than one such as [Hudak and Mcclure, 1999] have di–culties motif (on average each family has 3-4 conserved to determine which method performed better on a regions) which increases the certainty that a given dataset. It is impossible to do so based only protein has been assigned to a correct family on descriptions of algorithms. Still we have tried to [Nevill-Manning et al., 1998]. Hierarchical trees of choose such algorithms to our study that seem to protein clusters often reveal functional and evolu- contain most interesting ideas. tionary relationships among proteins. Starting with Patterns found by algorithms may or may not be a single "seed" sequence, protein families can be the patterns which we really want to flnd. Sections characterized in order to flnd ancient ancestor se- 4, and 5 provide means how to assess quality of the quences [Neuwald et al., 1997]. First, proteins re- found patterns. In particular, section 4 discusses lated to a query sequence are found by searching measures of statistical signiflcance of a pattern and the databases for similar sequences. Sequences re- section 5 shows how to verify patterns using biolog- vealed from this initial screen are then used as query ical experiments. sequences to search for other family members and the process is repeated to exhaustion. All of the se- To illustrate importance of pattern flnding meth- quences are aligned in order to identify conserved ods we included a few examples where such software regions which are used to generate models that rep- tools led to new biological discoveries (section 6). resent ancient conserved regions. The rationale be- Finally, section 7 closes our discussion, and ap- hind this approach is that if protein A is related pendices A and B include list of publicly available to protein B, and B is related to C, then A is also software tools for motif discovery and databases of related to C. Model reflnement parallels divergent motifs. evolution in that each subsequent alignment reveals progressively more distant relatives.
3 By this method, proteins are assigned to a family preserved through evolution because they are im- based on sequence homology as determined primar- portant to the function or structure of the protein. ily by alignment. If an alignment flnds homology be- While intuitively this seems a valid assumption, it is tween a query protein and a particular family of pro- possible that some conserved sequences may simply teins, a phylogenetic relationship between them is correspond to regions with a lower rate of mutation. automatically assumed [Barker et al., 1996]. There Although functional motifs may not be apparent are two problems with this assumption: 1) signifl- in the protein primary sequence when they consist cant sequence similarities are not always indicative of single conserved amino acid residues separated by of close evolutionary relations, and 2) despite lim- long, variable regions, these conserved residues may ited sequence homology, proteins can have struc- come together to form a functional group when the tural and mechanistic similarities, and even com- protein is folded into its three dimensional struc- mon ancestry not apparent through alignment. Per- ture [Califano, 2000]. On the other hand, patterns haps structural information should also be consid- of conserved sequences can often highlight elements ered when attempting to classify proteins that are that are responsible for structural similarity between highly divergent in homology, yet functionally equiv- proteins and can be used to predict the three dimen- alent. sional structure of a protein. Such an approach has been used to identify mo- Because some amino acids share similar char- tifs in proteins that may be related through conver- acteristics such as size, charge or hydrophobic- gent evolution. Leucine zipper sequences, involved ity, substitutions are often permitted in pro- in protein dimerization, appear in diverse families tein motifs even where residues are important that lack a common ancestor and thus may be an to structure or function. Nevill-Manning et al. example of a convergent motif. These regions lack [Nevill-Manning et al., 1998] describe a method for sequence similarity, being comprised of a repeating discovering conserved motifs that characterize a pro- pattern of a single conserved leucine residue sepa- tein family but are somewhat exible in the amino rated by six highly variable amino acids. This pat- acids allowed in particular positions within the mo- tern repeats on average about four times in a pro- tif. Groups of amino acids occurring at each position tein. Because of the high variability in the sequence in a motif with signiflcant frequency were identifled of the motif, pattern discovery is combined with and used to characterize subsets of motifs that are secondary structure prediction; leucine zippers form biologically relevant. While a motif should be sen- coiled coil structures that are involved in dimer for- sitive enough to allow identiflcation of new family mation [Bornberg-Bauer et al., 1998]. Further con- members with a minimum of false negatives, there founding pattern prediction, leucine is sometimes re- may be a tradeofi in terms of speciflcity; lower speci- placed with methionine, valine or isoleucine. This flcity leads to the identiflcation of false positive se- exibility in motif sequence which re ects exibil- quences. ity in biological function (proteins with leucine zip- Once biological dictionaries of protein sequence per domains can often form difierent combinations patterns are constructed (see Appendix B for exam- of hetero- and homodimers) must be considered in ples of protein motif databases), they can be used pattern prediction for proteins. to predict the function of newly discovered or un- Identiflcation of patterns that have been con- known proteins, or to screen genomic databases for served through evolution can lead to the association other proteins with similar function. Chloroplasts of these sequences with protein function or struc- are the photosynthetic organelles of plant cells that ture. A flrst step to function prediction is to look also perform many other functions such as hormone for sequence features that are common to groups synthesis. More than 200 proteins are involved in of proteins with a specifled activity but are absent chloroplast activity; some of these are encoded on from proteins without the activity. Savoie et al. the chloroplast genome while others are encoded on [Savoie et al., 1999] used such an approach to de- the nuclear genome of the plant cell. The latter velop a recognition rule for sequences that determine proteins are synthesized in the cytosol and are then whether a peptide will activate a T-cell response. transported to the chloroplast. Thus, targeting of These motifs are essentially antigenic determinants proteins to the chloroplast and localization of pro- that elicit an immune response and can be used to teins within the chloroplast are complex and are develop vaccines. The premise of all attempts to as- specifled by sequences within the protein. Peltier sign function to unknown proteins by pattern recog- et al. [Peltier et al., 2000] systematically charac- nition is that highly conserved sequences have been terized chloroplast proteins by two dimensional gel
4 electrophoresis, mass spectrometry, and protein se- structure (regions downstream of the TATA box quencing. Using a combination of de novo motif tend to be highly bendable while regions upstream discovery and detection of known motifs, they iden- have low bendability) [Pedersen et al., 1999]. tifled motifs that specify the function and location Often the goal is not only to locate promot- of many of the chloroplast proteins. ers, but to understand how genes under the con- Protein function can be determined by detec- trol of these promoters are regulated. This in- tion of characteristic motifs even in the absence volves identifying speciflc regulatory sequences in of homology outside of the motif sequence. RNA promoter regions, for example, that bind to spe- genomes found in many viruses such as HIV and ciflc transcription factors in response to a biolog- Ebola, replicate frequently and rapidly and there- ical signal. Known transcription factor binding fore accumulate errors at a high rate. Thus, ge- motifs can be modeled to flnd additional binding nomic sequences among these viruses tend to be sites in a genome and thereby identify genes that highly divergent. Although biological and biochem- are regulated in the same manner. Geraghty et ical data indicate a common ancestry, there is of- al. [Geraghty et al., 1999] were able to identify new ten no statistically signiflcant homology among se- genes that were coordinately regulated by the fatty quences. To predict the function of a protein from acid oleate in the genome of Saccharomyces cere- an RNA virus, one could look for conserved mo- visiae. These genes have a common upstream reg- tifs [Mcclure and Kowalski, 1999]. Viral genomes ulatory sequence known as the oleate response ele- are typically small and encode only a few proteins ment (ORE). The S. cerevisiae genome database was necessary for viral replication. Conserved motifs are screened for this element, constraining the search to therefore likely to specify the function or structure within 500 bases upstream of ORFs greater than 100 of a protein because if the sequence deviated signifl- codons. Because genes controlled by an ORE were cantly from the consensus sequence, then the protein expected to be targeted to the peroxisome (mem- would likely not be functional and the virus would brane bound organelles involved in, among other be unable to reproduce. things, fatty acid metabolism), the coding sequences downstream of predicted OREs were screened for 2.2 Pattern discovery in non-coding the presence of a peroxisome motif. The proteins encoded by these genes provided new insights into regions interesting metabolic mechanisms of the cell. Sim- Similar to patterns in proteins, motifs in non-coding ilarly, Roulet et al. [Roulet et al., 2000] found new sequences can be used to determine the function of sites that bind to members of the CTF/NFI family nucleotide sequences on a global level, for example, of eukaryotic transcription factors. These motifs are to flnd all promoters in a genome, or to determine di–cult to detect because they consist of two short speciflc function such as regions involved in tissue- motif sequences separated by a spacer of variable speciflc regulation of gene expression. Non-coding length; half sites are also recognized. By synthe- sequences are generally not well conserved, there- sizing a series of oligonucleotides with variations on fore, the presence of a conserved sequence in the the consensus sequence and determining their bind- region upstream of a gene usually implies that it ing e–ciency to a CTF/NFI transcription factor by is functionally important. Finding all promoters in gel shift assay, they were able to develop a model large genomic sequences necessitates the identiflca- that was successfully used to flnd new binding sites. tion of features that are common to all promoters Prediction of regulatory protein binding sites can but are not present in non-promoter sequences. This help to infer the function of a gene when homologous is a di–cult problem, especially in eukaryotic organ- genes of known function are not available. Yada et isms which do not have a single core promoter and al. [Yada et al., 1997] outline the development of a are usually associated with multiple regulatory fac- recognition rule for all of the difierent sigma factor tors. Some approaches include the identiflcation of binding sites (a sigma factor is a component of RNA global signals that interact with RNA polymerase polymerase involved in promoter recognition) in the and general transcription factors (e.g., TATA and Bacillus subtilis genome. They applied this infor- CAAT boxes, CpG islands), the detection of up- mation to the prediction of sigma factors that would stream regions with a high density of transcription bind to the promoters of uncharacterized genes and factor binding sites (although these are often not initiate expression. The function of these unknown clustered), and the identiflcation of sequence char- genes can then be hypothesized by analogy to the acteristics that in uence DNA three dimensional known function of other genes regulated by the same
5 sigma factor. loop (or hairpin) in mRNA just upstream of the When the transcription factor binding motifs termination site. Intramolecular complementary are unknown, they can be found by search- base pairing results in formation of a stem which ing for common elements in the upstream re- is capped by a loop of unpaired bases. Ermolaeva et gions of genes that are known to be coreg- al. [Ermolaeva et al., 2000] used secondary struc- ulated (such genes are known as regulons) ture patterns to predict transcription terminators [Mironov et al., 1999, Hughes et al., 2000]. A com- in twelve bacterial genomes. They searched ge- parative approach can be used to predict regu- nomic sequences for an mRNA motif characterized latory sequences in difierent genomes, however, by a stable stem sequence (i.e., high in GC content, the cognate regulatory factor must be known to with only one basepair mismatch in a sequence that be conserved [Gelfand et al., 2000]. Travasoie et would generate a stem of 4-20 nucleotides), followed al. [Tavazoie et al., 1999] examined life cycle- closely by a short U-rich region, and within a rea- dependant patterns of gene expression in Saccha- sonable distance of an open reading frame. romyces cerevisiae by flrst collecting and analyzing The recognition of regulatory sequences in eukary- mRNA using microarrays, and grouping the corre- otic and prokaryotic DNA sequences has had rela- sponding open reading frames according to the spe- tively limited success. Regulation of gene expres- ciflc point in the yeast’s life cycle in which they were sion is complex. It is common for transcription fac- expressed. Sequence patterns that are common and tors to bind to their DNA target sites in cooper- speciflc to each group were then identifled in the ation with many other factors that act synergisti- upstream regions of these genes; these are likely to cally to induce gene expression. In many cases tran- be involved in developmentally-speciflc regulation of scription factors can form difierent combinations of gene expression. The motifs were used to detect ad- hetero- and homodimers that bind with difierent ditional sites in the genome with the goal of eventu- speciflcities; these can often bind in their monomeric ally understanding the regulatory networks within form. Other problems that hinder the development the cell. The advantage of this approach is that it is of good models for promoter prediction include the not in uenced by any prior knowledge of genes that relatively low number of characterized promoters might be expressed or the organism that they are and poor understanding of the signals for start and expressed in, and is therefore particularly useful to stop of transcription and translation, especially in understand regulation in organisms for which very eukaryotes. In a review by Ficket and Hatzige- little biology is known. A similar approach could be orgiou [Fickett and Hatzigeorgiou, 1997], currently used to analyze gene expression in difierent tissues, available promoter prediction programs were tested in response to a given stimulus (for example, an en- and found at best about half of eukaryotic promot- vironmental signal), or as a cell transitions from a ers. normal to an abnormal state. For the detection of protein binding sequences It may be of interest to flnd motifs in RNA se- in DNA, the spatial distribution of amino acids quences; however, RNA molecules such as tRNA, with respect to a DNA substrate should be exam- rRNA and catalytic RNA are usually more con- ined in addition to the interaction of the protein served in structure than in sequence. The prop- with a speciflc DNA sequence. Kono and Sarai erties of an RNA molecule are often determined [Kono and Sarai, 1999] surmised that the distribu- by its structure which can take various forms as tion of amino acids around bases found in 130 a consequence of intramolecular basepairing within protein-DNA complexes in the protein data bank the single stranded molecule, for example, pseudo- could be used to derive empirical interaction poten- knots, hairpin loops, bulges, etc. Thus, aligning tials, and thus to predict DNA target sites for DNA- RNA solely on the basis of conserved sequences is binding proteins. A strict sequence correspondence often misleading. Rather, to look for motifs, an between amino acids and bases was not found, al- RNA sequence is searched for regions that could po- though preferences were evident. However, the in- tentially basepair to form secondary structure; of teractions between amino acids with a strong base course, distance constraints would have to be ap- preference and their cognate DNA target could be plied and a minimun number of base pairs would formed with other amino acids. be required [Gorodkin et al., 1997b]. One mecha- Sometimes the initial assumptions used to flnd nism of transcription termination in bacteria, the motifs are oversimplifled. The objective of Kochetov so-called rho independent termination, involves the et al. [Kochetov et al., 1999] was to predict prop- formation of a secondary structure known as a stem- erties of mRNA sequences that in uence levels of
6 translation. They compared the mRNA sequences the structure of the chromatin, or acting as protein of highly expressed genes with the mRNA sequences binding sites [Benson, 1999, p.573] and in the devel- of poorly expressed genes to detect sequence features opment of immune system cells. essential for e–cient expression. The selection of highly expressed mRNA was based on the assump- tion that an abundance of polypeptides is a conse- 3 Algorithms quence of e–cient translation, and conversely, that ine–cient expression results in a scarcity of polypep- 3.1 Introduction tides. This approach does not take into account Algorithmic approaches to pattern discovery exhibit other factors that in uence polypeptide levels such surprising variety. They can be classifled according as RNA stability, promoter activity, etc. It is pos- to difierent more or less orthogonal criteria. In our sible to have a very e–cient translation process and report we group algorithms together mainly based still have a short supply of polypeptides. For the on the approaches they use. In this introduction cell, more does not necessarily mean better and it we introduce other possible classiflcations of the al- is doubtful that e–cient translation processes in the gorithms. We concentrate on two issues: how is cell have been selected through evolution because the biological task formulated in computer science they result in greater production of polypetides. terms, and what kind of patterns are used in the pro- grams. We also introduce notation used throughout 2.3 Tandem Repeats this section.
A particularly interesting problem in pattern- 3.1.1 Computer science questions flnding involves the detection of tandem repeats, which are two or more contiguous, approximate Problem of pattern discovery appears in difierent ar- copies of a pattern of nucleotides. Tandem duplica- eas of biology. Good examples are protein binding tion occurs as a result of mutational events in which sites (including but not limited to discovery of el- an original segment of DNA, the pattern, is con- ements regulating gene expression) and motifs con- verted into a sequence of individual copies. With the served in members of protein families. More detailed progression of time, the individual copies within a discussion of biological aspects of motif flnding see tandem repeat may undergo other \uncoordinated" Section 2. Now we will discuss how to translate mutations which render the once-identical copies in such biological questions to more formal computer the original pattern now only approximate variations science problems. of each other. The prevalence of tandem repeats is surprisingly Classiflcation problems. One class of problems high in genomic sequences. [Benson, 1999] notes are classiflcation problems. These occur for exam- that \Tandem repeats are presumed to occur fre- ple in protein families. One of the goals of flnding quently in genomic sequences, comprising perhaps common motifs in protein families is to use these 10% or more of the human genome, But, accurate motifs as a classiflers: given an unknown protein characterization of the properties of tandem repeats we can classify it as a member or non-member of has been limited by the inability to easily detect a family, based on the fact whether it contains the them". As Benson also goes on to say, the detection motifs characteristic for the family. In this case we of tandem repeats has come to assume an increas- may formulate question as a machine learning prob- ing importance in genomic research, for both posi- lem: given a set of sequences belonging to the family tive and negative reasons. On the negative side, the (positive examples) and a set of sequences not be- appearance of speciflc kinds of tandem repeats has longing to the family (negative examples) one may been linked to a number of difierent diseases, in- wish to flnd a function f which for each protein de- cluding Huntington’s disease, myotonic dystrophy, cides whether it belongs to the family or not. In spinal and bulbar muscular atrophy, and Friedrich’s context of motif discovery we are mostly interested atxia. In each of these cases, individuals with the in such classes of functions f that involve match- disease have a huge increase in the number of copies ing some discovered patterns against the unknown of a trinucleotide pattern, into the hundreds or even sequence. Note, that negative examples are simply thousands. On the positive side, however, it appears other known proteins taken from protein databases that tandem repeats may play a role in gene regula- such as SWISS-PROT. Quite often people start only tion (interacting with transcription factors, altering from positive examples and negative examples use
7 for evaluation of their classiflers. In this case they by random. If this probability is small, the pattern usually solve the problem of flnding suitable signif- is statistically signiflcant. More detailed discussion icant patterns as described below. A detailed dis- of statistical signiflcance of patterns can be found in cussion of possible formalizations of pattern flnding Section 4. as classiflcation problem can be found in a review The goal of an algorithm may be to flnd the best [Brazma et al., 1998]. (i.e. usually the highest scoring patterns), or to flnd several best scoring patterns, or all patterns with some predeflned level of support and score. Finding signiflcant patterns. Motif discovery is not always formulated as a classiflcation prob- lem. For example if we want to flnd a regulatory Pattern discovery vs. pattern matching. So element, we might have a set of regions likely to far we have discussed the problem of pattern dis- contain this element. However it does not mean, covery, i.e. the algorithm is supposed to discover that this element cannot occur in other places in pattern unknown in advance. However in biology genome or that all of these sequences must contain many consensus sequences are known and it is im- common regulatory element. Also in a context of portant to have tools that allow to flnd occurrences protein family motifs we are interested in flnding of known patterns in new sequences. This problem conserved regions that may indicate structurally or will be called pattern matching. Program for pat- functionally important elements, regardless whether tern matching can be quite general, i.e. they get they have enough speciflcity to distinguish between pattern as a part of input, or they can be built to this family and other families. In this context it recognize only one particular kind of pattern. In is more complicated to formulate the question pre- this case authors usually try to flne-tune the pa- cisely. Usually people deflne class of patterns they rameters of the system to get better sensitivity and want to flnd and they are interested in discovering speciflcity of the algorithm. From computer science the highest scoring pattern from this class that has point of view these programs are not so interesting, enough support. Various approaches difier in a way however they are very useful for biologists. how the deflne a support and a score of a pattern. Support of a pattern usually means the number 3.1.2 Input sequences of sequences in which the pattern occurs. We can The input of pattern discovery programs usually require that pattern should occur in all sequences consists of several sequences, expected to contain or there is a minimum number of occurrences spec- the pattern. We will denote § the alphabet of all ifled by user. In some cases the number of occur- possible characters occurring in the sequences. Thus rences is not specifled but it is part of a scoring § = A, C, G, T for DNA sequences and § is a set function { longer pattern with fewer occurrences of all{20 amino acids} for protein sequences. Most of can be sometimes more interesting than shorter pat- the algorithms can be easily adapted to work with tern with more occurrences. The situation is even any flnite alphabet (this is true for algorithms, but more complicated in the case of probabilistic pat- not necessarily for their implementations). Thus the terns, such as Hidden Markov models. Determinis- pattern flnding algorithm can be used also outside tic patterns either match sequence or not (zero or bioinformatics, or on other types of biological data. one), whereas probabilistic models give a probabil- Some algorithms use not only sequences, but also ity between 0 and 1. Therefore there are difierent other information. For example pattern discovery is degrees of \matching". It is necessary to set some much easier in aligned sequences. Also we may use threshold on what should be considered a match or information about secondary or tertiary structure, to include these matching probabilities to the score evolutionary relationships between sequences and so of the pattern. on. However most of the time we will concentrate Methods for scoring patterns also difier from pa- on the discovery from unaligned sequences only. per to paper. Score can describe only the pattern itself (e.g. its length, degree of ambiguity etc.) or it 3.1.3 Types of patterns can be based on the occurrences of the pattern (their number, how much these occurrences difier from the Difierent programs discover patterns of difierent pattern). Scoring functions are sometimes based on kind. On the most general level patterns can be statistical signiflcance. For example we may ask, divided between deterministic and probabilistic. A what is the probability that the pattern would have deterministic pattern either matches given string or so many occurrences if the sequences were generated not. On the other hand probabilistic patterns are
8 usually probabilistic models that give to each se- Patterns with mismatches. One can further ex- quence probability that this sequence is generated tend expressive power of deterministic patterns by by the model. The higher is this probability, the allowing certain number of mismatches. Most com- better is the match between sequence and pattern. monly used type of mismatches are substitutions. In this case subsequence S matches pattern P with at most k mismatches, if there is a sequence S0 ex- Deterministic patterns. The simplest kind of a actly matching S that difiers from S in at most k pattern is just a sequence of characters from alpha- positions. bet §, such TATAAAA, the TATA box consensus se- Sometimes we may also allow insertions or dele- quence. We can also allow more complex patterns, tions, i.e. the number of mismatches would be an adding some of the following frequently used fea- edit distance between the substring S and a closest tures. string matching the pattern P .
Ambiguous character is a character corre- • sponding to a subset of §. Ambiguous char- Position weight matrices. Even the most com- acter then matches any character from this plicated deterministic patterns cannot capture some set. Such sets are usually denoted by a list subtle information hidden in a pattern. Assume we of its members enclosed in square brackets e.g. have a pattern that contains on the flrst position [LF] is a set containing L and F. A-[LF]-G C in 40% cases and G in 60% cases. The ambigu- is a pattern in a notation used in PROSITE ous symbol [CG] gives the same importance to both database. This patterns matches 3-character nucleotides. It is so important in strong patterns, subsequences starting with A, ending with G but it may be important in weak patterns, where and having either L or F in the middle. we need to use every piece of information to distin- For nucleotide sequence there is a special let- guish the pattern from random sequence. ter for each set of nucleotides, where R=[AG], The simplest type of probabilistic pattern is Y=[CT], W=[AT], S=[GC], B=[CGT], D=[AGT], position-weight matrix (PWM). PWMs are also H=[ACT], V=[ACG], N=[ACGT]. sometimes called position-speciflc score matrix (PSSM), or a proflle (however proflles are often more Wild-card or don’t care is a special kind of complicated patterns, allowing gaps). PWM is a • ambiguous character that matches any charac- simple ungapped pattern specifled by a table. This ter from §. Wild-cards are denoted N in nu- table contains for each pair (position, character), the cleotide sequences, X in protein sequences. Of- relative frequency of the character at that position ten they are also denoted by dot ’.’. Sequence of the pattern (see Figure 1 for an example). of one or several consecutive wild-cards is called Assume that the pattern (i.e. PWM) has lent k gap and patterns allowing wild-cards are often (number of columns of the table). The score of a called gapped patterns. sequence segment x1 . . . xk of length k is
Flexible gap is a gap of variable length. In k • PROSITE database it is denoted by x(i,j) A[xi, i] Y f(x ) where i is the lower bound on the gap length i=1 i and j is an upper bound. Thus x(4,6) matches any gap with length 4, 5, or 6. They also where A[c, i] is an entry of position weight ma- denote a flxed gap of length i as x(i) (e.g. trix corresponding to position i of the pat- x(3) = ...). Finally * denotes gap of any tern and character c and f(c) is background length (possibly 0). frequency of character c in all considered se- quences. This product represents odd-score that Following string is an example of a PROSITE the sequence segment x1 . . . xk belongs to the pattern containing all mentioned features: probability distribution represented by the PWM F-x(5)-G-x(2,4)-G-*-H. Some programs do [Dorohonceanu and Nevill-Manning, 2000]. In or- not allow all these features, for example they do der to simplify computation of the score we can store not allow exible gaps or they allow any gaps but log-odd scores log A[c, i]/f(c) in the table, instead of do not allow ambiguous characters other than a plain frequencies A[c, i]. Then the following formula wild-card. gives us log-odd score instead of odd score (A0[c, i]
9 PWM with relative frequencies A 0.26 0.22 0.00 0.00 0.43 1.00 0.11 C 0.17 0.18 0.59 0.00 0.26 0.00 0.35 G 0.09 0.15 0.00 0.00 0.30 0.00 0.00 T 0.48 0.45 0.41 1.00 0.00 0.00 0.54 1 PWM with log-odd scores (using f(c) = 4 ) A -3.94 -4.18 -3.22 -2.00 -5.18 C -4.56 -4.47 -2.76−∞ −∞ -3.94 -3.51 G -5.47 -4.74 −∞ -3.74 −∞ T -3.06 -3.15 -3.29−∞ -2.00−∞ −∞ -2.89−∞ −∞ −∞
Figure 1: Position weight matrix of vertebrate branch point in form of a table and corresponding visual representation as a sequence logo. The sequence logo was created using on-line software at http://www.cbs.dtu.dk/gorodkin/appl/slogo.html is an entry of the table containing log-odd scores): usually as some discrimination rule, which decides whether a given sequence is an occurrence of the k 0 modeled pattern or not. Such a discrimination rule X A [xi, i]. can be based on some stochastic model, such as hid- i=1 den Markov model (HMM), or can employ machine Position-weight matrices can be vi- learning methods, for example neural nets, and so sualized in the form of sequence logos on. [Schneider and Stephens, 1990] (see Figure 1). It is possible to argue whether such rules consti- Each column of a sequence logo corresponds to one tute a pattern at all, but obviously they can be position of the pattern. Relative heights of the trained (which corresponds to pattern discovery) characters in one column are proportional to the and then they can be used for discrimination (which frequencies A[c, i] at the corresponding position of corresponds to pattern matching). Therefore they the pattern. The characters are displayed sorted are applicable in pattern-related tasks such as pro- according to their frequency, with the most frequent tein family classiflcation, binding sites discovery etc. character on top. Each column is scaled so that In some cases (such as HMMs with simple topology) its total height is proportional to the information it is even possible to obtain some information about content of the position, computed as the pattern modeled, such as relative frequencies of characters at individual conserved positions. log2 § + X A[c, i] log2 A[c, i]. | | c 3.2 Exhaustive search Value log § is added in order to obtain pos- 2 | | Many computer science problems related to pattern itive values. It depends on the size of alpha- discovery are provably hard, therefore one cannot bet §. Sequence logos were further improved by hope to flnd fast algorithm which would guarantee to [Gorodkin et al., 1997a] to take background distri- flnd the best possible solution. Therefore many ap- bution into account, and displaying characters that proaches are based on exhaustive search. Although occur less frequently than expected upside down. thus algorithms may run in exponential time in the Quick look at a sequence logo reveals most pre- worst case, programs often use sophisticated prun- served positions, consensus characters at all posi- ing techniques that make the search feasible for typ- tions etc. Notice, that the size characters in difierent ical input data. columns cannot be directly compared. 3.2.1 Enumerating all patterns. Stochastic models. All types of patterns dis- cussed so far are explicit in a sense that the user The simplest approach to pattern discovery is can easily see important characteristics of the oc- to enumerate all possible patterns satisfying con- currences of a pattern. Sometimes it is advanta- straints given by the user, for each pattern flnd its geous to represent a pattern in a more implicit form, occurrences in input sequences and based on this
10 occurrences assign score or statistical signiflcance to Enumerating gapped patterns. In some con- each pattern. Then we may output patterns with texts it is more reasonable to search for patterns highest score or all patterns with scores above some with gaps. Example of such system is MOTIF threshold. [Smith et al., 1990]. MOTIF flnds patterns with For example if we want to flnd the most signiflcant 3 conserved amino acids, separated by two flxed nucleotide pattern of length 10, allowing at most 2 gaps (for example A...Q....I). The gaps can have length 0, 1, . . . , d where d is a parameter specifled mismatches, we can enumerate all possible strings of 3 2 length 10 over the alphabet A, C, G, T (there are by user. The number of possible patterns is 20 d . 410 = 1, 048, 576 such strings).{ Each string} is a po- MOTIF does not allow any mismatches, however the tential pattern. We flnd all its occurrences with at pattern does not need to occur in all sequences. If most 2 mismatches in input sequences and to com- the sequences contain a conserved region of more pute the score of the pattern. Then we report the than 3 positions, then there will be many patterns, pattern with the highest score. each containing difierent subset of conserved posi- tions from this region. Therefore in the next step This method is however suitable only for short the algorithm removes patterns occurring close to and simple patterns, because the running time grows each other. Then all matches of a particular pattern exponentially with the length of the pattern. The are aligned and based on this alignment the pattern number of possibilities is even larger if we allow pat- is extended by flnding consensus in the columns of terns containing wild-cards, ambiguous characters, the alignment. Pattern can be also extended to both gaps etc. On the other hand the running time grows sides, if possible. usually linearly with increasing length of the in- put sequences. Therefore the enumeration approach may be suitable for flnding short patterns in a huge Pruning pattern enumeration. If we want to amount of data. flnd longer or more ambiguous patterns, we cannot use straightforward exhaustive search. Assume we The advantage of this method is that it is guaran- want to flnd a long ungapped pattern occurring pos- teed to flnd the best pattern. We may easily output sibly with some mismatches in at least K sequences. arbitrary number of high scoring patterns, we may The we may start from short patterns (for example also choose relatively complicated scoring functions, patterns of length 1) that appear in at most K se- as far as they can be easily computed based on the quences and extend them until the support does not pattern and its occurrences. Also we can allow mis- go below K. In each step we need to extend the pat- matches, even insertions and deletions. tern in all possible ways and check whether the new pattern still occur in at least K sequences. Once we get a pattern that cannot be extended without Application of enumerative method. Many loss of support, this pattern is maximal and can be protein binding sites in DNA are actually short un- written to output. This search strategy is actually a gapped motifs, with certain variability. They can depth flrst search of the tree of all possible sequences be quite well modeled with simple patterns allow- (see Figure 2). We prune branches that cannot yield ing small number of mismatches. Therefore we can any supported patterns. apply exhaustive search to flnd this type of bind- This kind of improvements can work well in ing sites. Recent examples of this approach can be some real-life situations, however the theoretical found in [van Helden et al., 1998, Tompa, 1999]. In worst-case time still remains exponential. Their both papers authors use straightforward enumera- main advantage is that they allow to search for tion of all possible patterns and concentrate more longer and more complicated patterns than sim- on estimating statistical signiflcance of their occur- ple exhaustive search. Examples of this strategy rence. [van Helden et al., 1998] tries to flnd pattern include Pratt algorithm described in detail below that appears in several copies in most sequences and the flrst, scanning phase of TEIRESIAS algo- (GATA box). Therefore they consider patterns con- rithm [Rigoutsos and Floratos, 1998b] (see also part sisting of 4-9 nucleotides not allowing mismatches 3.3.1). Both programs flnd patterns allowing gaps. and they try to flnd such patterns that occur more often than others, taking into account background Pratt software. Pratt [Jonassen, 1996] is quite distribution. [Tompa, 1999] allows mismatches and advanced algorithm based on the idea of depth flrst tries to flnd statistical signiflcance of the given num- search in a tree of patterns, as described above. ber of occurrences of the pattern with mismatches. Pratt discovers quite general patterns which contain
11 Sequences: empty aaba pattern abaa pattern aabb support Required support K=2 a b (# of occurrences) Without mismatches supp. 3 supp. 3
aa ab ba bb supp. 3 supp. 3 supp. 2 supp. 1
aaa aab aba abb baa bab supp. 0 supp. 2 supp. 2 supp. 1 supp. 1 supp. 0
aaba aabb abaa abab supp. 1 supp. 1 supp. 1 supp. 0
Figure 2: One way to improve exhaustive search is to search in a tree of all possible patterns. When we discover node corresponding to pattern that does not have enough support, we do not continue to search its children. Dashed nodes do not have enough support. Bold nodes are patterns that cannot be further extended.