Dirichlet Mixtures a Metho D for Improving Detection of Weak

Dirichlet Mixtures A Metho d for Improving Detection of Weak but Signicant Protein Sequence Homology y Kimmen Sjolander Kevin Karplus Michael Brown Computer Science Computer Engineering Computer Science UC Santa Cruz UC Santa Cruz UC Santa Cruz kimmencseucscedu karpluscseucscedu mpbrowncseucscedu Richard Hughey Anders Krogh I Saira Mian David Haussler Computer Engineering The Sanger Centre Lawrence Berkeley Lab oratory Computer Science UC Santa Cruz England UC Berkeley UC Santa Cruz rphcseucscedu kroghsangeracuk sairacseucscedu hausslercseucscedu UCSC Technical Rep ort UCSCCRL Abstract This pap er presents the mathematical foundations of Dirichlet mixtures whichhave b een used to improve database search results for homologous sequences when a variable numb er of sequences from a protein family or domain are known We present a metho d for condensing the information in a protein database into a mixture of Dirichlet densities These mixtures are designed to b e combined with observed amino acid frequencies to form estimates of exp ected amino acid probabili ties at each p osition in a prole hidden Markov mo del or other statistical mo del These estimates give a statistical mo del greater generalization capacitysuch that remotely related family memb ers can b e more reliably recognized by the mo del Dirichlet mixtures havebeenshown to outp erform substitution matrices and other metho ds for computing these exp ected amino acid distribution s in database search resulting es for the families tested This pap er corrects a previously in fewer false p ositives and false negativ published formula for estimating these exp ected probabiliti es and contains complete derivations of the Dirichlet mixture formulas metho ds for optimizing the mixtures to match particular databases and suggestions for ecient implementation Keywords Substitution matrices pseudo count metho ds Dirichlet mixture priors proles hidden Markov mo dels y To whom corresp ondence should b e addressed Mailing address Baskin Center for Computer Engineering and Information Sciences Applied Sciences Buildin g University of California at Santa Cruz Santa Cruz CA Phone Fax Intro duction Recently the rst complete genome for a freeliving organism was sequenced On July The Institute for Genomic Research TIGR announced in Science the complete DNA sequence for Haemophilus inuenzae RDFleischmann Along with this sequence were protein genes It is not every day that the protein databases get such a large inux of novel proteins and within days protein scientists were hard at work analyzing the data Casari et al One of the main techniques used to analyze these proteins is to nd similar proteins in the database whose structure or function are already known When two sequences share at least residue identity and each is at least residues in length then the two sequences are said to b e homologous ie they share the same overall structure Do olittle If the structure of one of the sequences has b een determined exp erimentally then the structure of the new protein can b e inferred from the other If one is fortunate and a large numb er of homologous sequences are found then it maybe p ossible to tackle the somewhat more dicult problem inferring the new proteins functions However requiring a minimum residue identity of can mean that no sequences of known structure are deemed homologous to the new sequence Do es this mean that we can then assume that the three dimensional structure of this new sequence is in a class of its own This may b e the case some fraction of the time But it is more likely that some remote homolog exists in the database sharing a common structure but having less than residue identity Moreover the problem of nding homologous sequences close or remote is not limited to the case where en family but exp ect that other one has a single protein One mayhave several sequences available for a giv family memb ers exist in the databases and want to lo cate these putativememb ers Finding these remote homologs is one of the primary motivating forces b ehind the developmentofnewtyp es of statistical mo dels for protein families and domains in recentyears It is also a key motivationforthework presented here Database search using statistical mo dels Statistical mo dels for proteins are ob jects like proles that capture the statistics dening a protein family or domain Along with parameters expressing the exp ected amino acids at each p osition in the molecule or domain and p ossibly other parameters as well a statistical mo del will have a scoring function for sequences with resp ect to the mo del These mo dels come in various forms Proles and their many osho ots Gribskov et al Gribskov et al Bucher et al Barton and Sternb erg Altschul et al Waterman and Perlwitz Thompson et al a Thompson et al b Barton and Sternb erg Bowie et al Luth y et al Bucher et alPositionSp ecic Scoring Matrices Heniko et al and hidden Markov mo dels HMMs Churchill White et al Stultz et al Krogh et al Hughey Baldi et al Baldi and Chauvin Asai et al haveall b een prop osed and demonstrated eective for particular tasks under certain conditions In contrast with homology determination by residue identity statistical mo dels use a very dierent technique to determine whether two sequences share a common structure During database search with these mo dels each sequence in the database is assigned a score or negatively a cost generally by adding the score or cost at each p osition in the mo del For instance a typical cost for aligning residue a at p osition iis log Proba j p osition i where the base of the logarithm is arbitrary A sequence is determined to b elong to the family or contain the domain if the cost of aligning the sequence to the mo del falls b elow a cuto This cuto can b e determined exp erimentally for instance by setting it to the maximum cost for anyofthe known memb ers of the family or it can b e predetermined Because these parameters are used to score each sequence in the database careful tuning of the pa rameters representing the exp ected amino acids b ecomes essential and zero probabilities are particularly problematic Allowing zero probabilities at p ositions gives an innite p enalty to sequences having the zero probability residues at those p ositions Even if a sequence is homologous to those used in training the mo del a single mismatch at such a p osition would render that sequence unrecognizable by the mo del On the other hand the costs at each p osition are additive so small improvements in predicting the exp ected amino acids at each p osition accumulate over the length of the sequence and can b o ost a mo dels eectiveness signicantly Since each of these statistical mo dels relies on having sucient data to estimate its parameters mo deling protein families or domains for which few sequences havebeenidentied is quite dicult Metho ds that Two examples of presetting the cuto are cho osing a cost that is a certain numb er of standard deviations b elow the mean cost of all the proteins in the database in which case the numb er of standard deviations is predetermined and setting the cuto based on the statistical signicance of cho osing the mo del over a null mo del increase the accuracy of estimating the exp ected amino acids at each p osition are thus of primary imp ortance for these mo dels We tread a thin line b etween sp ecicity and sensitivity in estimating these parameters If a mo del is highly sp ecic but do es not generalize well it will recognize only a fraction of those sequences in the family In database discrimination this mo del will generate false negativessequences that should b e lab eled as family memb ers but are instead lab eled as not b elonging to the family The mo del is to o strict and database search with this mo del pro duces little new information The reverse situation o ccurs when we sacrice sp ecicity for sensitivity In this case the mo del categorizes sequences which are not in the family as family memb ers These false positives are obtained through mo dels that are to o lax and while true remote homologs may b e included in the set identiedasfamilymemb ers they may b e hard to identify as suchif the p o ol is simply to o large One of the tests of the eectiveness of a statistical mo deling technique in fact is howwell it reduces the numb ers of false negatives and false p ositives in database discrimination Issues in estimating exp ected amino acid probabilities The following examples illustrate the kinds of issues encountered in estimating amino acid probabilities In the rst scenario imagine that a multiple alignment of sequences has a column containing only isoleucine and no other amino acids In the second scenario an alignment of three sequences also has a column containing only isoleucine and no other amino acids If we estimate the exp ected probabilities of the amino acids in these columns to b e equal to the observed frequencies then the estimate of the exp ected probabilityofeach amino acid i is simply the fraction of times i is observed iep n jnj where n i i i P n Using this metho d of estimating the is the frequency of amino acid i in the column and jnj i i probabilities wewould assign a probability of to isoleucine and zero to all the other amino acids for b oth of these columns But is this estimate reasonable It is illuminating to consider the analogous problem of assessing the fairness of a coin A coin is said to b e fair if Prob heads Prob tail s Equivalentlywe exp ect that if we toss a fair

Load more