Dirichlet Mixtures A Metho d for Improving Detection of Weak
but Signi cant Protein Sequence Homology
y
Kimmen Sj olander Kevin Karplus Michael Brown
Computer Science Computer Engineering Computer Science
U C Santa Cruz U C Santa Cruz U C Santa Cruz
kimmen cse ucsc edu karplus cse ucsc edu mpbrown cse ucsc edu
Richard Hughey Anders Krogh I Saira Mian David Haussler
Computer Engineering The Sanger Centre Lawrence Berkeley Lab oratory Computer Science
U C Santa Cruz England U C Berkeley U C Santa Cruz
rph cse ucsc edu krogh sanger ac uk saira cse ucsc edu haussler cse ucsc edu
UCSC Technical Rep ort
UCSC CRL
Abstract
This pap er presents the mathematical foundations of Dirichlet mixtures whichhave b een used to
improve database search results for homologous sequences when a variable numb er of sequences from
a protein family or domain are known We present a metho d for condensing the information in a
protein database into a mixture of Dirichlet densities These mixtures are designed to b e combined
with observed amino acid frequencies to form estimates of exp ected amino acid probabili ties at each
p osition in a pro le hidden Markov mo del or other statistical mo del These estimates give a statistical
mo del greater generalization capacity such that remotely related family memb ers can b e more reliably
recognized by the mo del Dirichlet mixtures havebeenshown to outp erform substitution matrices and
other metho ds for computing these exp ected amino acid distribution s in database search resulting
es for the families tested This pap er corrects a previously in fewer false p ositives and false negativ
published formula for estimating these exp ected probabiliti es and contains complete derivations of the
Dirichlet mixture formulas metho ds for optimizing the mixtures to match particular databases and
suggestions for e cient implementation
Keywords Substitution matrices pseudo count metho ds Dirichlet mixture priors pro les
hidden Markov mo dels
y
To whom corresp ondence should b e addressed Mailing address Baskin Center for Computer Engineering and
Information Sciences Applied Sciences Buildin g University of California at Santa Cruz Santa Cruz CA
Phone Fax