Dirichlet Mixtures A Metho d for Improving Detection of Weak

but Signicant Protein Sequence Homology

y

Kimmen Sjolander Kevin Karplus Michael Brown

Computer Science Computer Engineering

UC Santa Cruz UC Santa Cruz UC Santa Cruz

kimmencseucscedu karpluscseucscedu mpbrowncseucscedu

Richard Hughey I Saira Mian

Computer Engineering The Sanger Centre Lawrence Berkeley Lab oratory Computer Science

UC Santa Cruz England UC Berkeley UC Santa Cruz

rphcseucscedu kroghsangeracuk sairacseucscedu hausslercseucscedu

UCSC Technical Rep ort

UCSCCRL

Abstract

This pap er presents the mathematical foundations of Dirichlet mixtures whichhave b een used to

improve database search results for homologous sequences when a variable numb er of sequences from

a protein family or domain are known We present a metho d for condensing the information in a

protein database into a mixture of Dirichlet densities These mixtures are designed to b e combined

with observed amino acid frequencies to form estimates of exp ected amino acid probabili ties at each

p osition in a prole hidden Markov mo del or other statistical mo del These estimates give a statistical

mo del greater generalization capacitysuch that remotely related family memb ers can b e more reliably

recognized by the mo del Dirichlet mixtures havebeenshown to outp erform substitution matrices and

other metho ds for computing these exp ected amino acid distribution s in database search resulting

es for the families tested This pap er corrects a previously in fewer false p ositives and false negativ

published formula for estimating these exp ected probabiliti es and contains complete derivations of the

Dirichlet mixture formulas metho ds for optimizing the mixtures to match particular databases and

suggestions for ecient implementation

Keywords Substitution matrices pseudo count metho ds Dirichlet mixture priors proles

hidden Markov mo dels

y

To whom corresp ondence should b e addressed Mailing address Baskin Center for Computer Engineering and

Information Sciences Applied Sciences Buildin g University of California at Santa Cruz Santa Cruz CA

Phone Fax

Intro duction

Recently the rst complete genome for a freeliving organism was sequenced On July The Institute

for Genomic Research TIGR announced in Science the complete DNA sequence for Haemophilus inuenzae

RDFleischmann Along with this sequence were protein genes It is not every day that the

protein databases get such a large inux of novel proteins and within days protein scientists were hard at

work analyzing the data Casari et al One of the main techniques used to analyze these proteins is

to nd similar proteins in the database whose structure or function are already known When two sequences

share at least residue identity and each is at least residues in length then the two sequences are

said to b e homologous ie they share the same overall structure Do olittle If the structure of one

of the sequences has b een determined exp erimentally then the structure of the new protein can b e inferred

from the other If one is fortunate and a large numb er of homologous sequences are found then it maybe

p ossible to tackle the somewhat more dicult problem inferring the new proteins functions

However requiring a minimum residue identity of can mean that no sequences of known structure

are deemed homologous to the new sequence Do es this mean that we can then assume that the three

dimensional structure of this new sequence is in a class of its own This may b e the case some fraction of the

time But it is more likely that some remote homolog exists in the database sharing a common structure

but having less than residue identity

Moreover the problem of nding homologous sequences close or remote is not limited to the case where

en family but exp ect that other one has a single protein One mayhave several sequences available for a giv

family memb ers exist in the databases and want to lo cate these putativememb ers Finding these remote

homologs is one of the primary motivating forces b ehind the developmentofnewtyp es of statistical mo dels

for protein families and domains in recentyears It is also a key motivationforthework presented here

Database search using statistical mo dels

Statistical mo dels for proteins are ob jects like proles that capture the statistics dening a protein family

or domain Along with parameters expressing the exp ected amino acids at each p osition in the molecule or

domain and p ossibly other parameters as well a statistical mo del will have a scoring function for sequences

with resp ect to the mo del These mo dels come in various forms Proles and their many osho ots Gribskov

et al Gribskov et al Bucher et al Barton and Sternb erg Altschul et al

Waterman and Perlwitz Thompson et al a Thompson et al b Barton and Sternb erg

Bowie et al Luth y et al Bucher et alPositionSp ecic Scoring Matrices Heniko et

al and hidden Markov mo dels HMMs Churchill White et al Stultz et al

Krogh et al Hughey Baldi et al Baldi and Chauvin Asai et al haveall

b een prop osed and demonstrated eective for particular tasks under certain conditions

In contrast with homology determination by residue identity statistical mo dels use a very dierent

technique to determine whether two sequences share a common structure During database search with

these mo dels each sequence in the database is assigned a score or negatively a cost generally by adding

the score or cost at each p osition in the mo del For instance a typical cost for aligning residue a at p osition

iis log Proba j p osition i where the base of the logarithm is arbitrary A sequence is determined to b elong

to the family or contain the domain if the cost of aligning the sequence to the mo del falls b elow a cuto

This cuto can b e determined exp erimentally for instance by setting it to the maximum cost for anyofthe

known memb ers of the family or it can b e predetermined

Because these parameters are used to score each sequence in the database careful tuning of the pa

rameters representing the exp ected amino acids b ecomes essential and zero probabilities are particularly

problematic Allowing zero probabilities at p ositions gives an innite p enalty to sequences having the zero

probability residues at those p ositions Even if a sequence is homologous to those used in training the mo del

a single mismatch at such a p osition would render that sequence unrecognizable by the mo del On the other

hand the costs at each p osition are additive so small improvements in predicting the exp ected amino acids at

each p osition accumulate over the length of the sequence and can b o ost a mo dels eectiveness signicantly

Since each of these statistical mo dels relies on having sucient data to estimate its parameters mo deling

protein families or domains for which few sequences havebeenidentied is quite dicult Metho ds that

Two examples of presetting the cuto are cho osing a cost that is a certain numb er of standard deviations b elow

the mean cost of all the proteins in the database in which case the numb er of standard deviations is predetermined

and setting the cuto based on the statistical signicance of cho osing the mo del over a null mo del

increase the accuracy of estimating the exp ected amino acids at each p osition are thus of primary imp ortance

for these mo dels

We tread a thin line b etween sp ecicity and sensitivity in estimating these parameters If a mo del is

highly sp ecic but do es not generalize well it will recognize only a fraction of those sequences in the family

In database discrimination this mo del will generate false negativessequences that should b e lab eled as

family memb ers but are instead lab eled as not b elonging to the family The mo del is to o strict and

database search with this mo del pro duces little new information The reverse situation o ccurs when we

sacrice sp ecicity for sensitivity In this case the mo del categorizes sequences which are not in the family

as family memb ers These false positives are obtained through mo dels that are to o lax and while true remote

homologs may b e included in the set identiedasfamilymemb ers they may b e hard to identify as suchif

the p o ol is simply to o large One of the tests of the eectiveness of a statistical mo deling technique in fact

is howwell it reduces the numb ers of false negatives and false p ositives in database discrimination

Issues in estimating exp ected amino acid probabilities

The following examples illustrate the kinds of issues encountered in estimating amino acid probabilities

In the rst scenario imagine that a multiple alignment of sequences has a column containing only

isoleucine and no other amino acids In the second scenario an alignment of three sequences also has a

column containing only isoleucine and no other amino acids If we estimate the exp ected probabilities of

the amino acids in these columns to b e equal to the observed frequencies then the estimate of the exp ected

probabilityofeach amino acid i is simply the fraction of times i is observed iep n jnj where n

i i i

P

n Using this metho d of estimating the is the frequency of amino acid i in the column and jnj

i

i

probabilities wewould assign a probability of to isoleucine and zero to all the other amino acids for b oth

of these columns But is this estimate reasonable

It is illuminating to consider the analogous problem of assessing the fairness of a coin A coin is said

to b e fair if Prob heads Prob tail s Equivalentlywe exp ect that if we toss a fair coin n times

obtaining h heads and t tails we exp ect hn and tn to each come closer and closer to as n approaches

innity in accordance with the law of large numb ers Now if wepick a coin at random and toss it three

times and it comes up heads each time what should our estimate of the probability of heads for this coin

be If we assume that most coins are fair then we are unlikely to change this a priori assumption based

on only a few tosses On the other hand if we toss the coin an additional thousand times and it comes up

heads each time at this p ointvery few of us would insist that the coin was indeed fair Our estimate of

this coins probability of heads is going to b e or quite close to it Given an abundance of data we will

discountany previous assumptions and b elieve the data

In the rst scenario for the column containing isoleucines and no other amino acids the evidence

is strong that isoleucine is conserved at this p osition Allowing any substitutions in this p osition is clearly

not optimal and giving isoleucine probability or close to it app ears sensible

In the second scenario with an alignment of only three sequences we cannot rule out the p ossibility

that proteins in the family not included in the training set mayhave other amino acids at this p osition

In this case we might not want to assign isoleucine probability and require that all sequences in the

family or containing the domain must have an isoleucine at this p osition Instead w e mightwanttouse

prior knowledge ab out amino acid distributions and mo dify our estimate ab out the exp ected distribution

to reect that prior knowledge In this case weknow that where isoleucine is found other hydrophobic

residues are often found esp ecially leucine and valine Our estimate of the exp ected distribution at this

p osition would sensibly include these residues and p erhaps the other amino acids as well alb eit with much

smaller probabilities By contrast when wehavemany sequences multiply aligned we exp ect the estimate

p n jnj to b e a close approximation of the true underlying probabilities and any prior information ab out

i i

typical amino acid distributions is relatively unimp ortant

Thus the natural solution is to intro duce prior information into the construction of the statistical mo del

interp olating smo othly b etween reliance on the prior information concerning likely amino acid distributions

in the absence of data to condence in the amino acid frequencies observed at each p osition given abundant

data Our aim in this work is to provide a statistically wellfounded Bayesian framework for obtaining this

prior information and for combining this prior information with observed amino acid frequencies

One nal comment concerning skew is in order A skewed sample can arise in twoways In the rst

the sample is skewed simply from the luck of the draw This kind of skew is common in small samples and

is akin to tossing a fair coin three times and observing three heads in a row The second typ e of skew is

more insidious and can o ccur even when large samples are drawn In this kind of skew one subfamily is

overrepresented such that a large fraction of the sequences used to train the statistical mo del are minor

variants of each other This disparity among the numb er of sequences available from dierent subfamilies

for a given protein is the basis for the widespread use of weighting schemes Sibbald and Argos

Thompson et al a Thompson et al b Heniko and Heniko If one has reason to b elieve

that the available data overrepresents some subfamilies Dirichlet mixtures can b e used in conjunction with

anyweighting scheme desired to pro duce more accurate amino acid estimates Simply weight the sequences

prior to computing the exp ected amino acids for each p osition using a Dirichlet mixture Each column in

the weighted data will b e a vector of counts though probably realvalued rather than integral Because we

assume weighted data maybeusedasinputwehave incorp orated this p ossibilityintheformula given in

Section to compute the exp ected amino acid distributions

Obtaining and using prior knowledge of amino acid distributions

Fortunatelyeven when data from a particular family may b e limited there is no lack of data in the protein

sequence databases concerning the kinds of distributions which are likely or unlikely in particular p ositions in

proteins In this work wehave attempted to condense the enormous wealth of information in the databases

into the form of a mixture of densities These densities assign a probabilitytoevery p ossible distribution of

Duda and Hart Nowlan Dempster et al the amino acids We use Maximum Likelihood

to estimate these mixturesie we seek to nd a mixture that maximizes the probability of the observed

data Often these densities capture some prototypical distributions Taken as an ensemble they explain

the observed distributions in the databases

There are many dierent commonly o ccurring distributions Some of these reect a preference for

hydrophobic amino acids some for small amino acids and some for more complex combinations of physio

chemical features Certain combinations of these features are commonly found while others are much rarer

Degrees of conservation dier due to the presence or absence of structural or functional constraints In the

extreme case when an amino acid is highly conserved at a certain p osition in the protein family suchas

the proximal histidine that co ordinates the heme iron in hemoglobin the distribution of amino acids in the

corresp onding column of the multiple alignment is sharply p eaked on that one amino acid whereas in other

cases the distribution may b e spread over many p ossible amino acids

With accurate prior information ab out which kinds of amino acid distributions are reasonable in columns

of alignments it is p ossible even with only a few sequences to identify which of the prototypical distribu

tions characterizing p ositions in proteins mayhave generated the amino acids observed in a particular

column of the emerging statistical mo del Using this informed guess we can adjust the exp ected amino

acid probabilities so that the estimate of the amino acids for that p osition includes the p ossibility of amino

acids that maynothave b een seen at all in that p osition but are consistent with observed amino acid

distributions in the protein databases This has the eect of moving estimated amino acid distributions

toward known distributions and away from distributions that are unusual biologically The mo dels pro

duced are more eective at generalizing to previously unseen data and are often sup erior at database

search and discrimination exp eriments Karplus a Tatusov et al Bailey and Elkan

Brown et al

Comparison with other metho ds for computing these probabilities

We are certainly not the rst group to notice the need for incorp orating prior information ab out such amino

acid distributions into the parameter estimation pro cess Indeed our presentwork has several conceptual

similarities with prole metho ds particularly in regard to seeking meaningful amino acid distributions for

use in database search and multiple alignment Waterman and Perlwitz Barton and Sternberg

Gribskov et al Bowie et al Luth y et al Claverie Claverie Thiswork also

has much in common with amino acid substitution matrices whichhave b een used eectively in database

search and discrimination tasks Heniko and Heniko Altschul

There are two drawbacks asso ciated with the use of substitution matrices First each amino acid has a

xed substitution probability with resp ect to every other amino acid In any particular substitution matrix

to paraphrase Gertrude Stein an isoleucine is an isoleucine is an isoleucine However an isoleucine seen in one

context for instance in a p osition that is functionally conserved will have dierent substitution probabilities

than an isoleucine seen in another context where anyhydrophobic residue may b e allowed Second only

the relative frequency of amino acids is considered while the actual numb er observed is ignored Thus in

substitutionmatrixbased metho ds the exp ected amino acid probabilities are identical for an apparently

conserved column containing isoleucines and no other amino acids and a column containing three

isoleucines or even a single isoleucine All three situations are treated identically and the estimates pro duced

are indistinguishable

The metho d describ ed here addresses b oth of these issues A Dirichlet mixture prior can b e decomp osed

into individual componentseach of which is a probability densityover all the p ossible combinations of amino

acids o ccurring at p ositions in proteins Common distributions determined by functional or structural

constraints are captured by these comp onents these then provide p ositionsp ecic substitution probabilities

In pro ducing an estimate for the exp ected amino acids the formula employed equation in Section

gives those comp onents which are most likely to have generated the actual amino acids observed the greatest

impact on the estimation

For example in Tables and we give a ninecomp onent mixture estimated on the Blo cks database

Heniko and Heniko In this mixture isoleucine is seen in several contexts Comp onentgives

high probability to all conserved distributions ie distributions where a single residue is preferred over

all others Comp onent represents distributions preferring isoleucine and valine but allowing leucine and

methionine ie this comp onentgives high probability to aliphatic residues found in b eta sheets Comp onent

reverses the order of residues preferred from comp onent preferring leucine and methionine to isoleucine

and allowing phenylalanine and valine as less likely substitutions Comp onent favors methionine but

ylalanine and a few other residues A full allows isoleucine and the other aliphatic residues as well as phen

description of howtointerpret these mixtures in general is given in Section

When only one or two isoleucines are observed the lions share of the probability is shared bytwo

comp onents comp onent starts o with the highest probabilityat while comp onent comes in second

with just under probability Comp onents and b oth have relatively low probabilityat and

resp ectively However the information in the column increases rapidly as the numb er of sequences grows

and the probabilities of eachofthecomponents changes Comp onents and decrease in probability

while comp onentwhichfavors conserved distributions grows very rapidly in probabilityAt ten observed

isoleucines comp onent has probability comp onent has probability comp onent has probability

and comp onent has one of the lowest probabilities of all the comp onentsat This pro cess is

demonstrated in Table

The estimates of the exp ected amino acids reect the changing contribution of these comp onents Given

a single observation isoleucine has probability valine has probability leucine has probability

and methionine has probability This reveals the inuence of comp onents and with their preference

for allowing substitutions with valine leucine and methionine By ten observations isoleucine has probability

valine has probability leucine has probability and methionine has probability Wecan

still see the contribution of comp onent with its bias toward allowing valine to substitute for isoleucine

But the predominant signal is that isoleucine is required at this p osition

Moreover the second issuethe imp ortance of the actual numb er of residues observedis addressed

in the estimation formula as well Here as the numb er of observations increases the contribution of the

prior information is lessened Even if a mixture prior do es not give high probability to a particular typ e of

distribution as the numb er of sequences aligned increases the estimate for a column b ecomes more and more

p eaked around the maximum likeliho o d estimate for that column iep approaches n jnj as jnj increases

i i

Imp ortantly when the data indicate a residue is conserved at a particular p osition ie most or all of the

sequences in an alignment contain a given residue in one p osition and a sucientnumb er of observations are

available the exp ected amino acid probabilities pro duced by this metho d will remain p eaked around that

residue instead of b eing mo died to include all the residues that substitute on average for the conserved

residue as is the case with substitution matrices See for example the estimated amino acid probabilities

pro duced bytwo substitution matrixbased metho ds in Tables

Pseudo count metho ds are a sp ecial case of Dirichlet mixtures where the mixture consists of a single

comp onent In these metho ds a xed value is added to each observed amino acid count and then the counts

P

are renormalized iep n z n z where z can b e the same constantforevery amino acid

i i i j j j

j

j or can vary from one amino acid to the next They have some of the desirable prop erties of Dirichlet

A comparison of Dirichlet mixtures with datadep endent pseudo count metho ds is given in Karplus a and

Tatusov et al where Dirichlet mixtures were shown to give sup erior results

mixtures but b ecause they have only a single comp onent they are unable to represent as complex a set

of prototypical distributions We include in Tables probability estimates for two p opular pseudo count

metho ds which add the same constant for each amino acid and can thus b e called zerooset metho ds

AddOne where z for all i and AddShare where z for all i The Dirichlet density compa

i i

singlecomp onent Dirichlet density estimated on the Blo cks database is also a pseudo count metho d where

z is a closely related to the background frequency of amino acid i

i

The work of Luth y McLachlan and Eisenberg Luthy et al also has some interesting similarities

to that presented here They analyzed multiple alignments containing secondary structure information

to construct a set of nine probability distributions whichwecalltheLME distributions describing the

distribution of amino acids in nine dierent structural environments LME distributions havebeenshown

to increase the accuracy of proles in b oth database search and multiple alignmentby enabling them to take

antage of prior knowledge of secondary structure adv

These distributions cannot always b e used since in many cases structural information is not available

or the statistical mo del employed is not designed to takeadvantage of such information For example our

metho d for training an HMM assumes unaligned sequences are given as input to the program and that no

secondary structure information for the sequences is available Thus distributions asso ciated with particular

secondary structural environments such as the LME distributions are inappropriate for our use Moreover

wehave an additional problem using the LME distributions in this Bayesian framework As we will show

in Section Bayes rule requires that in computing the amino acid probabilities the observed frequency

counts b e mo died less strongly when the prior distribution has a very high variance Thus when there is

no measure of the variance asso ciated with a distribution as is the case with the LME distributions one

must assign a variance arbitrarily in order to use the distribution to compute the exp ected probabilities

In this pap er we prop ose the use of mixtures of Dirichlet densities see eg Berhardo and Smith

as a means of representing prior information ab out exp ected amino acid distributions In Section

we give a description of ways to interpret these mixtures The mathematical foundations of the metho d

describ ed in this pap er are given in Section Dirichlet densities are describ ed in Section For those

wishing to use these mixtures we present in Section a Bayesian metho d for combining observed amino

acids with these priors to pro duce p osterior estimates of the probabilities of the amino acids Section

contains the mathematical derivation of the learning rule for estimating Dirichlet mixtures In Section we

presentanoverview of work done b oth at Santa Cruz Karplus a Karplus b Brown et al

and elsewhere Tatusov et al Bailey and Elkan Heniko and Heniko that demonstrates

the eectiveness of these densities in a variety of statistical mo dels and the sup eriority of this technique

in general over others tried Some p ointers to help users avoid underowandoverow problems as well as

sp eed up the computation of mixture estimation are treated in Section

We also want to emphasize p erhaps obviously that the metho d describ ed in this pap er is general and

applies not only to data drawn from columns of multiple alignments of protein sequences but can b e used

to characterize distributions over other alphab ets as well For example wehave done some exp eriments

developing Dirichlet mixtures for RNA b oth for singlecolumn statistics and for pairs of columns and we

have estimated Dirichlet densities over transition probabilities b etween states in hidden Markov mo dels

For a review of the essentials of the HMM metho dology we use including architecture parameter

estimation multiple alignments and database searches see Krogh et al

In more recentwork they have used dierent distributio ns Bowie et al

Interpreting Dirichlet Mixtures

We include in this pap er a comp onent mixture estimated on the Blo cks database Heniko and Heniko

which has given some of the b est results of any mixture estimated using the techniques describ ed

here Table gives the parameters of this mixture

Since a Dirichlet mixture describ es the exp ected distributions of amino acids in the data used to estimate

the mixture it is useful to lo ok in some detail at each individual comp onent of the mixture to see what

distributions of amino acids it favors

Two kinds of parameters are asso ciated with eachcomponent the mixture co ecient q and the

parameters which dene the distributions preferred by the comp onent For any distribution of amino acids

the mixture as a whole assigns a probability to the distribution bycombining the probabilities given the

distribution by each of the comp onents in the mixture

One waytocharacterize a comp onentisby giving the mean exp ected amino acid probabilities and the

variance around the mean Formulas to compute these quantities are given in Section We can also list

the amino acids for eachcomponent in order by the ratio of the mean frequency of the amino acids in a

comp onent to the background frequency of the amino acids Table lists the preferred amino acids for each

comp onent in the mixture

The mixture co ecient q asso ciated with a comp onent is equal to the probabilityofthatcomponent

given the data averaged over all the dataie it expresses the fraction of the data represented by the

comp onent In this mixture the comp onents p eaked around the aromatic and the nonp olar hydropho

t the smallest fraction of the columns used to train the mixture and the comp onent bic residues represen

representing all the highly conserved residues comp onentnumb er represents the largest fraction of the

data

P

is a measure of the p eakedness of the comp onent ab out the mean Higher The value jj

i

i

values of jj indicate that distributions must b e close to the mean of the comp onent in order to b e given high

probabilityby that comp onent In our exp erience when we allow a large number of components we often

nd that many of the comp onents that result are p eaked around individual residues and have high j j but

this may b e an artifact of our optimization technique However when we estimate mixtures having a limited

numb er of comp onents for instance ten or fewer comp onents we nd that one comp onent tends to have

avery small jj allowing this comp onenttogive high probability to all essentially pure distributions This

kind of comp onent has high probability in most of the mixtures wehave estimated evidence that nearly

pure distributions are common in the databases wehave used to estimate these mixtures Since the Blo cks

database was selected to favor highly conserv ed columns it is not surprising that the individual comp onents

of a Dirichlet mixture tuned for the Blo cks database also favor conserved columns Mixtures tuned for the

HSSP set of alignments whichcontains full proteins rather than just highly conserved blo cks show similar

b ehavior although the jj of the comp onents of these mixtures are not quite as low as the jj of mixtures

estimated on the Blo cks database

We often nd that jj and q are inversely prop ortional to each other For instance the comp onent

in Table which has the largest mixture co ecient meaning the most common distributions also has

the smallest value of jj The amino acids favored by this comp onent comp onent tryptophan glycine

proline and cysteineare indeed the most highly conserved ones However as Table shows this comp onent

gives high probability to pure distributions centered around other residues as well

Groups of amino acids that frequently substitute for each other will tend to have one comp onentthat

assigns a high probability to the memb ers of the group and a low probability to other amino acids These

comp onents tend to have higher jjFor instance in Table the twocomponents with the largest values of

j j and so the most mixed distributions represent the p olars and the nonp olar hydrophobics resp ectively

A residue may b e represented primarily by one comp onent as proline is or byseveral comp onents as

isoleucine and valine are

A close variant of this mixture was used in exp eriments elsewhere Tatusov et al Heniko and Heniko

Mathematical Foundations

What are Dirichlet densities

A Dirichlet density is a probability densityover the set of all probabilityvectors p ie p and

i

P

p Berger Santner and Duy In the case of proteins with a letter alphab et

i

i

p p p and p Prob amino acid i Here eachvector p represents a p ossible probability distribution

i

over the amino acids A Dirichlet density has parameters The value of the density

i

for a particular vector p is

Q

i

p

i

i

p

Z

where Z is the normalizing constant that makes integrate to unity The mean value of p given a Dirichlet

i

P

The mean value for p is density with parameters is jj where jj

i i

i

Ep jj

i i

The second moment Ep p for the case i j is given by

i j

j i

Ep p

i j

jj jj

When i j the second moment Ep is given by

i

i i

Ep

i

jj jj

In the case of a mixture prior we assume that is a mixture of Dirichlet densities and hence has the

form

q q

l l

where each is a Dirichlet density sp ecied by parameters and the numbers q q

j j j j l

are p ositive and sum to A density of this form is called a mixturedensity or in this sp ecic case a

Dirichlet mixture density and the q values are called mixturecoecients Each of the densities is called

j j

a component of the mixture

The mean of a mixture is the weighted sum of the means of each of the comp onents in the mixture

P

j j weighted by their mixture co ecients That is Ep q

ji j i j

j

We use the symb ol to refer to the entire set of parameters dening a prior In the case of a mixture

q q whereas in the case of a single density

l l

Computing Exp ected Amino Acid Probabilities

As describ ed in Section in predicting the exp ected probabilities of amino acids at each p osition in a protein

family or domain one is often hamp ered by insucientorskewed data The amino acid frequencies in the

available data may b e far from accurate reections of the amino acid frequencies in all the family memb ers

Fortunatelywe are in a p osition to takeadvantage of information contained in a Dirichlet prior As we

explained in Section a Dirichlet density with parameters q q denes a probability

l l

distribution over all the p ossible distributions of amino acids Given a column in a multiple alignment

we can combine the information in the prior with the observed amino acid counts to form estimatesp of

i

the probabilities of each amino acid i at that p osition These estimatesp p of the actual p values

i

will dier from the estimatep n jnj and should b e much b etter when the the numb er of observations

i i

is small

Let us supp ose that wexanumb ering of the amino acids from to Then eachcolumninamultiple

n n n where n alignment can b e represented bya vector of counts of amino acids of the form

i

is the numb er of times amino acid i o ccurs in the column represented bythiscountvector

At this p oint wemust explain some assumptions wehave made concerning how the observed data were

generated The mathematical formulae for estimating and using Dirichlet mixture priors describ ed in the

following sections follow directly from these assumptions We assume that the hidden pro cess generating

each countvector n can b e mo deled by the following sto chastic pro cess

First a comp onent j from the mixture is chosen at random according to the mixture co ecient q

j

Then a probability distribution p is chosen indep endently according to Prob p the probability

j

dened by comp onent j over all such distributions

Finally the countvector n is generated according to the multinomial distribution with parameters p

Obviously when consists of a single comp onent the rst step is trivial since the probability of the

t is In this case the sto chastic pro cess consists of steps and single comp onen

We can now dene the estimated probabilityp of amino acid i given a Dirichlet density with parameters

i

and observed amino acid counts n as follows

Z

n dp p Prob p n Prob amino acid i p Prob amino acid i

i

p

th

p is simply p thei element of the distribution The rst term in the integral Prob amino acid i

i

vector p The second term Prob p n represents the p osterior probability of the distribution p under

the Dirichlet density with parameters given that w ehave observed amino acid counts nTaken together

R

the integral Prob amino acid i p Prob p n dp represents the contributions from each probability

p

distribution pweighted according to its p osterior probability of amino acid i An estimate of this typ e is

called a mean posterior estimate

Computing probabilities using a single density pseudo counts

While we nd the b est results in computing these exp ected amino acid distributions come from employing

mixtures of Dirichlet densities it is enlightening to consider the p osterior estimate of an amino acid i in the

case of a single density

In the case of a singlecomp onent density with parameters the mean p osterior estimate of the prob

ability of amino acid i is dened

Z

p Prob p Prob amino acid i n dp p

i

p

By Lemma the pro of of which is found in the App endix the p osterior probabilityofeach distribution

p given the count data n and the density with parameters is

Note we could instead cho ose to select amino acids indep endentl y with probability p the optimization problem

i

for optimizing comes out the same

Lemma

Y

jj jnj

n

i i

Prob p n p

Q

i

n

i i

i

i

P P

Here as usual jj jnj n and is the Gamma function the con tinuous generalization

i i

i i

of the integer factorial function ie x x

p and the result of Lemma into equation wehave Now if we substitute p for Prob amino acid i

i

Z

Y

jj jnj

n

j j

Q

dp p p p

i i

j

n

j j

p

j

j

Here we can pull those terms not dep ending on p out of the integral obtaining

Z

Y

jj jnj

n

j j

Q

p p p dp

i i

j

n

j j

p

j

j

Now noting the contribution of the p term within the integral and using equation from Lemma

i

Q

R

Q

i

i

i

wehave p dp giving

i

i

jj

p

Q

n n

jj jnj i i j j

j i

Q

p

i

jj jnj n

j j

j

n

n

At this p ointwe can cancel out most of the terms and take advantage of the fact that

n n

n obtaining

n

i i

p

i

jnj jj

These Dirichlet densities can thus b e seen as vectors of pseudocounts probability estimates are formed

by adding constants to the observed counts for each amino acid and then renormalizing Pseudo count

metho ds are widely used to avoid zero probabilities in building statistical mo dels Note when n in

the absence of data the estimate pro duced is simply jj the normalized values of the parameters

i

which are the means of the Dirichlet density This mean while not necessarily the background frequency of

the amino acids in the training set is often a close approximation to it Thus in the absence of data our

estimate of the exp ected amino acid probabilities will b e close to the background frequencies The simplicity

of the pseudo count metho d is one of the reasons Dirichlet densities are so attractive

Programs to compute the exp ected amino acid frequencies are available via anonymous ftp from our ftp

site ftpcseucscedu and on our web site at httpwwwcseucsceduresearchcompbio

Computing probabilities using mixture densities

In the case of a mixture densitywe compute the amino acid probabilities in a similar way

Z

n dp p Prob p n Prob amino acid i p Prob amino acid i

i

p

As in the case of the single densitywe can substitute p for Prob amino acid i j p In addition since

i

is a mixture of Dirichlet densities by the denition of a mixture equation we can expand Prob p j n

obtaining

Z

l

X

A

p p p Prob Prob n n dp

i i j j

p

j

th

n is the p osterior probabilityofthe j comp onent of the densitygiven In this equation Prob

j

th

the vector of counts n equation b elow It captures our assessment that the j comp onentwas chosen

in step of the sto chastic pro cess generating these observed amino acids The rst term Prob p j n

j

then represents the probability of each distribution pgiven comp onent j and the countvector n

We can pull out terms not dep ending on p from inside the integral giving us

Z

l

X

n p Prob p j ndp p Prob

i j i j

p

j

At this p oint we use the result from equation and obtain

l

X

n

i ji

n p Prob

i j

jnj j j

j

j

Hence instead of identifying one single comp onent of the mixture that accounts for the observed data

we determine howlikely each individual comp onentistohave pro duced the data Each comp onent then

contributes pseudo counts prop ortional to the p osterior probability that it pro duced the observed counts In

P

this case when n p is simply q j j the weighted sum of the mean of each Dirichlet density

i j ji j

j

in the mixture

When a comp onenthasavery small jjitaddsavery small bias to the observed amino acid frequencies

As we showinSectionsuchcomponents give high probability to all distributions p eaked around individual

amino acids The addition of such a small bias allows these comp onents to not shift the estimated amino

acids away from conserved distributions even when relatively small amounts of data are available

By contrast comp onents having a larger j j tend to favor mixed distributions that is combinations

values tend to b e relatively large for those amino acids i of amino acids In these cases the individual

ji

preferred by the comp onent When such a comp onent has high probabilitygiven a vector of counts these

have a corresp onding inuence on the exp ected amino acids predicted for that p osition The estimates

ji

pro duced may include signicant probability for amino acids not seen at all in the countvector under

consideration

Moreover examining equation reveals a smo oth transition b etween reliance on the prior information

in the absence of sucient data and condence that the observed frequencies in the available training

data represent the exp ected probabilities in the family as a whole as the number of observations increases

When the numb er of observations is small the mixture prior has the greatest eect in determining the

p osterior estimate But as the number of observations increases the n values will dominate the values

i i

Imp ortantly as the numb er of observations increases this estimate approaches the maximum likeliho o d

estimatep n jnj

i i

Thus in the case of a mixture densitywe will rst want to calculate the quantity Prob n for

j

h j between and l This quantity is computed from Bayes rule as eac

q Prob n jnj

j j

Prob n

j

Prob n jnj

th

Prob n jnj is the probability of the countvector n given the j comp onent of the mixture and is

j

derived in Section A The denominator Prob n jnj is dened

X

Prob n jnj q Prob n jnj

k k

k

This formula was misrep orted in previous work Brown et al Karplus a Karplus b

Derivation of Dirichlet Densities

As noted earlier much statistical analysis has b een done on amino acid distributions found in particular

secondary structural environments in proteins However our primary fo cus in developing these techniques

for protein mo deling has b een to rely as little as p ossible on previous knowledge and assumptions and

instead to use statistical techniques that uncover the underlying key information in the data

Consequently our approach instead of b eginning with secondary structure is to take unlab eled training

data ie columns from multiple alignments with no secondary structure information attached and attempt

to discover those classes of distributions of amino acids that are intrinsic to the data The statistical metho d

employed directly estimates the most likely Dirichlet mixture density through clustering observed counts of

amino acids In most cases the common amino acid distributions we nd are easily identied eg a large

nonp olar but we do not set out a priori to nd distributions representing known structural environments

Given a set of m columns from a varietyofmultiple alignments we tally the frequency of each amino

acid in each column with the end result b eing a vector of counts of each amino acid for each column in the

dataset Thus our primary data is a set of m countvectors Manymultiple alignments of dierent protein

families are included so m is typically in the thousands Wexanumb ering of the amino acids from to

is the numb er of times amino acid i occurs so eachcountvector has the form n n n where n

i

in the column represented by this countvector

Wehave used Maximum Likeliho o d to estimate the parameters of from the set of countvectors

that is we seek those parameters that maximize the probability of o ccurrence of the observed countvectors

We assume the threestage sto chastic mo del describ ed in Section was used indep endently to generate

each of the countvectors in our observed set of countvectors Under this assumption of indep endence the

probability of the entire set of observed frequency countvectors is equal to the pro duct of their individual

Q

m

Prob n jn j Since the negative probabilities Thus we seek to nd the mo del that maximizes

t t

t

logarithm of the probabilityisinversely prop ortional to the probability this is equivalent to nding the

that minimizes the objective function

m

X

f log Prob n j n j

t t

t

In the simplest case wehave simply xed the numb er of comp onents l in the Dirichlet mixture to a

particular value and then estimated the l parameters twenty values for eachofthecomponents

i

and l mixture co ecients In other exp eriments we tried to estimate l as well Unfortunatelyeven for

xed l there do es not app ear to b e an ecient metho d of estimating these parameters that is guaranteed to

always nd the maximum likeliho o d estimate However a variant of the standard estimationmaximization

EM algorithm for mixture density estimation works well in practice EM has b een proved to result in

closer and closer approximations to a lo cal optimum with every iteration of the learning cycle a global

optimum unfortunately is not guaranteed Dempster et al

As the derivations that follow can b ecome somewhat complex weprovide two tables in the App endix

to help the reader follow the derivations Table contains a summary of the notation we use and Table

ey quantities are derived or dened contains an index to where certain k

In this section wegive the derivation of the pro cedure to estimate the parameters of a mixture prior

As we will show the case where the prior consists of a single density follows directly from the general case

of a mixture In the case of a mixture wehavetwo sets of parameters to estimate the parameters for

each comp onent and the q or mixture co ecient for eachcomponent In the case of a single densitywe

estimate only the parameters

In our practice we estimate these parameters in a twostage pro cess rst we estimate the keeping

the mixture co ecients q xed then we estimate the q keeping the parameters xed This twostage

pro cess is iterated until all estimates stabilize

An intro duction to this metho d of mixture density estimation is given in the b o ok by Duda and Hart Duda and

Hart Wehave mo died their pro cedure to estimate a mixture of Dirichlet rather than Gaussian densities

This metho d for parameter estimation has also b een used for other problems in biosequence analysis Lawrence

and Reilly Cardon and Stormo

Deriving the parameters

Since we require that the b e strictly p ositive and wewant the parameters up on whichwe will do gradient

i

w

ji

descent to b e unconstrained we reparameterize setting e where w is an unconstrained real

ji ji

numb er Then the partial derivative of the ob jective function equation with resp ect to w is

ji

m

X

log Prob n jnj

f

ji

w w

ji ji ji

t

Here weintro duce Lemma the pro of of which is found in the App endix giving

Lemma

log Prob n jnj log Prob n jnj

j

Prob n

j

ji ji

to obtain

m

X

jn j log Prob n

f

j t t

ji

Prob n

j t

w w

ji ji ji

t

ji

andintro ducing Lemma the pro of of which is found in the App endix Using the fact that

ji

w

ji

giving

Lemma

log Prob n jnj

jj jnj jj n

i i i

i

we obtain

m

X

f

Prob n j j jn j j jn

ji j t j t j ti ji ji

w

ji

t

In optimizing the parameters of the mixture we do gradient descent on the weights w taking a step

in the direction of the negative gradientcontrolling the size of the step by the variable during

each iteration of the learning cycle Thus the gradient descent rule in the mixture case can now b e dened

as follows

f

new old

w w

ji ji

w

ji

m

X

old

w Prob j j n j j n n

ji j j t j ti ji ji t

ji

t

P

m

n this is Now letting S Prob

t j j

t

m

X

new old

w w S j j n n n j j Prob

ji j j t ti ji t j ji j

ji ji

t

P

m

In the case of a single density Prob j n for all vectors nthus S Prob n m

j t

t

and the gradient descent rule for a single density can b e written as

m

X

new old

j n n jj w w m j

i ti i t i

i i

t

After each up date of the weights the parameters are reset and the pro cess continued until the change

in the ob jective function falls b elow some predened cuto

Mixture co ecient estimation

In the case of a mixture of Dirichlet densities the mixture co ecients q ofeachcomponent are also

estimated However since we require that the mixture co ecients must b e nonnegativeandsumto

we rst reparameterize setting q Q jQj where the Q are constrained to b e strictly p ositive and

i i i

P

jQj Q As in the rst stage wewant to maximize the probability of the data given the mo del which

i

i

P

m

log Prob n jn j In is equivalent to minimizing the ob jective function equation f

t t

t

this stage wetake the derivativeoff with resp ect to Q However instead of having to take iterative steps

i

in the direction of the negative gradient as we did in the rst stage we can set the derivative to zero and

solve for those q Q jQj that maximize the probability of the data As we will see however the new q

i i i

are a function of the previous q thus this estimation pro cess must also b e iterated

i

Taking the gradientoff with resp ect to Q we obtain

i

m

X

log Prob n jn j

f

t t

Q Q

i i

t

This allows us to fo cus on the partial derivative of the log likeliho o d of a single countvector with resp ect

to Q By Lemma the pro of for which is found in Section A

i

Lemma

jnj n log Prob n Prob

i

Q Q jQj

i i

When we sum over all observations nwe obtain that in the case of a mixture

m

X

n Prob

f

t i

Q Q jQj

i i

t

P

m

Prob n

m

i t

t

jQj Q

i

Since the gradientmust vanish for those mixture co ecients giving the maximum likeliho o d we set the

gradient to zero and solve Thus the maximum likeliho o d setting for q is

i

Q

i

q

i

jQj

m

X

n Prob

t i

m

t

P P P P P

m m m

m the mixture Prob n Prob n Note that since

i t i t

t t i t i

co ecients sum to as required

Since the reestimated mixture co ecients are functions of the old mixture co ecients we iterate this

pro cess until the change in the ob jective function falls b elow the predened cuto

In summary when estimating the parameters of a mixture prior we alternate b etween reestimating

w

ji

the parameters of each density in the mixture by gradient descentonthew resetting e after

ji

each iteration followed by reestimating and resetting the mixture co ecients as describ ed ab ove until the

pro cess converges

Results

The problem of estimating exp ected distributions over the amino acids in the absence of large amounts

of data is not unique to hidden Markov mo dels Thus other researchers have exp erimented with Dirichlet

mixture priors b oth those whichwe rep orted in Brown et al and those whichwedevelop ed and

made available afterwards In addition to the exp eriments we rep orted in Brown et al and whichwe

summarize b elow three indep endent groups of researchers Tatusov et al Heniko and Heniko

Bailey and Elkan used these mixtures in database search and discrimination exp eriments while the

work of Karplus Karplus a Karplus b is more information theoretic comparing the number of

bits to enco de the p osterior probability estimates of the amino acids given dierent metho ds and dierent

sample sizes

HMM exp eriments

In our original pap er on the use of Dirichlet mixture priors Brown et al we describ ed a series of

exp eriments on building HMMs for the EFhand motif EFhands are an approximately residue structure

ersechini et al Moncrief present in cytosolic calciummo dulated proteins Nakayama et al P

et al Wechose EFhands to demonstrate the ability of mixture priors to comp ensate for limited

sample sizes b ecause the motif s small size allowed many exp eriments to b e p erformed relatively rapidly

For these exp eriments we used the June database of EFhand sequences maintained by Kretsinger and

coworkers Nakayama et al We extracted the EFhand structures from each of the sequences

in the database obtaining EFhand motifs having an average length of HMM training sets were

constructed by randomly extracting subsets of size and

The Dirichlet priors we used for these exp eriments were derived from two sources of multiple alignments

a subset of alignments from the HSSP database suggested in Sander and Schneider and multiple

alignments we generated using HMMs to mo del the kinase globin and elongation factor families Haussler

et al Krogh et al

Using the maximum likeliho o d pro cedure describ ed in Section we estimated the parameters of a

onecomp onent and a ninecomp onentDirichlet mixture density from the countvectors obtained from

the HSSP multiple alignments We call these Dirichlet mixtures HSSP and HSSP resp ectively Similar

tand exp eriments were done for the HMM alignments obtaining Dirichlet mixture priors with one comp onen

nine comp onents HMM HMM

In addition to the priors we estimated via maximum likeliho o d estimation we tested the eectiveness

of some additional priors the standard uniform prior called AddOne see Section priors obtained

directly from amino acid distributions estimated byLuthy McLachlan and Eisenb erg whichwe call the

LME distributions for nine dierent structural environments and a comp onent EFhand custom

prior in which each comp onent is derived from a column in our EFhand multiple alignment The prior

derived from the ninecomp onent LME distributions was obtained by forming Dirichlet densities for each

of the nine LME amino acid distributions with the same means as the original distributions Since there is

no measure of the exp ected variance around the mean asso ciated with these distributions we arbitrarily set

the jj for each comp onent to and set the mixture co ecients uniformly The comp onent EFhand

custom prior was designed to determine a b ound on the b est p ossible p erformance for any Dirichlet mixture

for this family

For each training set size and each prior several HMMs were built using the metho d describ ed in

Krogh et alWeevaluated each HMM on a separate test set containing EFhand sequences not

in the training set yielding an average negative log likeliho o d NLL score over all test sequences for each

mo del Lower scores represent more accurate mo dels For ev ery combination of training sample size and

prior used we to ok the average testset NLL score across all mo dels and the standard deviation of the

testset NLL scores

In these exp eriments the EFhand custom prior p erformed the b est followed by HMM HSSP LME

HMM and HSSP AddOne p erformed the worst For example at training sequences the average test

set NLL score for HMMs trained using HMM was lower than the average NLL score for the HMMs trained

using the AddOne prior for all training set sizes up to Details of the results of these exp eriments are

given in Brown et al

In retrosp ect the high jj of the LME prior mayhave handicapp ed this density in these exp eriments j j closer

tomighthave b een more eective

In our previous work the NLL score has always b een almost p erfectly correlated with sup erior multiple

alignments and database search To further demonstrate the latter p oint we tested some of the HMMs built

from various priors on their ability to discriminate sequences containing the EFhand domain from those not

containing the domain Todothiswecho ose mo dels built from training samples of sizes and using

the Add one HMM HMM and EFhand custom priors For each sample size and prior we built an HMM

as ab ove and then used it to search the SWISSPROT database for sequences that contain the EFhand

motif using the metho d describ ed in Krogh et al The results of these database discrimination

exp eriments conrmed the ordering of the priors by NLL score Unfortunately only one test was done for

each combination of sample size and prior so the results are not as statistically signicant as those for NLL

score

Finallywe note that in these exp eriments data used to train HMM and HMM contained no EFhand

sp ecic proteins yet these mixtures still pro duced a substantial increase in p erformance for the EFHand

HMMs estimated using these priors This conrmed that these priors do indeed capture some universal

asp ect of amino acid distributions that are meaningful across dierent protein families

Exp eriments with other statistical mo dels

Karpluss work Karplus a Karplus b compared the relative costs of enco ding multiple align

ments using the estimated p osterior probabilitiesp of each amino acid i in samples of various sizes drawn

i

from countvectors from the BLOCKS database Heniko and Heniko for several metho ds Karplus

noted the sample sizes for whicheach metho d was sup erior to the others and whether a metho ds p osterior

likeliho o d estimate in the limit as the numb er of observations probability estimate approaches the maximum

grows unb oundedly large Karplus compared several metho ds

Zerooset of which one variant is the p opular AddOne where a small p ositive constant is added

to all amino acid counts

Pseudocounts in which a dierent p ositivevalue is added for each amino acid rather than one xed

constant for all amino acids

Gribskov proleor average score method where the scores are logarithmic comparing the probability

estimate of an amino acid in a particular context to the global or background probabilityofthat

amino acid This metho d has b een used byvarious researchers Tatusov et al Gribskov et al

employing anyofseveral scoring matrices such as the p opular Dayho Dayho et al

and Blosum Heniko and Heniko matrices

Substitution matrices which enco de the cost for substituting amino acid i for amino acid j and

comparing twovariants on this basic technique adding scaled counts andor pseudo counts These

metho ds are similar to those employed in metho d ab ove but use matrix multiplication to compute

Probamino acid i rather than log Probamino acid i scores

Dirichlet mixture priors with several mixture priors compared against each other

For this problem Dirichlet mixtures were always sup erior for sample sizes or more and were very

close to optimal for sample size where substitution matrices were optimal and sample size zero where

pseudo count metho ds based on background frequency were optimal

Tatusov Altschul and Ko onin prop ose a technique in Tatusov et al for iterative renementof

protein proles that is able to start with very few aligned sequences or even a single protein segment

and rep eatedly compute a probability distribution over the amino acids for each column in the alignment

search a sequence database for protein segments that match the amino acid distributions sp ecied by the

mo del according to some criterion and multiply align all new protein segments to the mo del until no new

sequences scoring ab oveagiven cuto are found They tested several metho ds for estimating the exp ected

distributions over the amino acids in the rst part of this iterative mo delbuilding pro cess The resulting

mo dels were then tested at database discrimination tasks and their relative p erformances compared The

metho ds compared in this pap er were

More recently Karplus duplicated these exp eriments on columns drawn from the HSSP protein database and

conrmed a similar ordering of these metho ds Also the comp onent Dirichlet density rep orted in Section of this

pap er p erforms very well in his tests for all sample sizes tested

Average score method incorp orating the use of amino acid substitution matrices suchasPAM

Altschul orBLOSUM Heniko and Heniko identical to the third metho d tested

by Karplus

Logodds Bayesian prediction using pseudocounts identical to the second metho d tested by Karplus

Datadependent pseudocount method where the pseudo counts are calculated using a substitution ma

trix equivalent to one of the substitution matrix metho ds tested by Karplus

Dirichlet mixture methodwhich incorp orates a Dirichlet mixture prior into the logo dds Bayesian

prediction metho d

Tatusov et al rep orted that the use of Dirichlet mixture priors sp ecically a ninecomp onent mixture

prior estimated from the Blo cks database Heniko and Heniko quite similar to Blo cks given in

Tables and resulted in protein mo dels with the highest accuracy in database discrimination yielding

the fewest false negatives and false p ositives overallofany of the metho ds compared

Steven and Jorja Heniko conducted a series of tests on the same metho ds using a testing strategy

similar to that describ ed in Heniko and Heniko and conrm these results p ersonal communication

Go o d results with these mixtures are also rep orted by W ang et alTo app ear who in a related set of

exp eriments created an expanded set of blo cks using the same mixtures used in Tatusov et al and

then used these blo cks to classify protein sequences

In Bailey and Elkan the authors rep ort several extensions to their motifnding to ol MEME

which incorp orate prior information into the parameter estimation pro cess While the authors did not com

pare dierent metho ds for computing p osterior estimates of amino acid densities the other prior information

intro duced concerned motif width presence or absence of the motif in sequences b eing searched and whether

as in the case of DNA sequences the motif is exp ected to b e a palindrome they rep orted that the use of

a Dirichlet mixture prior in this case a comp onent mixture estimated from the BLOCKS database

b o osted their protein database search accuracy signicantly esp ecially in the case where few training

sequences were available

Implementation details

Implementing Dirichlet mixture priors for use in hidden Markov mo dels or other sto chastic mo dels of bio

logical sequences is not dicult but there are many details that can cause problems if not handled carefully

This section will split the implementation details into two groups those that are essential for get

ting working Dirichlet mixture co de Section and those that increase eciency but are not essential

Section

Essential details

In Section wegave the formulas for computing the amino acid probabilities in the cases of a single

density equation and of a mixture density equation

For a single Dirichlet comp onent the estimation formula is trivial

n

i i

p

i

jnj jj

and no sp ecial care is needed in the implementation For the case of a multicomp onent mixture the

implementation is not quite so straightforward

As weshowed in the derivation of equation

l

X

n

i ji

n p Prob

i j

jnj j j

j

j

The interesting part for computation comes in computing Prob n whose formula is rep eated

j

here from Equation

q Prob n jnj

j j

Prob n

j

jnj Prob n

We can expand Prob n jnj using equation to obtain

q Prob n jnj

j j

n Prob

j P

l

jnj q Prob n

k k

k

Note that this is a simple renormalization of q Prob n jnj to sum to one Rather than carry the

j j

normalization through all the equations wecanwork directly with Prob n jnj and put everything

j

back together at the end

First we can expand it using Lemma the pro of of which is found in Section A

Y

jnj j j n

j i ji

Prob n jnj

j

jnj j j n

j i ji

i

If we rearrange some terms we obtain

Q

j j jnj n

j i ji

i

Q Q

jnj Prob n

j

jnj j j n

j ji i

i i

Q

x

i

i

The rst two terms are most easily expressed using the Beta function Bx where as usual

jxj

P

jxj x This simplies the expression to

i

i

Bn jnj

j

Q

Prob n jnj

j

B n

j i

i

The remaining Gamma functions are not easily expressed with a Beta function but they dont need to

b e Since they dep end only on n and not on j when we do the normalization to make the Prob n

j

sum to one this term will cancel out giving us

Bn

j

q

j

B

j

Prob n

j P

l

Bn

k

q

k

k

B

k

Plugging this formula into Equation gives us

P

l n Bn

i ji j

q

j

j

B jnjj j

j j

p

i P

l

Bn

k

q

k

k

B

k

Since the denominator of Equation is indep endentofiwe can computep by normalizing

i

l

X

Bn n

j i ji

X q

i j

B jnj j j

j j

j

to sum to one That is

X

i

p

P

i

X

k

k

The biggest problem that implementors run into is that these Beta functions can get very large or very

smalloutside the range of the oatingp oint representation of most computers The obvious solution is to

work with the logarithm of the Beta function

Q

xi

i

log Bx log

jxj

X

log xi log jxj

i

Most libraries of mathematical routines include the lgamma function which implements log x and so using

the logarithm of the Beta function is not dicult

We could compute each X using only the logarithmic notation but it turns out to b e slightly more

i

convenient to use the logarithms just for the Beta functions

X

B n n

j ji i

X q

i j

B j j jnj

j j

j

X

n

ji i

log B nlog B

j j

q e

j

j j jnj

j

j

Some care is needed in the conversion from the logarithmic representation back to oatingp oint since

the ratio of the Beta functions may b e so large or so small that it cannot b e represented as oatingp oint

P

numb ers Luckilywe do not really need to compute X onlyp X X This means that we

i i i k

k

can multiply X byany constant and the normalization will eliminate the constant Equivalentlywecan

i

freely subtract a constant indep endentofj and i from log B n log B b efore converting backto

j j

oatingp oint

If wecho ose the constanttobemax log B n log B then the largest logarithmic term will

j j j

b e zero and all the terms will b e reasonable

Eciency improvements

The previous section gave simple computation formulas forp Equations and When computations of

i

p are done infrequently for example for proles wherep only needs to b e computed once for each column

i i

of the prole those equations are p erfectly adequate

When recomputingp frequentlyasmay b e done in a Gibbs sampling program or training a hidden

i

Markov mo del it is b etter to have a slightly more ecient computation Since most of the computation

time is sp ent in the lgamma function used for computing the log Beta functions the biggest eciency gains

come from avoiding the lgamma computations

We could still get oatingp oint underow to zero for some terms but thep computation will still b e ab out as

go o d as can b e done within oatingp oint representation

If we assume that the and q values change less often than the values for n which is true of almost

ji j

every application then it is worthwhile to precompute log B cutting the computation time almost in

j

half

If the n values are mainly small integers is common in all the applications wevelooked at then it

i

is worth precomputing log log log and so on out to some reasonable value

ji ji ji

Precomputation should also b e done for log j j log j j log j j and so forth If all the n

j j j

values are small integers this precomputation almost eliminates the lgamma function calls

In some cases it maybeworthwhile to build a sp ecialpurp ose implementation of log x that caches

all calls in a hash table and do es not call lgamma for values of x that it has seen b efore Even larger savings

can b e had when x is close to previously computed values by using interp olation rather than calling lgamma

Conclusions and Future Research

Dirichlet mixture priors have b een demonstrated to b e more eective at forming accurate estimates for

exp ected amino acid distributions than substitution matrixbased metho ds pseudo counts and other such

metho ds In particular the metho d presented in this pap er has b een shown to x two primary weaknesses of

substitution matrixbased metho ds fo cusing only on the relative frequency of the amino acids while ignoring

the actual numb er of amino acids observed and having xed substitution probabilities for each amino acid

One of the p otentially most problematic consequences of these drawbacks is that substitution matrixbased

metho ds do not pro duce estimates that are conserved or mostly conserved where the evidence is clear that

an amino acid is conserved

The metho d presented here addresses these issues Given abundant training data the estimate pro duced

by these metho ds is very close to the actual frequencies observed When little data is available the amino

acids predicted are those that are known to b e asso ciated in dierentcontexts with the amino acids observed

In particular when evidence exists that a particular amino acid is conserved at a given p osition the exp ected

amino acid estimates reect this preference

In database search for homologous sequences Dirichlet mixtures have b een shown to maximize sensitivity

without sacricing sp ecicity As a result exp eriments using Dirichlet mixtures to estimate the exp ected

amino acid distributions in a variety of statistical mo dels for proteins result in fewer false negatives and false

p ositives than when other metho ds are used

The metho ds employed to estimate and use these mixtures have b een shown to b e rmly based on

Bayesian statistics While no biological knowledge has b een intro duced into the parameterestimation pro

cess the mixture priors that result agree with accepted biological understanding

In order to b e able to use these mixtures to nd true remote homologs mixtures should b e estimated

on alignments containing more distant homologs rather than estimated from databases where fairly close

homologs are aligned as is the case for b oth the BLOCKS and HSSP databases Another key area needing

trated on relativeweighting research is the weighting of sequences to remove bias Previous work has concen

schemes but the total weight is also relevant when using Dirichlet mixtures

Since the metho d for estimating these mixtures EM is sensitive to the initial parameter settings we

are exploring heuristics that enable us to explore the parameter space more eectively and obtain b etter

mixtures We are also exploring metho ds to comp ensate for the assumption that each column is generated

indep endently which although simplifying the math is without biological basis However as the detailed

analysis of Karplus Karplus a Karplus b shows the Dirichlet mixtures already available are

close to optimal as far as their capacity for assisting in computing estimates of amino acid distributions

given a single column context Thus further work in this area will p erhaps prot by fo cusing on obtaining

information from relationships among the sequences for instance as revealed in a phylogenetic tree or in

intercolumnar interactions

Acknowledgments

We gratefully acknowledge the input and suggestions of Stephen Altschul Tony Fink Lydia Gregoret Steven

and Jorja Heniko Graeme Mitchison and Sp ecial thanks to friends at Laforia Universitede

Pierre et Marie Curie in Paris and the Bio computing Group at the Europ ean Lab oratory

at Heidelb erg who provided workstations supp ort and scientic inspiration during the early stages of writing

this pap er This work was supp orted in part by NSF grants CDA IRI and BIR

DOE grant ONR grant NJ NIH grant GM a grant from the Danish

Natural Science Research Council a National Science Foundation Graduate ResearchFellowship and funds

granted by the UCSC Division of Natural Sciences This pap er is dedicated to the memory of Tal Grossman

a dear friend and a true mensch

References

Altschul Stephen F Gish Warren Miller Webb Meyers Eugene W and Lippman David J Basic lo cal

alignment search to ol JMB

Altschul Stephen F Amino acid substitution matrices from an information theoretic p ersp ective JMB

Asai K Hayamizu S and Onizuka K HMM with grammar In Proceedings of the Hawaii

International Conference on System Sciences Los Alamitos CA IEEE Computer So ciety Press

Bailey Timothy L and Elkan Charles The value of prior knowledge in discovering motifs with MEME In

ISMBCambridge England

Baldi P and Chauvin Y Smo oth online learning algorithms for hidden Markov mo dels Neural Computation

Baldi P Chauvin Y Hunkapill er T and McClure M A Adaptive algorithms for mo deling and analysis

of biological primary sequence information Technical rep ort NetID Inc Cathy Place Menlo Park CA

Barton G J and Sternb erg M J Flexible protein sequence patterns A sensitive metho d to detect weak

structural similaritie s JMB

Berger J Statistical and Bayesian Analysis SpringerVerlag New York

Berhardo JM and Smith AFM Bayesian Theory John Wiley and Sons rst edition

Bowie J U Luth y R and Eisenb erg D A metho d to identify protein sequences that fold into a known

threedimensiona l structure Science

hlet mixture Brown M P Hughey R Krogh A Mian I S Sjolander K and Haussler D Using Diric

priors to derive hidden Markov mo dels for protein families In Hunter L Searls D and Shavlik J editors

ISMB Menlo Park CA AAAIMIT Press

Bucher Philipp Karplus Kevin Mo eri Nicolas and Homan Kay A exible motif search technique based

on generalized proles Computers and Chemistry

Cardon L R and Stormo G D Exp ectation maximization algorithm for identifying proteinbindin g sites

with variable lengths from unaligned DNA fragments JMB

Casari G Andrade M Bork P Boyle J Daruvar A Ouzounis C Schneider R Tamames J Valencia A

and Sander C Scientic corresp ondence to nature Nature

Churchill G A Sto chastic mo dels for heterogeneous DNA sequences Bul l Math Biol

Claverie JeanMichael Information enhancement metho ds for large scale Computers and

Chemistry

Claverie JeanMichael Some useful statistical prop erties of p ositionweight matrices Computers and Chem

istry

Dayho M O Schwartz R M and Orcutt B C A mo del of evolutionary change in proteins In Atlas of

Protein Sequence and Structure National Biomedical ResearchFoundation Washington D C chapter

Dempster A P Laird N M and Rubin D B Maximum likeliho o d from incomplete data via the EM

algorithm J Roy Statist Soc B

Do olittle R F Of URFs and ORFs A primer on how to analyze derived amino acid sequences University

Science Bo oks Mill Valley California

E Pattern Classication and Scene Analysis WileyNewYork Duda R O and Hart P

Gradshteyn I S and Ryzhik I M Table of Integrals Series and Products Academic Press fourth edition

Gribskov M Devereux J and Burgess R The co don preference plot Graphic analysis of protein co ding

sequences and prediction of gene expression NAR

Gribskov Michael McLachlan Andrew D and Eisenb erg David Prole analysis Detection of distantly

related proteins PNAS

Gribskov M Luth y R and Eisenb erg D Prole analysis Methods in Enzymology

Haussler D Krogh A Mian I S and Sjolander K Protein mo deling using hidden Markov mo dels

Analysis of globins In Proceedings of the Hawaii International Conference on System Sciencesvolume Los

Alamitos CA IEEE Computer So ciety Press

Heniko Steven and Heniko Jorja G Automated assembly of protein blo cks for database searching NAR

Heniko Steven and Heniko Jorja G Amino acid substitution matrices from protein blo cks PNAS

Heniko Steven and Heniko Jorja G Positionbased sequence weights JMB

Heniko Steven and Heniko Jorja G Personal communication

Heniko Steven Wallace James C and Brown Joseph P Finding protein similarities with nucleotide

sequence databases Methods in Enzymology

Hughey Richard Massively parallel biosequence analysis Technical Rep ort UCSCCRL Universityof

California Santa Cruz CA

Karplus Kevin a Regularizers for estimating distribution s of amino acids from small samples In ISMB

Cambridge England

Karplus Kevin b Regularizers for estimating distribution s of amino acids from small samples Technical Rep ort

UCSCCRL University of California Santa Cruz URL ftpftpcseucscedupubtrucsccrl psZ

Krogh A Brown M Mian I S Sjolander K and Haussler D Hidden Markov mo dels in computational

biology Applications to protein mo deling JMB

Lawrence C E and Reilly A A An exp ectation maximization EM algorithm for the identication and

characterization of common sites in unaligned biop olymer sequences Proteins

Luthy R McLachlan A D and Eisenb erg D Secondary structurebased proles Use of structure

conserving scoring table in searching protein sequence databases for structural similarities Proteins Structure

Function and Genetics

Moncrief N D Kretsinger R H and Go o dman M of EFhand calciummo dulated proteins I

relationshi ps based on amino acid sequences Journal of Molecular Evolution

Nakayama S Moncrief N D and Kretsinger R H Evolution of EFhand calciummo dulated proteins ii

domains of several subfamilies have diverse evolutionary histories Journal of Molecular Evolution

wlan S Maximum likelihood competitive learning In Touretsky D editor Advances in Neural No

Information Processing Systemsvolume Morgan Kaufmann

Persechini A Moncrief N D and Kretsinger R H The EFhand family of calciummo dulated proteins

Trends in Neurosciences

RDFleischmann Wholegenome random sequencing and assemply of haemophilus inuenzae rd Science

Sander C and Schneider R Database of homologyderived protein structures and the structural meaning of

sequence alignment Proteins

Santner T J and Duy D E The Statistical Analysis of Discrete Data Springer Verlag New York

Sibbald P and Argos P Weighting aligned protein or nucleic acid sequences to correct for unequal repre

sentation JMB

Stultz C M White J V and Smith T F Structural analysis based on statespace mo deling Protein

Science

Tatusov Roman L Altschul Stephen F and Ko onin Eugen V Detection of conserved segments in proteins

Iterative scanning of sequence databases with alignment blo cks PNAS

Thompson Julie D Higgins Desmond G and Gibson Toby J a Improved sensitivity of prole searches

through the use of sequence weights and gap excision CABIOS

Thompson Julie D Higgins Desmond G and Gibson Toby J b CLUSTAL W Improving the sensitivity

ultiple sequence alignment through sequence weighting p ositionsp eci c gap p enalties and weight of progressivem

matrix choice NAR

Wang Jason TL Marr Thomas G Shasha Dennis Shapiro Bruce Chirn GungWei and Lee TY pp ear

Complementary classication approaches for protein sequences Protein Engr

Waterman M S and Perlwitz M D Line geometries for sequence comparisons Bul l Math Biol

White James V Stultz Collin M and Smith Temple F Protein classication bystochastic mo deling and

optimal ltering of aminoacid sequences Mathematical Biosciences

Tables

Blo cks

Blo cks Blo cks Blo cks Blo cks Blo cks Blo cks Blo cks Blo cks Blo cks

q

j j

A

C

D

E

F

G

H

I

K

L

M

N

P

Q

R

S

T

V

W

Y

Table Parameters of Mixture Prior Blo cks

This table contains the parameters dening a ninecomp onent mixture prior estimated on unweighted columns from

the Blo cks database The rst row gives the mixture co ecientq for each comp onent The second row gives the

j

P

jj for each comp onent This value reects howpeaked the distributio n is around the mean The higher

i

i

the value of jj the lower the variance around the mean Rows A alanine through Y tyrosine contain the values

of each of the comp onents parameters for that amino acid

This mixture is available via anonymous ftp from our ftp site ftpcseucscedu and on our web site at

httpwwwcseucsceduresearchcompbio

Analysis of Comp onent Dirichlet Mixture Prior Blo cks

Comp Ratio r of amino acid frequency relativetobackground frequency

r r r r r r r r

SAT CGP NVM QHRIKFLDW EY

Y FW H LM NQICVSR TPAKDGE

QE KNRSHDTA MPYG VLIWCF

KR Q H NETMS PWYALGVCI DF

LM I FV WYCTQ APHR KSENDG

IV LM CTA F YSPWN EQKRDGH

D EN QHS KGPTA RY MVLFWIC

M IVLFTYCA WSHQRNK PEG D

PGW CHRDE NQKFYTLAM SVI

Table Preferred amino acids of Blo cks

The function used to compute the ratio of the frequency of amino acid i in comp onent j relative to the

j j

ji j

P

background frequency predicted by the mixture as a whole is

q j j

k ki k

k

An analysis of the amino acids favored byeachcomponentreveals the following

Comp onentfavors small neutral residues

Comp onentfavors the aromatics

Comp onent gives high probability to most of the p olar residues except for C Y and W Since

cysteine can playtwo roles either as a disulde or as a free thiol in this comp onent it is apparently

app earing in its disulde role

Comp onent gives high probability to p ositively charged amino acids esp ecially K and R and

Qfavoring residues with long sidechains that can function as hydrogen donors

Comp onent gives high probability to residues that are b oth large and nonp olar

Comp onent prefers I and V aliphatic residues commonly found in Beta sheets and allows substi

tutions with L and M

Comp onent gives high probability to negatively charged residues allowing substitutions with certain

of the hydrophilic p olar residues

Comp onent gives high probability to methionine but allows substitution with most neutral residues

esp ecially the aliphatics

Comp onent gives high probability to distributions p eaked around individual amino acids esp ecially

P G W and C

Posterior Probability of the comp onents of Blo cks

Ile Blo cks Blo cks Blo cks Blo cks Blo cks Blo cks Blo cks Blo cks Blo cks

Table The p osterior probabilityofeach comp onent Prob n equation in Blo cks given to

j

isoleucines Initiallycomponent whichfavors I and V is most likely But as more isoleucines are seen

without anyvalines comp onentwhichfavors distributions p eaked around a single residue b ecomes more

likely

Metho ds used to estimate amino acid probabiliti es

Pseudo count metho ds Substitution Matrices Dirichlet densities

Amino Acid Add one Add share Blosum Subst comp Blo cks

A

C

D

E

F

G

H

I

K

L

M

N

P

Q

R

S

T

V

W

Y

Table Estimated amino acid probabilities using various metho ds given one isoleucine

Tables and give amino acid probability estimates pro duced by dierent metho ds givenavarying

numb er of isoleucines observed and no other amino acids Metho ds used to estimate these probabilities

one which adds to eachcount and then renormalizes Add sharewhich adds to eachcount are Add

and renormalizes Blosumwhich do es Gribskovaverage score Gribskov et al using the Blosum

matrix Heniko and Heniko natural log base decimal places Subst which do es matrix multiply

with an optimized matrix comp which is a singlecomp onent Dirichlet density optimized for the Blo cks

database Karplus a Blockswhich is the ninecomp onent Dirichlet mixture given in Tables and

Metho ds used to estimate amino acid probabiliti es

Pseudo count metho ds Substitution Matrices Dirichlet densities

Amino Acid Add one Add share Blosum Subst comp Blo cks

A

C

D

E

F

G

H

I

K

L

M

N

P

Q

R

S

T

V

W

Y

Table Estimated amino acid probabilities using various metho ds given three isoleucines See the caption

for Table for details

Metho ds used to estimate amino acid probabiliti es

Pseudo count metho ds Substitution Matrices Dirichlet densities

Amino Acid Add one Add share Blosum Subst comp Blo cks

A

C

D

E

F

G

H

I

K

L

M

N

P

Q

R

S

T

V

W

Y

Table Estimated amino acid probabilities using various metho ds given ve isoleucines See the caption

for Table for details

Metho ds used to estimate amino acid probabiliti es

Pseudo count metho ds Substitution Matrices Dirichlet densities

Amino Acid Add one Add share Blosum Subst comp Blo cks

A

C

D

E

F

G

H

I

K

L

M

N

P

Q

R

S

T

V

W

Y

Table Estimated amino acid probabilities using various metho ds given ten isoleucines See the caption

for Table for details

A App endix

P

jxj x where x is anyvector

i

i

n n n is a vector of counts from a column in a multiple alignment

th

n is the t such observation in the data set

t

P

n thenumb er of amino acids observed in a given column of a multiple alignment jnj

i

i

P

th

jn j n is the numb er of amino acids observed in the t countvector n

t ti t

i

P

p p p p p are the parameters of the multinomial distributions from whichthen

i i

are drawn

p P is the set of all such

st are the parameters of a Dirichlet density

i

P

is a measure of the p eakedness of the Dirichlet density with parameters jj

i

i

th

comp onent of the Dirichlet density are the parameters of the j

j

th th

is the value of the i parameter of the j comp onent of the Dirichlet mixture

ji

th

is the value of the i element of a Dirichlet density

i

th

q Prob isthemixturecoecient of the j comp onent of the mixture

j j

fq q g all the parameters of the Diric hlet mixture

l l

w w w are unconstrained values up on whichwe do gradient descent during training After each

w

ji

training cycle is set to e

ji

th th

w is the value of the i parameter of the j weight vector The nomenclature weights comes from articial

ji

neural networks

m the numb er of columns from multiple alignments used in training

l the numb er of comp onents in the mixture

eta the learning rate used to control the size of the step taken during each iteration of gradient descent

Table Summary of notation

P

m

log Prob n f jn j

t t

t

the ob jective function minimized

jnj n for integer n

Gamma function

x log x

x

x x

Psi function

n

i

Q

p

i

Prob n p jn j jnj

n

i

i

the probabilityofn under the multinomial distribution with parameters p

Q

n jnj jj

i i

Prob n jnj

jnj j j n

i

i i

the probabilityofn under the Dirichlet density with parameters

P

l

q Probn j jnj Prob n jnj

k k

k

the probabilityofn given the entire mixture prior

q Prob n j jnj

j j

Prob n

j

Prob n j nj

th

shorthand for the p osterior probabilityofthej comp onent of the mixture

given the vector of counts n

Table Index to key derivations and denitions

n

i

Q

p

i

A Lemma Prob n j p jn j jnj 

i

n

i

Pro of

P

th

Foragiven vector of counts n with p b eing the probabili ty of seeing the i amino acid and jnj n

i i

i

jnj

distinct p ermutations of the amino acids which result in the countvector nIfwe allow for there are

n n n

 

Q

n

i

the assumption that each column is generated indep endentl ytheneachsuchpermutation has probability p

i

i

Thus the probabilityofagiven countvector n given the multinomial parameters p is

Y

jnj

n

i

p p jnj Prob n

i

n n n

i

n

i Y

p

i

jnj

n

i

i

Since wemay need to handle realvalued data such as that obtained from using a weighting scheme on the

sequences in the training set weintro duce the Gamma function the continuous generalizatio n of the integer factorial

function

n n

Substituting the Gamma function we obtain the equivalent form

n

Y i

p

i

Prob n p jnj jnj

n

i

i

Q

jj

i

Q

A Lemma Prob p j  p

i

i

i

i

Pro of

P

Under the Dirichlet density with parameters the probability of the distribution p where p and p

i i

i

is dened as follows

Q

i

p

i

i

R

Probp j

Q

i

p dp

i

i

pP

Weintro duce two formulas concerning the Beta functionits denition Gradshteyn and Ryzhik p

Z

x y

Bx y t t dt

xy

x y

and the combining formula Gradshteyn and Ryzhik p

Z

b

x y xy

t b t dt b Bx y

This allows us to write the integral over all p vectors as a multiple integral rearrange some terms and obtain

Z

Y

i

p dp B B B

i

pP

i

Q

i

i

jj

This allows us to now give an explicit denition of the probability of the p oint p given the Dirichlet densitywith

parameters

Y

jj

i

Probp j p

Q

i

i

i

i

Q

n jn j jj

i i

A Lemma Prob n j jnj

i

jnjjj n

i i

Pro of

Since

Z

Prob n p jnj Prob p j dp Prob n jnj

pP

Substituting equations and into equation we obtain

Z

Y

jnj jj

n

i i

p dp jnj Prob n

Q

i

n

i i

pP

i

i

Pulling out terms not dep ending on p from inside the integral and using the result from Equation we

obtain

Q

n

jnj jj i i

i

Q

jnj j j

n

i i

i

At this p oint we can simply rearrange a few terms and obtain the equivalent form

Y

n jnj jj

i i

Prob n jnj

jnj jj n

i i

i

Q

jjjn j

n

i i

Q

A Lemma Prob p j n p

i

i

n

i i

i

Pro of

By Bayes Rule the probabili ty of the distribution p given the Dirichlet density with parameters and the

observed amino acid countvector n is dened

Prob n p jnj Prob p

Prob p n

Prob n jn j

However once the p oint p is xed the probabilityofn no longer dep ends on Hence

Prob n p jnj Prob p

Prob p n

Prob n jnj

p jn j equation At this p oint we apply the results from previous derivations for quantities Prob n

jnj equation This gives us equation and Prob n Prob p

Q Q Q

n jj jnj

i i

Q Q

n jn j jj p p

i i

i i

i i i

n

i i

i i

Q

Prob p n

jnj jj n

i i

i

Most of the terms cancel and wehave

Y

jj jnj

n

i i

Prob p n p

Q

i

n

i i

i

i

Note that this is the expression for a Dirichlet density with parameters and n This prop erty that the p osterior

densityof is from the same family as the prior characterizes all conjugate priors and is one of the prop erties that

make Dirichlet densities so attractive

log Prob n j jnj log Prob njjnj

j

Prob j n  A Lemma

j

ji ji

Pro of

The derivative of the logarithm of the probabilityofeach individu al observation n given the mixture with resp ect

to is

ji

log Prob n jnj Prob n jn j

ji ji

Prob n jnj

Applying equation this gives us

P

L

jnj q Prob n jnj log Prob n

k k

k

ji ji

jnj Prob n

Since the derivative of Prob n jnj with resp ect to is zero for all k j and the mixture co ecients

k ji

the q are indep endent parameters this yields

k

log Prob n jnj Prob n jnj

j

q

j

ji ji

Prob n jnj

q

j

by its equivalent obtaining We rearrange equation somewhat and replace

Prob n j nj

n Prob log Prob n jnj Prob n jn j

j j

ji ji

Prob n jnj

j

log f x f x

Here again using the fact that we obtain the nal form

x f x x

log Prob n jnj log Prob n jnj

j

n Prob

j

ji ji

log Prob nj jn j

 j j jnj  jj n    A Lemma

i i i

i

Pro of

In this pro of we use Lemma giving

Y

jnj jj n

i i

Prob n jnj

jnj jj n

i i

i

Since the derivative of terms not dep ending on are zero we obtain that for a single vector of counts n

i

log Prob n jn j

log jj log jnj jj log n log

i i i

i i i i i

Now if we substitute the shorthand

x log x

x

x x

wehave

log Prob n jnj

jj jnj jjn

i i i

i

Prob n j jnj Prob njjnj Prob njjnj

i

 A Lemma

Q

jQj

i

Pro of

P

l

Substituting equation giving Prob n jnj q Prob n j jnj and replacing q by Q jQjthis

j j j j

j

gives us

P

Q jQjProb n jnj

Prob n jnj

j j

j

Q Q

i i

As the derivative of a sum is the sum of the derivatives we can use the standard pro duct rule for dierentiation

and obtain

X

Prob n Prob n jnj jnj

j

Q jQj

j

Q jQj Prob n jnj

j j

Q Q Q

i i i

j

Prob n j nj

j

for all j this gives us Since

Q

i

X

Prob n jnj

Q jQj

j

jnj Prob n

j

Q Q

i i

j

Taking the derivative of the fraction Q jQj with resp ect to Q we obtain

j i

jQj Q jQj

Q

j

j

jQj Q

j

Q Q Q

i i i



Q

jQj

j

The rst term jQj is zero when j i and is is simply when j i The second term Q

j

Q jQj Q

i i

Q

j

Thus this is



j Qj

X

Prob n Prob n jnj jnj

i

Q

j

Prob n jnj

j

Q jQj

i

jQj

j

Here q Q jQj allows us to replace Q jQj with q jQj giving us

j j j j

P

q Prob n jn j Prob n jnj

j j i

j

jQj

At this p oint we use equation and obtain

Prob n jn j Prob n jnj Prob n jn j

i

Q jQj

i

Prob jn log Prob njjn j

i

 A Lemma

Q Q

jQj

i i

Pro of

jnj Prob n jnj log Prob n

Q Q

i i

jnj Prob n

Here we use Lemma which allows us to express the derivative with resp ect to Q of the log likeliho o d of a

i

single observation n given the mixture as

Prob n jn j Prob n jnj log Prob n jnj

i

Q jQj

i

Prob n jnj

jn j Prob n

i

jQj

jn j Prob n

Prob n j nj

i

Prob j n

i

If we rearrange equation we obtain This allows us to write

q

i

Prob n j nj

log Prob n jnj

Prob j n

i

jQj

Q q

i i

Nowwe can use the identity q Q jQj obtaining the equivalent

i i

log Prob n jnj Prob n

i

Q Q jQj

i i