Sequence alignment in

Ewan Birney

Sanger Centre

Wellcome Trust Genome Campus

Hinxton Cambridge CB SA

England

Email birneysangeracuk

March

Contents

Intro duction

Information ow in biology

Probabilistic Finite State Machines

PFSMs in bioinformatics

Complex FSMs

Previous use of PFSM in bioinformatics

DNA comp osition mo dels

prole hidden Markovmodels

PFSM interpretations of pairwise sequence alignment

Probabilistic mo dels of RNA

Genome Mapping

Metho ds

Other techniques

Dynamite

Intro duction

The use of dynamic programming in bioinformatics

PFSMs and dynamic programming

Finding the maximum likeliho o d path

Finding the total probability of observations

Dynamite

The Dynamite language

formal denition of a dynamic programming recursion

Implementations provided by Dynamite

Viterbi quadratic memory alignment

Viterbi score only linear memory

Forwards score only linear memory

Recursive linear memory alignment

Serial database search

pthreads Database search

OneMo del BioXLG p ort

Software engineering details of Dynamite

Innovations in Dynamite

Sp ecial states

Lab els

Compile time error detection by the Dynamite compiler

An optimiser for dynamic programming

Example Dynamite programs

Dna Blo ck Aligner DBA

Deriving an alignment from a D sup erp osition of protein

structures

Comparing two transmembrane proteins

Other Dynamic Programming to olkits

UNIX pattern matchers

Dong and Searls

Lefebvre

Discussion

GeneWise

Intro duction

The bio chemistry of premRNA splicing

Current computer approaches to predicting splicing patterns

PFSMs in Gene Prediction

Performance of ab initio Gene Prediction programs

Combining Homology with Gene Prediction

Combining Probabilistic Mo dels

GeneWise mo del

Parameterisation

Splice Site Mo dels

Co don emission probabilities

Insertion or Deletion errors

Intron parameterisation

Path scoring

Flanking Regions

Co ding region scoring

Using the GeneWise algorithm

Example of using GeneWise

Evaluation of GeneWise

Other evaluations of GeneWise

Guigo and Agarwal

The Drosophila annotation exp eriment

Discussion of GeneWise

Pfam a protein family database

Intro duction

protein prole HMMs of domains

The Pfam database

Requirements for Pfam as database

The Pfam Database Management System

Triggers on data entry

Pro ductivity to ols

Underlying Sequence database up date

Middleware Layer

Some Example families

The RNA Recognition Motif

Protein complexes

Discussion

Genome

Intro duction

Halfwise

Worm Genome

Comparison to curated worm genes

Indication of annotation errors

GeneWise accuracy in the worm

Comparison to protein Pfam analysis

Human Chromosome

Comparison to curated genes

Co ding density of Human vs Celegans

Discussion

Conclusion

List of Tables

Performance of

Performance of

Performance of Viterbi

Performance of

Guigo and Agarwal assessment

Pfam Database Management System utilities

Pfam Pro ductivitytools

Blast Sensitivity

Pfam across the worm

Introns in the worm

Potential annotation errors

Halfwise accuracy

Pfam across chromosome

Chromosome accuracy

List of Figures

Information ow in Biology

Simple nite state machine

Simple nite state machine parameterised

Non deterministic nite state machine

DNA motif PFSM

Alignment PFSM

Simple vs Complex PFSM

The two state gap mo del

Basic Viterbi recursion

Dynamic Programming implementation

Divide and conquor recursion

Dna Blo ck Aligner DBA

Dna Blo ck Aligner output

Transmembrane matching PFSM

Transmembrane comparison output

Diagram of splicing

Mo del combination

Gene mo del for GeneWise

protein prole HMM

The merging of gene prediction and homology mo dels

GeneWise

GeneWise

An example output of GeneWise

Diagram of the Pfam middleware

The RRM seed alignment

Preface

Many p eople have help ed myscientic career which has led up to this thesis AtCold

Spring Harb or Adrian Krainer and Akila Mayeda were great teachers for a young

student in molecular biology I have also beneted from the timely interjections

of Jim Watson as I have b ounced around institutions At Oxford Iain Campb ell

allowed me to work unhindered for a year which I am very grateful for

The most rewarding asp ects of mywork have come in the last three years Atthe

Sanger CentreIhave b een stimulated byamused by and grown with the researchers

around me in particular Ian Holmes Dan Lawson and Kevin Howe

have made my working life a pleasure Michele Clamp and James Cu were two

friends whos scientic minds not to mention their out of hours companionship has

made life at Hinxton more enjoyable

Finally Richard Durbin has been a wonderful sup ervisor of a p otentially im

p ossible student He has taken some pretty ill formed ideas and formed sensible

research from it he has mo derated my stubb ornness and tolerated my distractions

Finally he has provided good humour and stimulating discussions Many thanks

Richard

Chapter

Intro duction

The cost of sequencing DNA is around p ence a base one example of how inex

pensive data has b ecome in biology Rather than investigating each asp ect of the

biology of an organism piecemeal huge quantities of data can b e gathered at once

For example an entire genome can b e sequenced for an achievable cost The genome

of an organism enco des in principle nearly all the information of its biology So for

an outlay whichis relatively small compared to the worldwide annual exp enditure

on biomedical research a collection of data is being generated that will underpin

nearly all asp ects of human biology

These simple economic facts have started a revolution in biology Biological

science used to be a science of craftsmen pro ducing individual tailored facts that

painted a single picture suddenly part of it is b ecoming a science of large scale data

pro duction At the junction of the new and the old techniques lies a new science

bioinformatics trying to organise the mass of data into large sets of coherent images

that can b e viewed along side the traditional biologists work

Bioinformatics is a science which simply did not exist in its current form ten

years ago Computer scientists and biologists could condently exp ect never to

cross paths in a professional capacity Although there have been many interesting

applications of b oth mathematics and computers to biological problems b efore the

s these have b een in sp ecialised elds with narrow applications such as statistical

genetics or mo deling For the larger questions in biology biolog

ical exp eriments were by far the best way to progress However with the advent

of such a large amount of biological data the biologists are having to use computa

tional and mathematical techniques to try to manipulate the data into interpretable

information Because of this there are quantitative scientists being recruited into

the eld These researchers are often more at home with sciences with well dened

rules and a mathematical foundation they nd that biology has few rules and many

exceptions but tantalizing applications for theories that have been develop ed in

more quantative elds

The pace of change and the clash of cultures makes bioinformatics a wonderful

science Everyone is always learning new asp ects of biology or computer science

dep ending on their background or new technologies which threaten to give us an

other mass of data for some biological problem The p eople who participate can

nd radically new p ersp ectives on old problems simply from the cross p ollination

between the two elds

This thesis is fo cused on one area of bioinformatics using dynamic programming

to analyse sequences to extract useful information Each chapter concentrates on

the generation of one programmatic to ol or resource Each to ol is presented by

examining rst the background and theory of its use and then the programmatic

issues which ensure ecient and useful applications For to ols which pro duce bio

logical results I presentanevaluation of its eectiveness Finally eachchapter ends

with a discussion of the success or failures of the to ol and any future directions

The rest of the intro duction will describ e the background theory that the other

chapters rely on in particular probabilistic nite state machines PFSMs and their

applications to biology will b e throughly examined Chapter describ es a program

ming to olkit Dynamite that automates the programming of these PFSMs Chapter

is a real applicatios of Dynamite in gene prediction where I provide a principled

waytocombine ab initio and similarity based approaches Chapter describ es the

work I did on Pfam a database of protein families and Chapter describ es how I

applied Pfam on a genome wide scale

I hop e you enjoy reading the remainder of the work

Information ow in biology

Figure shows the basic ow of information between the genome top of the

gure and the phenotyp e b ottom of the gure Researchers are interested in the

molecular function that leads to the phenotyp e The genome provides the data

which can b e determined in an automatic fashion This sequence information is an

invaluable resource in its own right providing information for molecular biologists to

design and pro duce new exp eriments Another use of the sequence data is to b e able

to predict the function of the sequence without any additional exp eriments One

role of bioinformatics is to somehow help bridge the gap b etween the sequence data

and the functional data This transformation of genome sequence to the functional

information is understo o d in broad terms in the sense that we know the biological

players but nowhere near well enough to allowa mechanical deduction

The ve basic steps from the genomic sequence to function b eing transcription

premRNA splicing translation protein folding and the nal function of the protein

pro duct mean that there is considerable manipulation of the information in genomic

sequence b efore it is actually pro duces a biological eect This transformation of

information is understo o d extremely well for one step translation For transcription

and premRNA splicing there is a partial understanding of the biological pro cess

which provides this manipulation and for protein folding and the function of the

protein pro duct our understanding is far from ideal

The inabilityto deduce the function of a piece of genomic sequence simply by

understanding the biological machinery which pro cesses it has forced a dierent

approachtotheproblem of deriving functional information from sequence This is

to make arguments based on homology ie that two sequences shared a common

ancestor sequence at some point of evolution and since their divergence have kept

many features the same from sequence to structure to function

By comparing two sequences either genomic DNA mRNA or protein sequences

some inference of whether they really share homology can be gained This infer

ence has to be some typ e of statistical test whichprovides a way of distinguishing

non homologous pairs from homologous pairs Given that a researcher accepts the

inference he or she can then assert that at least some of the protein structure is

shared b etween the two genes In addition in many cases the functional information

can b e transfered There are cases where the transfer of functional information by

homology is plain wrong for example the gamma lens crystallin has a clear homolog

in alcohol dehydrogenase but their functional roles the former to refract light in

the lens and the latter to metab olise alcohol are unrelated However the structural

similarity is clear Biology has simply coopted this enzyme for a dierent role

Thankfully such case are rare and transfer of functional information is in general Genomic DNA

transcription pre−mRNA

premRNA splicing

mRNA

translation

Protein

Protein folding

Protein Structure

function

Glucose Glucose6phosphate

Figure A gure showing the central dogma in biology The top is genomic DNA

going down the page are the main pro cesses which transform the information in the

genomic DNA to the functional asp ects of the organism The exons are in blue

and the introns and intergenic DNA as thin lines The start and stop co dons are

represented as thin vertical lines The magenta rectangle represents a linear protein

sequence and the red circles its three dimensional structure

reliable

Much of bioinformatics is therefore interested in one of the two following prob

lems The rst problem is providing computer programs which mo del the transfor

mation of information from one biological entity to another in other words mo d

eling in a computer one of the steps in gure The second problem is providing

computer programs which mo del the pro cess of evolution b etween two homologous

genes Both these problems can be well describ ed as Finite State Machines which

leads us to the next section

Probabilistic Finite State Machines

Finite state machines FSMs are well known computer science constructs which

have a wide range of uses from the abstract theory computation to a variety of

practical problems including sp eech recognition and pro cess engineering They

are also a good t to biological as biological sequences are rep

resented well by linear chains of letters This section will briey intro duce nite

state machines and the following section will review some of the previous work in

bioinformatics that uses them

A nite state machine has a number of states connected by transitions The

machine starts in one state moves from state to state and stops when it reaches

a particular state The action of moving between the states on the basis of the

transitions pro duces a sequence of observable events such as sounds readings from



a machine or biological sequence data An imp ortant point is that the choice of

the next transition is not dep endent on the previous transition or in some cases

only dep endent on a limited numb er of pro ceeding states One nice feature of nite

state machines is that an innite set of observations can b e generated despite the

nite nature of the machine The machine in Figure could generate ab or aab

or aaabbbbb but not aabbaa indeed the machine will generate any a b string

n m

This machine only distinguishes b etween strings whicharevalid a b or not valid

n m

The most common extension of Finite State Machines used in this eld is to

asso ciate probabilities with the transitions and their asso ciated emissions The ef

fect of this is to assign a probability of certain strings emerging from the machine



for readers with a background in nite state machines I will b e consistently using the Mealy

representation of FSMs with emissions o ccurring on the transitions throughout a b

Start End

a b

Figure A nite state machine which generates a string of as followed by a

string of bs The round circles represents states and the arrows between them

transitions On each transition potentially a letter can be emitted which b ecomes

the observed data The starting and stopping criteria of this state machine are

shown by rectangles

For example gure shows the previous gure but now parameterised with prob

abilities This machine would generate the string aaaaaaaab with a far higher

probability than aabbbbbbbbbb

One nal prop ertyofFSMsisthatitisnotalways the case that one can deduce

the state of the machine by the sequence of observed letters The machine I have

used as in examples so far is one in which the current state of the machine can

be deduced from the sequence of letters such a machine is called a deterministic

nite state machine However it is very easy to construct machines in which this

is not the case An example is shown in gure Many paths can be consistent

with a particular observed string of letters in nondeterministic nite state machines

meaning that it is not clear which state path made a particular observed string

For PFSMs one natural state sequence to consider is the one with the highest

probability of generating observed symb ols This is known as the most likely state

path which is calculated by an algorithm called the Viterbi path The probability

of the Viterbi path is called the Viterbi score This probability is generally not the

same as the total probability considering every path which is sometimes called the

sum of all paths or the Forward score I will describ e the algorithmical details for

calculating these scores and the Viterbi path in the next chapter

Nondeterministic probabilistic nite state machines are the main typ es of ma a b 0.9 0.5

1.0 0.1 0.5 Start End

a b

Figure The nite state machine in gure parameterised with probabilities

Each transition has a probability asso ciated with it including the transition leaving

start which has a probabilityof The probabilities for all the transitions leaving

a particular state sum to one

chines used in bioinformatics They are equivalent to hidden Markov models A

hidden Markov mo del is a mathematical mo del which evolves in some dimension

often time but need not b e via Markov rules and has certain variables whichare

not observable In the case of PFSMs the hidden variables is the path information

through the state machine ie in gure which sequence of states was used to

generated the b letters The two descriptions of the pro cess as probabilistic Finite

State Machine and hidden Markov mo del are entirely equivalent I prefer using

the PFSM formalism as for me the description of the pro cess is b oth more intuitive

and closer to the programmatic implementation

PFSMs in bioinformatics

There are two main typ es of probabilistic nite state machine which are used in

bioinformatics corresp onding to the two typ es of problems to solve The rst has

each transition emit a single letter of an observed sequence This typ e of mo del is

the familiar hidden Markov mo del which is used in sp eech recognition and other

elds In bioinformatics this typ e of mo del includes prole hidden Markov mo dels

for protein domains simple mo dels of protein sequence and gene prediction hidden

Markov mo dels The states represent theorized explanations of the observations

dierent phonemes that make up a word in sp eech recognition and the dierent a 0.9

1.0 Start

0.09 b b 0.01

0.5 0.1 b b 0.9 0.5

End

Figure A non deterministic nite state machine which pro duces the same typ e

of language as the previous nite state machines namely a b but there are two

n n

classes of strings generated one with short b tails coming from the left hand side of

the gure and one with long b tails coming from the right hand side The long b tail

class is rarer than the short b tail class This can b e deduced by the probabilities of

the machine For a particular string of letters one path through either the left or

the right side of the machine will have a larger probability of pro ducing that string

of letters however whichwas used cannot b e known for certain gatc gatc

Gatc Gatc gatC Gatc

start end

Figure A gure showing a simple PFSM which mo dels a small DNA motif

in this case a sp transcription factor binding site with a consensus of GGCG

The rectangular states show the start and end p oints The two lo oping transitions

emit bases at equal probabilities generating random sequence either side of the

motif The other transitions emit bases with a strong G or C bias indicated by

capitalisation This mo del is a very simple mo del more complicated transitions

can mo del more complex pro cesses

biological features egexons and introns in gene prediction mo dels An example is

show in gure which mo dels a motif in a DNA sequence

The second typ e of mo del is one which emits letters from two dierent sequences

of observations Outside of bioinformatics this typ e of mo del is reasonably rare the

only mainstream application is for transducing grammars grammars which convert

one language to another However in bioinformatics this mo del is ubiquitous as it

represents the alignment of one sequence to another Figure shows the common

SmithWaterman typ e PFSM which represents the alignmentoftwo DNA sequences

by a pro cess allowing ane gaps scores

In this alignment typ e PFSM the states represented theorized explanations of

evolution for example the pro cess of inserting an amino acid In the PFSM struc

ture nding the most likely path incorp orates nding the b est p osition of the gap

characters in the sequence ie determining the alignment A:A

X

−:A A:− A:A A:A

Z Y −:A A:−

Seq1: ATGGGTGGT−−−−−GGTT Seq2: ATG−−TGCTGGTATTGTT

State XXXYYXXXXZZZZZXXXX

Figure Figure showing a PFSM of an alignmentof two sequences The PFSM

has transitions which emit a pair of letters one from each sequence p otentially with

gap residues which indicate the absence of a letter

Complex FSMs

The PFSMs describ ed so far have been relatively simple More complex machines

can b e written in the same formalism For example gure shows a mo del of a

bacterial sequence containing a single co ding gene The top panel shows this drawn

out as individual states each transition emitting a single base The lower panel

shows this contracted to a smaller state diagram but with the transitions emitting

more than one base at a time

The lower panel is both a far more compact way of representing a the PFSM

and is also closer to how the implementation in a programming language works

Notice that on the state after the main co don state in brown the transitions which

enter the state have dierent lengths of emissions in one case they are base pair

co dons and in the other case one base pair DNA sequence Having dierent length

emissions leading to the same state is a common o ccurrence in bioinformatics This

is why I prefer the Mealy representation of PFSMs which has the emissions on the

transitions as I have done throughout this work The other representation of PFSMs

is the Mo ore representation which has the emissions o ccurring on the states and is

p erhaps more commonplace The two representations are interconvertible in my

hands Mealy machines are clearer representation of PFSMs than Mo ore machines

for the problems I am interested in

Previous use of PFSM in bioinformatics

There have b een a large numb er of applications of probabilistic nite state machines

in bioinformatics over the last decade This includes b oth standard PFSMs such

as hidden Markov mo dels for mo deling DNA sequence and protein sequence and

alignment PFSMs The following sections go es into more details on sp ecic examples

of PFSMs in bioinformatics

DNA comp osition mo dels

One of the earliest explicit uses of hidden Markov mo dels and hence PFSMs in

bioinformatics was to mo del DNA comp osition Gary Churchill provided a number

of HMMs to deduce prop erties of DNA sequence suchasGCcontent and eukaryotic

iso chores He calculates the marginal distribution of what state with some Start x a

t a a a g a a t t t t g a a a a g ...... t g a a t t x Stop

x XXX x

Start Stop

atg tag taa

tga

Figure Figure showing two representations of the same PFSM which mo dels a

piece of bacterial sequence that contains a single protein co ding gene The colours

are consistent between the two views On the top panel PFSM equivalent to a

zero order HMM is written out with one or zero letters emitted on each transition

x stands for any base and the region indicates that the three state blo ck is

rep eated over all co dons The lower panel is the same PFSM equivalent to a

second order HMM is shown The XXX indicates any triplet each one of which will

have a separate probability

dened mo del pro duced a particular sequence the dened mo dels he uses are

inspired by known biological phenomena such as GC islands in the yeast genome

and iso chore switching in eukaryotes

prole hidden Markov mo dels

Anumb er of groups have employed a typ e of PFSM which mo dels protein sequences

called prolehidden Markov models due to its similarity to prole analysis whichhad

b een previously employed byanumber of groups in bioinformatics Anders Krogh

and colleagues were the rst group to provide a clear denition and implementation

of this sort of mo del They dened a restrictivearchitecture of hidden Markov

mo del which is based around rep etitions of states called Match M Insert I

and Delete D This choice of architecture was delib erately mo deled after previous

work on prole analysis which they acknowledge in the pap er The

rest of the work details how they applied standard HMM techniques to this problem

including parameter training using exp ectation maximisation and using the Viterbi

algorithm to discover the best path through a number of sequences providing a

multiple alignment The pap er also solves a numb er of practical problems in using

this typ e of HMM Firstly they intro duce additional states to mo del the fact that

a conserved region mightbeemb edded inside a protein sequence which is otherwise

unrelated to the HMM These additional states called dummy states are added

at the start and the end of the HMM Secondly they provide a useful statistic

to estimate the signicance of the match of a HMM to a sequence They found

that the likeliho o d of the sequence given the mo del over all paths called Negative

Log Likeliho o d NLL score is strongly correlated with the length of the sequence

Therefore they provided a way of correcting for this correlation rep orting a Zscore

of the numb er of standard deviations away from a windowed length bin of sequences

discarding outliers The nal Z score is used as a statistic and they suggest using

a Zscore of standard deviations as b eing a sensible cuto but suggested caution

in trusting automatic cutos They asses their work by mo deling three families

the globins the protein kinases and the EF hands In each case they were able to

recover nearly all known examples of the domain and nd p otentially novel examples

of the domains in the protein database Also they show that the multiple alignments

whichwere generated by the HMM are sensible

Another strong group in the prole hidden Markov mo del eld is and

colleagues The proleHMM software package written by Sean Eddy called

HMMER has b ecome the most widely used proleHMM package in bioinformatics

A number of imp ortant enhancements to proleHMMs are incorp orated into the

package which makes it ideal for practical use Firstly the problem of estimating

the mo del size in terms of the number of match insert delete no des used is

solved using an exact metho d which estimates the p osterior probability of whether

a column in an alignment should b e mo deled at a match or insert state Secondly

the additional states whichallowa prole HMM to mo del a domain rather than a

entire sequence was extended to allow for multiple copies of a domain to b e mo deled

with additional random sequence b etween the copies This changes the architecture

of the HMM and at the same time two transitions between insert to delete and

delete to insert were dropp ed in the new architecture informally called plan

These changes made HMMER a far b etter package for mo deling protein domains in

sequences Thirdly the statistical rep orting of whether a rep orted match represents

a particular domain was put into a b etter framework Firstly the basic score was

rep orted as a log likeliho o d ratio LLR rather than NLL considering the ratio to an

alternative null mo del Secondly it was assumed that the distribution of random

scores from the Viterbi log likeliho o d score followed an Extreme Value Distribution

EVD This assumption is based on the work on the statistics of pairwise sequence

alignments in which the Extreme Value Distribution has b een shown theoretically

for ungapp ed lo cally alignments and shown empirically to also t well for gapp ed

alignments The tting to an EVD is done by a separate calibration step

sp ecic for each prole HMM The derived exp ectation values evalues for the

numb er of random sequences exp ected to get a particular score or higher work well in

practice allowing a single cuto to b e applied across all prole HMMs automatically

A number of other supp orting pap ers to the basic prole hidden Markov mo d

eling technique have b een published These include derivations of sensible priors

using Dirchlet mixtures for the amino acid distributions in the match and in

sert states justication of the previous prole techniques in terms of large prior

information incorp oration of more motif like training techniques which drops

the numb er of parameters to b e trained in the EM pro cess and novel sequence

weighting mechanisms to derivetheb est mo del for a particular family In ad

dition the success of prole HMMs haveprovided a more detailed study into their

ecacy in entirely automated mo del training and a number of innovations in

the implementation of the algorithms Finally a numb er of databases of prole

HMMs have b een develop ed suchasPfamSMART and Prosite Proles

and TIGRFAM Chapter details mycontribution to the Pfam pro ject

Protein prole HMMs can probably b e seen as the biggest success in the deploy

ment of PFSMs in bioinformatics

PFSM interpretations of pairwise sequence alignment

Pairwise sequence alignment was one of the rst applications in sequence analy

sis develop ed some years ago A number of researchers have used probabilistic

mo dels to interpret pairwise sequence alignment with considerable success Martin

Bishop and Elizab eth Thompson provided one of the earliest probabilistic interpre

tations Their motivation was to place the alignment metho d characterised by

Needleman and Wunschinto a formal mo del Although the pap er do es not mention

Finite State Machines as such in it their metho d for evaluating the likeliho o d of

divergence between two sequences is easily recognisable as the sum over all paths

for a PFSM They show that for tRNA sequences this metho d gives sensible and

interpretable results

A very similar approach was presented almost a decade later by Philip Bucher

and Kay Homann in a pap er whichwas more fo cused on providing a sensitive

search metho d using pairwise alignment Again they recast a standard algorithm

this time the lo cal alignment metho d of Smith Waterman into a probabilistic for

malism One of the main dierences however is that they did not provide an explicit

probabilistic parameterisation of the metho d Instead they reinterpreted the Smith

Waterman scoring scheme in terms of a probabilistic framework their recursions

S

ij

have a heavy use of terms like z where z is the base of the log which is implicit

in the scoring matrix and S is the score at p osition i in one sequence and j in

ij

another sequence In other resp ects the recursions are very similar to a sum over all

paths typ e calculation The power terms are equivalent to probabilities in a more

standard framework but the advantage of this work is that they can use arbitrary

gap penalty settings and incorp orate lo cal alignment behavior and still provide a

sum over all paths The success of their metho d was shown with a number of exam

ples A b etter judgment of their success came from an indep endent assessment of

anumber of metho ds from a separate group in which their metho d was a clear

leader

Jun Zhu and colleagues provided one of the most complete interpretations of

pairwise alignment in a Bayesian framework called the Bayes Aligner

Their intention was to remove the nuisance variables of sp ecifying the parameters

for the substitution matrix and the gap p enalties Both of these cases are hard to

remove as one cannot nd easy integrals which would provide analytical answers

for summing over an entire distribution of parameters For the substitution matrix

they simply summed over a series of dierent matrices they suggested taking ma

trices from a well dened series such as the PAM matrices For gaps they to ok a

non standard approach Rather than sp ecifying gap parameters they considered all

p ossible alignments with up to k gaps k typically With careful manipulations

of the equations of the denitions of the joint probability of seeing two sequences

over all p ossible parameters they provided metho ds to sum over all given matrices

and over all given number of gaps up to some maximum These metho ds are very

similar to the standard sum over all paths recursions The authors provide a num

b er of dierent p osterior distributions in particular the p osterior distribution of the

matched residues over all gap numbers and all gap matrices make for interesting

and visually app ealing pictures One consequence of lo oking at the p osterior distri

bution of probabilities for matched residues is that they show there are a number of

alternative paths b etween well conserved blo cks For example the authors showan

alignment for the twoGTPases in whichtwo clear wellaligned blo cks are apparent

but there are a numb er of alternative routes b etween them The p osterior distribu

tion over the PAM matrix choice is also interesting with a bimo dal distribution for

this example of either or PAM distances suggesting that evolution is not a

homogeneous pro cess across a sequence

The dierence in notation and change towards nding alignments of a xed num

ber of blocks rather than alignments with sp ecic gap parameters make understand

ing the Bayes Aligner as a standard PFSM challenging By examining the recursions

carefully it is clear that the Bayes Aligner has a PFSM similar to that shown in

gure This typ e of machine has b een previously published as the generalised

gap penalty mo del though not with probabilistic parameterisation A b enet

of this architecture is that gaps can contain unaligned but otherwise matched

residues representing regions where residues o ccurs in b oth sequences but they are −:b a:b a:− a:−

A:B M G a:b A:B

−:b

Figure The two state gap mo del common to b oth the generalised gap p enalty

metho d and the Bayes aligner The two states are called Match M or Gap G

Transitions leading to the match state emit two aligned residues indicated by the

AB pairs Transitions leading to the Gap state emit one of three dierent alignment

pairs gapp ed A sequence a gapp ed B sequence b and dual emission of b oth A

and B but not aligned ab

not alignable Both the Bayes aligner and the generalised gap p enalty pap ers

emphasise that this is p ossibly a b etter mo del of protein evolution The use of a

xed numb er of gaps rather than a gap p enalty is harder to interpret in a standard

PFSM framework However they use priors over the gap numbers which are not

equiprobable for each number of gaps but rather take into account the complexity

of the number of dierent p ossible alignments for each gap number this is b est

describ ed in the pap er This prior can b e thought of as a prior over gap

parameters in a more standard PFSM and the recursions provided to sum over all

p ossible gaps b ecome almost identical to approximating the integral over gap pa

rameters by sampling at anumb er of p oints in the gap parameters distribution It

is reassuring to nd a mapping from the techniques used in the Bayes Aligner to

more standard PFSM techniques

Ian Holmes and Richard Durbin published a pap er that investigated alignment

accuracy They built up a framework to investigate alignment accuracy around

a probabilistic mo del of sequence evolution Given this framework they showed

that the natural choice of parameters for PFSM to align the sequences b eing the

probabilities from the mo del which generated the sequences were good parameter

choices They also derived an analytical approximation for the exp ected accuracy

of alignment for a set of parameters emphasising that no alignment pro cess can

be exp ected to be p erfectly accurate Finally they develop ed a novel algorithm to

maximise the accuracy of an alignment as opp osed to maximising the likeliho o d

of an alignment This maximal accuracy alignmentisatyp e of p osterior deco ding

using the p osterior distribution of state lab els of the two sequences The algorithm

has since b een applied to proleHMMs to go o d eect

Probabilistic mo dels of RNA

RNA sequence is commonly used as a biological entity in its own right as well as

being used as an intermediate in the pro cessing of a gene RNA sequences which

are biologically active form sp ecic three dimensional structures which provide the

function of the RNA sequence The pro cess of the one dimensional sequence of RNA

folding into the three dimensional structure is dominated by base pairing interactions

between the RNA bases and the ma jority of those are found in stemlo op structures

in which a series of consecutive bases pair up with another consecutive region of base

pairs pro ducing a free single stranded region b etween the two base paired regions

Two groups published pap ers on applying probabilistic mo dels to RNA sequences

at around the same time Sean Eddy and Richard Durbin develop ed a probabilistic

mo del for these RNA stem lo op structures This mo del was based around the

idea of parsing the RNA stem lo op structures into tree which represented the stem

lo op pairings These trees could mo del stem lo ops with bulges and single stranded

RNA but could only mo del nested stem lo op structures RNA pseudoknots in

which one stem lo op is interp osed b etween another in a non nesting manner could

not b e mo deled This tree structure was then used as a framework in whicha number

of states are placed in particular states which emitted two base paired letters states

which emitted single stranded RNA letters states which emitted additional letters

relativeto the consensus either base paired or not and states which did not emit

letters to mo del absence of a conserved p osition Having built up this mo del they

then detailed algorithms which given a mo del and a sequence could nd the most

likely path through the mo del being the equivalent to the Viterbi algorithm and

the equivalent the BaumWelch exp ectation maximisation technique for estimating

mo del paramters

In the pap er their describ e b oth the metho ds and the application to a number of

problems tRNA detection and the alignment of the structural RNA U sequences

They show that there are considerable b enets in providing a probabilistic mo del of

RNA sequences b oth in the detection of atypical examples of RNA molecules in

which the stem lo op pairing was well conserved although the primary sequence was

not and the deduction of the stem lo op patterns from the RNA sequences directly

In the pap er they also discuss the fact that this covariance mo del represents a typ e of

sto chastic context free grammar PFSMs can b e b e considered a typ e of sto chastic

regular grammar which places this work in the context of language theoretical

mo dels

The other group Sakakibara and colleagues published a very similar pap er in

which the corresp ondance to Sto chastic Context Free Grammars was made more

clear Again they used tRNA sequences to illustrate the p ower of their metho d

showing that with remarkably few sequences to train the mo del they are able to

easily distinguish tRNA sequences from background

Interestingly Eleanor Rivas and Sean Eddy extended this work recently to en

compass certain typ es of pseudoknots In their pap er they use a mo died typ e

of Feynman diagram to enumerate the dierent p ossible combinations during the

dynamic programming recursions Although the algorithm is parameterised on the

basis of exp erimental energies to nd the minimum energy fold they indicate that

a probabilistic parameterisation is p ossible One intriguing feature of this work is

that is was commonly thought that RNA pseudoknot prediction required a higher

grammar than a context free grammar and hence NP complete non deterministic

p olynomial eg only solvable by heuristic or brute force approaches The solution

provided in this pap er indicates that one can solve certain typ es of higher grammars

in p olynomial time

Genome Mapping

An interesting use of a hidden Markov mo del in bioinformatics was in Genome

Mapping Donna Slonim and colleagues develop ed a hidden Markov mo del which

represented the p osition of markers on a genome sequence The attraction of

using a hidden Markov mo del in this case was that there was a natural representation

of the uncertainty of the observations mo deling the p ossibility for errors to o ccur in

the lab oratory work Using the HMM they could derive a likeliho o d for a particular

ordering of markers along the genome consistentwithknown radiation hybrid data

They provided an ecient search routine to try to nd the maximum likeliho o d

set of markers which involves a generation of a sparse map of only a few reliable

markers followed by greedy incorp oration of additional markers

Gene Prediction Metho ds

Representing gene prediction metho ds as PFSMs started in bacterial genomes with

a number of pap ers describing hidden Markov mo dels to nd op en reading frames

in bacteria Anders Krogh and colleagues designed a PFSM similar though

far more complex to the one shown in gure which nds protein op en reading

frames in bacterial DNA

PFSMs for eukaryotic gene nding were a clear application David Kulp and

colleagues develop ed a generalised hidden Markov mo del called Genie The

generalised asp ect of the HMM is that they allow transitions between states to be

anyvariable length this additional freedom allows Genie to accurately mo del exon

length which has a distribution of lengths that is not well tted by a geometric

decay In addition they can use anytyp e of sensor comp onent which will generate

a probability of seeing some observed sequence The attraction is that they can

then use sensors such as neural networks for splice sites One of the problems they

encountered was in how to assess the alternative mo del for this sensor to provide

a likeliho o d ratio They note that nding the alternative mo del for discriminative

sensors such as neural networks is dicult and suggest a way of estimating it by

deducing the implicit alternative mo del from the training criteria of the network

Interestingly Anders Krogh has develop ed a framework in which the training of

neural networks emb edded inside HMMs can be achieved using so called hidden

neural networks HNNs In this case the training of the HNNs provides the

waytointegrate the score coming from the neural network with the generative HMM

typ e mo del

Chris Burge develop ed a fully featured integrated HMM for gene nding called

Genscan Genscan provides a complete mo del of genomic DNA including for

ward and backward strands and multiple genes The HMM also contains the ability

to mo del explicit length distributions which he describ es as a semiMarkov mo del

but do es not include the more general sensor formalism used in Genie Genscan

p erformed signicantly b etter than the next b est program when released sensitivity

of and sp ecicityof vs the next b est b eing and on his test set

and remains one of the b est gene prediction programs currently

Anders Krogh develop ed a HMM gene prediction mo del which has a number

of novel features This intro duced a concept of optmising the labeling of a

sequence rather than the total likeliho o d in using a PFSM which he calls a Class

HMM CHMM The gene mo del has a number of states to mo del exon regions

when summed over all paths the exp ected distribution of the total time sp end in

the states nicely hump ed as exp ected In contrast a Viterbi path provides only

an exp onential decay To deduce a gene prediction from the gene mo del he did

not use a Viterbi deco ding but a metho d which attempts to nd the most probable

lab els which are still consistent with the mo del called b est deco ding As the

parameters to provide this lab eling are not the same as the maximum likeliho o d he

provides a dierent training scheme for the parameters called conditional maximum

likeliho o d He shows that by using conditional maximum likeliho o d training and

the b est deco ding metho d he can improve in particular the accuracy of the gene

prediction mo del without changing the sensitivity accuracy at the base level go es

from to with coverage remaining around on his test set

Other techniques

There are a large number of other techniques which use PFSMs in bioinformatics

Building and interpreting phylogenetic trees have used probabilistic metho ds for a

long time as there are not manyotherways of attacking the problem Je Thorne

Nick Goldman David Jones and colleagues have published a number of pap ers on

combining PFSMs with phylogenetic trees These pap ers show that

there is an increase in the likeliho o d of the tree when a PFSM of the sequence is

included in the mo del Many of the pro cesses in large scale sequencing can b enet

from HMM techniques such as the deco ding of trace data as discrete bases

The following chapters describ e my researchinto PFSMs in bioinformatics

Chapter

Dynamite

Intro duction

Dynamic programming is a ubiquitous algorithm in computer science which has

b een applied to many dierentelds In its broadest sense dynamic programming

is the recursive decomp osition of an optimisation problem into smaller sub problems

until known initial conditions are reached To compute the solution of a problem the

known conditions are used to initialise the rst sub problem and then the recursions

which provide the solution of the next larger sub problem from the solution of the

previous smaller sub problem allow the propagation of this information until the

complete solution is known

Dynamic programming as a technique allows the solution of such problems as

nding the shortest path between two no des in a weighted graph and other

variations of this algorithm This in turn provides a basis for the solution of a

number of well known problems such as nding the minimal string edit cost the

parsing of regular grammars or alternatively stated nding the path through a

Finite State Machines consistent with a given emitted sequence These Finite State

Machines include probabilistic Finite State Machines PFSMs also known as hidden

Markov mo dels where the two main algorithms that o ccur when using HMMs the

Viterbi algorithm and the forwardbackward algorithm are b oth typ es dynamic

programming

Dynamic programming metho ds are used in a wide variety of applied elds from

passive sonar detection through oil exploration techniques sp eech recognition and

biological sequence analysis

The use of dynamic programming in bioinformatics

The application of dynamic programming to biological sequence was an early de

velopment in the eld and continues to be a key technique in the arsenal of

sequence analysis including common algorithms such as nding the optimial lo cal

alignmentoftwo sequences Even as more complex problems were tackled

such as representing family information and recognising structural folds new meth

ods were commonly simple variations of the basic dynamic programming algorithm

develop ed previously for pairwise comparisons of sequences

The intro duction of probabilistic nite state machines in bioinformatics was

inspired by the common use of dynamic programming algorithms to solve sequence

analysis problems A number of dierent metho ds which use PFSMs have been

already develop ed as describ ed in the previous chapter The next section will

outline how PFSMs use dynamic programming

PFSMs and dynamic programming

Dynamic programming techniques are used to solve two of the crucial problems

found in PFSMs rstly nding the optimal parse through the PFSM given the

data called the Viterbi algorithm and secondly nding the likeliho o d that observed

data was generated by the PFSM summing over all p ossible paths often called the

Forwards score Both these metho ds essentially rely on the same feature whichis

the Markov rules of the PFSM the probabilityofmoving to the next state is only

dep endent on the current state of the machine and no other prop erties of howthe

machine got to this state

Finding the maximum likeliho o d path

The Viterbi algorithm nds the most likely path through a probabilistic nite state

machine given observed data Assume that P is the total probabilityof the b est

ki

path which passes through state k and ends at data point i see gure for a

pictorial representation Consider the S p ossible states which could have been

immediate predecessors of the k state which we will call sources of this state

Each source s has to have an oset o in the data dimension o can be zero but

s s

not negative Each ks pair describ es a single transition in the PFSM which will Data: DF H

P(1,i−2)

P(2,i)

States P(3,i−1)

P(4,i−2)

Figure A diagram illustrating the basic Viterbi recursion The pro cess is calcu

lating the b est path that go es through state data p osition i Three p ossible states

shaded could be the sources of the best path passing through this p osition the

b est one is chosen by comparing the total probability of the b est path to each source

times the probabilityof the transition to state The actual best path to each of

these sources indicated by the dashed lines is not need to calculate the b est path

through state Note that transitions from state has an oset of two emitting

two symb ols whereas the transitions from the other two states have an oset of

emitting a single observation

have a probabilitydependentbothonthe nature of the transition and on the data

observations by the transition From the Markov rules the following recursion holds

P max P T d d

io i

ki sk

sio 

s

s

sources s

If o is zero weadopttheconvention that d d is the null string The

s io i

s

Markov rules are b eing used here to discard all the p ossible path information in how

the b est path reached state s at the data p osition i o and only b e interested in

s

its total probability P

sio 

s

Given this recursion rule and the starting and ending conditions the problem

of nding the b est path can b e solved using dynamic programming It starts at the

rst state with the rst data p oint and pro ceeds to apply equation iteratively to

for all states at each data p osition the pro cess is terminated when it has exhausted

the data and has reached the ending state At this p oint the probability of the most

likely path can b e simply read o

The actual state lab els which generated the b est path can b e found in a number

of ways once the probability of the best path is known One can keep a note

for each ki p osition ab out which source was used This is conceptually simple

but requires considerable additional b o okkeeping of the backp ointers for each ki

p osition A more ecient metho d is to work backwards up the path using the

recursion rules to reconstruct the preceeding ki p osition from the correct p osition

of the best path starting from the end point and working This only requires

storing the b est probability for each ki p osition However this storage requirement

is prohibitive for large problems in such cases a divideandconquors metho d which

has a longer running time but considerably less memory requirements can b e used

These metho ds are describ ed in more detail in a later section

The discussion so far has b een for a single sequence b eing compared to a PFSM

When an alignmentstyle PFSM is used two sequences are b eing aligned on the basis

of the PFSM The Viterbi algorithm is identical to that describ ed previously except

that there are two data dimensions i and j which form a matrix of data p oints

The states form an additional third dimension which is generally far smaller than

the sequences of observations

Finding the total probability of observations

Often one wants to know the total probability of seeing some observations regardless

of which state path was taken for a particular PFSM The calculation to nd the

total probabilityover all paths often called the forward score is closely related

to the Viterbi algorithm All it requires in eect is to replace the Maximum in

equation with a sum Consider AP the total probability over all paths to a

ki

particular state k and data observation i Again a recursion rule can b e set up that

provides AP in terms of its predecessors

ki

X

AP AP T di o d

s i

ki sk

sio 

s

sources s

Again this rule can b e applied from the starting conditions to the ending condi

tions using dynamic programming The nal result however is the total probability

over all paths of observing the data given the mo del rather than the probabilityof

the most likely way of pro ducing it

Dynamite

I realised early on in my research that I was likely to be developing a number of

dynamic programming metho ds in this eld The implementation of some of these

metho ds are time consuming to program and debug Iwanted some typ e of to olkit

to help develop these metho ds allowing me to concentrate on the actual biological

problems and not the implementation

Usually programming to olkits are based around libraries of resuable co de How

ever in this case I knew that that a library based solution would not provide ecient

execution of the metho d as the part of the metho d that I wanted to b e exible the

denition of the dynamic programming recursion was the part in which the vast

ma jority of the CPU time is sp ent If the compiler is unable to optimise this part

of the metho d the execution time can suer by an order of magnitude or more

For these reasons I cho ose to build a compiler whichwould work o a high level

description of the dynamic programming problem The compiler would pro duce

C co de for the sp ecic dynamic programming problem This C co de would need

to be linked to a library of supp orting routines and a piece of driving co de where

the programmer actually called the generated dynamic programming routine This

separation of compiler libraries and driving co de is common in other computer

science applications suchasthe yacc compiler system for programming languages

The entire ensemble of language compiler and libraries I called Dynamite

By using a compiled language the generated co de can take advantage of sp e

cic features of the hardware it is using for example for symmetric multipro cessor

machines which use a threading mo del multithreaded co de can be generated or

for distributed memory multipro cessor machines message passing co de can b e gen

erated The separation of the sp ecication of the program from the generation of

ecient co de allows p eople who are only interested in the design of dynamic pro

gramming algorithms to take advantage of large compute hardware without changing

any co de From the other p ersp ective of the computer hardware designer p eople

who are only interested in the ecient execution of the algorithm on sp ecialised

hardware can work on a more general form of the problem b eing condent that

their work can b e applied to many dierent metho ds with a minimum of eort

The Dynamite language

In lo oking at how Dynamite is actually structured it is probably worth reminding

the reader of the basics of dynamic programming as it is actually implemented in a

programming language for this class of problems Figure One has two ob jects

often sequences called the query and the target in Dynamite whichprovide the

twoaxesof a matrix of numb ers This matrix is usually divided up into cells each

cell containing a collection of numb ers that represent the b est score or the summed

score of the problem to the ith p osition in an ob ject compared to the j th p osition

of the other ob ject ending in that state

To calculate a cell one needs to use the

cells from previous ij th p ositions only some cells might b e required

data elements in the query and target

additional resources such as a protein comparison scoring matrix

In cases where wehave to calculate the path the most ecient approachistodo

a traceback through the matrix of numb ers deducing how the b est score was made

When memory limitations prevent allo cating and using the entire matrix in memory

a linear space divide and conquor version of the algorithm must b e employed

formal denition of a dynamic programming recursion

Dynamite is a language like yacc or lex the UNIX to ols to automate compiler

generation and regular expression parsing which generates C co de from a denition

of the problem The Dynamite language must provide the following features

A denition of the dynamic programming recursions suitable for the state

machine one is considering

The abilityto supply the data structures to provide information to calculate

the recursion 540 430

512 567

Figure A diagram explaining dynamic programming as it is usually implemented

for sequence alignment A matrix of cells is laid out one axis for one sequence

the other axis for another sequence The cell represents the b est score in each of

the states for a particular prex of sequence A compared to a particular prex of

sequence B for cases where there is more than one state each cell stores more than

one number Cells are calculated via recursion rules which indicate how the score

for the extension of the alignment to the next cell is mo died usually on the basis

of prop erties of the sequence at that p oint

The interface b etween the denition of the state machine and the data struc

tures to indicate where in the state machine which calculations o ccur

An example dynamite denition in fact for the SmithWaterman algorithm is

written out b elow

include dynah

matrix ProteinSW

query typePROTEIN namequery

target typePROTEIN nametarget

resource typeCOMPMAT namecomp

resource typeint namegap

resource typeint nameext

state MATCH offi offj

calcAAMATCHcompAMINOACIDqueryiAMINOACIDtargetj

source MATCH

calc

endsource

source INSERT

calc

endsource

source DELETE

calc

endsource

source START

calc

endsource

querylabel SEQUENCE

targetlabel SEQUENCE

endstate

state INSERT offi offj

source MATCH

calcgap

endsource

source INSERT

calcext

endsource

querylabel INSERT

targetlabel SEQUENCE

endstate

state DELETE offi offj

source MATCH

calcgap

endsource

source DELETE

calcext

endsource

querylabel SEQUENCE

targetlabel INSERT

endstate

state START special start

querylabel START

targetlabel START

endstate

state END special end

source MATCH

calc

endsource

querylabel END

targetlabel END

endstate

endmatrix

The matrix to endmatrix lines provides the Dynamite denition The rst

lines indicate the data structures whichare used in calculating the recursion The

rst two query and target are the two ob jects which dene the two axes of the

matrix The other resource lines indicate additional data structures needed to do

the calculation in this case a protein comparison matrix and the gap p enalties

The rest of the text is used to dene the nite state machine and its interface

to the data structures Each state is dened by the state to endstate lines and

contains within it the denition of the transitions which end at this state in each

source to endsource blo ck Each source blo ck describ es a single transition from a

particular state to the state it is within

The interface to the data structures is provided by the calc lines which can b e

found in one of two places in a source blo ck which provides the denition of the

calculation for this transition and optionally in the state blo ck which indicates a

calculation that is the same for all transitions which end on this state and is added

to all transitions regardless of where they originated

The calc lines are built from a subset of C which allows the usual C arithmetic

memory dereferencing array deferencing as a pointer reference

structure deferencing the and op erators and function calls This is imple

mented using a small yacc grammar which is fully parsed by the dynamite compiler

Implementations provided by Dynamite

Dynamite pro duces a number of implementations which provide dierent calcula

tions based on the same Finite State Machine denition The rst ones are those

which provide the Viterbi deco ding of the alignment pro cess in either quadratic

section or linear section memory There is also a convenience func

tion which decides whether to call the large or small memory mo del for a particular

query and target ob ject returning back an alignment which shields the user from

having to switchbetween the two implementations

Dynamite also pro duces some functions to provide the score of two ob jects either

by the Viterbi path section or the Forwards algorithm which calculates the

sum over all paths section These implementations work faster and in linear

memory as they do not need to calculate the alignment

A common use for these typ es of comparisons is a database search When dy

namite is made aware of howto lo op over a database of ob jects see section

a number of dierent database searching mo des can be used The serial search

section provides a standard database search As the database search can b e

done in any order the problemisvery amenable to coarse grain parallelization in

which each pair of comparisons required for the database search is packaged o to

a separate execution stream Dynamite can generate a multithreaded implementa

tion suitable for symmetric multi pro cessor machines section for database

searching

Finally as Dynamite is a true compiler not merely a large macro system in the

sense that it completely parses the denition of the state machine into a internal

representation that fully describ es the machine very complicated target machine

architectures can be considered One example of this is the generation of a sys

tem which is compatible with sp ecialised hardware that runs on Digital Signalling

Pro cessing chips section

Viterbi quadratic memory alignment

The easiest implementation of dynamic programming is to calculate an explicit

matrix of numb ers recursively as in gure The nal score can b e read o from

the appropriate END state The alignment which generated the high score can then

b e deduced by recursively inferring which previous state the high scoring path must

have originated from this is commonly called the trace back routine

Viterbi score only linear memory

When the alignment is not required the score of the Viterbi path can be easily

calculated in linear space this is a natural consequence of the fact that only a small

p ortion of the matrix is required to calculate the next p ortion of the matrix The

only issue with this implementation is that as this will be used in the database

searching metho d care should be taken to minimise its execution time For the

Dynamite compiler this means taking care of two issues

The C generated by Dynamite must be suitable for optimisation by the C

compiler Generally this means providing a large inner lo op of execution state

ments some of which can b e done concurrently

The memory layout should be such that concurrent retrievals of information

are close in memory Dynamite ensures that the matrix memory is laid out

optimally

Forwards score only linear memory

The forwards algorithm which sums the probabilityover all paths can also b e run

as an implementation which provides a score Its implementation is almost identical

to the Viterbi metho d but instead of a max at each p osition a sum in probability

space is provided The only minor twist to this is that the matrix numb ers are in

a log space representation the sum function therefore has to move the numb ers to

probability space b efore adding them This can b e done eciently by precomputing

a table of mappings in log space for addition of dierent probabilityvalues

Recursive linear memory alignment

Storing the explicit matrix of numb ers rapidly b ecomes a resource problem since

the memory requirmentgrows as the pro duct of the length of the sequences b eing

compared As DNA sequences can easily b e over p ositions long the explicit

form can quite easily b e imp ossible to calculate on standard workstations

A solution for this problem has been known for some time which involves a

recursive algorithm that calculates the place where highest scoring path crosses the

half way pointinthe length of one of the sequences Figure This can be

calculated in linear memory prop ortional to one of the sequences Once the half

p oint is known the problem can b e split into two problems on each side of the half

p oint These sub problems themselves are also alignment problems which can be

solved by the application of the same metho d ie nding the half point so as to

partition each problem into two smaller sub problems The recursion is terminated

when the remaining matrix is small enough to t into memory explicitly

This implementation is notoriously hard to program The rst problem is that

the recursive alignment routine can only b e called once the start and end p oints of

the alignmentareknown Secondly when the alignment is lo cal or even worse the

alignment has complex b oundary conditions such as a lo op very annoying book

keeping has to employed to ensure that the correct alignment startend points are

used Many Dynamite users were attracted to the to olkit due to the implemen

tation of the linear space alignment which they did not feel condent to program

themselves for non trivial dynamic programming situations

Dynamite solves the linear space alignment in a reasonably standard way The

rst stage is to break a more complex alignment problem whichpotentially involves

lo ops into a series of simple alignment problems which has no lo op see the section

on sp ecial states to see how these come ab out Then these alignments are

solved using a recursive function Both the splitting of the complex alignment to

simple non lo oping alignments and the recursive routine use a concept called the

Figure The rst rounds of the divide and conquer metho d illustrated picto

rially The green line represents the b est alignment and the blackboxes the matrix

calculated at each iteration The red line is the midp ointat which the p osition of

the best path is stored for each state in each cell and then propagated on to the

end of the calculation Each iteration pro duces two sub problems which can then

b e solved using the same metho d until the sub problems are small enough to solve

using conventional quadratic memory

shadow matrix Ashadow matrix is information which is attached to the score of a

particular matrix numb er and is propagated on with the score By using a shadow

matrix the start p oint of the match can b e propagated to the end of the alignmentin

the rst stage of the alignment pro cess In the full recursive alignment the shadow

matrix is used to store the p osition of each path at the halfway point At the end

of the alignment pro cess the p osition of the half waypoint can b e read o from the

end p oints shadowmatrix

This use of a shadow matrix is more exp ensive than the b etter metho d of running

the dynamic programming recursions in two directions one forward to the halfway

line and one backwards from the end p oint to the halfwayline The shadow matrix

metho d has to make the same number of calculations as this forwardbackward

sweeps but incurs the additional exp ense of propagating information in the shadow

matrix The benet is that is a far easier implementation to co de as it do es not

involveinverting the dynamic programming recursions

Serial database search

The ability to generate a database search of one database of ob jects vs another

database of ob jects is a common use of the dynamic programming algorithms In

these cases generally only the score of the match is required and so one can use the

Viterbi or Forwards score metho ds Although the programming of

serially lo oping through a database of ob jects applying one of the scoring functions

seems simple enough robust routines require a little more thought

The database searching routines must not leak memory as they will b e called

multiple times during the search

Ecient database searching requires that a minimum of ob jects remain in

memory during the database search and that ob jects can be retrieved later

on for pro cessing of the alignments

The database searching routines should b e able to either provide a Viterbi or

aforwards score

Error rep orting of problems during the database search should provide an

indication of where in the database search they o ccured and where p ossible

still provide partial results

Coupled with considerations of minimising the execution time of the database

search writing a go o d serial database searching routine is not that trivial Dynamite

provides an immediate robust and fully featured database searching routine for this

reason

pthreads Database search

Database searches are often the time limiting factor for research taking sometimes

around a week to run for some complex mo dels The advent of multipro cessing

boxes provides a way to reduce the actual amount of time taken for a database

search to complete by using multiple pro cessors on one particular search One com

mon mo del for programming multiple pro cessor b oxes has b een pthreads a POSIX

standard in which each execution stream is represented as a dierent thread but

each thread has access to the same memory

The database search is an embarassingly paral lel problem meaning that the

problem can b oth b e easily broken into sub problems and that these sub problems

require minimal communication b etween them This makes the conceptual writing

of a pthreads implementation simple each thread do es a comparison of one ob ject to

another ob ject separately and then places the score in a common results structure

Even though the conceptual programming is quite simple there are many bar

riers to writing pthreaded co de Firstly the pthread syntax whichinvolves calling

sp ecialised library functions to create threads and other structures which allowco

ordination b etween threads has to be understo o d Secondly each thread must not

share memory with other threads which would cause program failure or worse in

correct results Finally the threads must b e co ordinated to allow sequential access

to common resources such as the database streams

Dynamite provides an ecient well designed pthreads implementation

OneMo del BioXLG port

Database searching by dynamic programming is so compute intensive that a number

of companies have designed sp ecialised hardware to accelerate it This hardware is

generally hard co ded with a small numb er of algorithms the lack of exibilityinthe

algorithm means that every development of a new algorithm must b e implemented

by a small team of p eople who understand the hardware

The creation of Dynamite as a programming language that represents dynamic

programming in theory allows the p ossibility of Dynamite generating co de suitable

for the sp ecialised hardware One sp ecialised hardware company Compugen has

develop ed hardware exible enough to run many dierent versions of the Viterbi

score database search metho d Along with this hardware they provide an API to

program the machine for a new algorithm called OneModel This system is in

some ways similar to Dynamite but imp ortantly reads information from simple

arrays of integers and not directly from C structures as the data must b e moved to

the sp ecialised hardware and no general C compiler is available for the hardware

The fact that Dynamite represents the entire denition including the interface

to the user dened calculation metho ds the calc lines in the Dynamite denition

allows Dynamite to generate co de which can use the OneMo del API To do this

Dynamite has to generate Ccode which queries the user dened C co de to extract

the information required for the OneMo del API For example the calc line written

below might b e found in a protein prole HMM matching a protein sequence

source MATCH

calcpositionspecificgapqueryi

insertemissionqueryiAMINOACIDtargetj

endsource

would generate the following co de written as pseudo co de

foreach i position in the query

offset

modelioffset positionspecificgapqueryi

offset

foreach A over amino acids

modelioffsetA insertemissionqueryiA

offset offset

the next calc line stored

Notice that the calc line has been split logically into two separate expressions

The compiler knows that the position specific gap expression is invariant with

resp ect to changes in the target information However the insert emission

query i AMINOACIDtargetj expression uses information from the target in

formation The compiler then has to lo op over every p ossible input of that informa

tion In this case the compiler parses AMINOACIDtargetj as having as its

p ossible values and generates a lo op to iterate over those values storing the infor

mation in the mo del line The generation of this co de requires that the Dynamite

compiler can manipulate these calc lines in a detailed manner

The compiler then has to generate the necessary denition for the sp ecialised

hardware to indicate that for this particular transition in the state machine it needs

to use the numb ers in modelquerypositionaminoacid in target

Software engineering details of Dynamite

Dynamite is a to olkit whichprovides considerable help to the algorithm designer for

implementing new algorithms The aim is not only that this to olkit is a prototyping

environment but also that it provides functions for the nal implementation For

this to b e achievable the co de generated and used by Dynamite must b e sensible

The Dynamite generated co de passes strict ANSI requirements making it p ortable

to any architecture with an ANSI C compiler It has b een tested on a wide variety

of UNIX machines and Windows NT To allow the Dynamite generated co de to b e

used with other libraries a unique prex can be app ended to each external name

preventing namespace clashes when it is linked with other libraries Care has b een

taken to allow new implementations of database searching functions to b e seamlessly

integrated into existing programs which use dynamite with only the bare minimum

of changes

The Dynamite runtime environment is also built with the same considerations

of co de correctness and external namespace protection Parts of the Dynamite run

time environment are hardco ded into the Dynamite compiler but other parts are

made more exible to allow dierent libraries to provide for example database

streaming functionality This allows other userdened typ es to take advantage of

the database search functions generated by the Dynamite compiler without having

to understand the innards of the compiler

Innovations in Dynamite

Dynamite has a number of innovations which extend the paradigm of sequence

alignment algorithms represented as nite state machines

Sp ecial states

Sp ecial states are b est considered in the framework of rep etitive hidden Markov

mo dels A quickintro duction to using Dynamite for rep etitiveHMMsisgiven rst

and then an explanation of sp ecial states in this mo del

When Dynamite is used for a hidden Markov mo del it is b est suited to those hid

den Markov mo dels in which there is a rep etitive series of states which are connected

in a standard pattern Such hidden Markov mo dels include the prole HMM archi

tecture suggested by Anders Krogh and colleagues In sp eech recognition these

sorts of hidden Markov mo dels are called timedep endent hidden Markov mo dels

and in general when people consider using large hidden Markov mo dels it is very

common to constrain the architecture into a lefttoright rep etitive architecture

Although the main state denition of Dynamite can easily satisfy these rep etitive

HMMs the b oundary conditions can b e quite intricate The b oundary conditions are

sometimes particular parameterisations of the start conditions and end conditions

with resp ect to the p osition in the rep etitive nature of the hidden Markov mo del

As b oundary conditions b ecome more complex generally p eople want additional

b oundary states to mo del asp ects of the problem near the b oundary conditions

These additional states b ecome yet more involved when lo oping architectures are

employed In such architectures the rep etitive blo ck of states can be reentered

multiple times The b oundary conditions now apply to b oth the entering and leaving

the rep etitive blo ck at the start and end of the observations and between each

o ccurrence of the blo ck

Dynamite handles all b oundary conditions from simple to complex using sp ecial

states Sp ecial states are states that are only present once in the HMM they are

not rep eated This means that there is a break in symmetry between the query

and target dimensions in Dynamite the sp ecial states are only applicable to the

query dimension Many sp ecial states can o ccur and they can b e connected to b oth

standard states and other sp ecial states see section to see hownull cycles are

avoided By using sp ecial states many dierent complex b oundary conditions can

b e created Examples are given in the dynamite denitions of a variety of dynamic

programming algorithms in App endix B

Lab els

When designing dynamic programming co de for biological applications one might

be changing the dynamic programming recursions whilst still attempting to solve

the same biological problem One annoyance of this is that the data structure

which represents the result of the dynamic programming parse will be dierent

for dierent precise recursion denitions to generate sensible biological results one

needs to endlessly change the co de to that lo oks are the data structures emerging

from the dynamic programming co de

Dynamite provides a way of emb edding the biological interpretation of the dy

namic programming result into the dynamite denition le This is done by dening

two text labels for each transition source blo ck in the dynamite le one lab el for

the query dimension one lab el for the target dimension For convenience you can

also dene a state default for the lab els Dynamite then generates the co de to con

vert the initial description of an alignment path as a state path through the machine

to a set of aligned regions for each dimension with the appropriate lab els

As the programmer is free to reuse the same lab el in many p ositions radically

dierent nite state machines can generate the same set of lab els By having the

downstream co de only read the lab el based alignment one can change actual dy

namic programming co de at will and use just one set of p ostpro cessing co de for

the downstream analysis of the alignment This feature is heavily used in the Ge

neWise algorithm where dierent architectures were tried whilst designing the

algorithm and dierentarchitectures are actually distributed in the Wise pack

age see Chapter The fact that very dierent dynamic programming recursions

are calculating the alignment is completely hidden from the bulk of the GeneWise

co de

Interestingly the concept of lab els b eing attached to the dynamic programming

recursion is more than a programming to ol A recent set of algorithms to cop e

with the mismatchbetween the biological interpretation of a mo del and the mo del

architecture have b een develop ed by Anders Krogh These algorithms use a

concept also called lab elling to indicate the biological interpretation of a state and

using this concept provide a numb er of algorithms to replace the standard training

and path nding techniques with ones that are optimised for making the correct lab el

assignment not the correct state assignment It is interesting to investigate whether

the lab els as dened in Dynamite will have a broader use than just programmatic

convience but also b e involved in the algorithmical asp ects of Dynamite in the future

Compile time error detection by the Dynamite compiler

As the Dynamite language is parsed and loaded into an internal data structure inside

the Dynamite compiler a numb er of errors can b e found at this abstract level These

include of course syntactic errors when the Dynamite denition is incorrect but

can also include semantic checks after the Dynamite source co de has been parsed

These semantic checks ensure the integrity of the language and so catch some errors

at compile time errors which are imp ossible to catch with a lower level language

The semantic checks include

dangling transitions Ensures that each transition starts and ends on a valid state

startEnd Ensures that there is a single start state and a single end state and

that there is at least one path from start to end

null cycles Ensures that there are no null cycles by making sure that each transi

tion advances in at least one data dimension

inappropriate indices The programmer is free to use the i variable as the index

in the query dimension and the j variable as the index in the target dimension

The dynamite compiler detects when i is b eing used to index into the target

data structure or the j into the query data structure This nearly always

indicates an error but it could conceivably have some use so only a warning

is issued

typ e checking of functionmacro calls Each function call in the Dynamite com

piler is typ e checked see b elow

The typ e checking in the dynamite compiler is particularly exible and can

apply to both function calls and macros as used by the C programming language

This makes the typ e checking useful as for p erformance reasons many of the ways

of accessing the data will be via macros which return integers and so mistyping

errors cannot b e detected by the C compiler

An example of this typ e checking is given below Imagine we have declared

HMM MATCHhmmicodon as taking a HMM as the rst argument and a CODON

SEQseqj as returning the co don at p osition co don as the third argument CODON

SEQseqj as returning the base at p osition j The following calc line j and BASE

would b e a valid calculation and b e parsed without error

CODONHMMMATCHhmmiCODONSEQseqi

However the next line would cause a mistyp e error

CODONHMMMATCHhmmiBASESEQseqi

This line generates the following error from the Dynamite compiler

In parsing calc line for state MATCH source MATCH

Mistype in argument of CODONHMMMATCH

wanted CODON got BASE

Warning Error

This is a case where in the C co de the macros which accessed the base or co don

information would return identical results therefore this error can only be picked

up by the Dynamite compiler

Even for a user suchasmyself who has a clear knowledge of Dynamite and how

to use it these compile time checks catchmany problems early on in the development

of an algorithm A particular b enet are cases where the denition would pro duce

an incorrect answer not merely a runtime error These sorts of problems are hard

to trackdown in the logic of the C co de

An optimiser for dynamic programming

As Dynamite is a complete compiler holding a representation of the dynamic pro

gramming co de in an internal data structure there is the p ossibility of rearranging

the data structure to generate co de which executes faster One should b e wary of

such optimisers Dynamite is not generating assembly co de for a sp ecic class of

machines but rather C co de for a generic class of machines By and large it is b et

ter to let the C compilers do the optimisation as they are b oth written by exp erts

in optimisation and also have more knowledge ab out the target machine they are

writing for However Dynamite has more knowledge than the C compiler in how

the co de is actually used In particular it knows that certain data structures are

readonly from the p oint of view of the dynamic programming

I have only implemented one optimisation which exploits the read only nature

of parts of the dynamic programming co de I develop ed a standard common sub

expression analyser which could nd identical calc lines which are executed more

than once in the inner lo op These sub expressions were then analysed as to which

lo op they remain invariant under and the actual expression moved outside the

lo ops they remain invariant under this is called a lo op hoist in the compiler

optimisation eld and is a very basic optimisation In some cases the expressions

which I promoted included complex data structure dereferencing the standard C

compiler could not validly p erform the same optimisation as it do es not know that

that the data structure is read only for the duration of this function

This optimisation showed dramatic results when the C compiler optimisation was

not used sometimes showing a two fold sp eed up However when the C compiler

optimisation was used the eect dropp ed to or less sp eed up Considering the

sophistication of the C compilers I feel that sp eed up is still signicant

Currently there are more exp erienced researchers in compiler optimisation lo ok

ing at the p ossibilities of providing b etter optimisations in the Dynamite compiler

Example Dynamite programs

The next three chapters discuss in detail a number of Dynamite generated algo

rithms I have used Dynamite for many more algorithms than those detailed in

this thesis Here are three other algorithms which were written in Dynamite that

illustrate its exibility

Dna Blo ck Aligner DBA

When aligning a eukaryotic promoter region with the corresp onding region from

a related sp ecies it is clear that standard SmithWaterman alignments of the two

sequences do not reect the known biology of promoters Promoters are known

to contain a number of small motifs which bind DNA binding proteins some as

basal transcription machinery some as transcription factors These motifs can be

found at vastly dierent distances and still function identically When aligning two

homologous promoter or regulatory DNA sequences for example from mouse and

human we should exp ect large intervening sequences to o ccur b etween regions that

are conserved

There were no available programs whichwould easily provide this sort of align

ment information Rather than p ost pro cessing existing algorithms such as BLAST

or SIM we investigated designing our own alignment systems to represent evolu

tion in regulatory regions The architecture we settled on is shown in gure We

presumed that the two sequences would share motifs that could b e interrupted by

small gaps but that between these motifs very large gaps indistinguishable from

background were likely to b e present Because wewere interested in measuring the

level of conservation between dierent sites we split the blo cks into four dierent

typ es of blo cks which matched diering levels of conservation Thus the alignment

pro cess would align the two sequences and also classify the conserved blo cks simul

taneously

The parameterisation of the FSM was done by setting parameters by hand for

the conservation levelsandthetwo gap probabilities in blo cks and b etween blo cks

The parameters chosen gave go o d results on the test cases we used The mo del was

parameterised against a null mo del whichwas as if there were no conserved blo cks in

the two sequences This null mo del could b e applied directly to the scoring scheme

as there are no hidden variables to calculate meaning that we could provide a single

program that would provide the logo dds ratio of the alignment

An example output is shown in gure

Deriving an alignment from a D sup erp osition of protein

structures

When the D structures of two protein sequences are determined by Xray or NMR

techniques it is usually easy to decide visually whether they are related or not

unlikeonly knowing their linear amino acid sequence where it is extremely hard

even with computational help To help understand howtwo structures are related

a common technique is D sup erp osition such as rigid b o dy rotation and translation

to minimise the ro ot mean squared distance of the alpha carb on atoms b etween the A

B

C

D

Figure A gure illustrating the Dna Blo ck Aligner PFSM The conserved blo cks

are on the left hand side of the page each blo ck is parameterised with a dierent

level of exp ected DNA conservation shown as solid transitions Small gaps are

allowed showed as dashed transitions one for each sequence Separating the blo cks

are two states which represented unaligned DNA sequence Each state emits only

for one sequence It is p ossible to pass through the entire PFSM without entering

a conserved blo ck score = 72.52 EM:MMENDOBA 110 ACTCCCTCCACTCTTTCCACCATTCACCACATCCCCCCACACACACTCT A ACTCC T CTCTTTCC CATTC A CCCC AC C ACTC EM:HSKER101 57 ACTCCTTTGCCTCTTTCCG−CATTCCATAACCACCCCAACCCCTACTCC

EM:MMENDOBA 158 ATGGGGACGGAGTTAGGAATAC−CTGGACTCTCACCC A A GGGA GG GTT GG ATAC CTGGA T CA CC EM:HSKER101 106 AC−GGGAGGGGGTTGGGCATACCCTGGATTTCCATCC

EM:MMENDOBA 359 TTAGAGGGTTAAGCGGATGTGGCTAAGGGTGAGTCATCTAGGAGTAAAC C TT GAGGGTTAAGCGGATGTGGCTAAGG TGAGTCATCTAGGAGTAAAC EM:HSKER101 322 TTGGAGGGTTAAGCGGATGTGGCTAAGGCTGAGTCATCTAGGAGTAAAC

EM:MMENDOBA 408 AGGAGCCTTACCTGTAGGAGGGGCCA C A GAG C T CCT T GGAGG GCCA EM:HSKER101 371 AAGAG−CCTTCCTTTGGGAGGAGCCA

EM:MMENDOBA 468 GGGGGGGGGGGGGGGGGGGTTAGCAGGTGCACCTGGAAGAAGATGCCAG A GGG G GGGGG G GT A CAGGTGCAC GG A AA ATGCCAG EM:HSKER101 402 GGGTGTAGGGGGCCCAGAGTGACCAGGTGCACTAGG−AAAAAATGCCAG

EM:MMENDOBA 516 GAGAGGATCAGAAGGAAATCTTGTTGGAAGCTGCTCTTTTGTAAGCA A GAGAGG CAG A GA CTTGTT G AGC CTC TT T GCA EM:HSKER101 451 GAGAGGGCCAG−A−GAGGACTTGTTAGTAGCGACTCACTTCTGGGCA

EM:MMENDOBA 594 GGAATCCAGGAAGGGAG−GGA D GGAATCCAGGAA GGAG GGA EM:HSKER101 561 GGAATCCAGGAAAGGAGGGGA

EM:MMENDOBA 647 GCCTCTGACTGTTCCTGGGACTGGGATGGATTCACTGGAAAACACAAGA B GCCTCTG CT TTCCTGGGAC GGA G TTCACT G A ACA AA A EM:HSKER101 623 GCCTCTGGCTATTCCTGGGACCAGGAAGTTTTCACT−GGA−ACATAACA

EM:MMENDOBA 694 CGT−TTTTCTCATTCCCCCCAC B C T TTT C CA TC CCCCAC

EM:HSKER101 672 CTTTTTTACACA−TC−CCCCAC

Figure An example output of the DBA algorithm aligning an intron from

the keratin gene from mouse and human resp ectively This intron is known to be

conserved b etween the twospecies Each alignment shows a particular pass through

one of the conserved blo cks the intervening unaligned sequence is not shown

two structures

Mapping the D sup erp osition back to the one dimensional sequence alignmentis

not as trivial as it mightseem The sup erp osition often do es not assign every alpha

carb on of one sequence unambiguously to one from the other sequence in particular

in lo op regions where gaps are common Although visually one can usually derive

a sensible lo oking alignment from the D sup erp osition on a large scale one needs

a program

The problem can b e solved by a rather simple Dynamite comparison whichcom

pares two sequences in a manner similar to SmithWaterman but the comparison

metho d is related to distance b etween the carb on alpha atoms in the sup erp osition

In this case it was crucial that Dynamite could handle some arbitrary C typ e calcu

lation in the comparison metho d as this would b e read from a precalculated matrix

that was read in

Although this metho d was very ad ho c the key feature was that I was able

to develop the algorithm within minutes of having the problem describ ed to me

and once we had settled on the algorithm the co de required no debugging This

drastically dropp ed the development time from weeks to under a day This program

was used in the evaluation of the CASP results

Comparing two transmembrane proteins

Transmembrane proteins are known to have dierent residue conservations across the

protein corresp onding to the dierentenvironments of intracellular transmembrane

and extracellular In particular the transmembrane domain almost reverses the

usual amino acid conservation rules with hydrophobics often substituting many

times for other hydrophobics but small p olar amino acids b eing highly conserved

Matching two transmembrane proteins on the basis of standard globular matrices is

clearly not optimal

It is easy to construct a PFSM to represent the matching pro cess b etween two

dierent transmembrane segements as long as one accepts that the length distribu

tion of the dierent regions will be of an exp onential form This PFSM is shown

pictorially in gure Finding parameters for this PFSM is harder as one needs

to provide probabilities for dierentby matrices

To generate these probabilities I to ok a pragmatic approachtotake trusted trans Extracellular

Transmembrane

Intracellular

Figure A transmembrane matching PFSM The four states represent intracel

lular transmembrane extracellular and transmembrane environments The trans

membrane state is duplicated to enforce the top ology rules of transmembrane pro

teins Each state has three lo oping transitions the black representing matching of

two residues the green a gap on sequence and the red a gap on another sequence

The transition leading into the state is a matching residue Many asp ects of this

PFSM is unrealistic including the exp onential length distribution of the transmem

brane regions and the linear as opp osed to the more realistic ane gaps in the

sequence +++++++++++++++++++++++ ++++++ +++++ ++ CB21_MAIZE 1 MAASTMAISSTAMAGTPIKVGSF−GEGRIT−−MRKTV−−GK CB23_LYCES 1 MA−S−MAA−−TASSTTVVKATPFLGQTKNANPLRDVVAMGS ++ + +++ +++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++ CB21_MAIZE 37 PKVAASGSPWYGPDRVKYLGPFSGEPPSYLTGEFPGDYGWD CB23_LYCES 38 ARFTMSNDLWYGPDRVKYLGPFSAQTPSYLNGEFPGDYGWD +++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++======−−−−−− − CB21_MAIZE 78 TAGLSADPETFAKNRELEVIHSRWAMLGALGCVFPELLS−R CB23_LYCES 79 TAGLSADPEAFAKNRALEVIHGRWAMLGALGCIFPEVLEKW ++++++++++++++++++======−−−−−−−− −−−−−−−−−−−−−−−−−−−−−−−−−−−−−======CB21_MAIZE 118 NGVKFGEAVWFKAGSQIFSEGGLDYLGNPSLIHAQSILAIW CB23_LYCES 120 VKVDFKEPVWFKAGSQIFSDGGLDYLGNPNLVHAQSILAVL −−−−−−−−−−−−−−−−−−−−−−−−−−−−−======++++ +++++++++++++ +++++++ CB21_MAIZE 159 ACQVVLMGAVEGYRIAGGP−LGEVVDPLYPGGS−FDPLGLA CB23_LYCES 161 GFQVVLMGLVEGFRINGLPGVGEGND−LYPGGQYFDPLGLA ======+++++++++++ ++++++++++++++ +++++++++++++++======−−−−−−−− CB21_MAIZE 198 DDPEAFAELKVKELKNGRLAMFSMFGFFVQAIVTGKGPLEN CB23_LYCES 201 DDPTTFAELKVKEIKNGRLAMFSMFGFFVQAIVTGKGPLEN +++++++++++++++======−−−−−−−− −−−−−−−−−−−−−−−−−−−−−−−− CB21_MAIZE 239 LADHIADPVNNNAWAYATNFVPGN CB23_LYCES 242 LLDHLDNPVANNAWVYATKFVPGA

−−−−−−−−−−−−−−−−−−−−−−−−

Figure An example output of the transmembrane aligner The two sequences

are chloroplast sequences from maize and tomato The aligment is shown with

indicating the intracellular environment the transmembrane regions and extra

cellular environment

membrane alignments and make counts of residue matches in the identity

range for each region These counts were then converted into probabilities using a

simple pseudo count addition followed by dividing by the total counts The resulting

matrices were go o d enough to make sensible lo oking alignments for transmembrane

proteins as shown in gure

This metho d was a prototyp e to illustrate that using Dynamite more biological

knowledge could b e integrated with the standard protein comparison metho ds This

small prototyp e did not answer many of the harder questions which the work p oses

such as what is the best architecture of the PFSM exp onential length distribu

tions are particularly inappropiate for the transmembrane segments which require a

minimum numb er of amino acids to span the membrane Other researchers are cur

rently lo oking at integrating protein structure rules with protein alignments using

Dynamite and their work lo oks promising

Other Dynamic Programming to olkits

Dynamite is certainly not the rst nite state machine to olkit and is unlikely to

be the last Here is a brief review of some of the other to olkits available and a

comparison of their features with those of Dynamite

UNIX pattern matchers

There are a number of UNIX pattern matchers which provide implementations of

nite state machines designed for text manipulation These include the UNIX to ol

grep Perl regular expressions and the compiler to ol lex All these metho ds have

exible pattern matching engines optimised for text pro cessing in which generally

anumb er of xed letters are required at certain p oints in the match There is little

or no ability for b eing able to score the matches

The only large application of text pattern matching to bioinformatics is the

collection of Prosite patterns convertible to grep expressions with minor p ost

pro cessing required for end eects This pattern library is useful for a number

of biological applications in particular where there are strict rules due to ligand

binding to sp ecic residues However for most cases probabilistic grammars such

as hidden Markov mo dels are more useful

Dong and Searls

Dong and Searls describ ed a system that allowed sp ecication of quite general dy

namic programming nite state machines in Their application had a

graphical front end in whichthe user could draw out a nite state machine likein

gure The application generated PROLOG co de which then solved the dyna

mite programming using the inbuilt Prolog parsing metho d which is a depth rst

search through the p ossible paths

The application was conceived as a prototyping to ol and an illustration of how

cleanly FSMs represented the alignment pro cess In the pap er they mentioned work

to make the front end also generate ecient compilable programs however this was

not to myknowledge implemented

The educational and prototyping nature of the application is also clear in the

logical design There was no easy way to connect the nite state machine to the

machinery to actually calculate the score at any point a very restrictive set of

functions were provided For example this restriction completely prevented their

system to describ e the rep eated HMMs so prevelant in bioinformatics In contrast

Dynamite allows a large subset of C as the interface between the FSM machine

denition and the calculation In addition improvements in Dynamite such as

sp ecial states and lab elling greatly extend Dynamites usefulness

Lefebvre

A context free grammar generation suite was develop ed by Lefebvre primarily for

RNA folding problems This grammar suite used yacc like rules to indicate

how to build the parser As it uses a context free grammar it embraces a larger

set of grammars than Dynamite Like Dynamite it generates C co de However the

integration of more arbitrary C style functions is hard and it only generates the

alignment parsing co de and not other implementations To its credit it recognises

when it has a grammar which is a regular grammar and generates the faster regular

grammar co de

Discussion

I believe the value of Dynamite can be justied by the uses to which it has been

put The algorithms that are presented in the next chapter include some extremely

large nite state machines without Dynamite I am sure I would still b e debugging

the algorithmical co de The fact that designing a new algorithm takes under an

hour and with the lab elling system an improved algorithm can usually be easily

plugged into an existing system has meantIhave exp erimented with many dierent

variant algorithms to solve particular biological problems Suddenly from sp ending

or months designing and co ding a particular algorithm one can play around

with a numb er of alternatives making or dierentversions in a single day This

freedom allows one to take a dierent approach in the research of new metho ds

I have not only used Dynamite internally but also distributed and encouraged

other p eople to use it There is a considerable dierence b etween the author using

his to olkit and someone else picking it up and using it eectively I would like to

thank those early users who havehelpedmesomuch in this pro cess I b elieve that

Dynamite has a lot to oer researchers other than myself Intriguingly Dynamite

has even b een used in other elds than bioinformatics such as econometrics

The ability for Dynamite to automatically bridge the gap b etween the algorithm

designer and sp ecialised solutions such as pthreads co de or sp ecialised hardware is

a great boon to b oth camps It removes the need for the sp ecialist implementor

to meet the algorithm designer and understand his new algorithm To allow this

technology to be widely used Dynamite is distributed with very few restrictions

to anyone who wants it In particular commercial rms can take the Dynamite

compiler and integrate it with their own proprietary system without any restriction

A go o d example of this ability to separate the implementation from the denition

is the OneMo del p ort of dynamite Recently researchers have provided a message

passing interface MPI co de p ort of dynamite this supp orts parallelisation where

there is not shared memory In addition I am thinking of improving the p erformance

of some of the alignment metho ds using dierent reduced space implementations

By implementing this algorithm in Dynamite it will apply to all dierent

algorithms in my standard package automatically

A number of biologically fo cused researchers have downloaded Dynamite and

started to use it to investigate new algorithms As Dynamite shields them from the

actual implementation and encourages exp erimentation of the algorithm they can

fo cus far more on the biological mo del to use For some years wehave b een using

the same basic biological mo del of protein evolution in sequence alignment with

the restraint of implementation removed howmuch b etter can wedo I am lo oking

forward to seeing the results of these endeavors

Chapter

GeneWise

Intro duction

Most eukaryotic organisms exhibit premRNA splicing In this pro cess the tran

scrib ed RNA from the DNA sequence is pro cessed by a complex protein and RNA

mediated system which excises parts of the mRNA sequence these sequences are

called introns and joins the remaining mRNA sequences in order these sequences

are called exons The resulting mature mRNA sequence is exp orted from the

nucleus where it usually provides the template for the translation of a single protein

sequence

The reasons why eukaryotic organisms indulge in such a complex mechanism

between the genomic DNA sequence and the translatable mature mRNA sequence

are unclear p eople havehyp othesised everything from the fact that this exonintron

structure is b enecial to the evolution of genes through to the idea that introns are

simply selsh elements that b ecame xed in eukaryotic genomes An imp ortant

issue is that the pro cessing of the mRNA provides another level of control of gene

expression one of the most common b eing that a single mRNA can generate more

than one mature pro duct due to alternative splicing Regulation via alternative

splicing is for example resp onsible for a key decision in sex determination pathway

in Drosophila melanogaster

One point in all this is very clear premRNA splicing greatly increases the

computational complexity of understanding an organism from its genomic DNA se

quence alone This chapter presents some new approaches to predicting the splicing

structure of genes Firstly I will review some of the bio chemistry of splicing and the

current approaches to solving the gene structure prediction problem

The bio chemistry of premRNA splicing

The pro cess by which freshly transcrib ed mRNA gets pro cessed to mature mRNA

is one which has b een extensively researched since the s when splicing was dis

covered A number of exp erimental systems in particular in vitro human extracts

and yeast genetics have provided a detailed view of the splicing pro cess The

anatomy of the spliced RNA is shown at the top of gure Each exonintron

b oundary is called a splice site and are named relative to the intron the splice

site also called the donor site b eing at the start of the intron and the splice site

also called the acceptor site at the end of the intron All premRNA splicing go es

through two steps The rst step has the end of the intron attack a Adenosine

residue inside the intron forming an unusual to phospho diester b ond The

branched mRNA structure which this forms is called the lariat and is easily distin

guished bio chemically from other mRNA sp ecies The second step has the free

end of the exon attack the end of the downstream exon This excises the intron

and splices the two exons together

The splicing pro cess can be divided into two distinct phases the recognition

of a particular intron and anking exons dening the correct splice p oints and the

actual bio chemical splicing Sadly the latter phase the bio chemical splicing is far

b etter understo o d than the recognition Some of the ma jor players in the splicing

pro cess are small nuclear rib onuclear particles or snRNPs Each snRNP is a large

particle made from one particular RNA molecule and a numb er of protein molecules

Each snRNP is characterised by the RNA molecule it contains which are called

U U etc leading to U snRNP U snRNP etc The intron is recognised by a

p o orly understo o d set of factors which include SR proteins and other proteins not

asso ciated with snRNPs These proteins and the U and U snRNPs somehow

b oth recognise the splice junctions as b eing valid and also correctly pair up the

and of the intron There is a suggestive partial base paring b etween part of the

U snRNA with the splice site however the role of this base pairing in intron

recognition is unclear This recognition pro cess leaves the U snRNP at the

splice site of the intron and the U snRNP at the branch site Next the trisnRNP

U snRNP which is made from the U U and U snRNP joins the U and 5’SS 3’SS

Branch Site

Commitment Complex

5’ Attack at Branch Site

SplicedS Product

Lariat

Figure A diagram showing some of the ma jor pro cesses in splicing The top

panel shows the exonintron structure The exons are in blue and the line joining

them represents the intron The and splice site are shown and the green sp ot

represents the branch site The next three panels shows the splicing pro cess The

red circles represent protein complexes The rst shows the commitment complex

which is a lo ose asso ciation of proteins forming on the two ends of the intron and

elsewhere The second shows the rst catalytic step whichis the creation of a

linkage to the branch site Finally the second catalytic step is the joining of the

two exons and the release of the intron in the closed lariat form

U snRNPs making a complex which can be fractionated called the spliceosome

There are a number of rearrangements of the complex crucially the asso ciation

between the U and U snRNPs change as the spliceosome catalyses the twosteps

of the reaction Intriguingly the U RNA itself has been implicated in the actual

catalytic pro cess which leads to interesting hyp otheses of the relationship b etween

premRNA splicing and the self splicing introns in particular group I I self splicing

introns

Although the bio chemical catalysis of the splicing pro cess is well worth further

investigation the computationally interesting part is the recognition of the introns

and exons One can think of at least two problems the recognition machinery must

b e able to handle rstly it needs to distinguish real splice sites and hence real exons

and introns from other regions in the mRNA which by chance resemble splice

sites Secondly it needs to be able to pair the correct splice sites together as one

could easily recognise correct splice sites but in fact skip over an exon erroneously

splicing the mRNA These problems are exasp erated when one considers alternative

splicing where a single primary transcript has multiple valid splicing patterns in

either dierent tissues or in many cases the same tissue at the same time

Splice Site Recognition

By lo oking at manyintrons and exons a numb er of sequence features in the mRNA

could b e deduced Firstly the splice sites b oth on the splice site and the splice

site have some conserved residues The strongest feature of these conservation is

that the splice site has a conserved GT dinucleotide and the splice site has a

conserved AG dinucleotide giving rise to the so called GTAG rule However it

is obvious that more information than these dinucleotides are required to sp ecify an

intron indeed when naive or complex approaches for mo deling the and splice

sites are used they are not sucient to distinguish real splice sites from random noise

in the sequences Another bio chemical feature is the branch site adenosine which

is attached to the end of the excised intron Early examination of vetebrate

branch site sequences suggested a weak consensus around the A an adenosine is

always used however further analysis of a number of introns did not conrm this

feature In contrast in S cerevisiae and S pombe strong branchpoint consensi were

dened In human introns additional information must b e used by the recognition

machinery At the splice site there is a clear run of cytosine and uracil nucleotides

in the RNA b efore the splice site and after the branch site these are enco ded as

cytosine and thyamine nucleotides in the DNA sequence This poly pyrimidine

tract seems like a clear signal to lo calise the end of the intron and factors which

are bio chemically implicated in the splicing pro cess such as UAF have a in vitro

anity for p olypyrimidine RNA as do it must be admitted other factors which

are not involved in RNA splicing

U splicing

In recentyears a number of introns have b een found which do not contain the GTAG

dinucleotides but instead haveATACdinucleotides at either end of the intron Are

these introns were spliced with the same machinery Elegant exp eriments showed

that for a number of ATAC introns a parallel splicing system was used with a

new series of snRNP factors each corresp onding to one of the traditional set

of factors ie U for U U for U etc The two systems were named after

U or U snRNPs as it is these snRNPs which have been implicated as the key

comp onents which distinguish the two pathways

On further investigation it b ecame clear that whether an intron is spliced byU

or U dep endent systems is not asso ciated with its dinucleotide splice site endings

but instead on the presence or absence of a branch site consensus U dep endent

splicing has a strong branchpoint signal liketheyeast sequences This emphasised

that the recognition of the splice sites do es not o ccu by straightforward base pairing

of snRNA sp ecies to the splice site consensi The discovery of U splicing

has provided bio chemists with a more amenable splicing system as the U system

seems to b e simpler in terms of the numb er of additional factors it uses

Splicing enhancers

Because it was clear that the splice site consensi were not sucient to explain

splice site recognition a number of groups attempted to nd additional cis acting

motifs in the RNA sequence and trans acting factors whichwould bind the RNA

whichareinvolved in this recognition A group of non snRNP splicing factors were

discovered called SR proteins whichhaveanumb er of similar sequence features

and bio chemical features These proteins as a group havebeenshown to b e necessary

factors for splicing and their concentration can inuence alternative splicing patterns

b oth in vitro and in vivo

Many dierent attempts to classify how these proteins work has led to an un

derstanding that they bind small weakly conserved RNA sequences often found

in exons The binding of these factors in manycases stimulates the splicing of the

upstream intron for example see An analogy has b een made with the tran

scriptional enhancers involved in promoter recognition In b oth cases the sequence

motifs are not enough to uniquely indicate the splice site or promoter The motifs

are often additive in their action and can o ccur in a numb er of orientations relative

to the splice sites

Current computer approaches to predicting splicing patterns

Prediction of genes solely from sequence data has b een a topic of research since

the s The rst computer programs to mo del splicing were based on a

mixture of rules and intuition ab out how to represent and combine the signals from

anumb er of dierent features for example the knowledge of the splice sites and the

fact that protein co ding genes have no stop co dons From these rule based metho ds

came three dierent approaches which attempted to provide a more formal basis of

designing a gene nding program

Neural Networks

A number of articial neural networks were used to characterise genomic features

involved in splicing Soren Brunak and colleagues made a feed forward network

whichtook a windowof base pairs around a potential G nucleotide to classify

splice sites Using this network they discovered a number of errors in the

nucleotide databases

Ed Ub eracher and colleagues also used a neural network to predicted exon fea

tures Rather than providing the network with a direct representation of the

DNA sequence they presented the network with a numb er of features calculated on

a base pair window the most imp ortant of whichwere dierenttyp es of base

pair co don frame measures The network then integrated these signals to provide

an overall exon prediction

Although these metho ds were certainly eective at the tasks they designed for

it is less clear how complete gene predictions can b e made by using neural networks

as presenting all base pairs in a piece of genomic DNA to a network is clearly

impractical The imp ortance of neural networks in gene prediction has diminished

over this decade mainly for this reason

Linear Discriminant Functions

Other machine learning techniques have been used apart from neural networks

Solovyev SalamovandLawrence exploited Linear Discriminant Analysis to solvethe

gene prediction problem Linear Discriminant Analysis nds the b est hyp er

plane in a series of feature dimensions which separates two classes of data In these

cases the authors built up Linear Discriminant functions for many dierent genomic

features such as and splice sites the starting ATG and p olyA addition site

These features in each case were trained by compiling training sets of real features

and pseudofeatures whichwere pieces of random genomic DNA which met some

criteria of b eing a p ossible but not real splice site or p olyA For example for the

splice site they found sequences containing GT whichwere not splice sites

These individual features could be used on their own as feature prediction in

their own right but this would not help p eople get a complete prediction of the

gene To counter this the authors provide a nal dynamic programming based

parsing of a sequence on the basis of these precomputed features This dynamic

programming is of features not of individual bases and so is unlike the dynamic

programming outlined in chapter two of this thesis The score of the entire gene

prediction is the combination of the Linear Discriminant values for the individual

elements in it and the b est parse has b e consistent with a numb er of rules the most

imp ortant one b eing that it do es not contain stop co dons in the translation

PFSMs in Gene Prediction

The use of PFSMs in gene prediction has already b een discussed in the intro duction

chapter section PFSMs are a go o d t to the gene prediction problem

The following sections will briey discuss the dierent PFSM based gene prediction

metho ds with an emphasis on their biological gene mo del

Genie

Genie provides a generalised framework which represents the gene mo del in

which the actual comp onents can be swapp ed in and out The comp onents were

envisaged to be a mixture of simple ones such as th order Markov chains hex

amer frequencies for exon regions and complex ones such as neural networks for

splice sites The integration of neural networks which do not come with an easy

probabilistic interpretation into a probabilistic correct mo del was a challenge To

allow these complex features to b e correctly expressed the framework had to cop e

with features of variable length b eing pro duced by the mo del To allow this the

Markov rules were relaxed to allowthe duration of a particular set of observations

to be taken into account in the denition of the probability of its emission This

extension can b e integrated into all the common place manipulations of the hidden

Markov mo dels at the exp ense of additional running time and memory requirements

Genscan

Genscan provides a very similar framework mo del as Genie do es but rather than

making a clear distinction b etween the framework mo del and its comp onents con

centrates on providing a single integrated mo del for the entire pro cess Like

Genie the Markov rules need to b e relaxed though in Genscans case this is primar

ily to allow the exon length distribution to be more accurately mo deled Genscan

provids a complex representation of splice sites using a mixture b etween a deci

sion tree and p osition sp ecic weight matrices Another imp ortant feature was that

the hidden Markov mo del was parameterised for four dierent GC content levels

In human sequence it has long been known that the gene structures are dierent

in dierent iso chores in euchromatin introns tend to be shorter whereas in het

ro chromatin introns tend to b e longer Euchromatin and hetreo chromatin also have

dierent GC content The reparameterisation on the basis of dierent GC content

provides an eective way of mo deling the observation of dierentintron lengths in

the dierentisochores without requiring the user to dene the iso chore b eforehand

whether the GC content is directly linked to intron length is less obvious

Genscans eectiveness is that it is entirely fo cused on providing a single inte

grated hidden Markov mo del for ab initio gene prediction Because of its simplicity

in approach it has b ecome the gene prediction metho d to beat even though other

metho ds have a roughly similar measurable p erformance

HMMgene

HMMgene is the third ma jor hidden Markov mo del approach to gene prediction

In this case a number of interesting additions to the HMM approachwere applied to

tackle the problem of mo deling exon length distribution The exons are mo deled as

b eing the pro duction of a series of identical states which leads to a broad normal like

length distribution for the exon As each state is providing the same biological

mo del the straight forward Viterbi path is inappropriate the distribution of exon

lengths arising from Viterbi parsing would b e an exp onential decay

HMMgene solves these problems by providing a best parse over biological la

b els where each lab el is the biological interpretation of the state The exon states

share the same lab el making this parse of this mo del not the same as Viterbi

HMMgene provides a non exact metho d Nb est paths to capture the best lab el

parse given particular data One pleasing consequence of this is that as desired the

exp ected length distribution of exons is nicely hump ed

Performance of ab initio Gene Prediction programs

The p erformance of ab initio gene prediction programs is hard to assess The main

problem is that in long pieces of genomic DNA of the typ e one one would like to

using for testing it is hard to know for sure that one has exp erimentally veried all

genes A gene prediction which do es not overlap with exp erimental pro of can easily

be a mistake in the exp erimental verication not a false p ositive Despite these

problems a numb er of assessments of gene prediction accuracy havebeendevelop ed

Guigo and Burset provided one of the earliest assessments which has since

b een followed byanumb er of assessments more relevant to genomic DNA The

conclusion of the large scale studies is that gene prediction programs haveacoverage

of around of exons but at an accuracy of around in other words one can

exp ect around half the exon predictions to b e incorrect in large genomic datasets

Combining Homology with Gene Prediction

The central idea b ehind this work is very simple by combining the fact that one

knows that a gene pro duces a protein which is homologous to another known protein

with the gene structure rules one should b e able to make far b etter gene prediction

metho ds Lo oking at this problem from another p ersp ective it also frees p eople who

are interested in nding proteins which are homologous to their protein of interest

from b eing dep endent on a go o d gene predictor to analyse genomic DNA

Both the pro cess of evolution between two homologous protein sequences and

the gene structure rules are well describ ed by PFSMs This suggested that a single

PFSM which p erforms b oth pro cesses simultaneously should b e p ossible To provide

this PFSM I need a description of the two mo dels and a theory to provide the

combination of the mo dels Intuitively one exp ects that the resulting mo del would

b e a PFSM with similar prop erties to the twocombined mo dels though considerably

larger in size By relying on the other probabilistic mo dels as a starting p oint the

parameterisation could be solved by taking the previous mo dels and making sure

that we had a principled wayofcombining the mo dels

The implementation of the combined probabilistic mo del lo oked daunting as

its sheer complexity along with the size of the sequences which I would be using

suggested a programming nightmare Thankfullyby using Dynamite as a way of im

plementing the algorithm a robust timely implementation of the Viterbi algorithm

could b e provided

Combining Probabilistic Mo dels

By solving the general problem of howtocombine two PFSMs I could then use that

for this sp ecic case The typ e of PFSMs which I wish to combine are two alignment

typ e PFSMs each providing a mapping from one of set of letters to another set of

letters Consider two machines S and T where S maps a sequence of letters from

the alphab et A to sequence of letters from the alphab et B and T maps letters from

B to C The following notation is used to describ e the machines This pro cess is

shown diagrammatically on the top panel of gure

S has states n The transition from state i to state j emitting a nite

s

string of letters a in one sequence and b in another sequence has probability S

ij ab

a is a nite string of letters drawn from the alphab et A with representing a non

emitting spacer b is a single letter drawn from the alphab et B with the additional

string representing a nonemitting spacer There is a requirement for b to be a

single letter to allow the merging pro cess to work Similarly the T machine is dened

with states n and a transition from k to l emitting a single b and any nite

t

length of c letters is T

klbc

We wish to construct the state machine U which will map a sequence of A letters

to a sequence of C letters considering all p ossible sequences of B intermediates We

prop ose that U has n n states each of which can be characterised by a pair of

s t

states in each of the original machines i and j will b e used for states from the S

machine and k and l as indexes from the T machine Thus the transition U

ik jl ac

is the transition from the state i k in U derived from i in S and k in T to the

state j l derived from j and l states resp ectively emitting a in one sequence and

c in another sequence We need to construct the denitions of the probability for

each of these transitions in U in terms of the transitions dened in S and T The

following equations provide that denition

For neither a nor c being

X

U S T

ij ab klbc

ik jl ac

b

For a b eing but not c

X

U T S T

ij

klc ij b klbc

ik jl c

b

For c b eing but not a

X

U S S T

ij a

kl ij ab klb

ik jl a

b

The rst equation sums over all p ossible b intermediates for this transition

using the underlying transitions from the S and T machines The two other cases

are when no letter is generated for one of each sequence In each case the blank could

have been generated from either S or the T machines For the case of generating

a blank in the a stream of letters if it was generated by the S machine then that

means that there is an intermediate b sequence which has to accountforthec string

However if the blank was generated by the T machine then there is no p ossible b

letter and furthermore this case can only o ccur when the transition remains silent

in the S machine index hence the The symmetrical argument applies for c

ij

emitting but a We need not worry ourselves ab out the case when S emits a a

pair and T emits a c pair as this contains no b sequence and so is not p ermitted

The combined machine is shown pictorially on the b ottom panel of

The requirement that the two machines emit at most a single b letter p er tran

sition is so that we can do the summations in equations and over all

b If b was longer than a single letter it would b e p ossible to have S machine pro

duce a b string whichwas out of phase with the b string that the T machine would

pro duce I can see no clean waytoprovide the derivative U machine transitions in

this case One could claim that longer b strings could b e allowed as long as the two

machines were emitting in phase strings but this simply means that there is a

 

new alphab et B where each in phase string is mapp ed to a single b letter

Notice that the U machine represents a sort of mo del pro duct of S and T

This means that for even mo destly sized machines the pro duct will b e quite large

However the combined machine allows us to ignore the identity of intermediate

sequences in standard calculations of for example what is the likeliho o d of two

sequences A and C b eing generated by the combined machine and what is the

most likely path through both S and T The additional bulk of the combined

machine is easily justied when one considers the alternative is listing all p ossible

B sequences which dep ending on the architecture of the machine might in fact b e

innite

This theory has b een built up for a rst order PFSM in the mo del To extend

this theory to higher order PFSMs is easy as one can utilize the fact that any PFSM

can b e represented as a rst order PFSM at the cost of more states in the machine

This provides us with a principled way of combining any two machines However

notice that the expansion of a non rst order machine to a rst order machine

is also a pro duct typ e op eration If one do es want to merge two non rst order

machines the numb er of states required to mo del all the indep endent paths through

eachmachine will rapidly b ecome impractical A BC

A C

Figure A diagram showing the combination of two mo dels The top panel shows

the two PFSMs acting sequentially rst transducing the A letters to B letters and

then transducing the B letters to C letters The lower panel shows the merged

machine Codon

Base Deletions Base insertions

Phase0 Phase1 Phase2

5’SS 3 SS

Central Poly−Py Spacer

Figure The gene mo del used in GeneWise The central black state represents

the co dons there is no explicit start or ending of the co dons Sequencing error

are mo deled as co dons of one or two bases long deletions or four or ve bases

long insertions shown in red There is a separate mo del for each phase of introns

each of which have three states shown in green The transition to the rst state

emits a xed length base pair region representing the SS The transition from

the last state emits a base pair xed length region representing the SS The

p olypyrimidine tract is mo deled as a separate state

GeneWise mo del

The GeneWise mo del was created to be the integration of two separate mo dels a

gene prediction mo del and a protein homology mo del using the ideas outlined in

The genomic sequence is equivalenttotheA sequence the predicted protein

sequence of the gene is the B and the homologous protein sequence to which it is

being compared to is C The aim is to compare genomic sequence directly to the

homologous protein sequence considering all p ossible intermediates of the predicted

protein

To b e able to use the mo del combination theory eectively high order Markov

dep endencies have to b e removed from the two mo dels The protein mo del which

is a probabilistic SmithWaterman mo del is rst order Gene prediction mo dels

however are generally of a higher order and it this mo del which has simplications

to make the merging pro cess achievable There are two approximations made to

make the mo del usable Firstly amino acids which are split byintrons are ignored

This is similar to what happ ens in most gene prediction programs which make

the same approximation to avoid having a high order Markov dep endency in the

PFSM Secondly high order Markov dep endencies in the co ding sequence mo del are

shortened Most gene prediction programs use th order Markov chains to mo del

co ding regions whereas I use a simpler mo del that emits triplets equivalent in some

sense to a nd order Markovchains This has the nice feature that there is a simple

mapping to the letters of the intermediate sequence the predicted protein

The gene prediction mo del is shown in gure It has a single state repre

senting exons which emits a series of indep endent co dons The intron states are

dierentiated by which phase the intron o ccurs in phase b eing the place in the

co don which the intron interrupts For phase and phase introns the fact this

interrupts a co don is ignored and is not scored Each intron is considered to be

made from sections the splice site a central intron section a p olypyrimidine

tract a spacer following the polypyrimidine tract and the splice site As the

and splice sites are considered to b e ungapp ed motifs they can b e represented by

single transitions which emit or base pairs resp ectively

Sequencing error is represented in a naive manner in which the insertion or

deletion of bases is considered to pro duce co dons of one or two in the case of

deletion or four and ve in the case of insertion base pairs In these cases the

typ e of co don is ignored

The homology mo del is delib erately mo deled to be the inclusive mo del of the

SmithWaterman protein alignments and KroghEddy protein prole HMM The

homology mo del has a numb er of rep etitivenodeseach no de b eing states called

Match Insert or Delete see gure The Match State for protein sequences

represents matching a protein residue in the observed data whereas the Insert state

represents inserting an additional protein residue in the observed sequence relative

to the given homology information The Delete state represents a no de which is

skipp ed out silent iehomology information whichdoesnothaveanequivalentin

the observed sequence

This homology mo del can also b e represented in a more compact form as in the

b ottom panel of gure In this form the no de structure of Match Insert Delete

is represented and it is assumed to rep eat for the length of the mo del length b eing

in no des This representation emphasises the similarity to the traditional Smith

Waterman metho d and as a transducer typ e PFSM

Given these two mo dels the combination using the rules outlined ab ove is simple

The combined mo del should have states expanding each homology mo del

state into separate gene nding states This pro cess is shown pictorially in gure

However not all these states are actually required in the comparison as we

know that some transitions are forced to zero This is b ecause it is imp ossible to get

an intermediate protein sequence letter with no genomic DNA sequence in other

words the combination b in the previous notation do es not o ccur Applying this to

equation means that we can removeanumb er of states

Because transitions which emit c are all directed to the Delete state of the

homology mo del this means that all the transitions which are directed to the intron

states of the delete state in the combined mo del have probability zero as i j for

these states The result is that we can remove them This is intuitively correct as

the Delete state mo dels p ositions in the homologous protein sequence which have

no sequence in its protein counterpart With no protein sequence in the predicted

protein there is no p ossibilityof an intron at this p osition Notice that the Delete

state still has transitions to the Match and Insert introns as these transitions do

emit a protein sequence

The interintron transitions can also be removed The intron states all have

transitions which emit a pro ducing genomic sequence with no corresp onding protein B E

M M M

I I I

D D D

Expanded Model N times

B M E

D I

Contracted Model

Figure Two representations of prole Hidden Markov Mo dels The top panel

shows a standard expanded hidden Markov mo del in which each state in the

mo del is shown The central p ortion of the mo del can be rep eated N times The

b ottom panel shows a contracted representation of the mo del where each state rep

resents one of the canonical states in the HMM The transitions b etween the states

are coloured to show how they progress in terms of the mo del and the sequence

Black transitions advance in both the mo del and the sequence blue transitions in

only the mo del and green transitions only in the sequence predicted homologous genomic protein M protein

D I

Merged model

Figure A gure showing the merging pro cess b etween the gene prediction mo del

and the homology mo del The mo dels used have the same outline as those in Figure

and They are written out in the top panel The lower panel shows the

merged mo del The large blackarrows indicate that b etween each set of four states

all p ossible transitions exist This mo del will b e later pruned to a small set of states

see the text for more details

sequence For these transitions the sum in equation is zero as all the transitions

ab b are zero The other term is also zero as movement between dierent

kl

introns implies k l Again this marries well with the observation that one cannot

change from b eing in a Match intron to an Insert intron in the middle of an

intron

The pruned mo del called GeneWise is shown in gure The name

reects the numb er of states and numb er of transitions used in the mo del

Like the HMM gure this gure represents the rep etitive structure of a single no de

which is rep eated for the length of the homology mo del

This mo del is written down as a Dynamite mo del whichisprovided in App endix

B The Dynamite compiler then generated the Viterbi and database searching ver

sions of this algorithm automatically

It was clear from the startofthiswork that GeneWise was an overly am

bitious mo del and probably not useful for practical work Using Dynamite I ex

p erimented with a number of dierent machines and a go o d compromise between

sp eed sensitivity and compromise on the correct mo del was the GeneWise

mo del shown in gure Compared to the GeneWise mo del what I have

removed is the following

The p olypyrimidine states are removed providing a saving of states and

transitions

The MatchInsert information is lost in the introns This means that for

the cases where a Match to Insert or Insert to Match transition o ccurs in

the protein in the same p osition as an intron it will not b e scored correctly

As b oth introns and MatchInsert switches are rare this should not be a

signicant problem

The sequencing error transitions are removed from the MatchInsert and

Delete to Match and Delete to Insert transitions This means that sequencing

error which falls at a switch in the protein state will not b e mo deled and will

probably o ccur a base pair b efore or after the real p osition

Heavy use of the GeneWise mo del has shown excellent results section

presents results on this and has become the workhorse for GeneWise metho ds I 0 1 2 Match

0 1 2

Delete Insert

IntronPy Spacer

5’ 3’

Figure GeneWise Algorithm The dark circles represent states and the

arrows between them transitions Black transitions are standard protein transi

tions red transitions are frameshifting transitions and green transitions are intronic

transitions Introns are each built of three states listed at the b ottom of the gure

made even further reductions in GeneWise in which the dierentphasesofthe

introns were merged into a single state resulting in only states and transitions

Parameterisation

Simplicity was the byword for the parameterisation problem I exp ected most of

the p ower of this metho d to come from the application of accurate protein prole

HMMs to the gene prediction mo del iethe protein homology mo del would force

the gene mo del to take certain parses as the gene prediction was a b etter t to

the protein homology The gene mo del would only be used to provide good edge

detection of exons principly splice sites Therefore the approach was to take the

established prole hidden Markov mo dels from Sean Eddys HMMER package for

the protein homology mo del When we used a single sequence not a prole HMM

we parameterised it as if it was a protein prole HMM providing parameters deduced

from the standard SmithWaterman parameters routinely used For the gene mo del

we wanted to make it simple The approach was to make maximum likeliho o d

estimates of the xed length motifs of the splice sites from known splice sites and

parameterise almost all the rest of the gene mo del as if it was background However

the devil is in the details of parameterisation as the next sections will illustrate

Splice Site Mo dels

Splice sites have a long history of b eing mo deled in some complex ways These range

from neural networks through p osition weight matrices and more complex decision

tree typ e mo dels such as the one used in Genscan As the rst attempt at a splice

mo del we designed a system whichwas a a generative probabilistic mo del so that

we could t it in well with our scoring scheme and b could capture a lot of the

supp osed complexity in splice sites in particular the splice site using a decision

tree structure very similar to Genscans but with more branches

However this metho d did not p erform well p erhaps due to our training system

or p erhaps b ecause we did not require a complex splice site mo del see the path

interpretation detailed b elow Wewere also impressed bywork from

and colleagues that showed b oth that decision tree metho ds could b e considered a

mixture of p osition weight matrices which in many ways simplied their p ossible

training and that these mixtures could not detect that much more information in 0 Match 1 2

Delete Insert

Intron

5’ 3’

Figure GeneWise

the splice sites than simple p osition weight matrices Despite all the mystique

surrounding splice site mo dels it seemed that simpler was b etter We therefore

settled for a very simple ungapp ed mo del of splice sites whichcouldb e estimated

from observed splice sites by simply lining up splice sites on the splice junction and

observing residue counts at each p osition To deduce the probabilities of the mo del

we added a simple pseudo count metho d which represents a single unbiased Dirchlet

prior distribution on the multinomial of residue emission probabilities This simple

mo del outp erformed the old mo del considerably and we have found no reason to

revisit the splice site mo dels since

Co don emission probabilities

The emissions of co dons in the Match and Insert transition in the mo del are due to

three dierent eects

The amino acid distribution of the protein homology mo del

The co don bias of the organism

The substitution of the base pairs due to p ossible sequencing error

We considered this pro cess to be the transformation of the vector of amino

acid probabilities in the homology mo del to the p ossible co dons The co don

probability given a particular amino acid is decomp osed as coming from two p ossi

bilities

There was no substitution error in which case the probability is for co dons

which are not translated to this amino acid or P codonjaminoacid for the

co don bias in this organism

There was a substitution error meaning that the observed co don is actually

a dierent co don in the real DNA sequence In this case we considered every

p ossible single base pair substitution but not double hits inside one co don

Due to every p ossible base b eing substituted in most cases this means that

co dons combine information from a number of dierent amino acid p ositions The

eect of the substitution error therefore was to smudge out the amino acid distribu

tion over a number of dierent co dons mainly the ones which enco ded the amino

acid but also nearby co dons whichare related by a single sequencing error An

upshot of this is that stop co dons do have some small probability asso ciated with

them but this probabilityis greater when the homology p ositions are more likely

to emit amino acids which are a single base substitution away For example strong

trytophan emitting p ositions co don TGG have a relatively large chance of match

ing TAG and TGA stop co dons compared to other p ositions Default parameters

for substitution error was one error in ten thousand base pairs the quoted accuracy

for genome sequencing pro jects

Insertion or Deletion errors

GeneWises approach to handling insertion or deletion errors in the sequencing pro

cess is delib erately naive This is because handling the insertiondeletion errors

correctly is dicult When one considers the theory outlined in it would be

attractive to consider sequecing error to be the action of another machine which

substitutes inserts or deletes bases of the DNA sequence b efore it is used in the

gene prediction pro cess This sadly breaks the rule that the intermediate B se

quence is only emitted as single bases The sequencing error machine would emit

bases the intermediate sequence as single bases but the gene prediction machine

takes triplets or larger strings to form the amino acid sequence This means that I

cannot use the theory in providing the combined machine However I can use it to

guide the design of some of the ad ho c solutions to it The mo del merging theory

suggests that the sequencing error transition should o ccur at every place where DNA

sequence could b e emitted

GeneWise considers sequencing error to be a or base co don The base

comp osition of the deletion or insertion is ignored completely whichis a gross ap

proximation For example if one observes a TTTT in a putative sequencing error a

strong to phenalanine emitting p osition co don TTT is a far more likely p osition to

emit this than Glycine co dons GGN Ignoring the base comp osition was really to

prevent excessive calculation Deriving the p otential probability for base deletions is

relatively easy to do as one is considering only one or two bases and one can build

up a lo ok up table for each p osition of all combinations However one cannot take

this approach for or base pairs as the tables b ecome to o large An alternativeis

to call a function whichwould onthey calculate the probability of or base pair

insertion of particular bases to a probability distribution of amino acids However

such a function would b e called at every cell in the dynamic programming matrix

making it an extremely exp ensive solution

Intron parameterisation

The gene prediction mo del is non standard wehave no nice parameterisation such

as the HMMER HMMs whichwehave for the protein homology Therefore wehad

to parameterise the entire mo del ourselves Again simplicitywas our watch word

and we to ok the entirely simple way of collecting a data set of known genes checking

their accuracy and tabulating counts of particular features These counts were

splice site regions to calculate p osition weight matrices as ab ove

co dons in co ding regions to calculate co ding bias

bases in introns to calculate intron bias

length of introns to calculate transition parameters in introns

The only issue here was deciding on the parameterisation of p olypyrimidine

tract regions These tracts cannot b e dened bio chemically Rather than going for

a full blown training technique we simple chose regions which by eye lo oked like

the p olypyrimidine tract As it happ ens the mo del with the p olypyrimidine tract

is not the most common mo del used so this parameterisation was not imp ortantin

the nal analysis

Path scoring

Having chosen and parameterised the probabilistic mo del one mighthave thought

that all the diculties were over Indeed I exp ected that the Viterbi path through

this combined mo del would provide highly accurate gene prediction using the protein

homology to drive the pro cess Sadly there were a numb er of problems to confront

rst

Flanking Regions

One issue I did not account for at rst were the anking regions of genomic DNA

outside of a homology region that we were interested in predicting These regions

could of course contain genes either as part of the gene we wished to predict which

was outside the region of homology or entirely dierentgenes These genes might

not haveany similarityto the protein homology mo del we were using but they do

have gene features which would score well against the gene prediction p ortion of

the GeneWise mo del These additional fragments often caused mispredictions at

the end of the homology region In particular if the homology region ended near

an intron as introns haveavery broad length distribution GeneWise could ignore

the correct end of the intron and simply cho ose the best start or end splice site

resp ectively for the start or end of the region within the rest of the sequence For

large clones this gave rise to comedy gene predictions which spanned almost the

entire length of the clone They would start with a rst very large intron leaving at

the b est splice site in the clone joining to the homology region and continuing to

near the end of the homology region when another large intron that would jump to

the b est splice site in the remainder of the DNA sequence Unsurprisingly the

biologists who saw these early results were not impressed

The solution is to somehow make the anking regions less attractive to the

homology mo del The most principled way of doing this is to provide anking mo dels

which represent the content of genomic DNA in the absence of the homology mo del

These regions would then score at least as highly as homology gene prediction

mo del in the absence of homology and in general much b etter causing the homology

mo del to be kept in its correct place The most natural way to build the anking

mo dels is to duplicate the gene prediction mo del in the absence of the homology

mo del A practical issue of this is that these regions are sp ecial states in the

dynamite mo del This is what was done in the GeneWise mo del

A ma jor drawback to this approach is that now every genomic DNA will score

well against this mo del even if the genomic DNA do es not contain a gene with

homology to a particular HMM or protein sequence Thus using this mo del for the

detection of the presence of homology requires the path information of where the

most likely path went through the mo del in particular if it crossed into the homology

part of the larger complete mo del of b oth anking regions and homology mo del

Ideally one also wants the likeliho o d score of just the homology p ortion Although

there are ways to computationally achieve this without requiring the calculation

of the entire path it is an additional computational step in an already exp ensive

op eration

For GeneWise we provided the reverse solution by toning down the gene

prediction parts of the mo del so that any p otential b enet of pro ducing a erroneous

intron would b e more than outweighed by the additional p enalties for misaligning a

homology region As GeneWise do es not have a p olypyrimidine tract mo del a

considerable gene prediction signal is removed In addition no intronic bias where

the base comp osition of the intron is dierent to the intergenic DNA sequence was

provided The only remaining gene signals were the actual splice sites and in tests

these were not sucient to cause this error in practical use

Co ding region scoring

Further analysis of the gene prediction errors from GeneWise showed a common

error in mispredicting splice sites internally even with reasonable homology across

this region of the gene Further analysis of this phenomenon showed that in these

regions the homology information was reasonably consistent with a numb er of splice

sites but the gene prediction information weighed heavily towards a particular site

In nearly every case if one had taken the b est homology predicted splice site one

would have made the correct decision One issue to note here is that it is p erfectly

p ossible that we are b eing misled by alternative splicing iethat there are a number

of correct splice sites in this region and that the disagreement between the two

mo dels indicates alternative splicing It is hard to assess this as clean datasets

indicating every p ossible alternative splice transcript of a set of genes have not b een

develop ed yet

Again the problem here seems to b e that the gene prediction information is mis

leading the protein homology information One of the largest eects that provides

this misleading information is the gene mo del of the co ding regions rare co dons

are heavily p enalised and hence the most likely path in ambiguous regions of the

homology matching are dominated by the fact that certain amino acids are b etter

represented in random DNA sequence

Avery empirical solution to this problem is to score the co ding regions as if they

were scored as protein in particular to compare the probabilityof the co don with

the probabilityof the co don o ccurring in a random protein The logic b ehind this

is that one wants to nd the b est homology path considering a random mo del of a

protein co ding gene we call this a synchronous null mo del as one imagines a gene

prediction mo del for the null mo del which is used in sync with the combined mo del

entering and leaving co ding regions at the same places as the combined mo del

The eect of using this synchronous mo del is to change the best scoring path

Empirical evidence shows that this scoring technique do es pro duce b etter results

Using the GeneWise algorithm

How one uses the GeneWise algorithm is not trivial GeneWise is not going to

provide a complete answer for the genomic analysis of a region Indeed if the

homology region do es not extend across the entire region which is the most common

case GeneWise will not even provide a complete gene prediction In defense of

GeneWise when the prediction is provided it is highly accurate

Using GeneWise in a practical manner is going to require the integration of

GeneWise with other to ols Providing an easy route for this integration is the

role of the software which surrounds the GeneWise algorithm GeneWise is the

gure piece of the Wise package a software package which contains most of the

algorithms which I have developp ed The Wise package provides access to the

GeneWise algorithm in two principle ways

as command line programs to b e used in the UNIX paradigm of reading and

creating les p otentially these les b eing pip es into or from other programs

as a callable API Application Programming Interface accessible in either C

or Perl

Both systems provide the results of GeneWise in standard formats such as

EMBL feature table format AceDB interchange les and GFF Gene Feature Find

ing format The app endix C has a copy of the Wise package user do cumentation

which lists fully all the options available in the package and has go o d discussions

ab out howtointegrate GeneWise into a genomic analysis package

Alignment 1 Score 32.55 (Bits)

RRM 1 LFVGNLPPDVTEEDLKDLFS KFGPIV L V+N+P+ + DL+++F+ +FG+I LHVSNIPFRFRDPDLRQMFG QFGKIL HS41P2 11790 ccgtaactctcgcgcccatgGTAAGTC Intron 1 CAGctgaac tatcatctgtgacatgattg<0−−−−−[11850:15370]−0>atgatt gtcttttcccgctccgggtg gtcaca

RRM 27 SIKIVRDIIEKPKETGKSK GFAFVEF +++I+ + SK GF+FV+F DVEIIFN−−−−−−−ERGSK GFGFVTF HS41P2 15389 gggaata gcgtaGTAAGCT Intron 2 CAGgtgtgat atattta aggca<0−−−−−[15425:25099]−0>gtgttct taaactt atctg acgcatc

RRM 53 ESEEDAEKALEALNGKELGGRKLRV E++ DA++A E+L+G++++GRK++V ENSADADRAREKLHGTVVEGRKIEV HS41P2 25121 gaaggggagagatcgaggggcaagg aagcacagcgaatagcttaggatat

gttttacgcggaacccgagctacgg

Figure An example of a alignment from GeneWise This output format is

designed for human readability there are other formats designed to b e parsed The

alignment of the RRM prole HMM against the human sequence HSP is shown

running left to right in a blo ck At each p osition the most probable amino acid in

the HMM match emission is shown matching to the predicted amino acid from the

DNA sequence The co don whichprovides that amino acid is shown b elow running

from top to b ottom Introns are shown as breaks in the alignment The splice site

bases are shown in full but the central p ortion of the intron is omitted for clarity

Example of using GeneWise

A go o d example of using GeneWise in a real life situation is with the clone dJP

a clone on chromosone Preliminary halfwise analysis see Chapter which

uses genewise showed the presence of a gene on the reverse strand with a RNA

Recognition Motif domain internal to it shown in gure Further analysis showed

that it was a member of the SR protein family A small multiple mutliple

alignment of this family was made and from this a proleHMM of the larger gene

family which includes more of the surrounding sequence of the domain sp ecic

for this family Rerunning GeneWise with the extended HMM enlarges the gene

prediction

Evaluation of GeneWise

An ongoing evaluation of GeneWise was necessary to assess how well it works

and how dierent parts of the algorithm contribute to the nal result Assessing

gene prediction algorithms is notoriously tricky principly b ecause gathering go o d

datasets together is painful In addition GeneWise needs a source of homology

information as well as the DNA sequence the assessment should also quantify the

success of the algorithm at diering levels of similarity

A dataset to fulll these criteria was generated in the following manner Itook

each Pfam family and found a human protein whichboth had a genomic sequence

and was spliced The human protein had to be referenced in Swissprot which

indicates a certain amountofmanual curation Using this protein sequence I then

selected ve more protein sequences one from each of the following similarity bands

identity based on the Pfam align

ment In addition we could also use the Pfam HMM as a source of homology

information

One common source of error in GeneWise is misplacing the relative p ositions of

protein insertions relative to introns A protein insertion in the homology mo del

which is close to an intron will b e hard for GeneWise to nd

There are a numb er of criteria wewould like to assess GeneWise on

the accuracy of GeneWise when GeneWise claims that a base pair is part of

a gene how often is this so

the coverage of the gene b oth in absolute terms and in addition in terms of

howfarwe exp ect the homology to cover

What is rep orted for each similarity band are the following gures

the total numb er of base pairs predicted as co ding regions

the accuracy of these co ding region predictions

the number of bases which should have been predicted as co ding and this

represented as a prop ortion of the total base pairs predicted

the coverage of the correct co ding region vs the entire gene

Total Acc Missing Prop Missing Coverage

HMM

Table Table showing p erformance of the standard algorithm

Total Acc Missing Prop Missing Coverage

HMM

Table Table showing p erformance of the standard algorithm

Total Acc Missing Prop Missing Coverage

HMM

Table Table showing p erformance of the algorithm with Viterbi scoring

Total Acc Missing Prop Missing Coverage

HMM

Table Table showing p erformance of the algorithm with standard param

eters

There are a number of p oints to draw from these results Firstly GeneWise is

extremely accurate This should b e exp ected after all GeneWise is using similarity

information to drive the gene prediction pro cess and there is a huge amount of

information in the similarity data In addition this accuracy is maintained down to

really very low similarity measures even at identity one is still getting

accurate co ding sequence prediction

As similarity decreases two things are lost rstly the coverage of the gene drops

to ab out one half of what is achievable from similar sequences Secondly the

amount of missing co ding sequence ieregions predicted as introns but actually

co ding sequence grows dramatically increasing by a factor of ten b etween from

identical sequences to identical sequences This means that at identity

the exons are shorter by ab out

The dierences between the dierent algorithms are marginal There is slight

dierence in the balance b etween CDS prediction by Table compared to

the metho d Table Indeed even the metho d whichhasa very limited

gene mo del has almost identical accuracy levels to the mo del These results

emphasise that it is the quality of the protein homology mo del which is driving the

accuracy of the gene prediction not the gene prediction machinery

Other evaluations of GeneWise

There have b een two other critical evaluations of GeneWise by indep endent re

searchers These evaluations by other parties haveprovided very useful feedbackin

particular in default congurations and in more real life situations

Guigo and Agarwal

Ro deric Guigo and Pankaj Agarwal evaluated GeneWise alongside two other

programs Pro crustes and Genscan GeneWise came out with a p ositive

write up with the one drawback of the computational exp ense of the program The

essential gure gave GeneWise an accuracy of when used with proteins selected

from a BLAST output of less than e very much in line with the results presented

in this chapter They considered programs both in the small DNA sequence case

and in articial large genomic DNA sequence where they embedded known genes

into random DNA sequences The gures for the protein comparison programs

Similarity strong mo derate

Program Sn Sp CC Sn Sp CC

GenScan

GeneWise

Pro crustes

Table Table summarising results from Guigo and Agarwal on GeneWises com

parisons to other algorithms Strong similarity is a protein BLAST hit of under

 

pvalue Mo derate similarity is a protein BLAST hit of under pvalue

GeneWise and Pro crustes remained the same A summary of one of the tables in

the pap er is repro duced here with the authors p ermission

The principle gure they use to assess gene predictions is the correlation co

ecient which is a combination of the sensitivity coverage and the selectivity

accuracy of the metho d Because GeneWise do es not build gene predictions out

side of the homology p ortion of the match the sensitivity drops considerablyasit

do es in my analysis As the primary statistic they quote is the correlation co ecient

p erformance do es not lo ok ideal However examination of the tables summarised

in shows that the accuracy measurement tallies well with the results presented

in this chapter If one considers that GeneWises role is to pro duce highly accurate

but partial gene predictions which should then b e used by further analysis then it

is doing its job admirably

The Drosophila annotation exp eriment

Martin Reese and colleagues used the megabase region around the ADH region

of Drosophila for an assessment of gene prediction to ols called GASP Anumber

of ab initio prediction metho ds and some combined homologyprediction metho ds

were assessed I used here the Halfwise metho d with will be describ ed later in

Chapter but for the purp oses of the comparison it was basically GeneWise

using HMMs from Pfam GeneWise scored surprisingly p o orly with an accuracy

of and a sensitivity of The low sensitivity can b e explained on the basis

that Pfam HMMs were used as the source of homology information and not protein

sequences The p o or accuracy is more worrying

Closer examination of the data revealed a number of reasons why GeneWise

scored with such a low accuracy Firstly I did not remove retroviral proteins from

the Pfam database nor did I use masked DNA sequence This was delib erate as I

wanted to see the behaviour of the search when p otential retroviral proteins could

be matched Although I provided sets b oth with and without retroviral hits to the

assessors only the set with retroviral hits were considered As retroviral genes were

not considered to b e real genes unsurprisingly this caused a large loss in sensitivity

With the hits removed GeneWises sensitivity on a base pair level was raised to

There were still a number of GeneWise predictions which did match up to any

of the standard datasets either the exp erimentally veried genes std or the

broader well annotated set std These gene predictions generally had high bits

score which indicates that there was a strong hit to homologous proteins These

lo ok like real genes to me and I have not b een able to ascertain whether they were

real genes but missannotated or known pseudogenes

Discussion of GeneWise

GeneWises p erformance should b e critically measured on two key criteria Firstly

how accurate is it in ob jective tests The results presented in this chapter along

with the two indep endent assessments show that GeneWise is very accurate This is

not surprising given the added homology information it brings to the problem but

it is go o d to see it indep endently veried This level of accuracy b eing around

means that GeneWise predictions can b e used in an automatic fashion by annotation

systems The second criteria is how useful is it in the practice Here the resp onse

to GeneWise has been less forthcoming GeneWises main strength is pro ducing

accurate gene predictions when there is a close homolog these were in many cases

the sort of genes which human annotators found easy and rewarding to annotate

However with the ever increasing data ow p eople havegrown accustomed to using

GeneWises prediction as a starting p ointto eventually annotating genes in some

p eoples exp erience this has reduced the time to annotate from hours p er gene to

under minutes

GeneWise is now used in many lo cations world wide The principle users are the

Sanger Centre and Celera a private company in the US and it has some devoted

followers at other places A common complaint is that GeneWise is very CPU

exp ensive In manyways I am unap ologetic ab out the cost GeneWise is fo cused on

getting the correct answer not on sp eed In my view sp ending an additional hours

worth of CPU time is a considerably b etter use of resources than an additional hour

of a human annotator However the CPU exp ense of GeneWise do es limit its use in

large scale pro jects Chapter details how I cop ed with this problem at the Sanger

Centre

I b elieve that the key to GeneWise was the analysis of the problem from theoret

ical principles I did not use complicated machine learning techniques nor invented

any radically new algorithms instead I simply lo oked at the problem as b eing the

application of two probabilistic Finite State Machines That view of the problem

suggested a solution based on the principled merging of the two Finite State Ma

chines As we have a coherent theory for how the algorithm works we can easily

add or remove parts of the mo del

Chapter

Pfam a protein family database

Intro duction

Protein sequences are the most conserved pieces of genetic information in genomes

For example the protein sequence of actin has remained conserved at around

identityover the last half billion years of eukaryotic evolution Some more striking

examples are the clear similarities of eukaryotic and prokaryotic rib osomal proteins

indicating a single evolutionary ancestor which one assumes was presentbeforethe

split of prokaryote and eukaryote lineages

Protein sequences are synthesised as linear p olyp eptide chains which then fold

up to provide the protein with a three dimensional shap e see section in the

Intro duction However the folding and sometimes the function of the protein

sequence can often b e split into continuous regions each region able to fold byand

large indep endently These regions are commonly called domains During evolution

dierent domains have b een combined to pro duce proteins of unique functions from

a large set of comp onents This reuse of domains as building blo cks for proteins

is clearly favoured by evolution compared to reinvention the ma jority of proteins

have b een found to be multidomain Over a large timescale one often nds

that the only similarities b etween distantly related proteins are within one or two

domains rather than along entire protein sequences

Of course this being biology not everything conforms to this scheme Firstly

there are quite a few proteins which are conserved as complete units during evolu

tion For example Rib osomal protein L is conserved in eubacteria RL ECOLI

METJA and Eukaryotes RL HUMAN over its entire length Archea RL

These can be considered to be single domain proteins where the domain is never

reused in a dierentcontext so conceptually suchcases t without problem into a

domain orientated approach to proteins Another class to consider includes cases

where the region conserved during evolution is not contiguous or colinear An ex

ample of these are Swap osins in which the homologous p ortions of the sequence

seemed to have b een circularly p ermutated during evolution More commonly

one nds one complete domain inserted into the middle of the sequence of another

domain Although there are a numb er of cases in which discontinuous domains are

present the ma jority of domains are continuous in sequence

A consequence of domain evolution is that if one fo cuses on dening and char

acterising domains one can provide a very informative deco ding of the protein

sequence even if one has never seen a protein with that particular collection of

domains b efore Some of the most impressive predictions using sequence analysis

have come from careful domain analysis of proteins For example Bhattacharya

and colleagues found the DNA cytosine methylase in humans by starting from the

identication of a methyl binding domain Domain analysis has also b een critical

for eective structure prediction In the structure prediction exercise CASP tar

get T was predicted correctly by only one group one of the main dierences

of their analysis compared to other groups was that they considered the domain

structure of the protein there were two copies of one domain b efore attempting to

predict its structure

protein prole HMMs of domains

One question in domain analysis is how to represent a protein domain in a com

putationally tractable manner After some years of work in the eld a rough

consensus of how to do this sensibly is using prole analysis which when

describing these proles as probabilistic mo dels gave rise to prole hidden Markov

models prole HMMs Prole HMMs attempt to mo del protein domains

using the standard HMM paradigm the sequence of amino acids are the observa

tions and the HMM pro duces these observations To greatly simplify the learning

of the form of the HMM to pro duce the amino acids the architecture of the HMM

is constrained to be a rep eated set of states joined in a left to right manner as

shown in gure in the previous chapter This architecture of HMMs was inspired

by the established technique of prole analysis whichprovided a numb er of successes

in analysing protein sequences alb eit using an ad ho c derivation of the parameters

The big improvementwhich prole HMMs provided was a mathematical framework

of how to parameterise and use these proles

The adopting of prole HMMs as a sensible way of mo deling protein domains

has b ecome widespread in bioinformatics

The Pfam database

The acceptance of a standard way of building mo dels of domains has allowed re

searchers to make databases of these mo dels These databases can be used as a

resource for analysing proteins in a tractable manner domain by domain The

number of proteins domains is considered to be p erhaps in total which is

not a large numb er compared to the around protein genes in a single mam

malian genome This size is small enough that a curated database can cover most

domains found in biology Three researchers Erik Sonnhammer Sean Eddy and

Richard Durbin set ab out to pro duce such a database using the prole HMM soft

ware written by Sean Eddy HMMER The database is called Pfam standing for

Protein FAMily and started as system based around at les kept in a Unix

directory structure with a numb er of shell scripts to maintain consistency and use

the database

The basic structure of Pfam has four les for each family The seed le is

a protein multiple alignment of a representative set of protein memb ers These

memb ers are chosen for their ability to pro duce a go o d prole HMM The HMM

le is the proleHMM built using HMMER from the seed alignment The ful l le

is a complete alignment of all memb ers of the domain that can be found with the

proleHMM in the complete protein database aligned with the aid of the HMM

Finally the desc le contains the do cumentation for the family

Requirements for Pfam as database

My primary role in Pfam ended up b eing to provide a stable environment to expand

the collection and work with Pfam As a database there were a number of challenges

to solve

Physical data integrity Pfam had traditionally relied on the users of the

database to know precisely how to manipulate the dierent parts This had

already caused a numb er of problems in p eople inadvertently deleting renam

ingorlosingdata One particular problem was that the implicit building rules

of Pfam that the HMM is built from the seed and the full alignment from

the HMM could be ignored This allowed a number of families to be stored

incorrectly

Sequence data integrity Pfam is a database of multiple alignments built

on top of a sequence database Pfam needed to be kept synchronised with

the underlying sequence database The particular challenge to this was when

the underlying sequence database changed all the sequences in Pfam could

potentially change having kno ckon eects on the seed alignment

Internal data integrity One early rule adopted in Pfam is that no two domains

could overlap on any sequence This rule had to b e implemented by a program

that checked that new data conformed to this rule and there had to b e a way

of ensuring this check o ccurred every time

Interactions with external programs Unlike more standard databases the

main ways of querying and using Pfam is not via adho c structured queries

Instead the HMMs are used to search protein databases and multiple align

ments are viewed and edited in sp ecialised programs This means that the

database has to interact well with these external programs whose input is le

based

Pro ductivity of the database curators As Pfam expands a greater load is

placed on the database curators to maintain the database By providing sys

tems to automate some tasks considerable work can be removed from the

curators allowing curators to b e more pro ductive

The Pfam Database Management System

The rst task with Pfam was to provide stronger physical integrity of the data

The database was very le based strongly suggesting a le orientated solution I

to ok the pragmatic decision to store the data as a series of RCS Revision Source

Utility Description

pfco Check out a family to lo cal area

pfci Check in a family back to the database

pfup date Check out a family without a lo ck

pfab ort Ab ort a checked out family

pnfo Provide information on a family

pfnew Check in a new family to the database

pfmove Rename a pfam family

pfadduser Give another user access to the Pfam database

pfhelp Help on pfam database management system

Table Pfam Database Management System utilities

Control les indexed by using a Unix directory system and a lightweight Berkeley

style database The users would not access this system directly instead a collection

of utilities provided access to the database Table

The pfco utility allows users to checkout a family from the database this

places a lo ck using the inbuilt lo cking facilities of RCS on this family preventing

anyone else from editing the family The user is then be free to make changes to

the family lo cally Once he or she is happy with the family the changes can be

committed back into the database using the pfci utility Alternatively if the user

wants to discard the edits the pfabort utilityremoves the lo ck on the family without

committing anychanges

Tomake a new familythepfnew utility creates a new database index assigns a

new accession numb er and checks in the family

Anumb er of other supp orting utilities are provided pfupdate provides the abil

ity to get a lo cal copy of a family without placing a lo ck on it pnfo provides

information ab out a family directly to the terminal and in a dierent mo de will

provide overview information on the entire database indicating which families had

b een lo cked by which users pfmove gives the ability to rename a family I had

fo olishly decided to make the primary index on the human readable name and not

the pure identier of the family its accession number This was a design error and

why pfmove hadtobewrittento cop e with renaming a family pfadduser gives a

way to add a new user with readwrite p ermissions to the Pfam database

Triggers on data entry

One clear role for the database management system was to force data integrity This

is achieved by running a numb er of programs which ensures the integrity of the data

b efore it can b e checked backinto the database as the database management system

is le based there are not many inherentintegrity checks in the way the database

is structured There are three main integrity checks b efore allowing data in the

database

The le mo dication times have to satisfy the rule that the seed le must b e

older than the HMM le which in turn must b e older than the full le This

rule ensures that the family has b een built correctly

As the database system is le based rules ab out data integrity can not be

directly represented in the database To ensure that the internal structure is

correct a utility called pqcformat was written whichchecked the integrityof

the desc seed and full les

The overlap rule has to b e satised A utility called pqcoverlap was written

which compares the current family to all families in the database agging

overlaps This utility writes a le containing all the overlaps in the family

directory This way the pfci or pfnew scripts can check that both the overlap

check has b een run and that it has rep orted no overlaps

Pro ductivity to ols

When Pfam started most of the asp ects of building families required the user to

rememb er a string of connected commands This b oth to ok a signicant amountof

the curators time and also created a large hindrance for new p eople coming joining

the Pfam group To reduce these problems I develop ed a number of pro ductivity

to ols for the Pfam database

The main to ol is automake table whichremoves the mechanical pro cess

of scanning the HMMER output le to nd the sequence regions that have been

matched building a fasta le of these regions and building then an alignment using

the Fasta le and the HMM In addition automake writes some statistics ab out

to ols Description

automake Takes HMMER output optional manual cutos and builds

new FULL le placing certain information in the DESC le

HMMResults mo dules Ob jects representing the output of a HMMER search used

in automake and the Pfam Web site

Table Pfam Pro ductivitytools

the distribution of scores near the cuto into the Desc le These were previously

maintained by hand and prone to error

A key piece of functionalityfor automake is the abilityto scan HMMER out

put les The HMMER search rep orts a bit score and exp ectation value for each

sequence and a bit score and exp ectation value for each domain by careful manipu

lation of the cuto at b oth the sequence and the domain level b etter p erformance in

nding some domains in particular rep eats could b e managed The HMMResults

set of mo dules encapsulates the parsing of HMMER les and ltering with this two

level threshold on the results for furthering pro cessing

Underlying Sequence database up date

All sequences in alignments must be valid sequences in the underlying sequence

database of Pfam Ab out every months we change the underlying sequence

database to a newer version of a non redundant database set from the Swissprot

sptrembl databases When this o ccurs all the seed alignmentsneedtobechecked

to ensure that the sequences that they contain are consistent with the new database

When sequences change these changes have to b e reected in the alignment and man

ually veried Once the seed alignments are consistent with the database HMMs

need to b e rebuilt and searched against the new database from which the new full

les can b e made

The main problem in all this is that the underlying sequences can change A

sequence can change in a numb er of dierentways

The sequence can be radically dierent from previously given often b ecause

the annotators have recognised a frameshift in the DNA sequence or a gene

misprediction

The sequence mayhave had a large addition or deletion of a region outside the

region of interest for the domain Although the domain sequence in the seed

alignnment therefore is still in the sequence the co ordinates havechanged

The sequence can have some small changes due to the recognition of sequencing

error in the DNA sequence or merging of a numb er of p olymorphic forms of the

sequence only one reference sequence is provided for p olymorphic sequences

in Swissprot

The sequence could have been merged with another logically equivalent se

quence often given rise to small changes but more imp ortantly making it

harder to track in the database as one has to use secondary accession numb ers

to track merges

The sequence can b e deleted all together

These changes in sequence cause ma jor problems in the remapping of the seed

alignments As some seed alignments havechanged and so the HMMs are p otentially

dierent one would like to knowhowmany full matches have b een lost during the

migration however due to changes in sequence and in particular sequence merging

tracking sequences across twoversions of the database is far from trivial

Just b efore I to ok over managing the Pfam database a single researcher sp ent

around an entire month doing the move of Pfam to a new version of the sequence

database Since the number of families in Pfam has increased by a factor of four

it was clear that such lab our intensive work could not b e maintained To alleviate

this I provided an automated system to take the brunt of the work in the database

move

After some less successful attempts I develop ed a robust system in which the

automatic moving pro cess is go o d enough to run unattended Researchers can then

briey examine byeye the families whichhave underlying sequence changes and in

most cases accept the nal result only a few cases require manual intervention

The key part of the system was the ReSeed mo dule which takes a multiple

alignment on one underlying database and attempts to map it to a new database

For each sequence it uses the following rules

nd the new sequence p otentially using secondary accession numb ers

if the old and new sequences are identical accept

if it is p ossible to nd the subsequence used in the alignment in the new

sequence even if at a dierent sequence p osition simply change the startend

points in the new sequence

otherwise align the old and new sequences using the SmithWaterman al

gorithm Using the co ordinates of the old sequence nd the corresp onding

co ordinates in the new sequence

Finally align all the changed fragments back to the unchanged p ortion of the

alignment using the old HMM as a guide

This pro cedure is robust to changes in the new sequence including insertion

and deletions of residues The tolerance towards indels is due to using the Smith

Waterman algorithm to map old domain co ordinates to new co ordinates rather than

xed length matching and following that by using the HMM to guide the alignment

of the new fragmentonto the new alignment

Middleware Layer

As the sophistication of Pfam grew a numb er of scripts were written which accessed

the Pfam database for a variety of reasons for example calculating statistics or

automating the insertion of new structure links into the database This growing

body of scripts b ecame a large burden to write and maintain many of the scripts

had similar co de segments for example to lo op over the entire database More

worryingly when we made changes to the layout of the database it was not clear

which scripts would b e aected and how This cloud of interactions with the Pfam

database was eectively stiing any potential improvements of the actual database

as changes would invariably have kno ck on eects on a numb er of scripts

To counter this I built a middleware layer that shielded the actual database

access from the script writer Instead access to the database was provided as a single

ob ject which hid all the implementation details of the database The metho ds of the

ob ject allowed the retrieval of database entries as ob jects these entries had further

metho ds to retrieve parts of the entry for example the seed and full alignment

Figure shows the main ob jects and how they interrelate The key feature of

the middleware design was sensible separation of the Pfam sp ecic information from

more general annotation information and the actual multiple alignments Access to

the HMM itself was not provided by the middleware layer



The middleware layer is written in Perl and based on Biop erl Biop erl is a

op en software pro ject to pro duce stable ob jects in Perl for bioinformatics For the

Pfam middleware layer I reused a numb er of ob jects from Biop erl in particular the

Sequence ob ject BioSeq and the alignment ob ject BioSimpleAlign

The middleware layer has b een used to supp ort scripts which interacted with

the database and also the web site A fully functional web site is an imp ortant part

of a public bioinformatics resource the web site at the Sanger centre was written

with an eye for extendibility by maintainers over the years with the middleware

layer providing insulation of the html rendering co de from the database access

Some Example families

As well as b eing the database administrator for Pfam in charge of maintaining the

integrity of the database as a system I maintained a numb er of actual families in the

database These ranged from simple families whichI have had a long term interest

in through to dicult families which required many manipulations to maintain their

biological correctness I list mycontributions to the data curation in App endix D

The RNA Recognition Motif

Ihave had a long asso ciation with RNA binding proteins as they were the original

reason I became interested in computational biology One of the principle RNA

binding domains is the RNA Recognition Motif This domain of around

amino acids is one of the most abundant in biology and certainly the most abundant

domain involved in RNA pro cessing Anumb er of crystal structures are known for

this domain and there is a large series of bio chemical exp eriments

into the action of certain family memb ers for example

The seed alignment of the RRM family is shown in gure It is quite large for

a Pfam seed alignment indicating the large numb er of protein sequences required to



Biop erl is co ordinated from httpbiop erlorg In Janurary I b ecame the ocial co ordi

nator for biop erl DB get all acc() returns all accession A single numbers in DB Pfam DB

get Entry by acc()

Entry Bio::SimpleAlign seed() A single A multiple alignment Pfam Entry fulll() with methods to output with specific different formats Pfam methods

ann() Comment Annotation Generic DBLink biological

annotation Reference

Figure A diagram showing the essential ob jects in the Pfam middleware Each

rectangle indicates an ob ject The lines between rectangles indicates a metho d

which pro duces the ob ject BioSimpleAlign comes from the biop erl pro ject Most

metho ds are not shown for simplicity File Edit Colour Sort Picked: (90x103) ------10------20------30------40------50------60------70------80------90------100--- ARP2_PLAFA 364 438 VEVTYLF....STYLVNGQTL..IYS.....N.ISVV....LVILY...... HQKFKETVLGRNSGFGFVSYDNVISAQHAIQFMNG.Y...FVNNKYLKV CABA_MOUSE 77 147 MFVGGL...... SWDTSKKDLKDYFT.....K.FGEV..VDCTIKMD...... PNTGRSRGFGFILFKDSSSVEKVLD.QKE.H...RLDGRVIDP CABA_MOUSE 161 231 IFVGGL...... NPEATEEKIREYFG.....Q.FGEI..EAIELPID...... PKLNKRRGFVFITFKEEDPVKKVLE.KKF.H...TVSGSKCEI CPO_DROME 453 526 LFVSGL...... PMDAKPRELYLLFR.....A.YEGY..EGSLLKV...... TSKNGKTASPVGFVTFHTRAGAEAAKQDLQGVR...FDPDMPQTI CST2_HUMAN 18 89 VFVGNI...... PYEATEEQLKDIFS.....E.VGPV..VSFRLVYD...... RETGKPKGYGFCEYQDQETALSAMRNLNG.R...EFSGRALRV D111_ARATH 281 360 LLLRNMVG.PGQVDDELEDEVGGECA.....K.YGTV..TRVLIFE...... ITEPNFPVHEAVRIFVQFSRPEETTKALVDLDG.R...YFGGRTVRA ELAV_DROME 250 322 LYVSGL...... PKTMTQQELEAIFA.....P.FGAI..ITSRILQN...... AGNDTQTKGVGFIRFDKREEATRAIIALNG.T...TPSSCTDPI ELAV_DROME 404 475 IFIYNL...... APETEEAALWQLFG.....P.FGAV..QSVKIVKD...... PTTNQCKGYGFVSMTNYDEAAMAIRALNG.Y...TMGNRVLQV EWS_HUMAN 363 442 IYVQGL...... NDSVTLDDLADFFK.....Q.CGVV..K.MNKRTG....QPMIHIYLDKETGKPKGDATVSYEDPPTAKAAVEWFDG.K...DFQGSKLKV GBP2_YEAST 124 193 IFVRNL...... TFDCTPEDLKELFG.....T.VGEV..VEADIIT...... SKGHHRGMGTVEFTKNESVQDAISKFDG.A...LFMDRKLMV GBP2_YEAST 221 291 VFIINL...... PYSMNWQSLKDMFK.....E.CGHV..LRADVELD...... FNGFSRGFGSVIYPTEDEMIRAIDTFNG.M...EVEGRVLEV GBP2_YEAST 351 421 IYCSNL...... PFSTARSDLFDLFG.....P.IGKI..NNAELKP...... QENGQPTGVAVVEYENLVDADFCIQKLNN.Y...NYGGCSLQI GR10_BRANA 8 79 CFVGGL...... AWATGDAELERTFS.....Q.FGEV..IDSKIIND...... RETGRSRGFGFVTFKDEKSMKDAIDEMNG.K...ELDGRTITV HUD_HUMAN 48 119 LIVNYL...... PQNMTQEEFRSLFG.....S.IGEI..ESCKLVRD...... KITGQSLGYGFVNYIDPKDAEKAINTLNG.L...RLQTKTIKV IF3X_SCHPO 41 124 VVIEGAP....VVEEAKQQDFFRFLSSKVLAK.IGKVKENGFYMPFE...... EKNGK..KMSLGLVFADFENVDGADLCVQELDGKQ...ILKNHTFVV IF3X_YEAST 79 157 IVVNGAPVIPSAKVPVLKKALTSLFS.....K.AGKV..VNMEFPID...... EATGKTKGFLFVECGSMNDAKKIIKSFHGKR...LDLKHRLFL IF4B_HUMAN 98 168 AFLGNL...... PYDVTEESIKEFFR.....G.LNIS...AVRLPR...... EPSNPERLKGFGYAEFEDLDSLLSALS.LNE.E...SLGNRRIRV LA_DROME 151 225 AYAKGF...... PLDSQISELLDFTA.....N.YDKV..VNLTMRNS...... YDKPTKSYKFKGSIFLTFETKDQAKAFLE.QEK.I...VYKERELLR LA_HUMAN 113 182 VYIKGF...... PTDATLDDIKEWLE.....D.KGQV..LNIQMRR...... TLHKAFKGSIFVVFDSIESAKKFVE.TPG.Q...KYKETDLLI MEI2_SCHPO 197 265 LFVTNL...... PRIVPYATLLELFS.....K.LGDV..KGIDTSSL...... STDGICIVAFFDIRQAIQAAKSLRSQR...FFNDRLLYF MODU_DROME 177 246 VFVTNL...... PNEYLHKDLVALFA.....K.FGRL..SALQRFTN...... LNGNKSVLIAFDTSTGAEAVLQAKPKAL...TLGDNVLSV MODU_DROME 260 326 VVVGLI...... GPNITKDDLKTFFE.....K.VAPV..EAVTISSN...... RLMPRAFVRLASVDDIPKALK.LHS.T...ELFSRFITV MODU_DROME 342 410 LVVENVG....KHESYSSDALEKIFK.....K.FGDV..EEIDVVC...... SKAVLAFVTFKQSDAATKALAQLDG.K...TVNKFEWKL MODU_DROME 422 484 ILVTNL...... TSDATEADLRKVFN.....D.SGEI..ESIIMLG...... QKAVVKFKDDEGFCKSFL.ANE.S...IVNNAPIFI MSSP_HUMAN 31 102 LYIRGL...... PPHTTDQDLVKLCQ.....P.YGKI..VSTKAILD...... KTTNKCKGYGFVDFDSPAAAQKAVSALKA.S...GVQAQKAKQ NAM8_YEAST 165 237 IFVGDL...... APNVTESQLFELFI.....NRYAST..SHAKIVHD...... QVTGMSKGYGFVKFTNSDEQQLALSEMQG.V...FLNGRAIKV NONA_DROME 304 369 LYVGNL...... TNDITDDELREMFK.....P.YGEI..SEIFSNLD...... KNFTFLKVDYHPNAEKAKRALDG.S...MRKGRQLRV NONA_DROME 378 448 LRVSNL...... TPFVSNELLYKSFE.....I.FGPI..ERASITVD...... DRGKHMGEGIVEFAKKSSASACLRMCNE.K...CFFLTASLR NOP3_YEAST 127 190 LFVRPF...... PLDVQESELNEIFG.....P.FGPM..KEVKILN...... GFAFVEFEEAESAAKAIEEVHG.K...SFANQPLEV NOP3_YEAST 202 270 ITMKNL...... PEGCSWQDLKDLAR.....E.NSLE..TTFSSVN...... TRDFDGTGALEFPSEEILVEALERLNN.I...EFRGSVITV NOP4_YEAST 28 98 LFVRSI...... PQDVTDEQLADFFS.....N.FAPI..KHAVVVKD...... TNKRSRGFGFVSFAVEDDTKEALAKARK.T...KFNGHILRV NOP4_YEAST 292 363 VFVRNV...... PYDATEESLAPHFS.....K.FGSV..KYALPVID...... KSTGLAKGTAFVAFKDQYTYNECIKNAPA.A...GSTSLLIGD NSR1_YEAST 170 241 IFVGRL...... SWSIDDEWLKKEFE.....H.IGGV..IGARVIYE...... RGTDRSRGYGYVDFENKSYAEKAIQEMQG.K...EIDGRPINC NSR1_YEAST 269 340 LFLGNL...... SFNADRDAIFELFA.....K.HGEV..VSVRIPTH...... PETEQPKGFGYVQFSNMEDAKKALDALQG.E...YIDNRPVRL NUCL_CHICK 283 352 LFVKNL...... TPTKDYEELRTAIK.....EFFGKK...NLQVSEV...... RIGSSKRFGYVDFLSAEDMDKALQ.LNG.K...KLMGLEIKL PABP_DROME 4 75 LYVGDL...... PQDVNESGLFDKFS.....S.AGPV..LSIRVCRD...... VITRRSLGYAYVNFQQPADAERALDTMNF.D...LVRNKPIRI PABP_DROME 92 162 VFIKNL...... DRAIDNKAIYDTFS.....A.FGNI..LSCKVATD...... EKGNSKGYGFVHFETEEAANTSIDKVNG.M...LLNGKKVYV PABP_DROME 183 254 VYVKNF...... TEDFDDEKLKEFFE.....P.YGKI..TSYKVMS...... KEDGKSKGFGFVAFETTEAAEAAVQALNGKD...MGEGKSLYV PABP_SCHPO 263 333 VYIKNL...... DTEITEQEFSDLFG.....Q.FGEI..TSLSLVKD...... QNDKPRGFGFVNYANHECAQKAVDELND.K...EYKGKKLYV PES4_YEAST 93 164 LFIGDL...... HETVTEETLKGIFK.....K.YPSF..VSAKVCLD...... SVTKKSLGHGYLNFEDKEEAEKAMEELNY.T...KVNGKEIRI PES4_YEAST 305 374 IFIKNL...... PTITTRDDILNFFS.....E.VGPI..KSIYLSN...... ATKVKYLWAFVTYKNSSDSEKAIKRYNN.F...YFRGKKLLV PR24_YEAST 43 111 VLVKNL...... PKSYNQNKVYKYFK.....H.CGPI..IHVDVAD...... SLKKNFRFARIEFARYDGALAAIT.KTH.K...VVGQNEIIV PR24_YEAST 119 190 LWMTNF...... PPSYTQRNIRDLLQ.....D.INVV.ALSIRLPSL...... RFNTSRRFAYIDVTSKEDARYCVEKLNG.L...KIEGYTLVT PR24_YEAST 212 284 IMIRNL.....STELLDENLLRESFE.....G.FGSI..EKINIPAG...... QKEHSFNNCCAFMVFENKDSAERALQ.MNR.S...LLGNREISV PSF_HUMAN 373 443 LSVRNL...... SPYVSNELLEEAFS.....Q.FGPI..ERAVVIVD...... DRGRSTGKGIVEFASKPAARKAFERCSE.G...VFLLTTTPR PTB_HUMAN 61 128 IHIRKL...... PIDVTEGEVISLGL.....P.FGKV..TNLLMLKG...... KNQAFIEMNTEEAANTMVN.YYT.SVTPVLRGQPIYI PTB_HUMAN 186 253 IIVENL...... FYPVTLDVLHQIFS.....K.FGTV....LKIIT...... FTKNNQFQALLQYADPVSAQHAKLSLDG.Q...NIYNACCTL PUB1_YEAST 76 146 LYVGNL...... DKAITEDILKQYFQ.....V.GGPI..ANIKIMID...... KNNKNVNYAFVEYHQSHDANIALQTLNG.K...QIENNIVKI PUB1_YEAST 163 234 LFVGDL...... NVNVDDETLRNAFK.....D.FPSY..LSGHVMWD...... MQTGSSRGYGFVSFTSQDDAQNAMDSMQG.Q...DLNGRPLRI PUB1_YEAST 342 407 AYIGNI...... PHFATEADLIPLFQ.....N.FGFI..LDFKHYPE...... KGCCFIKYDTHEQAAVCIVALAN.F...PFQGRNLRT RB97_DROME 34 104 LFIGGL...... APYTTEENLKLFYG.....Q.WGKV..VDVVVMRD...... AATKRSRGFGFITYTKSLMVDRAQE..NRPH...IIDGKTVEA RN12_YEAST 200 267 IVIKFQ...... GPALTEEEIYSLFR.....R.YGTI....IDIFP...... PTAANNNVAKVRYRSFRGAISAKNCVSG.I...EIHNTVLHI RN15_YEAST 20 91 VYLGSI...... PYDQTEEQILDLCS.....N.VGPV..INLKMMFD...... PQTGRSKGYAFIEFRDLESSASAVRNLNG.Y...QLGSRFLKC RNP1_YEAST 37 109 LYVGNL...... PKNCRKQDLRDLFE.....PNYGKI..TINMLKKK...... PLKKPLKRFAFIEFQEGVNLKKVKEKMNG.K...IFMNEKIVI RO28_NICSY 99 170 LFVGNL...... PYDIDSEGLAQLFQ.....Q.AGVV..EIAEVIYN...... RETDRSRGFGFVTMSTVEEADKAVELYSQ.Y...DLNGRLLTV RO33_NICSY 116 187 LYVGNL...... PFSMTSSQLSEIFA.....E.AGTV..ANVEIVYD...... RVTDRSRGFAFVTMGSVEEAKEAIRLFDG.S...QVGGRTVKV RO33_NICSY 219 290 LYVANL...... SWALTSQGLRDAFA.....D.QPGF..MSAKVIYD...... RSSGRSRGFGFITFSSAEAMNSALDTMNE.V...ELEGRPLRL ROA1_BOVIN 106 176 IFVGGI...... KEDTEEHHLRDYFE.....Q.YGKI..EVIEIMTD...... RGSGKKRGFAFVTFDDHDSVDKIVI.QKY.H...TVNGHNCEV ROC_HUMAN 18 82 VFIGNL.....NTLVVKKSDVEAIFS.....K.YGKI..VGCSVHK...... GFAFVQYVNERNARAAVAGEDG.R...MIAGQVLDI ROF_HUMAN 113 183 VRLRGL...... PFGCTKEEIVQFFS.....G.LEIV.PNGITLPVD...... PEGKITGEAFVQFASQELAEKALG.KHK.E...RIGHRYIEV ROG_HUMAN 10 81 LFIGGL...... NTETNEKALEAVFG.....K.YGRI..VEVLLMKD...... RETNKSRGFAFVTFESPADAKDAARDMNG.K...SLDGKAIKV RT19_ARATH 33 104 LYIGGL...... SPGTDEHSLKDAFS.....S.FNGV..TEARVMTN...... KVTGRSRGYGFVNFISEDSANSAISAMNG.Q...ELNGFNISV RU17_DROME 104 175 LFIARI...... NYDTSESKLRREFE.....F.YGPI..KKIVLIHD...... QESGKPKGYAFIEYEHERDMHAAYKHADG.K...KIDSKRVLV RU1A_HUMAN 12 84 IYINNLNE..KIKKDELKKSLYAIFS.....Q.FGQI..LDILVSR...... SLKMRGQAFVIFKEVSSATNALRSMQG.F...PFYDKPMRI RU1A_HUMAN 210 276 LFLTNL...... PEETNELMLSMLFN.....Q.FPGF..KEVRLVPG...... RHDIAFVEFDNEVQAGAARDALQG.F...KITQNNAMK RU1A_YEAST 229 293 LLIQNL...... PSGTTEQLLSQILG.....N.EALV...EIRLVSV...... RNLAFVEYETVADATKIKNQLGS.T...YKLQNNDVT RU2B_HUMAN 9 81 IYINNMND..KIKKEELKRSLYALFS.....Q.FGHV..VDIVALK...... TMKMRGQAFVIFKELGSSTNALRQLQG.F...PFYGKPMRI RU2B_HUMAN 153 220 LFLNNL...... PEETNEMMLSMLFN.....Q.FPGF..KEVRLVPG...... RHDIAFVEFENDGQAGAARDALQGFK...ITPSHAMKI

SFR2_CHICK 16 87 LKVDNL...... TYRTSPDTLRRVFE.....K.YGRV..GDVYIPRD...... RYTKESRGFAFVRFHDKRDAEDAMDAMDG.A...VLDGRELRV

Figure Figure showing the RRM seed alignment The protein sequences are on

the left hand side of the gure In the alignment columns conserved columns are

shaded

make an eective prole HMM Included in the alignmentareanumber of atypical

RRMs which help to educate the HMM ab out some of the more distant memb ers

of the domain This is a particular problem for this family as some of the most

conserved features of the domain from typical families are the surface facing aromatic

residues involved in RNA binding Despite their strong conservation in typical

families there are a numb er of RRMs which do not have these aromatic residues but

have all the structural core residues and bind RNA for example hnRNPL Without

the inclusion of these atypical RRMs many structually correct domains would be

missed by the HMM

Protein complexes

I have taken an interest in a number of protein complexes One problem of these

common complexes is that often researchers dismiss them as uninteresting and do

not sp end as muchtimechecking their annotation in genome pro jects A common

mistake is to misname proteins due to naming dierences between bacterial and

eukaryotic proteins When the top hit is a bacterial sequence they might give

the bacterial name to a eukaryotic protein although the naming convention for

the complex in eukaryotes is dierent For example in the F F ATP synthase

O 

complex the D subunit in bacteria corresp onds to the Oligomycin sensitive subunit

in Eukaryotes also called the O subunit and there are Delta and Epsilon subunits

in Eukaryotes which corresp onds to the Epsilon subunit in bacteria Just to add

more confusion in Eukaryotes there is a D subunit for Vacuolar ATPases whichare

related to the F F ATPases in some asp ects but not others This means that there

O 

are three p ossible typ es of ATP synthase D subunits all dierent the Bacterial D

subunit the Eukaryotic Delta subunit and the Vacuolar D subunit The confusion

here is inherent in the naming of the protein complexes and there is very little

that can be done now to solve this problem in the general literature Pfam on the

other hand can provide an accurate single annotation automatically there are three

separate Pfam families for these D subunit typ es and the description makes it

clear what the dierent naming conventions mean in dierent organisms

Ihaveworked on the ATP synthase the Signal Recognition Particle and a num

ber of the electron transp ort comp onents The seed alignment construction and

HMM building in these families is trivial in every case the separation of signal

from noise is considerable However tracking down the precise annotation for these

families is a harder task as it requires nding retrieving and reading the relevant

pap ers The reward is knowing that the problems in naming and classication can

b e eliminated as resources like Pfam b ecome more widely used

Discussion

Pfam has grown over while I was managing the database During that time

wehave had an additional four p eople work with the database at dierent times all

of whom b oth had to familarise themselves with the to ols and had the chance of p o

tentially breaking the database by issuing inappropriate commands The database

management system I wrote both scaled well to the increase of data and also pro

tected users from corrupting the data We have not lost track of a single family

in Pfam and when there was some ma jor corruption or mistake in the database

thankfully rare wewere able to recover the data easily by direct manipulation of

the RCS les Scalability and robustness are two key features of a database and I

achieved them in this tailored database management system

The database system relies on RCS les rather than a relational DBMS or a

ob ject based DBMS which would have been a more standard choice One reason

why these solutions is not as appropriate is that some of the main query mechanisms

and views of the database are through sp ecialised software to compare HMMs to

proteins or view multiple alignments If we had implemented the database around a

more traditional system there would have b een additional software to write to view

our data In addition RCS provides for free a full history for each family giving the

security of a reasonably easy wayofbacktracking changes

The middleware layer insulates the access of the database by other programs

from knowing to o much ab out how the database is actually stored on disk This

allows the database management system to change without breaking scripts that

interact with it and makes writing scripts that interact with the database simpler

Evaluating how useful the middleware layer has b een is dicult as we do not have

an alternative In some cases the middleware layer has saved time and other cases

it has prevented quickdevelopment as new features have to b e implemented in the

middleware b efore they can b e actually used



Pfam has b ecome a memb er database of the Interpro consortium Interpro is

a p o oling of the do cumentation resources of Pfam Prints and Prosite along with

the maintenance of a consistent set of matches of these databases along with other

motifdomain databases on Swissprot sptrembl This p o oling of the do cu

mentation eort is imp ortant as do cumentation is one of the most time consuming

parts of these databases mainly b ecause it is almost imp ossible to automate The

Interpro pro ject also encourages the dierent databases to play to their strengths

rather than comp eting to provide all asp ects of motif databases

Pfam provides a way to maintain a large number of multiple alignments robustly

an imp ortant part of that strategy is the prole HMM These prole HMMs are also

useful in their own right to b e used by other analysis to ols The GeneWise algorithms

describ ed in this thesis Chapter can work with proleHMM libraries such as

those generated by Pfam as well as alignments to protein sequences The next

chapter details how using GeneWise Pfam can be applied to eukaryotic genomes

at a large scale

The curation of a database such as Pfam is a satisfying task I have been

able to keepaninterest in a numb er of protein families such as the RRM and EGF

domains over the last years and I have maintained these families with considerable

attention I b elieve that the underlying technology and basis of Pfam can b e applied

across all protein sequences which would place Pfam in a central role to provide

consistent accurate and automatic identication for all protein families



The interpro web pages are at httpwwwebiacukinterpro

Chapter

Genome

Intro duction

Pfam represents a large p ortion of protein families covering b etween of pro

teins in a genome However applying Pfam to eukaryotic genomes is dep endent

on accurate gene prediction to form go o d protein sequences GeneWise Chapter

eliminates the need for go o d gene predictions as it can compare a protein pro

le HMM directly to the genomic sequence By combining GeneWise and Pfam

it should be p ossible to nd and classify around half the genes in a genome in an

entirely automated manner

Such a comparison has two b enets Firstly it has the p otential of nding genes

which other programs have missed or when the gene has already been predicted

improving their gene prediction The accuracy of GeneWise section should

provide high quality predictions Secondly it allows the investigation of the domain

content of genomes without worrying ab out accurate gene prediction nor for that

matter accurate sequencing as GeneWise is tolerant to sequencing errors

The main complication to applying GeneWise on a genome wide scale is the

CPU exp ense of the metho d It takes ab out a second to compare one protein hidden

Markov mo del against bases of genomic sequence A genome of MBase

the size of Celegans against protein prole HMMs in Pfam will take a total

of around days of CPU time Even with large compute farms this is a number

of months and not practical with current resources

In this chapter I showhowI overcame the problem of CPU limitations by using

a heuristic sp eed up Then gene predictions by Pfam and GeneWise of the Celegans

genome Mbases are compared to the curated annotations on that genome This

provides an estimate of the numb er of new genes whichwere missed by the Celegans

annotation pro ject and an estimate of the numb er of domains missed by GeneWise

due to more sensitive protein comparisons Finally I p erformed comparisons to a

p ortion of the human genome the nished Chromosome sequence Mbases

to assess how useful this analysis is against the human genome

Halfwise

The prohibitive computational cost of comparing all prole HMMs to genomic

DNA wasaproblemIovercame byemploying a prelter on which proleHMMs are

used in the GeneWise comparison This prelter selects only a small number of the

HMMs to do a full GeneWise searchwith The combined action of the prelter and

GeneWise is encapsulated in a single Perl script called halfwise By selecting only

a small fraction of the p otential prole HMMs to b e searched against each genomic

sequence fragment I could cut the numb er of prole HMMs down to b etween to

HMMs from meaning the total running time of GeneWise would b e ab out

of the exhaustive search and as long as the prelter was quick enough the

total time reduced byuptotwo orders of magnitude

The prelter I used was to search the DNA sequence with BLASTX against a

protein database which had been designed to capture nearly all the signal present

in the Pfam prole HMMs This database was constructed by taking the Pfam full

alignments and making them non redundant at the identity level removing

close homologs To ensure that this database did capture the vast ma jority of the

Pfam information I constructed this database with varying levels of identity and

then searched a random selection of proteins from each Pfam family against the

database proteins in all scoring whether these families hit or missed anyof

the Pfam families present The results are shown in table

identity seemed to b e the lowest cuto which still provides eectivecoverage

over Pfam with almost no loss on the test sample

id Size of DB Numb er of missing assignments

Table A table showing how the eectiveness of lowering the cuto for non

redundancy eects the numb er of family hits retrievable by the BLAST metho d

Worm Genome

The rst genome sequence to be examined was the worm genome To compare

the entire worm genome against Pfam each sequencing unit generally a cosmid

or a YAC fragment was run through halfwise This do es mean that there will b e

problems with genes which span more than one sequencing unit but as domains

are generally small compared to genes this is not such a great problem All

megabases of the Celegans genome to ok around and half weeks of op eak com

puter time at the Sanger Centre

Halfwise pro duced predictions of protein domains across dierent

families on average around thirteen predictions p er domain though this distribution

is very skewed The predictions had exons on average two and a half exons

per domain predictions were of only one exon with the remaining

having introns

Domain Number Comment

Collagen rep eats copies of the GXY motif in the HMM

GLTT rep eats copies of GLTT rep eats Sp ecic to Celegans

Protein kinase The common serinetheroninetyrosine kinase

EGF like domains Extracellular domain See also Laminin EGF

Zinc nger C DNA binding domain

Fb ox asso ciated domain Expanded domain in worm

Fb ox domain Ubiquitination receptor domain

MATH domain Signal pro cessing domain

Worm chemoreceptor tm Worm sp ecic tm subfamily

WD rep eats Beta prop eller rep eat Around copies p er domain

Transp osase Found in mobile genetic elements

LDL receptor A domain Extracellular domain

Zinc nger CH domain DNA binding domain often multiple domains to a

protein

Ankyrin rep eats Rep eats found in structual proteins

Worm chemoreceptor tm Worm sp ecic tm subfamily

Ig sup er family domains Ubiquitous metozoan domain

Protamine Sp erm histone protein

Ion channel Includes voltage gated channels

Fibronectin typ e I I I domains Extracellular domain

RNA recognition motif Involved in many asp ects of RNA metab olism

Trypsin inhibitor Protease inhibitors common in signal transduction

Reverse Transcriptase Mobile genetic element domain

Ma jor Sp erm Protein domain Found in structual proteins including sp erm tails

Cyto chrome p Involved in metab olism Imp ortant medical impli

cations

DUF Domain of unknown function found mainly in

Celegans

tm Classical tm receptors

laminin EGF sub typ e of EGF domains

BTB domain Protein interaction domain

EB domain Unknown function expanded in C elegans

ABC transp orters Involved in metab olite transp ort

Homeob ox DNA binding domain

Lectin Ctyp e Ctyp e lectin domain

Trypanosome mucins Mucin like coat proteins

Sp ectrin Sp ectrin rep eats found in structual proteins

Table A table showing Pfam families with more than do

main hits in Celegans via halfwise analysis The table shows the

rawcounts for each Pfam HMM This can b e confusing for some

families for example the Collagen HMM represents a single re

p eat which is found many times in a collagen domain The num

b ers here are therefore an overestimate of the numb er of collagen

domains o ccurring in Celegans For other cases such as the

transmembrane family Pfam uses a numberofHMMstocover one

domain meaning that those cases should b e summed See the main

text for more details

Table shows the domains with or more hits in the Celegans genome Inter

preting this table requires some knowledge ab out the dierent Pfam families Some

Pfam families represent a rep eated motif which occurs many times sequentially

These families generally do not conform to the usual idea of a protein domain

in the sense of a single indep endently folding unit but instead form nonglobular

extended protein structures Collagen is a clear example of this and the GLTT

rep eat is also likely to b e another such nonglobular rep eat The numb ers therefore

in this table are overestimates ideally one would like to quote number of contin

uous stretches of these rep eats but as GeneWise cannot distinguish b etween gaps

between HMMs matches due to protein sequence and gaps due to intergenic regions

this is hard to deduce accurately Other families are unfairly represented in table

as they are split over a number of Pfam families There are two examples of

this in the table The transmembrane family is split over dierent Pfam families

tm and tm The tm and tm families were created sp ecically tm

to mo del the divergentchemoreceptors found in Celegans which are clearly trans

membrane receptors but cannot b e found with the conventional tm HMMThe

total number considering all three HMMs is making them the second largest

single family in Celegans when one ignores the structural rep eats of collagen and

GLTT rep eats The largest family is the EGF like domains which are split into the

EGF domain and the laminin EGF domain in Pfam making a grand total of

domains

It is annoying that the numb ers in cannot b e used to give precise answers to

questions such as what is the largest worm family and which families have had the

most expansion However it is hard to represent the complexity of the biological

domains to provide an automatic answer to such queries As any conclusion one

draws is qualitative in any case the lack of automatic reliable numb ers for each

domain is not such an issue

The gures in table broadly agree with other genome wide studies of Celegans

protein content In particular the expansion of the Zinc Finger C domain

and the transmembrane domain proteins have b oth b een previously noted

It is interesting to wonder whether selective family expansion is a general feature

to eukaryotic evolution or whether the worm is unique in its indulgence of sp ecic

family expansions Only the increase in the numb er of complete or mostly complete

eukaryotic genomes will give us this answer to ols such as Pfam and GeneWise will

greatly help obtaining these answers when the data is available

Comparison to curated worm genes

Thereisan active database group which maintains curated gene structures on the

worm genome This database contains a set of gene structures which usually were

the result of an ab initio gene prediction program genender followed bymanual ex

amination and curation by a skilled annotator In addition the program estgenome

was used to match cDNA and EST sequences back to the genomic DNA These

Introns Conrmed introns

Curated Genes

Halfwise Genes

Intersection

Table A table showing the numb er of dierentintrons b oth conrmed and not

conrmed in three dierent datasets Curated genes in Celegans Halfwise predic

tions and the intersection of the two datasets The curated genes and conrmed

introns are only from the Cambridge p ortion of the database as this information is

not available from the St Louis p ortion of the data

matches pro duced a numb er of conrmed intron features whichareintrons which

have an EST or cDNA crossing the splice b oundaries

Comparison of the worm gene dataset with the halfwise predictions provides a

useful feedbacktoboththeworm annotators ab out a number of potential mispredic

tions and myself for the accuracy of GeneWise There are three p ossible explanations

for discrepancies b etween the halfwise predictions and the curated gene set Firstly

halfwise could have mispredicted the gene Secondly the annotators could have

mispredicted the gene Thirdly there could be another genomic feature such as a

pseudogene or transp oson which explains the halfwise prediction but do es not indi

cate the presence of another real gene As we exp ect nearly all the conrmed introns

to be correct marking where conrmed introns disagree with halfwise predictions

will provides a go o d measure of halfwises accuracy

As table shows the vast ma jority of halfwise predictions are consistent with

the curated genes with of halfwise introns b eing identically placed in the cu

rated set and the halfwise predictions The numb ers of those which are conrmed by

EST is shown in the second column It is interesting to note that a higher p ercent

ageofintrons are conrmed in the curated genes than the intersection of the

curated genes and halfwise This indicates that the halfwise predictions are

o ccurring in areas of the genes which generally have less EST conrmation As the

EST sequences are derived from and sequence reads on attempted full length

cDNA clones this is not surprising as halfwise predictions are more likely to b e in

the central co ding p ortion and ESTs at the ends

Number Percentage

Consistent

Mo dications

Pseudogenes

Rep eats

Novel

Total

Table A table showing the distribution of exons predicted by halfwise b etween

annotated exons dierent exons from annotations but inside an annotated gene

Mo dications exons inside pseudogenes exons inside the rep eats and nally exons

not overlapping with any of the other genomic features

Indication of annotation errors

The halfwise analysis pro duced a numb er of gene predictions in regions where there

were no curated genes at all In some cases these overlapp ed with known pseudo

genes as GeneWise is only matching a domain and is also tolerant to sequencing

error detecting domains in pseudogenes is a reasonably common o ccurrence

Table shows the distribution of halfwise exons between annotated genes

annotated pseudogenes and other genomic regions Exons inside annotated genes

are split into two classes identical to annotated exons and those contained by

annotated exons but implying a dierent gene structure It is interesting to

note that lo oking at the halfwise statistics on the exon level gives a more favourable

view of GeneWise rather than comparisons at an intron level which is presented in

the next section This is b ecause there are a large numb er of GeneWise predictions

of single domains inside large exons for example a set of Zinc Finger rep eats which

improve the annotated exon count The other of exons lie outside of annotated

exons The predominant number here of the do not lie in pseudogenes

or rep eats indicating p ossible new genes or extensions of existing genes

GeneWise accuracy in the worm

The gures ab ove do not allow us to assess how many of the halfwise predictions

which are not consistent with the curated gene predictions are due to annotation

error or halfwise misprediction Table gives an indication of this

Total conrmed introns Consistent within Inconsistent

Halfwise

Table A table showing the accuracy of GeneWise when used in genomic scans

against the worm The total conrmed introns are conrmed introns whichoverlap

in some manner with the halfwise gene predictions The consistent column provides

the numb er of those which are also predicted by halfwise The nal column indicates

the inconsistentintrons

The conrmed introns which disagree with halfwise are mainly errors by

halfwise There are a number of cases which are due to other eects for example

a gene nested in the intron of another gene so there are two overlapping correct

introns Even allowing for a generous of the introns as b eing due to such

eects the results are not close to the sort of accuracy gures quoted in section

When examined visually many of the cases were due to short introns both

in the halfwise predictions and the worm genes The worm has a large number of

short introns of around base pairs and unfortunately it is hard for GeneWise to

distinguish b etween short introns and protein insertions relative to the proleHMM

The parameterisation which provides eective prediction of short introns also allows

the erroneous prediction of an intron to avoid making a long protein insertion

Assessment of the accuracy of intron placement by GeneWise is fair indication

of its eectiveness for nding annotation errors in a curated database However the

accuracy of GeneWise at a base pair or exon level is more relevant to its eectiveness

for gene identication purp oses where the aim is to nd exons with a high condence

to design primers The results in the previous section indicate that GeneWises exon

accuracy is far b etter than its intron accuracy however it is less easy to classify exons

as conrmed in a large scale automatic manner

Comparison to protein Pfam analysis

The worm database holds the results of a search of the curated worm proteins

sequences against Pfam Comparisons of the Pfam matches on the protein set

against the GeneWise matches provides an indication ab out the loss of sensitivity

to nding domains by using Halfwise on genomic DNA compared to the protein

analysis

This analysis is complicated byseveral factors Firstlyaspresented ab ove there

are a numb er of halfwise predictions whichdonotoverlap with known genes As we

know where these lie it is easy enough to account for these The second problem is

that the protein analysis pro duces domains in the p eptide co ordinates It is dicult

to compare these matches with the halfwise matches on the genomic sequence in a

true domain by domain fashion

The protein database has Pfam domains which is not that larger than

the predictions from halfwise analysis Of those predictions we know that

only do not overlap in anyway with predicted genes and the numb er of exons

in disagreement with curated genes are small enough for us to ignore This gives

a loss of sensitivity of around which I consider to b e quite go o d One technical

reason for why there is this loss of sensitivity is that GeneWise can only predict genes

when the loglikeliho o d ratio against an genomic DNA sequence mo del is ab ove zero

As some Pfam families have cutosofnegative log likeliho o d scores which is p ossible

using HMMER software these domains are imp ossible for GeneWise to predict

Human Chromosome

Recently Chromosome was nished with only small gaps interrupting the

otherwise continuous megabases of human DNA This contains the largest sin

gle contig of DNA sequence known Mb and the largest amount of contiguous

human DNA To appreciate the scale of the human genome these MB represent

approximately of the entire genome

Halfwise analysis was p erformed against the individual clones sequences that

make up the contiguous sequence as with Celegans In total domains were

predicted from dierent Pfam families with exons The exon numb ers are

broadly in agreementwiththeworm gures with somewhat more exons p er domain

around compared to worms and but the concentration average number of

domains per family is considerably lower domains per family compared to

in the worm

There are no real surprises in the domain comp osition of human as exp ected

there is a numb er of extracellular mosaic domains are found such as the EGFlike

domains and the Ig sup erfamily The large number of zinc nger CH domains

is also exp ected There are a number of gene clusters on chromosome such as

Domain Number Comment

Ig sup er family domains Ubiquitous metozoan domain

Zinc nger CH DNA binding domain

EGF like domains Extracellular mo dular domain

Protein kinase domain The common serinetheroninetyrosine kinase

Kelch domain Beta prop eller rep eat Around copies p er do

main

Mito chondrial Carrier protein Solute transp ort

Crystallin Lens protein

SH domains Signalling domain

Cyto chrome p Involved in metab olism

GProtein eector RhoGAP Signalling protein

Cadherin Cell adhesion domains

PI PI kinase Signalling domain

Myosin head Structual protein

Clathrin rep eats Vesicle transp ort

Sushi SCR Extracellular domain

Filament Structual domain

Trypsin Peptidase domain

transmembrane Classical tm proteins

PH domain Signalling protein

Zinc Finger CHC Ring Protein protein interactions

SH Signalling mo dular domain

Bromo domain Acetyllysine chromatin binding domain

Actin Structual domain

Glutathione Stransferases

BTB Protein interaction domain

protamine P Sp erm histone

Table A table showing Pfam families on Chromosome with more than

domains via halfwise analysis

Total Exons Consistent Mo dications Pseudogene Novel

Halfwise

Table A table showing the distribution of exons predicted by halfwise b etween

annotated exons exons overlapping an annotations but inside an annotated gene

Mo dications exons inside pseudogenes and novel exons The reference dataset

were the GD tagged gene structures from ace

the immunoglobulin cluster a glutathione S transferase cluster and a crystallin

cluster which explain the high numb er of these domains

Comparison to curated genes

As with the worm chromosome has an active database curation group and these

results can be compared to their predictions This comparison is hamp ered by a

numb er of practical issues Firstly it is harder for technical reasons to get consistent

co ordinate systems of both the curated genes and the halfwise predictions as the

clone assembly of chromosome is still somewhat in ux I was only able to

compare a little under half of the detected exons of Secondly I was

unable to get a set of conrmed intron features in the co ordinate system I needed

Finally the curators have b een using halfwise predictions to guide their exp eriment

design somewhat and so the comparison is not entirely unbiased

The comparison of halfwise predictions to curated genes is shown in table

The most striking gure in this list are the number of entirely novel exons

making of the exons used in this comparison Examination of some of these

exons showthemtobeinoraroundgeneswhich are b eing actively curated however

the correct data structures have not b een entered to allow them to b e automatically

compared to the halfwise predictions This inability to p erform sensible comparisons

is annoying but commonplace in this eld For exons whichdooverlap with anno

tated genes around a quarter are at disagreement with the annotations As for the

novel domains a large numb er of these disagreements are due to misrepresentations

of the genes in the database so it is hard to really assess the eectiveness of the

metho d

The chromosome annotation also provided PFAM analysis on the protein

genes a total of predicted proteins have PFAM hits from Pfam families

These gures indicate that the halfwise analysis mayhave found more protein genes

in chromosome Because of the dicultly in interpreting the gene lo cations

the precise mismatch cannot be indicated For example in the lo cus all the Ig

domains comprising the V regions were not annotated as p eptides but the halfwise

analysis predicts them

Co ding density of Human vs Celegans

Using the two comparisons of these large streches of genomic data with halfwise pro

vides a way to compare co ding density in the two genomes There are a considerable

numb er of p otential caveats to this analysis Is Pfam biased towards domains in one

of the genomes Do the genomes have dierentoverall domain comp osition in terms

of the number of domains per gene Are there more pseudogenes in one genome

Assuming that none of these biases are signicant this analysis suggests that there

is one Pfam domain p er kilobases in worm compared to one Pfam domain every

kilobases in human ab out seven times the co ding density This number is about

what is exp ected Extrap olating to the whole human genome this would suggest

around genes which is consistent with other estimates

Discussion

The two comparisons presented in this chapter show that by using Halfwise Pfam

can be compared directly to genomic sequence using the GeneWise algorithm As

GeneWise provides accurate gene prediction and is robust towards errors this means

that Halfwise can provide automatic annotation of genomic sequence One area for

improvement is the detection of small introns in inveterbate sequences such as the

worm It is likely that to b e able to detect these new approaches in parameterisation

are required which try to provide the best gene predictions and not the highest

likeliho o d of the parse The Class HMM and conditional maximum likeliho o d

metho ds of Anders Krogh provide a go o d framework to derive this parameterisation

The domain comp osition of the dierent organisms by the Halfwise analysis is

as exp ected from the domain comp osition found for their proteomes In general

there are a number of common domains in each organism some which seem to be

sp ecic to an organism such as the FBoxinworm and some which are ubiquitous

in eukaryotes such as protein kinases It is likely that as more metazoan genomes

are analysed trends in how and why domains are expanded in sp ecic lineages will

b ecome clearer currently with only two metazoan genomes with large unbiased

genome coverage human and worm available it is hard to draw conclusions ab out

domain evolution in metozoans

The systematic halfwise analysis drew attention to a numb er of p ossible annota

tion errors or oversights in the two genomes Although the accuracy of the GeneWise

predictions in the worm is not enough to condently exp ect the halfwise prediction

to be correct p ossible missannotations can be quite easily sp otted as most of the

errors by halfwise are small intron predictions This analysis provides the annota

tors with a systematic way of nding curation oversights As the human genome

do es not suer from this small intron prediction problem the accuracy in human

is susp ected to be far higher The same approach of using Pfam and GeneWise is

b eing used to annotate the Drosophila genome and genomes such as Asp ergillus

The combination of quality Pfam proleHMMs with the GeneWise algorithm

in this halfwise analysis provides a useful to ol for genome analysis As the number

of sequenced genomes increase providing more automated to ols to provide a b etter

rst pass annotation of genomes is crucial This work shows that such to ols are

p ossible to develop use and apply on a large scale

Chapter

Conclusion

The stream of DNA sequence data emerging from the sequencing pro jects has caused

arevolution in some asp ects of biology One ma jor challenge is interpreting the data

on a large scale to connect it with existing hyp othesis driven biological research This

thesis presents a numb er of to ols to make that p ossible

The fo cus of the thesis has been the GeneWise algorithm to compare protein

information directly to genomic DNA sequence with the gene prediction o ccurring

at the same time as the protein comparison To write this algorithm I to ok the

standp oint that b oth problems have already b een well solved the task was to com

bine them together so that the intermedaite predicted gene sequence was implicit

rather than b eing required up front This was done byworking out how to merge in

general mo dels and applying the solution to this sp ecic case This solution relies on

the established metho ds b eing well describ ed as probabilistic nite state machines

I view the ability to conceive of algorithms such as GeneWise as a vindication for

using formal probabilistic mo dels to describ e protein and DNA sequences with

out the previous work on these mo dels it would have been dicult to design this

algorithm from scratch

GeneWise is accurate and can be used on a large scale There are still many

challenges in really using an algorithm such as GeneWise eectively one clear p oint

is that the maximum likeliho o d path is unlikely to b e the path one wants to rep ort

However it is with algorithms such as GeneWise that accurate gene prediction can

be achieved

GeneWises eectiveness is greatly increased when it is used with Pfam Pfam

provides a set of prole hidden Markov mo dels which cover around two thirds of

known proteins I have help ed the Pfam database considerably by writing a tailor

made database management system and supp orting software for Pfam As well as

designing the software Ihave b een an active curator on the database and involved

in many asp ects of its daytoday running such as the Web Site

Bioinformatics is fo cused on providing useful computational techniques for bio

logical data One asp ect of bioinformatics which often gets overlo oked is that for

a software solution to really have a wide impact it must b e able to work in a large

scale manner robustly and must b e maintainable often by p eople who did not write

the original program Chapters and presented some of my work in this area

of software engineering in bioinformatics The language Dynamite freed me from

worrying ab out the implementation of dynamic programming solutions to PFSMs

The Pfam database provided the source of proleHMMs for the GeneWise analysis

I have b een involved in other work in the area of software engineering in bioinfor

matics here it is hard to p oint to sp ecic results from this work and harder still to

write ab out it from a research p ersp ective However I am a committed promoter of

stronger software engineering techniques in the eld and I b elieve this asp ect of to ol

development is crucial to the success of bioinformatics as a whole

Many of the asp ects of this work are b eing drawn together in my new work

as part of the Ensembl pro ject Ensembl is a pro ject to develop a stable software

system to provide automatic analysis of metozoan genomes starting with the human

genome in I have learned a considerable amount from my implementation

of the Pfam database management system and applied the knowledge to this new

work In addition algorithms such as GeneWise will b e invaluable for the automatic

accurate prediction of genes We can also exp ect to have to write a numb er of new

algorithms for example for eective genomic to genomic comparisons In cases

which require dynamic programming Dynamite will greatly shorten our time to

pro duce solutions

The amount of data which we will exp ect to pro cess over the next ten years is

vast I b elieve that deriving information from this data is only achievable by taking

principled theoretical approaches and applying them to real life data in a robust

engineering environment

Bibliography

P Agarwal and D J States Comparative accuracy of metho ds for protein

sequence similarity search Bioinformatics

S F Altschul and W Gish Lo cal alignment statistics Methods in Enzymology

S F Altschul T L Madden A A Schaer J Zhang Z Zhang W Miller and

D J Lipman Gapp ed BLAST and PSIBLAST a new generation of protein

database search programs Nucleic Acids Research

T L Bailey and M Gribskov The megaprior heuristic for discovering protein

sequence patterns In States et al pages

G J Barton and M J Sternb erg Flexible protein sequence patterns a sensitive

metho d to detect weak structural similarities Journal of Molecular Biology

G J Barton and M J E Sternb erg A strategy for the rapid multiple alignment

of protein sequences Journal of Molecular Biology

A Bateman E Birney R Durbin S R Eddy K L Howe and E L L

Sonnhammer The pfam protein families database Nucleic Acids Research

S K Bhattacharya S Ramchandani N Cervoni and M Szyf A mammalian

protein with sp ecic demethylase activity for mCpG DNA Nature

E Birney and R Durbin Dynamite a exible co de generating language for

dynamic programming metho ds used in sequence comparison In Gaasterland

et al pages

E Birney S Kumar and A R Krainer Analysis of the RNArecognition

motif and RS and RGG domains conservation in metazoan premRNA splicing

factors Nucleic Acids Research pages

M J Bishop and E A Thompson Maximum likeliho o d alignment of DNA

sequences Journal of Molecular Biology

M Boro dovsky and J McIninch GENMARK parallel gene recognition for

b oth DNA strands Computers and Chemistry

J U Bowie R Luthy and D Eisenberg A metho d to identify protein se

quences that fold into a known threedimensional structure Science

S Brunak J EngelbrechtandSKnudsen Prediction of human mRNA donor

and acceptor sites from the DNA sequence Journal of Molecular Biology

R Bruskiewich Personal communication

P Bucher and K Hofmann A sequence similarity search algorithm based on

a probabilistic interpretation of an alignment scoring system In States et al

pages

C Burge and S Karlin Prediction of complete gene structures in human

genomic DNA Journal of Molecular Biology

C B Burge R A Padgett and P A Sharp Evolutionary fates and origins

of utyp e introns Molecular Cel l

M Burset and R Guigo Evaluation of gene structure prediction programs

Genomics

K C Burtis The regulation of sex determination and sexually dimorphic

dierentiation in drosophila Current Opinion in Cel l Biology

J F Caceres T Misteli G R Screaton D L Sp ector and A R Krainer Role

of the mo dular domains of sr proteins in subnuclear lo calization and alternative

splicing sp ecicity Journal of Computational Biology pages

G A Churchill Sto chastic mo dels for heterogeneous DNA sequences Bul letin

of Mathematical Biology

G A Churchill and B Lazareva Bayesian restoration of a hidden markov

chain with applications to dna sequencing Journal of Computational Biology

pages

N D Clarke and J M Berg Zinc ngers in caenorhab ditis elegans nding

families and probing pathways Science pages

E W Dijkstra A note on two problems in connection with graphs Numerische

Mathematics

J Ding M K Hayashi Y Zhang L Manche A R Krainer and R M Xu

Crystal structure of the twoRRM domain of hnRNP A UP complexed

with singlestranded telomeric DNA Genes and Development pages

S Dong and D B Searls Gene structure prediction by linguistic metho ds

Genomics

S R Eddy Hidden Markov mo dels Current Opinion in Structural Biology

S R Eddy prolehidden Markov mo dels Bioinformatics

S R Eddy and R Durbin RNA sequence analysis using covariance mo dels

Nucleic Acids Research

S R Eddy G J Mitchison and R Durbin Maximum discrimination hidden

Markov mo dels of sequence consensus Journal of Computational Biology pages

S R Eddy G J Mitchison and R Durbin Maximum discrimination hid

den Markov mo dels of sequence consensus Journal of Computational Biology

The C elegans sequencing consortium Genome sequence of the nemato de C

elegans A platform for investigating biology Science pages

T Gaasterland P Karp K Karplus C Ouzounis C Sander and A Valen

cia editors Proceedings of the Fifth International Conference on Intel ligent

Systems for Molecular Biology Menlo Park CA AAAI Press

M S Gelfand A A Mironov and PAPevzner Gene recognition via spliced

sequence alignment Proceedings of the National Academy of Sciences of the

USA

N Goldman J L Thorne and D T Jones Using evolutionary trees in pro

tein secondary structure prediction and other comparative sequence analyses

Journal of Molecular Biology

O Gotoh An improved algorithm for matching biological sequences Journal

of Molecular Biology

M Gribskov A D McLachlan and D Eisenb erg Prole analysis detection

of distantly related proteins Proceedings of the National Academy of Sciences

of the USA

J A Grice R Hughey and D Sp eck Reduced space sequence alignment

Computer Applications in the Biosciences

W N Grundy T L BaileyCPElkan and M E Baker Metameme motif

based hidden markov mo dels of protein families Computer Applications in the

Biosciences pages

R Guigo and PAgarwal manuscript in preparation

N Handa O Nureki K Kurimoto I Kim H Sakamoto Y Shimura Y Muto

Y and S Yokoyama Structural basis for recognition of the tra mRNA precursor

by the sexlethal protein Nature

K Hofmann PBucher L Falquet and A Bairo ch The prosite database its

status in Nucleic Acids Research

I Holmes and R Durbin Dynamic programming alignment accuracy Journal

of Computational Biology

D S Horowitz and A R Krainer Mechanisms for selecting splice sites in

mammalian premrna splicing Trends in Genetics

T J P Hubbard RMScoverage graphs A qualitative metho d for comparing

threedimensional protein structure predictions Proteins

D T Jones W R Taylor and J M Thornton A mutation data matrix for

transmembrane proteins FEBBs Letters

D H Kil and F B Shin Pattern recognition and prediction with applications

to signal characterization AIP Press

A M Krecic and M S Swanson hnrnp complexes comp osition structure

and function Current Opinion in Cel l Biology

A Krogh Hidden Markov mo dels for lab eled sequences In Proceedings of the

th IAPR International Conference on Pattern Recognition pages

Los Alamitos CA IEEE Computer So ciety Press

A Krogh Two metho ds for improving p erformance of a HMM and their ap

plication for gene nding In Gaasterland et al pages

A Krogh I S Mian IS and D Haussler A hidden markov mo del that nds

genes in e coli dna Nucleic Acids Research pages

D Kulp D Haussler M G Reese and F H Eeckman A generalized hidden

Markov mo del for the recognition of human genes in DNA In States et al

pages

F Lefebvre An optimized parsing algorithm well suited to RNA folding In

Rawlings et al pages

P Lio J L Thorne N Goldman and D T Jones Passml combining evolu

tionary inference and protein secondary structure prediction Bioinformatics

H X Liu M Zhang and A R Krainer Identication of functional exonic

splicing enhancer motifs recognized by individual sr proteins Genes and De

velopment

R Luthy A D McLachlan and D Eisenberg Secondary structurebased

proles use of structureconserving scoring tables in searching protein sequence

databases for structural similarities Proteins

A Mayeda and A R Krainer Mammalian in vitro splicing assays Methods

in Molecular Biology

R Mott Maximum likeliho o d estimation of the statistical distribution of

SmithWaterman lo cal sequence similarity scores Bul letin of Mathematical

Biology

GENOME a program to align spliced dna sequences to unspliced R Mott EST

genomic dna Computer Applications in the Biosciences pages

A G Murzin and A Bateman Distant homology recognition using structual

classication of proteins Proteins

E W Myers and W Miller Optimal alignments in linear space Computer

Applications in the Biosciences

K Nagai C Oubridge T H Jessen J Li and PREvans Crystal structure of

the RNAbinding domain of the U small nuclear rib onucleoprotein A Nature

S B Needleman and C D Wunsch A general metho d applicable to the search

for similarities in the amino acid sequence of two proteins Journal of Molecular

Biology

C Oubridge N Ito PREvans PR C H Teo and K Nagai Crystal structure

at a resolution of the RNAbinding domain of the UA spliceosomal protein

complexed with an RNA hairpin Nature

J Park K Karplus C Barrett R Hughey D Haussler T Hubbard and

C Chothia Sequence comparisons using multiple sequences detect three times

as many remote homologues as pairwise metho ds Journal of Molecular Biology

C PPonting and R B Russell Swap osins circular p ermutations within genes

enco ding sap osin homologues Trends in Biochemistry

C Rawlings D Clark R Altman L Hunter T Lengauer and S Wo dak ed

itors Proceedings of the Third International Conference on Intel ligent Systems

for Molecular Biology Menlo Park CA AAAI Press

S K Riis and A Krogh Hidden neural networks a framework for HMMNN

hybrids In Proceedings of ICASSP pages New York

IEEE

Eleanor Rivas and Sean Eddy A dynamic programming algorithm for rna struc

ture prediction including pseudoknots Journal of Molecular Biology

H Rob ertson Two large families of chemoreceptor genes in the nemato des

caenorhab ditis elegans and caenorhab ditis briggsae reveal extensive gene du

plication diversication movement and intron loss Genome Research pages

Y Sakakibara M Brown R Hughey I Saira Mian Kimmen Sjolander R C

Underwo o d and D Haussler Sto chastic contextfree grammars for tRNA mo d

eling Nucleic Acids Research

D ScherlyW Bo elens N A Dathan W J van Venro oij and I W Matta j

Ma jor determinants of the sp ecicity of interaction between small nuclear ri

bonucleoproteins ua and ub and their cognate rnas Nature

J Schultz R R Copley T Do erks C PPonting and P Bork Smart aweb

based to ol for the study of genetically mobile domains Nucleic Acids Research

pages

K Sjolander K Karplus M Brown R Hughey A Krogh I S Mian and

D Haussler Dirichlet mixtures a metho d for improved detection of weak

but signicant protein sequence homology Computer Applications in the Bio

sciences

D Slonim L Kruglyak L Stein and E Lander Building human genome maps

with radiation hybrids Journal of Computational Biology

T F Smith and M S Waterman Identication of common molecular subse

quences Journal of Molecular Biology

E L L Sonnhammer S R Eddy and R Durbin Pfam a comprehensive

database of protein domain families based on seed alignments Proteins

R Staden Measurements of the eects that co ding for a protein has on a DNA

sequence and their use for nding genes Nucleic Acids Research

D J States P Agarwal T Gaasterland L Hunter and R F Smith editors

Proceedings of the Fourth International Conference on Intel ligent Systems for

Molecular Biology Menlo Park CA AAAI Press

R Tacke and J L Manley Determinants of sr protein sp ecicity Current

Opinion in Cel l Biology

S A Teichmann J Park and C Chothia Structural assignments to the

mycoplasma genitalium proteins show extensive gene duplications and domain

rearrangements Proceedings of the National Academy of Sciences of the USA

J D Thompson D G Higgins and T J Gibson Improved sensitivity of

prole searches through the use of sequence weights and gap excision Computer

Applications in the Biosciences

J L Thorne N Goldman and D T Jones Combining protein evolution and

secondary structure Molecular Biology and Evolution

E C Ub erbacher and R J Mural Lo cating proteinco ding regions in human

dna sequences by a multiple sensorneural network approach Proceedings of

the National Academy of Sciences of the USA

A A Salamov V V Solovyev The GeneFinder computer to ols for analysis

of human and mo del organisms genome sequences In Gaasterland et al

pages

A A Salamov V V Solovyev and C B Lawrence Identication of human

gene structure using linear discriminant functions and dynamic programming

In Rawlings et al pages

J Zhu J Liu and C Lawrence Bayesian adaptive alignment and inference

In Gaasterland et al pages

J Zhu J S Liu and C E Lawrence Bayesian adaptive sequence alignment

algorithms Bioinformatics