<<

View metadata, citation and similar papers at core.ac.uk brought to you by CORE

provided by PubMed Central

http://genomebiology.com/2000/1/2/reports/4012.1

Meeting report

Gene prediction: the end of the beginning comment Colin Semple

Address: Department of Medical Sciences, Molecular Medicine Centre, Western General Hospital, Crewe Road, Edinburgh EH4 2XU, UK. E-mail: [email protected]

Published: 28 July 2000 reviews Genome Biology 2000, 1(2):reports4012.1–4012.3 The electronic version of this article is the complete one and can be found online at http://genomebiology.com/2000/1/2/reports/4012 © GenomeBiology.com (Print ISSN 1465-6906; Online ISSN 1465-6914)

Reducing genomes to reports A report from the conference entitled Genome Based All ab initio programs have to balance sensi- Structure Determination, Hinxton, UK, 1-2 June, 2000, tivity against accuracy. It is often only possible to detect all organised by the European Institute (EBI). the real exons present in a sequence at the expense of detect- ing many false ones. Alternatively, one may accept only pre- dictions scoring above a more stringent threshold but lose The draft sequence of the will become avail- those real exons that have lower scores. The trick is to try and able later this year. For some time now it has been accepted increase accuracy without any large loss of sensitivity; this deposited research that this will mark a beginning rather than an end. A vast can be done by comparing the prediction with additional, amount of work will remain to be done, from detailing independent evidence. For example, one may increase confi- sequence polymorphisms to discovering the complexities of dence in a predicted coding exon by detecting the presence of the transcriptome - the totality of sequences transcribed - a sequence within it which codes for a known protein domain. and, ultimately, the proteome - all the proteins encoded by The patterns made in DNA sequences by such domains may the genome. All of this work will, to a greater or lesser extent, be detected using probabilistic models known as hidden depend on all the genes having been correctly identified. It Markov models (HMMs). Predictions for exons that contain will be necessary to document not only the coding exons of non-coding sequence or untranslated regions (UTRs) can be refereed research each gene but also non-coding exonic sequence and regula- refined by comparison with ESTs (expressed sequence tags) - tory sequences. As this conference made clear, however, the sequences representing fragments of mRNA sequence that production of genomic sequence has outstripped our ability include coding and/or UTR sequence. The latest generation to reliably predict such features computationally. of gene prediction programs take advantage of such similar- ity-based approaches to complement ab initio predictions. Traditionally, gene prediction programs that rely only on the For example, Ed Uberbacher (Oak Ridge National Labora- statistical qualities of exons have been referred to as per- tory, USA) unveiled the latest incarnation of the Grail forming ab initio predictions (from the Latin: from the program - GrailEXP [http://grail.lsd.ornl.gov/] - which uses beginning). Ab initio prediction of coding sequences is an EST data to find both UTR boundaries and short exons, typi- interactions undeniable success by the standards of the machine-learning cally problematic areas for ab initio predictions. Anders field, and most of the widely used gene prediction Krogh (Centre for Biological Sequence Analysis, Denmark) programs belong to this class of . It is impressive also showed improvements in accuracy for his gene predic- that the statistical analysis of raw genomic sequence can tion program HMMgene [http://www.cbs.dtu.dk/services/ detect around 77-98% of the genes present, which was the HMMgene/] when similarity-based evidence was incorpo- range of sensitivity reported at the conference. This is, rated. however, little consolation to the bench biologist, who wants information the complete sequences of all genes present, with some cer- Regulatory regions in the human genome are estimated to tainty about the accuracy of the predictions involved. As occupy ten times the sequence length of coding sequences. (European Bioinformatics Institute, UK) put it, Prediction of regulatory sequences remains troublesome as what looks impressive to the computer scientist is often they are invariably short sequences matching a rather vague simply wrong to the biologist. consensus pattern that arise frequently by chance in genomic sequence. Thomas Werner (GSF - National 2 Genome Biology Vol 1 No 2 Semple

Research Centre for Environment and Health, Germany) Making ambiguity clear outlined a novel approach to circumventing the low Because ab initio prediction is far from perfect, and adding sequence conservation of functionally equivalent promoters. other evidence can improve gene prediction, there have been He treats regions as clusters of small, locally con- several efforts to develop graphical interfaces for the com- served sequence motifs or ‘modules’. It would seem that parison of results. These interfaces allow simultaneous there are specific restrictions on the spacing and ordering of examination of the plethora of results generated by gene pre- modules within a promoter region, imposed by the require- diction programs along with sequence similarities. The idea ments of the regulatory protein complexes that bind there. is to show explicitly where evidence from different sources is His program PromoterInspector [http://genomatix.gsf. contradictory or in agreement. Human intervention then de/free_services/] uses such restrictions to increase the takes the form of ‘polishing’ annotation: making decisions accuracy of promoter prediction, achieving impressive about the reliability of predicted features and designing results over large genomic sequences such as human chro- experiments to support or refute them. A graphical interface mosome 22, where it reached levels of specificity of more called Artemis [http://www.sanger.ac.uk/Software/Artemis/], than 98% (that is, it generated less than 2% false positives developed in the Sanger Centre (poster presented by Kim when compared to the published annotation). As with gene Rutherford), is designed to allow users to edit the features prediction, however, there is a trade-off with sensitivity, and displayed. Artemis has been used extensively for annotation only around 33% of known promoters were found. Michael of Sanger Centre projects up to 4 Mb in size. These have Zhang (Cold Spring Harbor Laboratory, USA) has begun the been mainly pathogen genomes, but the program is now task of incorporating regulatory motif detection into his gene being used in the annotation of the genome of the fission prediction software MZEF. As with many other programs, yeast Schizosaccharomyces pombe. Various attempts are MZEF is most successful when predicting internal, coding, being made to automate the polishing process and reduce exons, so the idea is to tailor new programs to other classes the amount of human intervention necessary. Richard of exon. For instance, models of initial exons could incorpo- Durbin (Sanger Centre, UK) described a new program called rate upstream promoter regions and final exon models could GAZE, which is an offshoot of his ACEDB database software. include poly(A) addition sites. GAZE integrates evidence from multiple sources to come up with graphical representations of gene predictions. We were Comparative may be the unexploded bomb in gene also introduced to Ensemble [http://www.ensembl.org/] by structure prediction, capable of sweeping away many of the Ewan Birney. It takes exons predicted ab initio that are con- ambiguities in human gene predictions. Regions of sequence firmed via similarity results and assembles the exons into conserved between species can reveal novel coding predicted genes that are presented graphically. In this way it sequences and, more importantly, non-coding features that provides a ‘base line’ annotation for many of the fragmen- could not otherwise be detected. Mikhail Gelfand (Centre for tary, unfinished human genomic sequences in the EMBL Biotechnology, Russia) outlined strategies for finding regu- database. Around 2.9 Gb of the existing draft and finished latory regions by comparison of bacterial species. He showed sequence of the human genome have been processed by how the discovery of regulatory elements for heat-shock Ensemble and the results are freely available from the protein genes allowed the detection of such elements at Ensemble website. Ensemble gene identifiers will remain other loci and consequently the detection of novel co-regu- stable during rearrangements and extension of draft lated genes that may be involved in the heat-shock response. sequences on the way to a definitive human genome The mouse genome sequence will be available within a year sequence, which is at least three years away. Similar ‘indus- or two and will doubtless provide a popular resource for trial scale’ analyses are being run using the Genome Channel comparisons with human sequence, particularly in detecting [http://grail.lsd.ornl.gov/tools/channel/], according to Ed regulatory elements and refining exon boundaries. But more Uberbacher; the genome channel is an analysis pipeline pro- than one speaker concluded that comparisons between cessing draft human genome sequence with a different com- mouse and human sequences may not be as instructive as bination of programs from Ensemble. those between human and other species. For instance, Roderic Guigo (Institut Municipal de Investigacio Medica, Spain) found that chicken sequences may be more helpful in Automatic for the people predicting coding exons, as they show good conservation in The Cold Spring Harbor Genome conference earlier this year coding regions but diverge substantially elsewhere. Conser- saw the creation of Genesweep [http://www.ensembl.org/ vation between mammals seems to be more widespread, and .html], a ‘gene sweepstake’ where participants can so creates less clear distinctions between conserved and bet on the final number of human genes that will be found. divergent sequences. (Pennsylvania State Uni- The spread of bets reflects the current uncertainty among versity, USA) presented a new program, PipMaker workers in human genomics, ranging from 27,462 up to [http://bio.cse.psu.edu/pipmaker/] (Percentage Identity 200,000. Interestingly, the mean is currently 62,598, much Plot Maker) for graphically viewing sequence conservation lower than the ballpark figure of 100,000 we have all become along genomic sequences. accustomed to. When the winner of Genesweep is announced comment reviews reports deposited research refereed research interactions information http://genomebiology.com/2000/1/2/reports/4012.3 D. methods, but 16% of ab initio , the fruit fly gene prediction and, it would seem, the more genome, and it is hoped that GO will become a tational humans involved, the better. real exons were not predicted at all computationally and only 20% of predicted gene structures were correct. So, computa- tional techniques give us invaluable clues to gene but structure, in the end it will be the addition of work by bench scien- tists that will provide the moves have full begun to establish a picture. Distributed Sequence Anno- With this in tation System (DAS) mind, [http://stein.cshl.org/das/] to democra- tize genome annotation. The idea is ‘reference to designate a server’ central sequence which data stores for the servers’ essential maintained elsewhere by genome, a range of different mapping groups. and multiple Researchers and interested in a given ‘annotation region of the genome would use a web browser-like application to download and integrate different features from servers of their choice. Thus, a reliable central annotation can be maintained in parallel with a diver- sity of less confidently predicted features that may or may not turn out to be useful. It seems inevitable that human beings will have to take on the annotation final once the machines tasks have had a of first pass. One gene way prediction and or another, human intervention compu is still essential in Perhaps the real measure of the success of computational gene Perhaps the real measure of the success of biologists’ the into integration successful their in is predictions toolbox. (Sanger Centre) described the strategy for annotation of the first sequence, completed human of chromosome chromosome predictions HGP/Chr22/]. Here the strength of computational 22 [http://www.sanger.ac.uk/ - their sensitivity - was exploited to generate a set of candidate exons. These candidates could then be used point as the for starting laboratory work sequences. For chromosome 22, 94% of genes were found and to discover the at least partially actual predicted by mRNA and completed human chromosomes. (EBI) Michael stressed Ashburner the importance annotation across of species and consistency described the Gene in Ontology genome (GO) project [http://www.geneontology.org/]. attempt GO to rigorously is describe all an according the to the genes molecular in functions of a their genome products the biological and processes and cellular components with which they are associated. The classification and standardized ter- minology of GO were melanogaster used in the annotation community-curated of entity, the providing central vocabulary for annotation. a democratic but in 2003, what will be the reward for the rest of us? Once there Once us? of rest the for reward the be will what 2003, in is a complete set of known and predicted genes in which we have high confidence, it is to be made publicly available via the internet, and the way these data influenced by are the experiences and presented software taken from other will be genome projects. Various speakers at the meeting discussed archiving genomic annotation data for projects in the plant Arabidopsis thaliana