Genome Informatics Wellcome Trust Genome Campus, Hinxton, Cambridge, UK
Total Page:16
File Type:pdf, Size:1020Kb
Comparative and Functional Genomics Comp Funct Genom 2001; 2: 376–383. DOI: 10.1002/cfg.114 Meeting Highlights Joint Cold Spring Harbor Laboratory and Wellcome Trust conference – genome informatics Wellcome Trust Genome Campus, Hinxton, Cambridge, UK. August 8–12, 2001 Jo Wixon1*, Jennifer Ashurst2 and Jo Dicks3 1 HGMP-RC, Hinxton, Cambridge, UK 2 Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK 3 John Innes Centre, Norwich Research Park, Colney, Norwich, UK *Correspondence to: HGMP-RC, Hinxton, Cambridge, CB10 1SB, UK. E-mail: [email protected] Published online: 26 October 2001 Annotation pipelines described CAEPA, an online, bulk EST sequence processing and annotation pipeline. The group’s Ewan Birney (EBI, UK) opened the meeting with main aim is gene discovery using EST data, so far a presentation on the Ensembl gene building sys- they have annotated and submitted y500 000 tem (http://www.ensembl.org). The first part of the ESTs. They have produced an automated initial process is Exonerate (Guy Slater), this is the ‘basic annotation pipeline, and provide support for serial guts’ of BLAST, allows multiplexing and makes subtraction, which includes clustering to detect a lazy evaluation of the path between HSPs, to novelty, and large scale BLAST for synteny assess- rapidly build gapped alignments. This was designed ment. Their EST processing pipeline consists of an for ESTs, but is now being applied to mouse whole EST preparation stage with a vector and contami- genome shotgun data. The next stage is Pmatch nant screen and a repeat masker and low complex- (Richard Durbin), a hyper-fast protein-based exact ity screen, a local annotation tool looks for 3k and 5k matcher (similar to Jim Kent’s BLAT). This finds exact 14mer substrings by building a table of non- overlapping 5mer matches, and using pairs of con- secutive 5mer matches as seeds. A targeted gene build uses Pmatch to match all known human proteins to the entire genome, and is then refined using Genome Wise. Genome Wise (Ewan Birney) uses information such as EST alignments to the genome to build gene models (Figure 1 ). It can reconcile overlapping alignments and uses ‘tunnel- ling’ in the absence of splices in a match, this extends the ORF, where possible (until it finds a stop codon). Tom Casavant (University of Iowa, USA) Figure 1. An illustration of the abilities of Genome Wise Copyright # 2001 John Wiley & Sons, Ltd. Meeting Highlights 377 EST features, this is followed by clustering and then work is centred on Neurospora crassa, but they radiation hybrid mapping. To make this accessible hope to scale up to a larger eukaryote genome next. to groups running small projects, they have pro- Inna Dubchak (LBNL, USA) described a pipe- duced a web ‘front end’ with a data drop point, and line for real-time comparative analysis of the human e-mail return of results from custom BLAST and mouse genomes (http://pipeline.lbl.gov), which is searches of internal datasets. Users can choose called Godzilla, because it is going to be huge. They which parts of the pipeline they which to take download mouse data daily from GenBank and advantage of and can make other selections such as pass this through Masker Aid and a first round which libraries of repeats to screen against. homology search using BLAT, to determine which Ron Chen (DoubleTwist, USA) described their sequences are then compared to the April freeze of strategy for prioritising predicted human genes for the Golden Path human annotation using BLAT. experimental validation. Their high-throughput The next stage is called AVID, this is a global approach starts with the ab initio gene prediction alignment engine which can handle a combination programs GenScan, FGENESH and GrailEXP. To of finished and draft sequence. The final alignments this data they add protein homology data and are viewed using VISTA. The pipeline has been results from comparisons to RefSeq, UniGene and used to discover conserved sequence fragments, their own gene index. These three sources of which they have seen in coding and non-coding evidence are then considered by Gene Squasher, regions. It has also found overlapping contigs in which makes locus predictions. It constructs exons GenBank and detected mis-assemblies. Their data is by assembling overlapping candidates from the soon to be included in the UCSC genome viewer. three methods and assembles them into genes, if Michelle Clamp (EBI, UK) spoke about the there is a common gene model or other supporting Ensembl analysis pipeline. They currently have evidence. All the data goes into their Agave XML 4.3 Gb of data (480 000 sequence reads) to analyse. format and they have a genomic viewer, which 16 types of analysis are used, which requires a lot of handles the data, showing the three types of exon organisation. They have automated job submission, data and the final predictions. Their current locus tracking programs, a set-up to retry failed jobs prediction counts are ab initio – 39 000, protein and access to large, file-based databases. They have homology – 31 000, and gene index match – 68 000. a cluster with 2 Terabytes of storage and a farm of Checking for overlaps brings the total down to 320 machines with 60 Gb local drive. New entries y98 000, half of these have only gene index match go immediately into tracking, which determines data, only 10% are detected by all three approaches. which jobs need doing and what order to do them, 30.2% (y29 500) are high confidence, with two lines then forms a job and sends it to the farm. Com- of evidence. munication between the farm and the cluster is very William Fitzhugh (Whitehead Institute, USA) complex, so they keep as much as possible on the presented Calhoun, a comprehensive system for local drives (hence their large size) such as BLAST genome sequence annotation. A crucial part of this databases, binaries and runtime files. The data is approach is that it has a platform to tie the existing fetched or written directly to the databases. They stand-alone genome annotation tools together use BioPerl (which Michelle likes very much) and and keep track of results. It has a tool for viewing MySQL. This is an open source project, the results and can manage and run a large number of software and data are freely available. The entire jobs at any given time. Calhoun has an Oracle system is portable and there are currently more relational database, a Java sequence browser, web- than 10 remote installations of the website. based tools accessible by the public and an ana- GANESH; a sequence analysis and display pack- lysis pipeline. The database schema is huge, with a age, was presented by Holger Hummerich (Imperial vast array of entities, from sequence, to taxa, to College, UK). This is a user-friendly tool aimed at mapping data. They use SQL to ask biologically researchers interested in small regions, such as those relevant questions. The system uses an analysis hunting for disease genes (http://zebrafish.doc.ic. queue table to organise analysis jobs, which can ac.uk/Ganesh/ganesh.html). It is dynamic and updates keep track of all the jobs done on a particular daily, allows selective use (such as choosing to view sequence. Web ‘front end’ tools include a gene only new data), and can display data at varying search and the ability to visualise the data and a levels of resolution, from clone contigs up to BLAST combined physical and genetic map. Their current alignments. The system downloads the relevant data Copyright # 2001 John Wiley & Sons, Ltd. Comp Funct Genom 2001; 2: 376–383. 378 Meeting Highlights from GenBank and runs several analysis programs, reduce the high risk of mis-assembly that exists with including Repeat Masker, GenScan and Pfam. such incomplete and repeat rich data. They now BLAST searches are performed across a wide range plan to improve this tool and to take on a pilot of databases. This open source tool uses a MySQL project using Caenorhabditis briggsae data. relational database and has a Java front-end display. David Jaffe (Whitehead Institute, USA) descri- The display shows the sequence across the centre of bed the Arachne whole genome shotgun assembler, the window, with data for one strand above and the which we have already reported on in the highlights other strand below. Features such as ESTs, SNPs, of the CSHL Genome Sequencing and Biology repeats, exons and promoters are marked on the meeting in issue 2(4) of Comparative and Functional display in different colours and users can add their Genomics. own annotation. Colin Semple (University of Edinburgh, UK) pre- TIGR gene indices (http://www.tigr.org/tdb/tgi. sented a computational comparison of human geno- shtml) have now been assembled for 13 animal mic sequence assemblies for a region of chromosome species, 11 plant species, 12 protists and 5 fungi. 4. His group has a contig of 58 BAC/PAC clones Dan Lee (TIGR, USA) described how to build an covering a 5.8 Mb region linked to manic depres- index they take ESTs from GenBank and their own sion. These have been used to order 107 STS databases and remove vector, poly A tails, mito- markers across the region. They used this high- chondrial and ribosomal sequences etc. They then resolution physical mapping data to assess the harvest expressed transcripts from GenBank cds coverage of the region by the UCSC, NCBI and entries and reduce the redundancy amongst these Celera genome assemblies. The principal observa- before comparing to the ESTs to build clusters. tion was that there was significant variation bet- Each cluster is assembled into a tentative consensus ween the three assemblies.