Workflow-Based Systematic Design of High Throughput Genome Annotation

Imperial College London Department of Computing Workflow-based systematic design of high throughput genome annotation Xikun Wu Submitted in part fulfilment of the requirements for the degree of Doctor of Philosophy in Computing of Imperial College London, September 2009 1 Abstract The genus Eimeria belongs to the phylum Apicomplexa, which includes many obligate intra- cellular protozoan parasites of man and livestock. E. tenella is one of seven species that infect the domestic chicken and cause the intestinal disease coccidiosis which is economy important for poultry industry. E. tenella is highly pathogenic and is often used as a model species for the Eimeria biology studies. In this PhD thesis, a comprehensive annotation system named as \WAGA" (Workflow-based Automatically Genome Annotation) was built and applied to the E. tenella genome. InforSense KDE, and its BioSense plug-in (products of the InforSense Company), were the core softwares used to build the workflows. Workflows were made by integrating individual bioinformatics tools into a single platform. Each workflow was designed to provide a standalone service for a particular task. Three major workflows were developed based on the genomic resources currently available for E. tenella. These were of ESTs-based gene construction, HMM-based gene prediction and protein-based annotation. Finally, a combining workflow was built to sit above the individual ones to generate a set of automatic annotations using all of the available information. The overall system and its three major components were deployed as web servers that are fully tuneable and reusable for end users. WAGA does not require users to have programming skills or knowledge of the underlying algorithms or mechanisms of its low level components. E. tenella was the target genome here and all the results obtained were displayed by GBrowse. A sample of the results is selected for experimental validation. For evaluation purpose, WAGA was also applied to another Apicomplexa parasite, Plasmodium falciparum, the causative agent of human malaria, which has been extensively annotated. The results obtained were compared with gene predictions of PHAT, a gene ¯nder designed for and used in the P. falciparum genome project. 2 Contents Abstract 2 Acknowledgements 17 1 Introduction 20 1.1 Eimeria and apicomplexan parasites . 20 1.2 The developmental life cycle of E. tenella ...................... 23 1.3 Di®erential gene expression between developmental life-cycle stages . 25 1.3.1 E. tenella ................................... 25 1.3.2 Other apicomplexan parasites . 26 1.4 Apicomplexan genome projects . 27 1.4.1 Overview . 27 1.4.2 Individual organism databases . 29 1.4.3 ApiDoTS . 30 1.4.4 ApiDB . 30 1.5 The E. tenella genome project . 31 1.5.1 Importance of E. tenella genome annotation . 31 1.5.2 International E. tenella consortium . 32 3 1.5.3 E. tenella genomic resources . 33 1.5.3.1 Genome assembly . 33 1.5.3.2 ESTs and ORESTES . 34 1.5.3.3 Genome annotation . 36 1.5.3.4 Analysis of repetitive DNA . 37 1.5.3.5 Gene families encoding E. tenella surface antigens (EtSAGs) . 37 1.5.3.6 Extranuclear genome of Eimeria spp. 39 1.6 Bioinformatics for genome analysis . 41 1.6.1 Individual bioinformatics tools . 41 1.6.1.1 BLAST . 42 1.6.1.2 FASTA . 44 1.6.2 Open Bioinformatics Foundation . 45 1.6.3 ENSEMBL . 46 1.6.4 Genomics Uni¯ed Schema . 47 1.6.5 Generic Model Organism Database . 48 1.6.6 GeneDB . 49 1.7 Workflow techniques . 50 1.7.1 Workflow concept . 50 1.7.2 Workflow in bioinformatics . 51 1.7.3 Workflow packages . 53 1.8 InforSense Knowledge Discovery Environment and its BioSense plug-in . 55 1.9 Objectives and summary of thesis . 56 4 2 Construction of gene models based on expression sequence data 57 2.1 Introduction . 57 2.2 ESTs and ORESTES resources for E. tenella .................... 58 2.3 Step One: ESTs mapping to the genome assembly . 60 2.3.1 ESTs mapping tools . 60 2.3.2 Sim4 . 61 2.3.3 Pre-Sim4 BLASTN . 61 2.3.4 The \ESTs mapping" node . 63 2.4 Step Two: Selection of mapped results . 65 2.4.1 Selection criteria . 67 2.4.2 The \mapping selecting" node . 68 2.5 Step 3: Merging of selected results . 70 2.5.1 Third party software \cluster merge" . 71 2.5.2 Merging transcripts of unknown polarity . 71 2.5.3 The \gene merging" node . 73 2.6 The \ESTs-based Gene Construction" workflow . 73 2.6.1 Workflow architecture . 73 2.6.2 Workflow parameters . 73 2.7 Results for E. tenella ................................. 74 2.7.1 Results of ESTs mapping . 76 2.7.2 Results of gene models building . 78 2.8 Summary . 81 5 3 HMM-based gene prediction 82 3.1 Introduction . 82 3.2 The problem of gene prediction . 82 3.3 HMM algorithm . 83 3.3.1 De¯nition of the HMM . 84 3.3.2 Decoding . 84 3.3.3 Evaluating . 86 3.3.4 Learning . 87 3.3.5 Structure of HMM in gene ¯nding . 88 3.4 Training set . 88 3.4.1 Source data . 89 3.4.2 BLASTCLUST . 90 3.4.3 The \redundancy clean" node . 90 3.5 SNAP services . 91 3.5.1 SNAP training . 91 3.5.2 SNAP predict . 92 3.6 GlimmerHMM services . 94 3.6.1 GlimmerHMM training . 94 3.6.2 GlimmerHMM predict . 94 3.7 GC content . 95 3.8 The \HMM-based Gene Prediction" workflow . 96 3.8.1 Workflow architecture . 97 6 3.8.2 Workflow parameters . 97 3.9 Prediction by trained models . 98 3.10 Results for E. tenella ................................. 99 3.10.1 Practical strategy . 99 3.10.2 E. tenella training set . 101 3.10.3 E. tenella predictions . 102 3.11 Summary . 104 4 Protein-based annotation 106 4.1 Introduction . 106 4.2 Protein Databases . 107 4.3 BLASTX utility . 108 4.4 MSPcrunch . 109 4.5 The \chunk blastx" service node . 111 4.6 The \Protein-based Annotation" workflow . 112 4.7 Results for E. tenella ................................. 112 4.8 Summary . 115 5 Combining all evidences to produce the WAGA annotation 117 5.1 Introduction . 117 5.2 Evidence Modeler . 117 5.3 The \Combine Evidences" workflow . 119 5.4 WAGA annotation of E. tenella ........................... 120 5.4.1 Results . 120 7 5.4.2 Discussions . 122 5.4.2.1 Comparing with other apicomplexan genome annotations . 122 5.4.2.2 Comparing with other E. tenella resources . 123 5.5 Summary . 124 6 Results visualisation and workflows deployment 125 6.1 Introduction . 125 6.2 Artemis . ..

Load more