Assembling and Annotating the Genome of the Nematode Caenorhabditis Doughertyi

Date of acceptance Grade Instructor Assembling and annotating the genome of the Nematode Caenorhabditis doughertyi Sinduja Chandrasekaran Helsinki 12/01/16 MSc. Thesis University of Helsinki Department of Computer Science HELSINGIN YLIOPISTO HELSINGFORS UNIVERSITET – UNIVERSITY OF HELSINKI Tiedekunta/Osasto Fakultet/Sektion –Faculty/Section Laitos Institution Department Faculty of Science Department of Computer Science TekijäFörfattare Author Sinduja Chandrasekaran Työn nimi Arbetets titel Title Assembling and annotating the genome of nematode Caenorhabditis doughertyi Oppiaine Läroämne Subject Bioinformatics Työn laji Arbetets art Level Aika Datum Month and year Sivumäärä Sidoantal Number of pages MSc. Thesis January 2016 51+9 pages Tiivistelmä Referat Abstract NGS technologies and the advancement of bioinformatics methodologies have led to the start and success of many projects in genomics. One such project is the Caenorhabditis genome project, aimed at generating draft genomes for all the known and non-sequenced Caenorhabditis species. Except for C. elegans, the model organism responsible for discoveries such as the molecular mechanism of cell death and RNA interference, not much is known about other species in this genus. This hinders our understanding of the evolution of C. elegans and its distinct characteristics. This project is therefore an initiative to understand the Caenorhabditis genus. The aim of my project was to sequence and annotate the genome of Caenorhabditis doughertyi, as a part of the Caenorhabditis Genome Project. C. doughertyi is the sister species of C. brenneri, which is known for its high level of polymorphism among eukaryotes. It was initially found in the regions of Kerala, India by MA Felix in 2007 and consists of both male and female adults. The sequencing of C. doughertyi would pave the way for understanding the evolution of high diversity levels observed in C. brenneri. The raw data of the genome consisted of two paired-end libraries with insert size of 300- and 500 bp, with read lengths of 125 bp. The quality of reads was ensured by quality control measures such as trimming of adaptor sequences, error correction and removal of DNA from non-target organisms. The reads were then assembled using multiple assemblers and ABySS was decided as the best assembly based on metrics such as N50 and biological parameters like CEGMA. The draft genome was then annotated using MAKER pipeline and the orthologs were identified using OrthoMCL. The obtained draft genome can aid in preliminary comparative genomic analyses with other species in the genus. Further work may focus on improving the quality of this draft assembly towards a publication quality genome sequence for this species. ACM Computing Classification System (CCS): Life and medical sciences -> Computational biology Life and medical sciences -> Bioinformatics Avainsanat – Nyckelord Keywords Nematode, Caenorhabditis doughertyi, Genome Assembly, Annotation, Säilytyspaikka Förvaringställe Where deposited Muita tietoja Övriga uppgifter Additional information ii Contents Abbreviations iv 1 Introduction 1 1.1 Caenorhabditis Genome Project............................................................................................2 1.2 Caenorhabditis brenneri........................................................................................................3 1.3 Aims and objectives..............................................................................................................4 2 Methodology 6 2.1 Quality control...................................................................................................................6 2.1.1 FastQC................................................................................................................................6 2.1.2 Adapter trimming.............................................................................................................7 2.1.3 Error correction.................................................................................................................8 2.1.4 Blobology............................................................................................................................9 2.1.4.1 Bowtie.............................................................................................................................10 2.1.5 Filtering reads................................................................................................................11 2.1.6 Insert size estimation......................................................................................................11 2.1.7 Blobology after contamination removal........................................................................11 2.1.8 BLAST against Caenorhabditis briggsae.......................................................................11 2.2 Prior to final assembly....................................................................................................11 2.2.1 Estimating k-mer size.....................................................................................................11 2.3 Final assembly.................................................................................................................12 2.3.1 SPAdes.............................................................................................................................12 2.3.2 Ray...................................................................................................................................12 2.3.3 Velvet...............................................................................................................................12 2.3.4 CLC..................................................................................................................................13 2.3.5 AbySS (Assembly by Short Sequences)..........................................................................13 2.3.6 Trinity..............................................................................................................................14 2.4 Comparison of assemblies...............................................................................................15 2.4.1 Assembly statistics.........................................................................................................15 2.4.2 Biological metrics............................................................................................................15 2.4.2.1 Transcript content.........................................................................................................15 2.4.2.2 Protein content..............................................................................................................15 2.4.2.3 Core Eukaryotic Gene Mapping Approach (CEGMA).................................................15 2.4.3 Statistical metrics...........................................................................................................16 2.4.3.1 Assembly Likelihood Evaluation (ALE).......................................................................16 iii 2.4.3.2 Recognition of Errors in Assembly using Paired Reads (REAPER)...........................17 2.5 Annotation.......................................................................................................................18 2.5.1 Maker pipeline.................................................................................................................18 2.5.1.1 Repeat masking.............................................................................................................18 2.5.1.2 Gene Prediction.............................................................................................................19 2.5.2 Augustus..........................................................................................................................21 2.5.3 OrthoMCL........................................................................................................................22 3 Results 23 3.1 Read data.........................................................................................................................23 3.2 Quality control................................................................................................................23 3.2.1 FastQC.............................................................................................................................23 3.2.2 Adapter trimming...........................................................................................................26 3.2.3 Error correction…............................................................................................................26 3.2.4 Contamination removal...................................................................................................26 3.2.5 Insert size estimation......................................................................................................29 3.2.6 BLAST against Caenorhabditis briggsae.......................................................................30 3.2.7 Estimating k-mer size.....................................................................................................31 3.3 Comparison of assembly.................................................................................................32 3.3.1 Assembly statistics comparison......................................................................................32 3.3.2 Biological metrics comparison........................................................................................34

Assembling and Annotating the Genome of the Nematode Caenorhabditis Doughertyi

Incorporating Genomics Into the Toolkit of Nematology

Large Genetic Diversity and Strong Positive Selection in F-Box and GPCR Genes Among

Evolution and Developmental System Drift in the Endoderm Gene

5 Molecular Taxonomy and Phylogeny

Zootaxa,Comparison of the Cryptic Nematode Species Caenorhabditis

Ephemeral-Habitat Colonization and Neotropical Species Richness of Caenorhabditis Nematodes

Downloading the Zinc-Finger Motif from the Gag Protein Must Have Assembly Files and Executing the Ipython Notebook Cells Occurred Independently Multiple Times

Deep Sampling of Hawaiian Caenorhabditis Elegans Reveals

Evolution of Mitotic Spindle Behavior During the First Asymmetric Embryonic Division of Nematodes

Caenorhabditis Elegans

A Phylogeny and Molecular Barcodes for Caenorhabditis, with Numerous

Molecular Hyperdiversity Defines Populations of the Nematode