Assembling and Annotating the Genome of the Nematode Caenorhabditis Doughertyi

Assembling and Annotating the Genome of the Nematode Caenorhabditis Doughertyi

Date of acceptance Grade Instructor Assembling and annotating the genome of the Nematode Caenorhabditis doughertyi Sinduja Chandrasekaran Helsinki 12/01/16 MSc. Thesis University of Helsinki Department of Computer Science HELSINGIN YLIOPISTO HELSINGFORS UNIVERSITET – UNIVERSITY OF HELSINKI Tiedekunta/Osasto Fakultet/Sektion –Faculty/Section Laitos Institution Department Faculty of Science Department of Computer Science TekijäFörfattare Author Sinduja Chandrasekaran Työn nimi Arbetets titel Title Assembling and annotating the genome of nematode Caenorhabditis doughertyi Oppiaine Läroämne Subject Bioinformatics Työn laji Arbetets art Level Aika Datum Month and year Sivumäärä Sidoantal Number of pages MSc. Thesis January 2016 51+9 pages Tiivistelmä Referat Abstract NGS technologies and the advancement of bioinformatics methodologies have led to the start and success of many projects in genomics. One such project is the Caenorhabditis genome project, aimed at generating draft genomes for all the known and non-sequenced Caenorhabditis species. Except for C. elegans, the model organism responsible for discoveries such as the molecular mechanism of cell death and RNA interference, not much is known about other species in this genus. This hinders our understanding of the evolution of C. elegans and its distinct characteristics. This project is therefore an initiative to understand the Caenorhabditis genus. The aim of my project was to sequence and annotate the genome of Caenorhabditis doughertyi, as a part of the Caenorhabditis Genome Project. C. doughertyi is the sister species of C. brenneri, which is known for its high level of polymorphism among eukaryotes. It was initially found in the regions of Kerala, India by MA Felix in 2007 and consists of both male and female adults. The sequencing of C. doughertyi would pave the way for understanding the evolution of high diversity levels observed in C. brenneri. The raw data of the genome consisted of two paired-end libraries with insert size of 300- and 500 bp, with read lengths of 125 bp. The quality of reads was ensured by quality control measures such as trimming of adaptor sequences, error correction and removal of DNA from non-target organisms. The reads were then assembled using multiple assemblers and ABySS was decided as the best assembly based on metrics such as N50 and biological parameters like CEGMA. The draft genome was then annotated using MAKER pipeline and the orthologs were identified using OrthoMCL. The obtained draft genome can aid in preliminary comparative genomic analyses with other species in the genus. Further work may focus on improving the quality of this draft assembly towards a publication quality genome sequence for this species. ACM Computing Classification System (CCS): Life and medical sciences -> Computational biology Life and medical sciences -> Bioinformatics Avainsanat – Nyckelord Keywords Nematode, Caenorhabditis doughertyi, Genome Assembly, Annotation, Säilytyspaikka Förvaringställe Where deposited Muita tietoja Övriga uppgifter Additional information ii Contents Abbreviations iv 1 Introduction 1 1.1 Caenorhabditis Genome Project............................................................................................2 1.2 Caenorhabditis brenneri........................................................................................................3 1.3 Aims and objectives..............................................................................................................4 2 Methodology 6 2.1 Quality control...................................................................................................................6 2.1.1 FastQC................................................................................................................................6 2.1.2 Adapter trimming.............................................................................................................7 2.1.3 Error correction.................................................................................................................8 2.1.4 Blobology............................................................................................................................9 2.1.4.1 Bowtie.............................................................................................................................10 2.1.5 Filtering reads................................................................................................................11 2.1.6 Insert size estimation......................................................................................................11 2.1.7 Blobology after contamination removal........................................................................11 2.1.8 BLAST against Caenorhabditis briggsae.......................................................................11 2.2 Prior to final assembly....................................................................................................11 2.2.1 Estimating k-mer size.....................................................................................................11 2.3 Final assembly.................................................................................................................12 2.3.1 SPAdes.............................................................................................................................12 2.3.2 Ray...................................................................................................................................12 2.3.3 Velvet...............................................................................................................................12 2.3.4 CLC..................................................................................................................................13 2.3.5 AbySS (Assembly by Short Sequences)..........................................................................13 2.3.6 Trinity..............................................................................................................................14 2.4 Comparison of assemblies...............................................................................................15 2.4.1 Assembly statistics.........................................................................................................15 2.4.2 Biological metrics............................................................................................................15 2.4.2.1 Transcript content.........................................................................................................15 2.4.2.2 Protein content..............................................................................................................15 2.4.2.3 Core Eukaryotic Gene Mapping Approach (CEGMA).................................................15 2.4.3 Statistical metrics...........................................................................................................16 2.4.3.1 Assembly Likelihood Evaluation (ALE).......................................................................16 iii 2.4.3.2 Recognition of Errors in Assembly using Paired Reads (REAPER)...........................17 2.5 Annotation.......................................................................................................................18 2.5.1 Maker pipeline.................................................................................................................18 2.5.1.1 Repeat masking.............................................................................................................18 2.5.1.2 Gene Prediction.............................................................................................................19 2.5.2 Augustus..........................................................................................................................21 2.5.3 OrthoMCL........................................................................................................................22 3 Results 23 3.1 Read data.........................................................................................................................23 3.2 Quality control................................................................................................................23 3.2.1 FastQC.............................................................................................................................23 3.2.2 Adapter trimming...........................................................................................................26 3.2.3 Error correction…............................................................................................................26 3.2.4 Contamination removal...................................................................................................26 3.2.5 Insert size estimation......................................................................................................29 3.2.6 BLAST against Caenorhabditis briggsae.......................................................................30 3.2.7 Estimating k-mer size.....................................................................................................31 3.3 Comparison of assembly.................................................................................................32 3.3.1 Assembly statistics comparison......................................................................................32 3.3.2 Biological metrics comparison........................................................................................34

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    60 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us