Date of acceptance Grade

Instructor

Assembling and annotating the genome of the doughertyi

Sinduja Chandrasekaran

Helsinki 12/01/16

MSc. Thesis

University of Helsinki

Department of Computer Science HELSINGIN YLIOPISTO  HELSINGFORS UNIVERSITET – UNIVERSITY OF HELSINKI Tiedekunta/Osasto  Fakultet/Sektion –Faculty/Section Laitos  Institution  Department

Faculty of Science Department of Computer Science TekijäFörfattare  Author

Sinduja Chandrasekaran Työn nimi Arbetets titel  Title

Assembling and annotating the genome of nematode Caenorhabditis doughertyi Oppiaine  Läroämne  Subject Bioinformatics Työn laji Arbetets art  Level Aika Datum  Month and year Sivumäärä Sidoantal  Number of pages MSc. Thesis January 2016 51+9 pages Tiivistelmä Referat  Abstract NGS technologies and the advancement of bioinformatics methodologies have led to the start and success of many projects in genomics. One such project is the Caenorhabditis genome project, aimed at generating draft genomes for all the known and non-sequenced Caenorhabditis species. Except for C. elegans, the model organism responsible for discoveries such as the molecular mechanism of cell death and RNA interference, not much is known about other species in this genus. This hinders our understanding of the evolution of C. elegans and its distinct characteristics. This project is therefore an initiative to understand the Caenorhabditis genus.

The aim of my project was to sequence and annotate the genome of Caenorhabditis doughertyi, as a part of the Caenorhabditis Genome Project. C. doughertyi is the sister species of C. brenneri, which is known for its high level of polymorphism among eukaryotes. It was initially found in the regions of Kerala, India by MA Felix in 2007 and consists of both male and female adults. The sequencing of C. doughertyi would pave the way for understanding the evolution of high diversity levels observed in C. brenneri. The raw data of the genome consisted of two paired-end libraries with insert size of 300- and 500 bp, with read lengths of 125 bp. The quality of reads was ensured by quality control measures such as trimming of adaptor sequences, error correction and removal of DNA from non-target organisms. The reads were then assembled using multiple assemblers and ABySS was decided as the best assembly based on metrics such as N50 and biological parameters like CEGMA. The draft genome was then annotated using MAKER pipeline and the orthologs were identified using OrthoMCL.

The obtained draft genome can aid in preliminary comparative genomic analyses with other species in the genus. Further work may focus on improving the quality of this draft assembly towards a publication quality genome sequence for this species.

ACM Computing Classification System (CCS): Life and medical sciences -> Computational biology Life and medical sciences -> Bioinformatics

Avainsanat – Nyckelord  Keywords Nematode, Caenorhabditis doughertyi, Genome Assembly, Annotation, Säilytyspaikka  Förvaringställe  Where deposited

Muita tietoja  Övriga uppgifter  Additional information ii

Contents

Abbreviations iv 1 Introduction 1 1.1 Caenorhabditis Genome Project...... 2 1.2 Caenorhabditis brenneri...... 3 1.3 Aims and objectives...... 4 2 Methodology 6 2.1 Quality control...... 6 2.1.1 FastQC...... 6 2.1.2 Adapter trimming...... 7 2.1.3 Error correction...... 8 2.1.4 Blobology...... 9 2.1.4.1 Bowtie...... 10 2.1.5 Filtering reads...... 11 2.1.6 Insert size estimation...... 11 2.1.7 Blobology after contamination removal...... 11 2.1.8 BLAST against Caenorhabditis briggsae...... 11 2.2 Prior to final assembly...... 11 2.2.1 Estimating k-mer size...... 11 2.3 Final assembly...... 12 2.3.1 SPAdes...... 12 2.3.2 Ray...... 12 2.3.3 Velvet...... 12 2.3.4 CLC...... 13 2.3.5 AbySS (Assembly by Short Sequences)...... 13 2.3.6 Trinity...... 14 2.4 Comparison of assemblies...... 15 2.4.1 Assembly statistics...... 15 2.4.2 Biological metrics...... 15 2.4.2.1 Transcript content...... 15 2.4.2.2 Protein content...... 15 2.4.2.3 Core Eukaryotic Gene Mapping Approach (CEGMA)...... 15 2.4.3 Statistical metrics...... 16 2.4.3.1 Assembly Likelihood Evaluation (ALE)...... 16 iii

2.4.3.2 Recognition of Errors in Assembly using Paired Reads (REAPER)...... 17 2.5 Annotation...... 18 2.5.1 Maker pipeline...... 18 2.5.1.1 Repeat masking...... 18 2.5.1.2 Gene Prediction...... 19 2.5.2 Augustus...... 21 2.5.3 OrthoMCL...... 22 3 Results 23 3.1 Read data...... 23 3.2 Quality control...... 23 3.2.1 FastQC...... 23 3.2.2 Adapter trimming...... 26 3.2.3 Error correction…...... 26 3.2.4 Contamination removal...... 26 3.2.5 Insert size estimation...... 29 3.2.6 BLAST against Caenorhabditis briggsae...... 30 3.2.7 Estimating k-mer size...... 31 3.3 Comparison of assembly...... 32 3.3.1 Assembly statistics comparison...... 32 3.3.2 Biological metrics comparison...... 34 3.4 Gapclosing...... 36 3.5 Comparing with the genome of Caenorhabditis brenneri...... 37 3.6 Annotation...... 38 3.6.1 OrthoMCL...... 39

4 Discussion 40 4.1 Quality control...... 40 4.2 Assembly...... 40 4.3 Annotation...... 41 4.4 Conclusion...... 41 5 Acknowledgements 42

References 43-46

Appendix a-i iv

Abbreviations

NGS Next Generation Sequencing

CEGMA Core Eukaryotic Genome Mapping Approach

QC Quality Control

TAGC Taxon Annotated GC Coverage

ABySS Assembly By Short Sequences

HMM Hidden Markov Model

ALE Assembly Likelihood Evaluation

Reapr Recognition of Errors in Assembly using Paired Reads

HSMM Hidden Semi-Markov Model 1

1 Introduction

Nematodes, a Greek-derived name meaning thread-like species, are the most abundantly present capable of adapting and existing in varying environments from polar regions to hot springs and fresh- to saltwater [Cog05].

Although the average genome size of a nematode is approximately 50-250 Mb, estimates range from as small as 30 Mb to almost the size of a mammal (~2100 Mb) [Cog05]. Their gene compositions differ massively because of gene gain and gene loss. The former may be a result of horizontal gene transfer and gene duplication while the latter due to rapid evolutionary change or gene deletion [RSS13].

The phylum Nematoda consists of three main classes: Chromadoria, Enophlia and Dorylaimia. It comprises of both plant and parasites as well as free-living worms. There are nearly 25,000 known species present in this phylum and nearly one million estimated to be present. Parasitic are one of the focuses of nematode research as they are responsible for diseases such as ascariasis, elephantiasis and river blindness [Bla11]. These nematodes are not only responsible for diseases in millions of people and animals but also cause major destruction in plants; most of these plant-parasitic nematode species affect the roots of their hosts, reducing their ability to absorb nutrients and water [LB02].

The genus Caenorhabditis belongs to the order . This is comprised of species which are highly similar in terms of morphology but diverse in terms of ecology. Almost all the Caenorhabditis species can survive by feeding on Escherichia coli, though little is known about their nutrient source in a natural environment [KW06]. A well-known species in this genus is C. elegans or “The Worm.” This model organism, with its simple and easy-to-observe life cycle, transparent body, not requiring a need for complex growth conditions and small size, has been the source for many discoveries such as RNA interference, the use of green fluorescent proteins in expression mapping and the mechanism of programmed cell death [GAP+08][EH15][Gri05]. This hermaphrodite species also has the advantage that it does not suffer from inbreeding. C. elegans is the only species to have its complete neural network mapped [Bla11] and also the first ever eukaryote to have its complete genome sequenced. Their genome sequence, published in 1998, was done using Sanger sequencing and took nearly 5 years to be completed [Seq98].

Sanger sequencing has been one of the most accurate and trusted methods but has not been frequently used due to its slow and expensive nature. In recent times, next generation sequencing (NGS) technologies have revolutionized genomic studies with their speed, reduced cost and manpower. For example, the human genome project, which took 10 years to be completed for a cost of nearly $3 billion using Sanger sequencing [Hum] can now be completed in a day for a cost of 2

$1000 using Illumina technology [His]. Some of the NGS technologies include Ion Torrent, 454, SOLiD, Illumina and Pacific Biosciences.

Illumina technology uses sequencing by synthesis methodology, where for each correctly paired deoxynucleotide triphosphate (dNTP) with the DNA strand, a fluorescent signal is detected. The emission wavelength of the signal and its intensity is used to determine the base call. The methodology can produce both single- and paired-end libraries. It is useful in de novo sequencing and structural variant detection [Seq].

However, NGS technologies have a particular drawback in comparison to Sanger sequencing in terms of their read length. The longer reads of Sanger sequencing lead to improved scaffolding, i.e. improved structural accuracy of a genome. Another major drawback is the errors present in the sequencing data such as mismatches, indels and homopolymer errors depending on the sequencing technology used. For instance, 454 sequencing technologies are more susceptible to indels (insertions and deletions) whereas Illumina sequencing technologies are more prone to single nucleotide substitution errors. These can severely limit the production of an accurate genome and make it extremely difficult to assemble repeat regions and heterozygous sequences in the case of de novo assembly. Thus, quality control is one of the most crucial steps for assembling a genome.

1.1 Caenorhabditis Genome Project

While C. elegans has been extensively studied as a model organism, relatively little is known regarding other species in the Caenorhabditis genus. As a consequence, there are many biological questions which remain unanswered about C. elegans; for instance, the factors involved in making its genomic composition, repeat content, gene structure and sequence diversity [Cae].

There have been a small number of comparative genomic studies on the Caenorhabditis genus. One such genomic study compared the genome of C. briggsae and C. elegans; it identified that the latter hosted higher repeat content and greater fraction of introns (non-coding sequences) than exons (coding regions of genome). It was also interesting to note that the synteny and chromosomal arrangement were conserved in the species despite their huge evolutionary time difference [HMB+07]. Thus, it is evident that the sequencing of the genomes can lead to many interesting questions and discoveries.

The Caenorhabditis Genome Project was initiated to produce draft genomes of the non-sequenced Caenorhabditis species. There are more than 40 species in the genus which are currently available for research and are yet to be sequenced. Apart from these, there are a number of Caenorhabditis genomes which have been sequenced and available, including C. briggsae, C. brenneri, C. remanei, C. angaria and C. tropicalis. The massive advancements in sequencing are a major factor contributing towards generating efficient genomes at a faster rate for the project [Cae]. 3

The datasets for each species are generated in a standard format. They are generated as paired- end dataset, i.e. sequencing is carried out on both ends of the DNA molecule. An advantage of paired-end data is the ability to generate unique size insert libraries. DNA that needs to be sequenced is inserted between adapters and the length of this DNA sequence is termed as the insert size. The dataset usually contains two insert libraries with sizes of approximately 350 and 550 bases each. The dataset generated also includes stranded RNA Sequencing (RNA-Seq) data. In stranded RNA-Seq data, only the sense strand is sequenced, which makes it possible to discern the strand from which RNA is transcribed and thus making it useful to identify the non-annotated genes and non-coding RNA. The genome sequence, genes and annotation are made available for the interim through BADGER [EJB13], a web browser that helps in storing, searching and visualisation of genomic data. Complete and stable annotations are submitted to [CHA+05], a database that serves as the platform to share information on C. elegans and other nematodes for research and educational purposes [Cae].

1.2 Caenorhabditis brenneri C. brenneri are small, free-living worms, feeding mainly on bacteria and other microorganisms. They are mostly found around the tropical regions such as Sumatra, India and Costa Rica [Caea]. It has a genome size of about 135 Mb, which is nearly 50 % more than the genome size of C. elegans (100.4 Mb) and C. briggsae (108 Mb) [FWT+15].

C. brenneri genome has a very high within-species diversity compared to other members. It has an extremely high level of polymorphism, about 150 times the polymorphism in humans (Figure 1). This is because of the genetic mutations which have gathered up on a species having very large population size [DCTC13]. A further interesting aspect about the species is that it is morphologically highly similar to C. remanei, even though high divergence is observed within the species [DCTC13].

Figure 1: Nucleotide diversity between species calculated using mean pairwise differences at synonymous sites. C. brenneri is shown to have high level of diversity 4 particularly with respect to humans [DCTC13].

These species raise many interesting questions about their morphological similarity and the ability to maintain high allelic variability, making them novel organisms for many evolutionary studies concerning gene regulation and mutation rate and so on [DCTC13]. Though research has been carried out on C. brenneri and a draft genome is available (wormbase WS230 release), a high- quality final genome of C. brenneri is not yet available.

The draft genome of C. doughertyi could play a pivotal role in answering many evolutionary questions which might have arised whilst studying C. brenneri. It could be seen if the high level of polymorphism can be observed in C. doughertyi as well. If identified, it could be useful to infer the evolutionary background behind the high allelic variability and gene regulation. Though unlikely, if the high diversity was absent in C. doughertyi, it could raise questions on “what makes C. brenneri ideal to host the high level of variability within species?”

1.3 Aims and objectives

The aim of this study was to generate a draft genome assembly and annotation of C. doughertyi as a part of the Caenorhabditis genome project. C. doughertyi, otherwise known as C. sp. 10 (JU1333), is the sister species of C. brenneri (Figure 2) and relatively less information is known about the former. These isolates were collected by MA Felix from Kerala, India in 2007 and consist of both male and female individuals unlike others in its group such as C. elegans and C. briggsae, which are hermaphrodites. 5

Figure 2: Phylogenetic tree of Caenorhabditis species. In the elegans supergroup, C. brenneri and Sp10 or C. doughertyi can be clearly seen as sister species [KFA+11].

The raw data were generated by Illumina HiSeq 2500. It was a paired-end data and had two libraries with insert sizes of 300 and 500 bases. Initially, quality control was carried out on the raw sequence data to improve the read quality. Following this, reads were assembled using multiple assemblers such as Velvet, Ray, AbySS, CLC and SPAdes and the best assembly was chosen using metrics such as N50 and span of assembly and also biological measures such as core eukaryotic genes mapping approach (CEGMA) and presence of mitochondrial genes and 18S RNA. CEGMA pipeline consists of a core eukaryotic gene set which is conserved in almost all eukaryotes and helps in assessing the completeness of a genome following an assembly. Similarly, the latter was considered for assessing the completeness of the genome, as 18S RNA and mitochondrial genes are highly conserved in the phylum Nematoda. Annotation process was then carried out on the final assembled draft genome. 6

2 Methodology

Genome assembly is defined as the process of rebuilding the genome sequence after fragmentation, sequencing and assembly of resulting sequence. The initial steps in sequencing involve library preparation, where the DNA sequence is converted into libraries to be suitable for the instrument. In this step, the genome of interest is fragmented and adapter sequence is added to the DNA sequence. Adapter sequences are short, known sequence of DNA that are added, to carry out the PCR amplification procedure.

Sequencing is a process to determine the order of the nucleotides from the amplified DNA sequence. The sequenced nucleotide or basepair is termed as a read; there are two types of read: single-end and paired-end reads. In single-end read, the sequencing is carried out from one end of the sequence to other while in paired-end, a second round of sequencing is carried out from the opposite end to the start. The latter is useful in de-novo sequencing while the former in preferred for RNA and ChIP- sequencing.

Following the initial quality control steps, adapter trimming and error correction, the sequenced genome is assembled. Most of the tools make use of De Bruijn graph for the assembly procedure. In this method, the entire genome is divided into k-mers, which are DNA sequences having k number of bases. For example, ATGCT sequence can be divided as ATG, TGC and GCT 3-mers. These k- mers are split into k-1 k-mers and are set as the nodes on the graph and an edge is constructed between two overlapping k-1 mer nodes. These overlapping k-1 mers are then joined to form the assembled sequence. The overlapping reads that are joined are termed as contigs, the overlapping contigs are then joined with gaps to form scaffolds of the assembly.

2.1 Quality control

2.1.1 FastQC

The process of assembling a genome begins with identifying whether there is any bias observed in the raw data, either introduced during the process of sequencing or present in the initial DNA given. FastQC is a tool that helps in finding the bias produced during and before sequencing. 7

Figure 3: Example of FastQC output for raw data showing error in k-mer content. The peaks represent the top 6 k-mers that are imbalanced (specified in the top right). The left column shows the measures that were considered. The exclamatory mark (yellow) represents warning parts in the data, the cross (red) represents failure parts in the data and the the tick (green) represents the good regions of the data

The raw reads were run through FastQC to generate a quality control report. The report gives us feedback in sections such as adapter content, k-mer content, GC content, per base sequence content, over-represented sequences and others, which it tags as good, bad or gives a warning based on its parameters for good quality reads as shown in Figure 3 [And]. FastQC version 0.11.2 was used.

2.1.2 Adapter trimming

One of the preliminary steps in quality control is the removal of the adapter sequence from the sequenced DNA. This reduces the complexity and proves beneficial in assembling the genome.

Cutadapt is a tool which helps in removing the adapter sequence from the reads. The output data consist of trimmed reads and those reads that were not trimmed. Its algorithm is based on semiglobal alignment; in this algorithm, the sequences can move relative to one another and the penalising is carried out in the regions of overlap between them. Cutadapt takes insertion, deletions and mismatches as one error. It works by initially finding all the overlaps between the read and adapter sequence as shown in Figure 4 and then calculates alignment score. Finally, it 8 removes all the alignments that are greater than the specified error rate and considers the alignment with the maximum score for the remaining. In case of multiple alignments having maximum score, the one with the lower error rate is considered [Mar11]. Cutadapt version 1.9 was used to remove the adapter sequence from the reads.

Figure 4: Working of cutadapt: option -a in the tool is used to remove 3' adapter option while -b in the tool is used to remove 3' or 5'adapters' [Mar11]

2.1.3 Error correction

Sequencing errors such as insertions, deletions and mismatches make the process of genome assembly complicated. They can produce inaccurate genome assembly with false connection between scaffolds.

Quake is a tool used to identify and remove these errors from the reads. Quake algorithm uses k- mer coverage along with quality values generated during sequencing to correct errors. K-mer coverage is defined as the number of k-mers that map perfectly with the contigs. To find the ideal cutoff of coverage between trusted and erroneous k-mer, quake uses k-mer coverage. For erroneous k-mers, corrections are made using maximum likelihood. The likelihood is done using a model incorporating the quality values and the rate of miscalling nucleotides. Corrections are carried out until all the k-mers become reliable [KSS10]. Figure 5 explains the error correction through a tree model. Quake version 0.3 was used to error correct the reads. 9

Figure 5: The figure describes the error correction procedure as a tree. The branches indicate the corrections and the nodes show the corrected reads. Each node has a likelihood associated with it and the algorithm visits every node in the order of decreasing likelihood [KSS10].

2.1.4 Blobology

Non-target genomes present along with the genome of interest will reduce the quality of the assembly produced. The presence of multiple genomes may lead to GC bias due to the varying GC content between them. In order to avoid this, one of the crucial steps in genome assembly is removal of information obtained from the non-target genomes. 10

Figure 6: A) Blobology plot of Caenorhabditis sp 5. The third one (right) is a combination of the first two libraries. Size of the blobs represent the size of the contigs and each colour represents a particular taxon. The unannotated ones are in grey. B) Blobology plot after the removal of the non-target genomes [KJK+13].

The blobology pipeline helps in this process by creating a Taxon Annotated GC coverage plot (TAGC) utilising coverage and relative GC content as shown in Figure 6. The varying coverage and GC content between the species help in separating the non-target genomes from the genome of interest. This requires BLAST [AGM+90] output file, alignment file (BAM format) and assembly file as inputs and it outputs a text file which can be used to generate the TAGC plot.

The pipeline works by initially creating an assembly file to avoid dealing with millions of short reads, and reduces complexity for calculating numerical and biological metrics. The preliminary assembly file is created using CLC assembler, following which, the raw reads are mapped back to the assembly using bowtie2 to generate the BAM output. Finally, the best species hit is identified using megaBLAST, by comparing contigs from the assembly to the nt database. The average GC content from the assembly, coverage per contig from the BAM output and taxon information from BLAST output are thus used to produce an output file which can be visualised as a TAGC plot. TAGC plot is a GC content vs coverage plot that is used to display the different genomes present in the data as separate blobs as shown in Figure 6 [KJK+13][Lae].

2.1.4.1 Bowtie The bowtie alignment program is used to align short DNA reads to larger genomes. It indexes the reference genome using Burrows-Wheeler transform, which is used for text compression of data. Bowtie introduces two methodologies for the purpose of alignment in short reads: backtracking algorithm and double indexing. The former gives way for mismatches and thus produces high- quality alignments while the latter prevents excessive back tracking [LTPS09]. 11

2.1.5 Filtering reads To filter the reads that were not classified as nematode or not annotated from the TAGC plot, a BLAST search of those reads was performed aginst nt database. The non-target genome commonly observed was that of bacteria. To filter out the reads from bacteria, genome of the Escherichia coli strain was obtained from NCBI and the reads of the C. doughertyi genome was mapped back to it using Bowtie2. The reads mapping to the E. coli strain were then removed.

2.1.6 Insert size estimation A perl script by Dr. Kumar from Prof. Blaxter's lab was used to verify the insert sizes for both the libraries.

2.1.7 Blobology after contamination removal Following the removal of non-target genome, the blobology plot was generated by combining both libraries to ensure the complete removal of non-targeted genomes.

2.1.8 BLAST against Caenorhabditis briggsae To identify the unannotated reads, a BLAST search was performed against the genome of C. briggsae. To analyse the span of the reads present after the annotation with briggsae, a blobology plot was generated using the BLAST result.

2.2 Prior to final assembly

2.2.1 Estimating k-mer size

To generate assemblies, it is important to choose the appropriate k-mer value, as smaller k-mer value makes the assembly reconstruction difficult while a larger k-mer value not only requires a large amount of space but also reduces the chance for k-1 overlap to build De-Brujin graph. k- mergenie is a tool which generates histograms based on k-mer abundances that helps in choosing the ideal value by the user [CM14].

Abundance for a particular k-mer is the number of times it occurs in a multiset (i.e. for k-mer = 3, seq = ATGTA, multiset = {ATG, TGT, GTA}). Figure 7 depicts an abundance histogram for different values of k considered in a human chromosome. k-mergenie algorithm considers a parameter ε indicating unique k-mers sampled. Using a hash function, all the k-mers are divided into ε bins. The abundance histogram is then built by scaling k-mers for a given abundance of ε and calculating the k-mer count. Following this, k-mergenie fits a generative model on this histogram to identify the unique k-mers which are error free or 12 otherwise termed as genomic k-mers. Finally, the value of k which contributes towards higher genomic k-mer number is found [CM14]. k-mergenie version 1.6982 was used.

Figure 7: The abundance histogram for chr14 (human) with k value 21, 41 and 81. It can be clearly seen that k = 21 has a higher abundance in comparison to k = 81 and k = 41 [CM14].

2.3 Final assembly

The final assembly of the reads was carried out using five different assemblers and the best assembly was selected using statistical and biological metrics. The five assemblers are SPAdes, Ray, CLC, Velvet and ABySS.

2.3.1 SPAdes

SPAdes assembler performs a four-stage process in assembling reads. The first stage is the De Bruijn graph simplification wherein multi-size De Bruijn graph is used that allows back tracking of graph operations. The remaining three stages are k-bimer adjustment, constructing paired assembly graph and contig construction [BNA+12].

2.3.2 Ray

Ray assembler uses local coverage distributions for k-mers during parallel assembly and hence generally suitable for de novo assembly [BRG+12].

2.3.3. Velvet

Velvet assembler is suitable for data with high coverage short reads. For creating the assembly, a De Bruijn graph is constructed following which error correction is performed by merging sequences which occur consecutively. Sequences sharing local overlaps are then separated to remove repeats [ZB08]. 13

2.3.4 CLC

CLC assembler initially identifies unique words of specified length, following which, it builds a De Bruijn graph by constructing the nodes using those unique words. This graph is then simplified to resolve the repeat regions with the help of significantly overlapping reads (unique words). The scaffolding step is carried out by calculating the distance between the contigs (overlapping reads). The gaps between the contigs are then closed starting from the ones having the shortest distance between them. The contigs that are present within these scaffolds are connected through NNN and displayed [Ass].

2.3.5 ABySS (Assembly by Short Sequences)

ABySS is a parallelized sequence assembler. Similar to other assemblers, it forms a simplified De Bruijn graph. The assembly is then carried out in two steps: initially, the contigs are extended till they reach an end or cannot be extended any further because of insufficient coverage, following which, the resulting inexactness can be sorted through paired end information. The work flow of the assembler has been depicted in Figure 8 [SWJ+09].

Figure 8: ABySS assembler work flow. 1-2: The assembler initially identifies unique k- mers from the reads and constructs a De Bruijn graph. 3-4: The branches containing 14 read errors are trimmed and the graph is further simplified using information from overlapping reads. 5-6: k-mer (nodes on that graph) that overalp by k-1 bases are joined. Reads that map back to single and multiple contigs help to infer fragment size distribution and inter-contig distance, respectively [SWJ+09].

2.3.6 Trinity assembly

RNA sequence data of the genome was initially trimmed for adapters using cutadapt. The transcriptome assembly of the RNA sequence data was generated through Trinity [GHY+11].

Trinity works by combining three softwares: Inchworm, Chrysalis and Butterfly. Inchworm is responsible for the production of unique transcripts from the reads. Chrysalis then clusters the contigs and constructs De Brujin graph for every cluster. Finally, Butterfly traces back the graphs in parallel leading to the complete transcripts. Figure 9 shows the working of Trinity using the three programs [GHY+11].

Figure 9: Working of trinity using Inchworm, Chrysalis and Butterfly. a) Inchworm is shown to derive unique transcripts to form the De Bruijn graph. b) Chrysallis then simplifies the graph identifying overlapping reads with k-1 bases. c) Butterfly traces back along the graph to identify the transcript [GHY+11]. 15

2.4 Comparison of assemblies

2.4.1 Assembly statistics

Scaffold_stats.pl is a perl program devloped by Dr. Kumar from Prof. Blaxter's lab. It is used to give the statistical parameters for assemblies such as N50, Num N50, span and GC content. N50 is defined as the value, for which contigs or scaffold lengths greater than this value contain 50% of the entire assembly. This could be considered as the median value for the contig or scaffold lengths [RG14]. Num N50 specifies the number of contigs that have a length greater than or equal to the N50 value.

2.4.2 Biological metrics

2.4.2.1 Transcript content

The transcript content present in the genome was measured with 70% cutoff [HWL+07]. To verify this, the genome from each assembler is BLAT (BLAST like alignment tool) against the transcriptome generated by trinity. BLAT works similar to BLAST by searching for segments in the database and query sequence with matches close to a particular cutoff. These were then extended both ways to produce the output with highest score. BLAT has an exponentially higher speed due to the indexing of non-overlapping k-mer genome. The overlapping k-mers in query were then searched against the database to produce the results.

2.4.2.2 Protein content

The protein content present in the genome was measured with 70% cutoff . To verify this, a BLAST database was generated for each of the assemblies produced by the five assemblers. These were then compared to the proteome of C. briggsae using TBLASTN and the protein content and span were found.

2.4.2.3 Core Eukaryotic Gene Mapping Approach (CEGMA)

Core Eukaryotic Gene Mapping Approach (CEGMA) was run using the assemblies to identify the percentage of core proteins in each case. It is useful in identifying orthologs of essential proteins (458 core proteins). The procedure extracts information on the essential proteins from six model organisms: Homo sapiens, Drosophila melanogaster, Arabidopsis thaliana, C. elegans, Saccharomyces cerevisiae and Schizosaccharomyces pombe using TBLASTN to find the regions in the given genome that are similar to the essential proteins. Using gene-wise HMMER (Hidden Markov Modeller) [SSMW06] and gene ID, the gene structures would then be filtered as shown in Figure 10. This process is repeated till final gene structures are obtained [PBK07]. 16

Figure 10: Workflow of CEGMA. The core proteins initially build protein profiles using Hidden Markov Modeller (HMM), following which, they are compared with the protein profiles of our desired genome using genewise. These initial gene structures produced are then further analysed using protein alignements and HMM model to produce the final gene structures [Adopted from PBK07].

2.4.3 Statistical metrics

2.4.3.1 Assembly Likelihood Evaluation (ALE) Assembly Likelihood Evaluation (ALE) is a statistical methodology used to measure the consistency of the assembly, given the read data. ALE statistical measure was used to compare the assemblies generated. The ALE score is based on the reads rather than on the assembly. It is a logarithmic value which signifies the probability of an assembly being correct. By Bayes rule, the probability is given by,

P(S|R) = [P(R|S) × P(S)]/Z

where, Z - proportionality constant

P(R|S) – probability that the set of reads (R) was generated from an assembly (S)

P(S) – probability of assembly S without reads 17

P(S) is calculated using k-mer distribution and P(R|S) can be computed using read quality, mate- pair orientation, insert length and sequencing depth. Figure 11 presents the factors that contribute toward the final score. ALE score can also be used for comparing different assemblies. Higher score is suggestive of a better assembly [CEFW13].

Figure 11. The figure illustrates the factors contributing to the total ALE scores. The assembly and the reads produce the alignment file. This helps in generating four factors responsible for the ALE score: k-mer score from the assembly, placement score obtained by comparing the mapped base with the probabilistically calculated read position, insert score based on the length distribution of reads and depth score based on the GC content bias present [Adopted from CEFW13].

2.4.3.2 Recognition of Errors in Assembly using Paired Reads (REAPR) The Recognition of Errors in Assembly using Paired Reads (REAPR) pipeline is used to ensure the accuracy of the data and identifies inconsistencies. The assemblies were run through the REAPR pipeline as specified in Figure 12 to obtain the best possible assembly. The input to the pipeline is given as a BAM file of high-quality read alignments and an assembly in a FASTA format. It begins by calculating statistical measures such as depth of coverage and fragment length. Depth of coverage indicates the number of times a region of the genome has been sequenced. A fragment considered by the REAPR pipeline is the region that lies between the ends of a read pair. This is followed by estimating statistical measures such as read depth, inner fragment coverage, error in inner fragment coverage, amount of soft clipping and fragment coverage distribution (FCD) error. At every given base, REAPR has a FCD plot that indicates the number of fragements mapping back to that particular base, the difference in the area between observed and ideal fragment coverage distribution is known as the error in FCD. This is helpful in identifying errors and for scoring each base. Soft clipped reads are those that are present in the sequence but are not included for alignment of reads. REAPR then outputs the errors and warnings in the assembly. It also generates an assembly depending on the errors by producing breaks in locations containing errors [HKS+13]. 18

Figure 12. The REAPR pipeline above shows the different steps involved in calculating the REAPR score and identifying errors. This involves calculating read coverage, fragment coverage and FCD error from the BAM file. This aids in calculating the score of each of the bases [HKS+13].

2.5 Annotation

2.5.1 MAKER pipeline

The MAKER annotation pipeline divides the process into five main steps: compute, filter/cluster, polish, synthesis and annotate.

Compute: The compute step involves masking of repeats and carrying out alignments that aid in the annotation process [CKR+08]. The software used to carry out these steps are the following:

2.5.1.1 Repeat masking Significant portions of the genome consists of repeats. They can be classified as low-complexity repeats and interspersed repeats. The former consists of tandem repeats and the latter transposons and retrotransposons. The low-complexity repeats mimic low-complexity protein regions with statistical significance. Similarly, the complex ones create problems with ab inito gene finders due to the presence of protein-coding genes. This makes masking of repeats a crucial step in generating an annotation.

The MAKER pipeline carries out this process in two steps. Initially, repeats that match the records in the RepBase library are identified. Repeat libraries can also be made according to the specifications in terms of species. This process is carried through a program called RepeatMasker. 19

To identify repeats which eluded the previous software, RepeatRunner is used. It identifies transposable elements using its protein database. Due to the slower divergence rate of protein sequence, it identifies highly diverged repeats. During the analysis, complex repeats are hard- masked, that is, repeats are replaced with the letter “N”. Likewise, simple repeats are soft-masked, that is, the repeats are transformed to lower case [TC02].

2.5.1.2 Gene prediction

MAKER pipeline then runs ab inito gene predictors to generate gene models using mathematical models to define introns/exons structure.

• Semi-HMM-based Nucleic Acid Parser (SNAP)

The Semi-HMM-based Nucleic Acid Parser (SNAP) gene finder creates mathematical models of the genomic DNA using Hidden Markov Models (HMM) as shown in Figure 13. It consists of six intron states which help in predicting genes with stop codons. SNAP considers each individual strand, thus allowing the overlap of opposite strands. The key advantages of SNAP are its parameter file that could be altered to create a HMM based on the need for genomic features, the flexibility in fixing the length of the weight matrix and to store these in an array or decision tree. Introns can be specified a particular length distribution over a distance even though it is computationally demanding [Kor04].

Figure 13: Hidden Markov model of SNAP. Shape=states; arrows=transitions;

Es=single-exon gene; Ei =initial exon; Et=terminal exon [Kor04]. • Genemark-ES

The Genemark-ES gene prediction software begins by initializing parameters for the Hidden Semi-Markov Model (HSMM). HSMM has hidden states for initial internal and 20

terminal exons, introns and single exon genes. The software then divides genomic sequence into coding and non-coding regions and labels appropriately. These labelled segments are then considered for re-estimation of parameters for HSMM (taking as training set). The process is repeated until convergence as shown in Figure 14 [Lom05].

Figure 14: Workflow of GeneMark-ES. This initially takes a genomic sequence and runs genemark.hmm to identify genes. The test parameters are then restimated following the update of training set through HMM. This procedure is then repeated till convergence, when the final predicted gene list is provided [Adopted from Lom05].

Filter and cluster: The filter and cluster step removes marginal predictions and alignments with the help of scores and percentage identities. This criteria could be modified by the user. Following the filtering step, clustering of data is carried out. It has two main objectives: clustering diverse computational data to support evidence for the same gene and helps to provide continuous evidence.

Polish: In the polish step, alignment is carried out again on BLAST hits using Exonerate to attain higher accuracy at exon boundaries and this software also provides information on splice donors and acceptors. The threshold for the BLAST hit in this step could be modified by the user.

Synthesis: In the synthesis step, information from the polished data, clustered EST and protein alignments are taken to provide evidence for annotation. 21

Annotate: Finally during the annotate step, MAKER combines the evidence provided in the previous step with information from gene predictors to give a complete annotation [CKR+08].

The entire process is described through the flowchart in Figure 15

Figure 15: Flowchart depicting the pipeline of MAKER. MAKER pipeline begins with masking the repeats of the genomic sequence. Gene identification is carried out on the masked genome. By using BLAST, filtering is carried out based on scores and the data obtained is then clustered. These data along with the gene identification results are used to produce annotation [Adopted from CKR+08].

2.5.2 AUGUSTUS

AUGUSTUS is used for ab initio gene prediction in eukaryotes. It uses generalized HMM as the basis for its working. Introns and exons are considered as states of the model, each thought to generate DNA sequences of specified emission probabilities. The model comprises of 47 states, of which 23 states model reverse strand while the remaining 24, the complementary strand of DNA. Probabilistic modelling of coding and non-coding regions of genome, different regions of the gene, length and distribution of introns occur in this step. The AUGUSTUS outputs predicted genes either for the forward strand, reverse strand or both, for a given input DNA sequence [SSWM04]. 22

2.5.3 OrthoMCL

In genome annotation, identifying orthologous group plays a crucial role and helps in comparative genomics and for studying gene evolution. OrthoMCL is a methodology used for finding orthologous group in eukaryotes (Figure 16). Initially, an all-against-all BLASTP search of the protein sequence from the genome considered is perfomed, following which, orthologues of pairs of genome are found through reciprocal best similarity pairs. For every ortholog identified, paralogs are found such that these sequences are more similar to each other than to any other sequence from another genome. The ortholog and paralog relationship are then displayed as a graph with the nodes representing the protein sequences and the edges the relation between them. A weighing procedure is then carried out on the graph to account for the higher similarity observed in terms of the recent paralogs in comparison to orthologs. The graph is represented as a similarity matrix to which the MCL algorithm is applied. This helps in removing paralogs that are not recent and more distantly related orthologs and finally produce a cluster of recent paralogs and orthologs. [LSR03].

Figure 16: Methodology/Algorithm used in OrthoMCL. An initial all-against-all BLAST of the proteome of interest helps in identifying similarity between and within the species. This produces a similarity matrix. Clustering is carried out on the matrix to identify orthologous group of the species of interest [Adopted from LSR03]. 23

3 Results

3.1 Read data

The initial read data for C. doughertyi were generated as paired-end reads from Illumina Hiseq. They consist of forward and reverse read files for libraries of insert sizes 300 and 500 respectively. The reads have a length of 125 bp each. The raw read information is provided in Table 1.

Library Type Instrument Insert Length No. of No. of bases size reads forward_300_1.sanfastq.gz PE HiSeq 300 125 39 276 434 4 909 554 250

reverse_300_2.sanfastq.gz forward_500_1.sanfastq.gz PE HiSeq 500 125 36 004 149 450 051 862

reverse_500_2.sanfastq.gz Table 1: The raw read data for C. doughertyi.

3.2 Quality control

3.2.1 FastQC

The reads in the libraries with 300 and 500 bp insert sizes were subjected to quality control testing using FastQC tool. The report generated showed warnings in per base sequence content and sequence length distributions. There was an error reported in per sequence GC content that verifies for the normal distribution of GC content in the genome. Figure 17 clearly depicts the difference between expected GC content (blue curve) and observed GC content (red curve), not only in the peak region, but also in the downward curve at about 54-59% mean GC content. This indicates the presence of non-target genome. 24

Figure 17: Per sequence GC content of C. doughertyi.

There was also error reported in k-mer content of the genome. FastQC identifies k-mers (7-mer) that show positional bias and present a distribution of the top six biased k-mers in the graph. Figure 18 shows the top six biased k-mers, which can be clearly seen to be over represented almost along the entire length of the read. The major over representation of the top six k-mers is observed within the read positions of 11-19. This clearly indicates the need for adpater trimming and error correction in the assembly. 25

Figure 18: K-mer distribution of top six biased 7-mer in C.doughertyi. 26

3.2.2 Adapter trimming

Trimming of adapters is one of the crucial steps to ensure a good assembly. The tool used for trimming was Cutadapt. The adapter sequence used was a Truseq universal adapter.

Adapter sequence: AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT

3.2.3 Error correction

The k-mer content error reported by FastQC, can be corrected using Quake, an error correction tool. The tool requires uniform coverage for its usage. K-mer size of 19 was taken for error correction.

Library Number of reads Number of Number of Number of % (Before trimming bases reads bases and error (Before (After ( After correction) trimming and trimming and trimming and error error error correction) correction) correction) forward_300_1 39 276 434 4 909 554 250 37 981 251 4 690 785 110 95.5 .sanfastq.gz

reverse_300_2 .sanfastq.gz forward_500_1 36 004 149 4 500 518 625 33 863 414 4 180 396 266 92.8 .sanfastq.gz

reverse_500_2 .sanfastq.gz Table 2 : Read data after trimming and error correction.

Following error correction and adapter trimming of C. doughertyi raw data, about 5 and 8% of the reads were removed from the read libraries with insert size of 300 and 500 bp, respectively, as presented in Table 2.

3.2.4 Contamination removal

To remove the non-target DNA from the genome, the Taxon Annotated GC Coverage (TAGC) plot was generated (Figure 19). 27 28

Figure 19: Blobology plot of the genome to verify the presence of non-target genomes (GC content vs coverage). The non-target genomes listed by the plot include Proteobacteria (pink blob), Chordata (green blob), Streptophyta (yellow blob), Arthopoda (brown blob) and Hemichordata (grey blob). The Proteobacteria can be clearly seen as a pink peak in the span region of the plot. a) blobology for insert size of 300 bp; b) blobology for insert size of 500 bp.

The non-target genomes such as Chordata, Arthopoda, Hemichordata, Streptophyta and Ascomycota listed in the blobology plot were identified to be nematodes through BLAST search. This could also be inferred by the similar coverage observed for all the non-target phylum except proteobacteria. They were presented as non-target genome in the plot despite being nematodes, because of considering the best BLAST hit for the generation of the TAGC plot and mis-annotation 29 in databases.

The main non-target genome observed was that of Proteobacteria. These were identified to be E. coli and removed with the help of the TAGC plot. The insert sizes of 300 and 500 bp were merged and the TAGC plot was generated again (Figure 20) to ensure the absence of non-target genomes.

Figure 20: Blobology plot after removing contaminants and merging the 300 and 500 bp insert sizes. The orange blob represents nematodes and the light grey blob the non- annotated regions. The other blobs indicated (green-, pink-, yellow-, brown-, dark grey blobs) were also found to be nematodes. The blob observed in the coverage region of 5000 and GC content of 0.25 was found to be mitochondrial genome.

3.2.5 Insert size estimation

The insert sizes were estimated for the reads after contamination removal, using a perl script by 30

Dr Sujai Kumar from Prof. Blaxters's lab. The script requires the mapping file (SAM/BAM) of the interleaved reads. The insert size is estimated by using the read pair information such as pairs with both the read mapped, one of the reads is mapped and neither of the reads are mapped. The results showed an insert size range of 300 and 500 without any irregularities as given in Figure 21.

Figure 21: Insert size estimation of the libraries. a) Histogram of 300 insert size library; b) histogram of 500 insert size library [The histograms on the left show read pair mapping in the FR orientation i.e. the forward strand hosts the end, mapping to the smaller coordinate and the right histogram for the RF orientation i.e. the reverse strand hosts the end mapping to the smaller coordinate].

3.2.6 BLAST against Caenorhabditis briggsae

Following contamination removal, the genome was compared to C. briggsae using TBLASTN and the blobology plot was generated. The plot generated (Figure 22) showed an increase in the span of the nematodes from 90.94 to 111.69, thus helping in annotating a part of the non-annotated region. The decrease in the span peak (grey peak) of the non-annotated regions can be clearly seen in the plot. 31

Figure 22: Blobology plot following BLAST against briggsae. The orange blob represents nematodes and the light grey blob, the non-annotated regions. The other blobs indicated (green-, pink-, yellow-, brown-, dark grey blobs) were also found to be nematodes.

3.2.7 Estimating k-mer size

K-mer genie tool was used to identify the optimum k-mer size. Based on its algorithm, it gave an optimum k value of 111 as shown in Figure 23. 32

Figure 23: K-mergenie report predicting optimum k-mer size. The optimum level (red dotted line intersection) is shown to be achieved between k = 100 and k = 120, where maximum number of genomic k-mers are observed.

3.3 Comparison of assemblies

The genome assembly generated through assemblers Ray, SPAdes,Velvet, ABySS and CLC were then compared using various parameters to identify the best assembly.

3.3.1 Assembly statistics comparison

Ray ABySS SPAdes Velvet CLC No. of 41 566 18 529 14 286 20 610 11 535 scaffolds Span of 68 908 845 142 916 954 129 560 989 138 339 916 125 639 977 scaffolds (bp) Min (bp) 443 200 200 221 200 Mean (bp) 1 657 7 713 9 069 6 712 10 892 N50 2 808 67 490 36 818 84 284 37 665 Num N50 6 714 553 890 441 840 GC 0.3710 0.3740 0.3740 0.3730 0.3700 Table 3: Statistical parameters computed for scaffolds longer than 200 bp. 33

Ray ABySS SPAdes Velvet CLC No. of 18 171 7 309 8 347 5 121 7 426 scaffolds Span of 53 426 790 138 768 940 126 704 667 132 983 113 123 753 414 scaffolds (bp) Min (bp) 1 000 1 000 1 000 1 000 1 000 Mean (bp) 2 940 18 986 15 179 25 968 16 664 N50 3 752 71 009 37 996 88 720 38 617 Num N50 4 328 523 852 410 815 GC 0.3690 0.3740 0.3730 0.3730 0.3730 Table 4: Statistical parameters computed for scaffolds longer than 1000 bp.

The scaffolds of the five assemblers were compared using statistical metrics such as N50, GC content, number- and span of scaffolds as shown in Tables 3 and 4. The number of scaffolds was better in the case of Velvet followed by AbySS, for the longer scaffolds (Table 4) but the span of the scaffolds shows higher value in ABySS than Velvet. The higher span is the reason why the number of scaffolds in ABySS was comparatively smaller than Velvet. The N50 value was also notably higher in the case of Velvet and ABySS in comparison to SPAdes and CLC, while Ray gave a very poor N50 value while considering both small and long scaffolds (Table 3 and 4).

Ray ABySS SPAdes Velvet CLC Longest 23 243 383 555 417 672 123 575 257 293 contig (bp) No. of contigs 52 884 26 192 14 530 48 121 18 908 Span of 64 858 328 142 353 954 129 542 973 134 037 996 124 710 448 contigs (bp) Min (bp) 443 110 200 100 112 Mean (bp) 1 226 5 435 8 915 2 785 6 595 N50 1 437 37 381 35 143 8 262 23 563 Num N50 14 285 966 932 4 077 1 302 GC 0.3710 0.3740 0.3740 0.3730 0.3730 Table 5: Statistical metrics for contigs longer than 1000 bp

As seen from Table 5, SPAdes had the longest contig closely followed by ABySS. The longest contig length in Ray had a comparatively small value and a similar result was obtained in the case of span of the contig. However, the number of contigs was highest in Ray indicating the presence of many short contigs in the assembly. Though ABySS had lesser number of contigs than Velvet, it had the highest span of all the five assemblers. The N50 value for contigs was very low in Ray and 34

Velvet, while ABySS and SPAdes had similar high values followed by CLC.

Ray ABySS SPAdes Velvet CLC No. of N's 11 318 7 663 244 28 311 7 499 Span of N's 4 050 517 563 000 18 016 4 242 792 927 338 N50 419 101 178 177 193 Table 6: Statistical metrics for the N's present in the scaffold.

The number of N's was very low in SPAdes assembly as shown in Table 6 but this is due to the fact that SPAdes has an inbuilt gapcloser, thus reducing the number of N's. ABySS and CLC also had relatively lesser number of N's, while the number of N's in the case of Velvet was the highest of the five assemblers.

The results of the statistical metrics displayed appreciable results in the four assemblers ABySS, SPAdes, CLC and Velvet. ABySS indicated consistently better results followed by Velvet and SPAdes. To validate the above obtained results, the assemblies were then compared using biological metrics.

3.3.2 Biological metrics comparison

The biological metrics used for comparing the assemblies were: presence of at least 70% transcript and protein content and presence of 18S RNA and mitochondrial genes in the genome (as they are conserved within the phylum of Nematoda), CEGMA scores for verifying the presence of core genes conserved in species, ALE- and REAPR scores to check for the quality of assembly.

Ray assembly had very poor transcript content while the rest were exceptionally good, with ABySS assembly having a near complete transcriptome (Table 7). The protein content was not as high as transcript content because the proteome of C. briggsae was used for comparison unlike in transcript content where the RNA assembly of C. doughertyi was used. Despite that, the protein content was good in all the assemblers except Ray. ABySS and Velvet had the highest protein content closely followed by SPAdes and CLC.

Interesting results were obtained while verifying for the presence of mitochondrial genes and 18S RNA. Mitochondrial genes were present in all the assemblers except Velvet, that seemed to have given high quality results. Similarly, 18S RNA was absent in CLC assembly which too gave good results, while Ray assembly which did not show promising results contained both 18S RNA and mitochondrial genes. CEGMA-partial scores and CEGMA-complete scores indicated that almost all the assemblies except Ray may contain all the core genes. CEGMA-partial value was high in case of SPAdes but reduced in CEGMA-complete marginally, showing that few of the core genes may be split across the contigs. Velvet showed high value both in complete and partial, leading to the inference that almost all the core genes were present in individual contigs. 35

The REAPR (I) score was highest for SPAdes followed by ABySS and CLC, while Velvet and Ray did not give good results indicating that the percentage of error free bases calculated by REAPR may be low in the case of Velvet. REAPR (II) gave a perfect score for ABySS, followed by Velvet and SPAdes. This may be because the error free bases were compared with itself instead of taking percentage. ALE score was observed to be the lowest in CLC assembly.

Ray showed mediocre results in almost every measure, thus making it a poor assembly. This failure may also be due to the factors considered while running the assembly. The other four assemblers gave good results in most of the measures. Nonetheless, CLC and Velvet assemblies were not considered as the best assembly because of the absence of 18S RNA and mitochondrial genes respectively. Though SPAdes showed good results consistently, ABySS assembly showed the best results of the five in many measures such as presence of transcript-, protein content and span of assembly to mention a few, making it the best assembly.

Ray ABySS SPAdes Velvet CLC Transcript - 64.63 99.14 98.07 98.90 98.15 70% complete (%) Protein - 70% 40.39 79.97 78.52 79.28 78.51 complete (%) Mitochondria Yes Yes Yes No Yes (C. briggsae) 18S RNA Yes Yes Yes Yes No CEGMA- 54.84 98.79 99.60 99.19 98.79 partial (%) CEGMA- 35.48 96.77 95.97 98.39 95.56 complete (%) REAPR score 0.02 1.00 0.77 0.91 0.62 (II) REAPR score 0.44 0.74 0.87 0.58 0.79 (I) ALE score -4 839 617 748 -1 226 736 836 -1 365 093 242 -1 555 011 447 -1 787 008 847

Table 7: Comparison of assemblies using biological metrics.

REAPR score (I) = percentage of error free bases ⋅ (broken N50/N50)

REAPR score (II) = (no. of error free bases/best no. of error free bases)⋅ ((broken N50/best broken N50)2/ (N50/best N50))].

Error free bases indicates the regions in the assembly that are accurate. The N50 value 36 recalculated from this was based on the regions that were error prone or broken and are termed as broken N50.

3.4 Gapclosing

As mentioned earlier, ABySS was chosen as the best assembly based on statistical parameters and biological metrics. Gap closing was carried out on the final assembled genome generated by ABySS as it is not present within the assembler. Soap denovo was the gap closer used, it closes the gap region on the scaffolds using paired read information. Statistical measures were then calculated for the genome before and after gapclosing.

ABySS before gap closing ABySS after gap closing Longest scaffold (bp) 498 614 498 601 No. of scaffolds 18 529 18 529 Span of scaffold (bp) 142 916 954 142 857 150 Min (bp) 200 200 Mean (bp) 7 713 7 709 N50 67 490 67 476 Num N50 553 553 GC 0.374 0.374

Table 8: Assembly statistics for scaffold longer than 200 bp

ABySS before gap closing ABySS after gap closing No. of scaffolds 7 309 7 293 Span of scaffold (bp) 138 768 940 138 701 491 Min (bp) 1 000 1 000 Mean (bp) 18 986 19 018 N50 71 009 71 009 Num N50 523 523 GC 0.374 0.374

Table 9: Assembly statistics for scaffold longer than 1000 bp.

ABySS before gap closing ABySS after gap closing Longest contig (bp) 383 555 484 136 No. of contigs (bp) 26 552 20 654 Span of contigs (bp) 142 323 397 142 704 054 Min (bp) 101 102 37

Mean (bp) 5 360 6 909 N50 37 291 61 041 Num N50 967 603 GC 0.374 0.37

Table 10: Assembly statistics for contigs longer than 100 bp.

ABySS before gap closing ABySS after gap closing N50 99 107

No. of N's 8 088 2 141 Span of N's 583 162 152 040

Table 11: Assembly statistics for N's in the assembly. The assembly statistics comparison shows a marginal decrease in the N50 value for scaffolds smaller than 200bp (Table 8) while for the longer scaffolds (>1000 bp), the N50 value was the same in both the assemblies (Table 9). N50 value was also much higher in the contigs (Table 10) due to generation of longer contigs by gap closing. This was also evident from the increase in the longest contig length and the span of contigs after the gap closing of the assembly as shown in Table 10. The span of the contigs had also increased in longer scaffolds but decreased when considering smaller scaffolds. This may be due to the decrease in the number of smaller scaffolds after gap closing. The assembly statistics of the N's in Table 11 also show a significant decrease in the number of N's as expected.

ABySS before gap closing ABySS after gap closing ALE score -1 226 736 836 -1 177 381 466 CEGMA – complete (%) 96.77 96.37 CEGMA – partial (%) 96.79 96.39

Table 12: Biological parameters of assembly before and after gap closing To further ensure that the assembly was better after gap closing, certain biological parameters like CEGMA and ALE were considered. As seen in Table 12, ALE score was better for the assembly after gap closing. In the case of CEGMA-partial and CEGMA-complete, there was a marginal decrease in the percentage, which may have been due to the loss of a single gene due to the closing of gap.

3.5 Comparison with the genome of C. brenneri

The gap filled ABySS assembly of C. doughertyi was then compared with the draft genome of C. brenneri to check the quality of the draft genome. 38

Figure 24: Comparing the genomes of C. doughertyi and C. brenneri. The blue line indicates the assembled genome of C. doughertyi and the red line marks the genome of C. brenneri. The steeper curve indicates the assembly of genome into fewer parts.

In Figure 24 (left), it can be seen that the scaffold of C. brenneri had higher cumulative length than that of C. doughertyi. It can also be seen that that the span of scaffolds in C. doughertyi was much lesser than that of C. brenneri. In figure 24 (middle), it can be seen that the length and span of contigs of C. doughertyi are better than that of C. brenneri till above 1.0e-08, following which the contigs of C. brenneri have higher contig length. In figure 24 (right), the number of N's is much higher in C. brenneri and almost negligable in C. doughertyi. The long length of contigs and scaffolds, decreased span and high levels of N in C. brenneri, may be due to the absence of scaffolding.

3.6 Annotation

Number of genes 27 113 Longest transcript (bp) 54 299 Median transcript length (bp) 1 641 Median exon length (bp) 863 Median intron length (bp) 2 255 Table 13: Annotation statistics of C. doughertyi

Table 13 lists the annotation statistics for the genes predicted by AUGUSTUS by taking the genome and MAKER pipeline predictions as input. The number of genes predicted for C. doughertyi is lesser than the number of genes in C. brenneri (30,667 bp). The median transcript length is slightly higher than the transcript length of C. brenneri (1154 bp). Similarly, the median length of an exon is also marginally higher than that of C. brenneri (654 bp) [Bre]. 39

3.6.1 OrthoMCL

OrthoMCL generated 17630 orthologous clusters. From Figure 25, it is clear that nearly half the genes present in C. doughertyi are present in C. brenneri, C. briggsae and C. elegans. It can also be seen that C. brenneri and C. doughertyi share the maximum number of exclusive genes between them, which is an expected result as these are sister species. Furthermore, C. doughertyi, C. elegans and C. brenneri share four times the number of genes shared by C. doughertyi, C. brenneri and C. briggsae.

Figure 25: Venn diagram depicting the orthologous clusters generated by OrthoMCL. 40

4 Discussion

The aim of the project was to generate a draft genome and annotation of the nematode C. doughertyi as a part of the Caenorhabditis Genome Project.

4.1 Quality control

The FastQC report presents evidence that the data contains non-target genomes. This is identified by the marked difference in GC content distribution that was theoretically calculated and determined from the raw data of C. doughertyi. The reports also show the bias observed in certain k-mers during sequencing, confirming the need for trimming of data. The blobology plot presented information that the genome of C. doughertyi contained about 10 Mb of proteobacterial genome. The proteobacteria was identified to be a strain of E. coli. The presence of E. coli in the genome was expected as they are one of the main sources of food for the Caenorhabditis species.

After the removal of E. coli genome, the blobology plot showed the absence of non-target genome but presented major regions that were not annotated by BLAST search against nt (~34 Mb). The size of the non-annotated regions dropped to nearly 13 Mb on performing BLAST search against C. briggsae. The increase in the span of the nematode content (~90 to ~111 Mb) following the BLAST search indicates that some of the unannotated DNA sequence in C. doughertyi were observed in C. briggsae. The raw data produced could thus be considered as relatively clean as only 5% and 8% of the data in the two read libraries were lost during trimming and error correction and the only non- target genome observed was that of proteobacteria.

4.2 Assembly

In the final assembly, though Ray gave mediocre results, the other four assemblers produced appreciable results showing that the raw data of C. doughertyi may be of good quality. ABySS assembler produced one of the best assemblies with the highest span, N50, CEGMA, ALE and REAPR scores. The number of N's that were high in the case of ABySS assembly was reduced by gap closing; there was also an increase in the N50 value of the assembly after gap closing.

The draft assembly of C. doughertyi produced by ABySS when compared with the draft genome of C. brenneri showed that the scaffolds of C. doughertyi were better than that of C. brenneri. Similar case was observed in contigs but only up to a certain level, following which, contigs of C. brenneri were longer. The reason for the better scaffolds in C. doughertyi may be due to the large number of N's present in C. brenneri due to scaffolding. 41

4.3 Annotation

The annotation carried out on the draft genome of the C. doughertyi showed results comparable to its sister genome, C. brenneri. It can be seen that the length of introns was higher than that of exons; these long introns that flank the exons serve the purpose of splice site recognition. It was also observed that the GC % was higher in the cases where long introns were observed [ADH+12]. OrthoMCL generated 17630 clusters and maximum number of genes were present in all the four species C. brenneri, C. briggsae, C. elegans and C. doughertyi. C. brenneri and C. doughertyi also had about 1700 genes, present only in these two sister species. It was interesting to note that C. elegans shared more genes with C. brenneri and C. doughertyi than C. briggsae, despite the larger evolutionary distance observed between the sister groups and C. elegans, than between the sister groups and C. briggsae. This may have been due to the high quality genome of C. elegans. However, conclusions cannot be reached with the results as some of the genes predicted may be false positives. Production of accurate gene predictions requires further research on C. doughertyi and generation of its high-quality genome. The high quality genome could be achieved in the cases where we obtain longer read length contributing to a better accuracy in prediction. This could be achieved in the cases of sequencer such as Pacific Biosciences, that can produce reads of length greater than 10 kb. The major limitation though in these cases are the high cost and large samples required to produce accurate reads. If these limitations are overcome either with better error correction tools or cheaper sequencing techniques, it can be useful in producing many high quality genomes.

4.4 Conclusion

C. doughertyi is the sister species of C. brenneri and both the species were isolated from the tropical regions and are present as male-female species. The project was successful in generating a draft genome and annotation of C. doughertyi. The genome of C. doughertyi could be useful in carrying out many comparative genomic studies with C. brenneri and understand gene loss observed in the species. Also, the information from annotation could play a crucial role in understanding the hyperdiverse nature of C. brenneri and its evolution. Finally, it could also be useful for performing comparative genomic studies with other neighbors and to understand their evolutionary uniqueness. 42

5 Acknowledgements I would like to express my sincere gratitude to Prof. Mark Blaxter from University of Edinburgh for providing me an opportunity to carry out my master thesis in his lab. I would also like to thank Georgios Koutsovoulos, Dominik Laetsch, Sujai Kumar, Reuben Nowell and other lab members for providing guidance and making the stay a knowledgeable and joyful one.

I must thank Prof. Veli Makinen for being extremely patient and helpful during the entire process of my thesis. I would like to thank Heidi Ilomaki, University of Helsinki and Erasmus for providing me with financial assistance and making my stay a peaceful one.

Last but definitely not the least I would like to thank my pillars of support, my parents, sister and my friends Anuroop and Sudar for always encouraging me and standing by my side. 43

References ADH+12 Amit, M.; Donyo, M.; Hollander, D.; Goren, A.; Kim, E.; Gelfman, S.; Lev-Maor, G.; Burstein, D.; Schwartz, S.; Postolsky, B.; Pupko, T. and Ast, G. (2012). Differential GC Content between Exons and Introns Establishes Distinct Strategies of Splice- Site Recognition, Cell Reports 1 : 543-556.

And Andrews, S. FastQC A Quality Control tool for High Throughput Sequence Data, http://www.bioinformatics.babraham.ac.uk/projects/fastqc/.

AGM+90 Altschul, SF.; Gish, W.; Miller, W.; Myers, EW and Lipman, DJ. (1990). Basic local alignment search tool, Journal of Molecular Biology 215(3) : 403-410

Bla11 Blaxter, M. (2011). Nematodes: The Worm and Its Relatives, PLoS Biology 9 : e1001050.

BNA+12 Bankevich, A.; Nurk, S.; Antipov, D.; Gurevich, A. a.; Dvorkin, M.; Kulikov, A. S.; Lesin, V. M.; Nikolenko, S. I.; Pham, S.; Prjibelski, A. D.; Pyshkin, A. V.; Sirotkin, A. V.; Vyahhi, N.; Tesler, G.; Alekseyev, M. a. and Pevzner, P. a. (2012). SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing, Journal of Computational Biology 19 : 455-477.

Bre C. brenneri, https://www.wormbase.org/species/c_brenneri/gene#0-9f-3f, Accessed date: 2015-10-11

BRG+12 Boisvert, S.; Raymond, F.; Godzaridis, E.; Laviolette, F. and Corbeil, J. (2012). Ray Meta: scalable de novo metagenome assembly and profiling., Genome biology 13 : R122.

Cae Caenorhabditis Genomes Project, http://caenorhabditis.bio.ed.ac.uk/, Accessed date: 2015-08-30.

Caea Caenorhabditis brenneri,

http://metazoa.ensembl.org/Caenorhabditis_brenneri/Info/Index, Accessed date: 2015-08-30.

CEFW13 Clark, S. C.; Egan, R.; Frazier, P. I. and Wang, Z. (2013). ALE: A generic assembly likelihood evaluation framework for assessing the accuracy of genome and metagenome assemblies, Bioinformatics 29 : 435-443.

CHA+05 Chen, N.; Harris, T. W.; Antoshechkin, I.; Bastiani, C.; Bieri, T.; Blasiar, D.; Bradnam, K.; Canaran, P.; Chan, J.; Chen, C.-k.; Chen, W. J.; Cunningham, F.; Davis, P.; Kenny, E.; Kishore, R.; Lawson, D.; Lee, R.; Muller, H.-m.; Nakamura, C.; Pai, S.; Ozersky, P.; Petcherski, A.; Rogers, A.; Sabo, A.; Schwarz, E. M.; Auken, K. 44

V.; Wang, Q.; Durbin, R.; Spieth, J.; Sternberg, P. W. and Stein, L. D. (2005). WormBase : a comprehensive data resource for Caenorhabditis biology and genomics, Nucleic Acids Research 33 : 383-389.

CKR+08 Cantarel, B. L.; Korf, I.; Robb, S. M. C.; Parra, G.; Ross, E.; Moore, B.; Holt, C.; Alvarado, A. S. and Yandell, M. (2008). MAKER: An easy-to-use annotation pipeline designed for emerging model organism genomes, Genome Research 18 : 188-196.

CM14 Chikhi, R. and Medvedev, P. (2014). Informed and automated k-mer size selection for genome assembly, Bioinformatics 30 : 31-37.

Cog05 Coghlan, A. (2005). Nematode genome evolution, WormBook : 1-15.

DCTC13 Dey, A.; Chan, C. K. W.; Thomas, C. G. and Cutter, A. D. (2013). Molecular hyperdiversity defines populations of the nematode Caenorhabditis brenneri., Proceedings of the National Academy of Sciences of the United States of America 110 : 11056-60.

EH15 Ellis, H. M. and Horvitz, H. (2015). Genetic control of programmed cell death in the nematode C. elegans, Cell 44 : 817-829.

EJB13 Elsworth, B.; Jones, M. and Blaxter, M. (2013). Badger - an accessible genome exploration environment, Bioinformatics 29 : 2788-2789.

FWT+15 Fierst, J. L.; Willis, J. H.; Thomas, C. G.; Wang, W.; Reynolds, R. M.; Ahearne, T. E.; Cutter, A. D. and Phillips, P. C. (2015). Reproductive Mode and the Evolution of Genome Size and Structure in Caenorhabditis Nematodes, PLOS Genetics 11 : e1005323.

GAP+08 Green, R. a.; Audhya, A.; Pozniakovsky, A.; Dammermann, A.; Pemble, H.; Monen, J.; Portier, N.; Hyman, A.; Desai, A. and Oegema, K. (2008). Expression and Imaging of Fluorescent Proteins in the C. elegans Gonad and Early Embryo, Methods in Cell Biology 85 : 179-218.

GHY+11 Grabherr, M. G.; Haas, B. J.; Yassour, M.; Levin, J. Z.; Thompson, D. A.; Amit, I.; Adiconis, X.; Fan, L.; Raychowdhury, R.; Zeng, Q.; Chen, Z.; Mauceli, E.; Hacohen, N.; Gnirke, A.; Rhind, N.; di Palma, F.; Birren, B. W.; Nusbaum, C.; Lindblad-Toh, K.; Friedman, N. and Regev, A. (2011). Full-length transcriptome assembly from RNA-Seq data without a reference genome, Nat Biotech 29 : 644-652.

Gri05 Grishok, A. (2005). RNAi mechanisms in , FEBS Letters 579 : 5932-5939.

His Hiseq X Ten And Hiseq X Five Systems, http://www.illumina.com/systems/hiseq-x- 45

sequencing-system.html, Accessed date: 2015-08-30.

HKS+13 Hunt, M.; Kikuchi, T.; Sanders, M.; Newbold, C.; Berriman, M. and Otto, T. D. (2013). REAPR: a universal tool for genome assembly evaluation, Genome Biology 14 : R47.

HMB+07 Hillier, L. W.; Miller, R. D.; Baird, S. E.; Chinwalla, A.; Fulton, L. a.; Koboldt, D. C. and Waterston, R. H. (2007). Comparison of C. elegans and C. briggsae Genome Sequences Reveals Extensive Conservation of Chromosome Organization and Synteny, PLoS Biology 5 : e167.

Hum The Human Genome Project Completion: Frequently Asked Questions (2010), https://www.genome.gov/11006943, Accessed date: 2015-08-30.

HWL+07 He, H.; Wang, J.; Liu, T.; Liu, X. S.; Li, T.; Wang, Y.; Qian, Z.; Zheng, H.; Zhu, X.; Wu, T.; Shi, B.; Deng, W.; Zhou, W.; Skogerb G. and Chen, R. (2007). Mapping the C. elegans noncoding transcriptome with a whole-genome tiling microarray, Genome Research 17 : 1471-1477.

KFA+11 Kiontke, K. C.; Félix, M.-A.; Ailion, M.; Rockman, M. V.; Braendle, C.; Pénigault, J.- B. and Fitch, D. H. (2011). A phylogeny and molecular barcodes for Caenorhabditis, with numerous new species from rotting fruits, BMC Evolutionary Biology 11 : 339.

KSS10 Kelley, D. R.; Schatz, M. C. and Salzberg, S. L. (2010). Quake: quality-aware detection and correction of sequencing errors., Genome biology 11 : R116.

KJK+13 Kumar, S.; Jones, M.; Koutsovoulos, G.; Clarke, M. and Blaxter, M. (2013). Blobology: exploring raw genome data for contaminants, symbionts and parasites using taxon-annotated GC-coverage plots, Frontiers in Genetics 4 : 1-12.

Kor04 Korf, I. (2004). Gene finding in novel genomes., BMC bioinformatics 5 : 59.

KW06 Kiontke, K. and Walter, S. (2006). Ecology of Caenorhabditis species, WormBook : 1-14.

Lae Laetsch, D. R. DRL/blobtools-light, https://github.com/DRL/blobtools-light.

LB02 Lambert, K. and Bekal, S. (2002). Introduction to Plant-Parasitic Nematodes, The Plant Health Instructor .

Lom05 Lomsadze, a. (2005). Gene identification in novel eukaryotic genomes by self- training algorithm, Nucleic Acids Research 33 : 6494-6506.

LSR03 Li, L.; Stoeckert, C. J. J. and Roos, D. S. (2003). OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomes, Genome Research 13 : 2178-2189. 46

LTPS09 Langmead, B.; Trapnell, C.; Pop, M. and Salzberg, S. L. (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome, Genome biology 10 : R25.

Mar11 Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads, EMBnet.journal 17 : 10.

PBK07 Parra, G.; Bradnam, K. and Korf, I. (2007). CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes, Bioinformatics 23 : 1061-1067.

RG14 Rogers, J. and Gibbs, R. a. (2014). Comparative primate genomics: emerging patterns of genome content and dynamics, Nature Reviews Genetics 15 : 347-359.

RSS13 Rödelsperger, C.; Streit, A. and Sommer, R. (2013). Structure, Function and Evolution of The Nematode Genome, eLS .

Seq Sequencing By Synthesis (SBS) Technology,

http://www.illumina.com/technology/next-generation-sequencing/sequencing- technology.html, Accessed date: 2015-08-30.

Seq98 The C. elegans Sequencing Consortium (1998). Genome sequence of the nematode C. elegans: a platform for investigating biology., Science (New York, N.Y.) 282 : 2012- 2018.

SSWM04 Stanke, M.; Steinkamp, R.; Waack, S. and Morgenstern, B. (2004). AUGUSTUS: a web server for gene finding in eukaryotes, Nucleic Acids Research 32 : W309-W312.

SSMW06 Stanke, M.; Schoffmann, O.; Morgenstern, B and Waackm S. (2006). Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources, BMC Bioinformatics 7 : 62

SWJ+09 Simpson, J. T.; Wong, K.; Jackman, S. D.; Schein, J. E.; Jones, S. J. M. and Birol, I. (2009). ABySS: A parallel assembler for short read sequence data, Genome Research 19 : 1117-1123.

TC02 Tarailo-Graovac, M. and Chen, N. (2002). Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences. In: (Ed.), Current Protocols in Bioinformatics, John Wiley & Sons, Inc..

ZB08 Zerbino, D. R. and Birney, E. (2008). Velvet: algorithms for de novo short read assembly using de Bruijn graphs., Genome research 18 : 821-9. a

Appendix

Commands used:

Cutadapt: cutadapt -a adapter_sequence -q 30 -f fastq -o output

-a: used to remove 3' adapter

-q: used to trim low quality reads

-f: output format

-o: output_ file name

Quake: quake.py -f file_list -k 19 -p

-f: names of the read files are present in a text

-k: length of the k-mer

-p: number of processors

Bowtie: Assembling genome clc_assembler -o -q -cpus -v

-o : output_file

-q : reads(forward and reverse if paired end)

–cpus: number of processors

-v: verbose

Mapping Creating index: bowtie2-build -assembled genome -basename for index

Mapping: bowtie2 -x –very-fastlocal -k -p -U

-x: the base-name of the index created for the reference genome

-k: used to find best alignment for each read , k=1

-p: number of processors

-U: forward and reverse reads b

Blast blastn -task megablast -query -db -evalue -outfmt -out –max_target_seqs -num_threads nucleotide blast

-task : megablast for highly similar sequences

-query: sequence for which blast needs to be carried out, assembled genome in our case

-db: database to blast the query against, nt database chosen

-evalue: 1e-5

-outfmt: 6 qseqid staxids std

-out: output filename

-max_target_seqs: number of hits to be considered from blast search, we chose 1

-num_threads: number of processors

Taxon Annotated GC coverage plot gc_cov_annotate_eval.pl –evalue –blasttaxid --assembly --bam --taxdump

-blasttaxid: blast output file

--assembly: preliminary assembly generated

--bam: mapping files generated

--taxdump: blast database

K-mergenie: kmergenie readfile_list -l -s -t

readfile_list: file containing names of read files

-l: starting k-mer length

-t: number of processors

SPAdes: spades.py -o spades --only-assembler --careful -t 32 -m 450 --pe1-1 --pe1-2 --pe2-1 --pe2-2

-o: output directory

--only-assembler: run only assembly module

--careful: reduces errors such as indels and mismatches c

-t: number of threads

-m: memory limit(Gb)

-pe1-1, -pe1-2: paired end library one

-pe2-1, -pe2-2: paired end library two

Ray: mpirun -n16 Ray -k111-o ray_k111 -amos -p

mpirun: Message passing interface mode(MPI)

-n: number of cores

-k: length of the kmer

-o: output directory

-amos: an amos file is created, it contains information about read positions on the contig

-p: paired end reads

Velvet: velveth velvet_k111 111 -fastq -shortPaired -fastq -shortPaired2 velveth -output directory -k-mer length -file format-typeofread -file format2 -type of read2 velvetg velvet_k111 -shortMatePaired no -exp_cov auto -cov_cutoff auto

velveth -output directory -if mate paired library or not -exp_cov auto: sets the median coverage by weighted length -cov_cutoff auto: half the exp_cov

CLC: clc_assembler -o -e lib_dist -p fb ss 180 250-q read1_name -i -p fb ss 180 250 -q -i read2_name – cpus 16-v

-o: output name

-p: specifying the paired read type

-fb: orientation of reads (first read is the forward read and the second one is the reverse read

-ss: indicates the read length + the length of reads between them

-q: read file name

-i: interleave, when the first and second member are present as separate files d

-cpus:number of processors

-v: verbose

AbySS: mpirun-np 32 abyss-pe k=111 name=abyss_k111 lib='pe1 pe2' pe1= pe2= mpirun-np: using mpi, number of cores 32

abyss-pe: paired end data

k: k-mer length

name: name of directory

lib: number of libraries

pe1: forward and reverse reads of insert size 300

pe2: forward and reverse reads of insert size 500

Trinity: Trinity --seqType fq --left reads_1.fq --right reads_2.fq --CPU 32 --max_memory 50G

--seqtype: read format (fq=fastq)

--left: forward read

--right: reverse read

--CPU:number of processors

--max_memory: memory limit

CEGMA: cegma -g -T 16

-g: genome assembly

-T: number of processors

ALE: ALE - the mapping file(BAM) - genome assembly -output directory

REAPR: Selects scaffold of length 1000bp or more fastaqual_select.pl -f -l 1000> output.fasta e

-f: genome assembly(fasta)

-l: setting up the minimum length of scaffolds

Checks for errors in the contig names of the assembly reapr facheck output.fasta

present the genome assembly from previous step

Mapping of reads to the assembly reapr smaltmap -n 32

Calculates REAPR score reapr pipeline - -BAM -

Repeat Masker: Creating database for the pipeline

RepeatModeler\BuildDatabase -name doughertyi -engine ncbi -genome assembly

-engine: input for creating database

-name: name of database created

Models repetitive DNA

RepeatModeler/RepeatModeler -engine ncbi -pa 32 -database doughertyi

-pa : number of processors

Identifying repeats from the repeatmasker database repeatmasker-4-0-3/util/queryRepeatDatabase.pl-speciesrhabditida >Rhabditida.repeatmasker -species: species of interest

Combining modeled and queried repeats cat Rhabditida.repeatmasker consensi.fa.classified > all_repeats

Masking the repeats repeatmasker-4-0-3/RepeatMasker -lib all_repeats genome_assembly

-lib: all the repeats that need to be masked f

SNAP: Run cegma cegma -g -T 32

Convert gff format to zff cegma2zff output_cegma.gff masked_genomes

Breaking genome.ann and genome.dna to have one gene per sequence fathom genome.ann genome.dna -categorize 1000

-categorize: upto 1000bp on either side of the gene

Converting uni genes to plus stand fathom -export 1000 -plus uni.ann uni.dna

Estimation of parameters forge export.ann export.dna

export.ann: gene struture on the plus strand

export.dna: dna of plus strand

Build a HMM model hmm-assembler.pl -genome -directory created for parameters >hmm_output

GeneMark-ES: gm_es.pl --BP OFF -max_nnn 500 -min_contig 1000 masked_genome_sequence --BP OFF: switches off branch point submodel

-max_nnn: splits the NNN strings in input having length greater than max_nnn

-min_contig: minimum contig length

Maker: Run maker (creates config files) maker

The outputs of the gene finding software are given as input to the maker by editting config files and run maker again mpiexec -n40 maker

Combining gff files produced by maker gff3_merge -d logfile created by maker g

Combining fasta files (protein, transcript) fasta_merge -d logfile created by maker

AUGUSTUS:

Converting from maker gff to zff format maker2zff genome.maker.output/genome.all.gff

Converting from maker zff to gff3 format and adding column absent in output zff2gff3.pl genome.ann | perl -plne 's/\t(\S+)$/\t\.\t$1/'>augustus.train.gff3

Simplifies headers to remove special characters simplifyFastaHeaders.pl -Trinity.fasta formatted -genome_formatted.fasta -augustus/header.map

Running augustus autoAug.pl --species=CDT –genome= masked_genome --trainingset=augustus.train.gff3 --cdna= genome.fasta --useexisting -v --singleCPU --optrounds=3 &>augustus.out.txt

-species: name of the species

-genome: fasta file of the genome

–trainingset: training genes in gff3 format(it can also be in fasta)

--cdna: cdna sequences in fasta format

--useexisting: change the parameters in configuration file if they already exist

-v: verbose

--singlecpu: run program without any parallel execution

--optrounds: optimization round or every parameter

OrthoMCL:

Install schema orthomclInstallSchema orthomcl.config

Adjusting fasta files to the required format(C.brenneri, C.briggsae, C.elegans, C.doughertyi) orthomclAdjustFasta cdo doughertyi_prot.fasta 1 orthomclAdjustFasta cel elegans_prot.fasta 1 h orthomclAdjustFasta cbn brenneri_prot.fasta 1 orthomclAdjustFasta cbg briggsae_prot.fasta 1

Filter poor quality sequences orthomclFilterFasta compliantfasta/ 100 20

orthomclFilterFasta -directory containing proteomes -length of sequence -percentage of stop codons

All vs All BLAST - Creating a BLAST database from the filtered protein sequences makeblastdb -in goodProteins.fasta -dbtype prot -out prot_db

-in: input

-dbtype: database type

-out: prefix for the blast database produced

BLAST blastp -task blastp -query goodProteins.fasta -db prot_db -evalue 1e-5 -outfmt 6-out prot_blast_1e-5 -max_target_seqs 5 -num_threads 32

-task: blast type

-query: filtered out sequences as query

-db: name of created blast database to blast against

-evalue: expect value to check the similarity between sequences

-outfmt:output format

-out: output name

-max_target_seqs: maximum number of hits to be present in the output

-num_threads: number of processors

Adjust the format to load into orthomcl database and compute percentage match orthomclBlastParser prot_blast_1e-5 compliantfasta/ >>SimilarSequences.txt orthomclBlastParser -blast_output -directory containing proteome file >>adjusted_blastoutput.txt

Load the adjusted blast output into database orthomclLoadBlast orthomcl.config adjusted_blastoutput.txt

Identify the protein pairs (Ortholog, Paralog, CoOrtholog) orthomclPairs orthomcl.config orthomcl_pairs.log cleanup=no i

-cleanup: whether to keep or drop intermediate tables

Create directories containing the protein pairs/ mclInput orthomclDumpPairsFiles orthomcl.config

MCL clustering mcl mclInput --abc -I 1.5 -o mclOutput

--abc: indicates changing from abc format to label mode

-I: inflation value for clustering

-o: output name

Clustering into groups orthomclMclToGroups -prefix for the group id 1000 output_group.txt