Comparative Gene Expression Analysis in Emiliania huxleyi, Isochrysis

galbana, and Gephyrocapsa oceanica

Arun Gopinath

In Partial Fulfillment of the Master of Computer Science

California State University San Marcos

July, 2015

Abstract

Comparative genomics is a field of biological research in which the genomic sequences of different species are compared. Comparative genomics provides a powerful tool for studying evolutionary changes among organisms, helping to identify genes that are conserved among species, as well as genes that give each organism its unique characteristics. Genome annotations of newly sequenced species initially rely primarily on ab initio gene predictions and alignment of reference transcripts of related species; however, the quality of gene models is greatly improved when incorporating same species transcriptomes.

In this study, we started with the re-annotation of the E. huxleyi, G. oceanica and I. galbana based on de novo assembly of transcriptomes. De novo assembly tool TRINITY was used to assemble the transcriptomes of each individual sister species, followed by a pipeline based on Program to Assemble Spliced Alignments (PASA), ab initio gene predictors

AUGUSTUS and SNAP, EvidenceModeler (EVM) and MAKER2. The pipeline was validated by re-annotating E. huxleyi, as we have a good first generation JGI E. huxleyi annotation. The revised E. huxleyi annotation describes 35,582 genes compared to 30,499 for first generation annotation. More than 95% of the previously annotated E.huxleyi genes mapped on to the re- annotated 35,582 E.huxleyi genes. The re-annotation pipeline predicted G.oceanica 52,680 genes compared to 28,441 for the previous annotation pipeline. Revised I.galbana annotation has

18,712 genes compared to 13,148 genes for the previous annotation.

Once we re-annotated the sister species genome, we focused the research to comparative study. Various PASA, EVM and BLAST tools were used to determine the genome size, number of genes predicted, and number of shared genes among the sister species. The revised E. huxleyi annotation shares 31,734 orthologous genes with G. oceanica and 23,392 orthologous genes with

i

I. galbana. Comparative study between G.oceanica and I.galbana found 15,358 common genes, with additional 20,800 G.oceanica genes sharing considerable similarity to the common

G.oceanica-I.galbana genes. Further comparative study found 2,642 unique I.galbana genes;

3,137 unique E.huxleyi genes and 6,959 unique G.oceanica genes.

ii

Acknowledgements

It is my pleasure to extend my deepest gratitude to my committee members, Dr. Xiaoyu

Zhang, Dr. Betsy Read and Dr. Ahmad Hadaegh for sharing their wisdom during the course of this study. I am extremely thankful to them for their time and valuable suggestions to better this study.

I am forever thankful to my wife, Kinga, for listening, offering me advice and supporting me through this entire process. My profoundest feelings of love to my sons, Milan and Emil, for the important roles they play in my life.

Finally, I want to acknowledge with gratitude the support of my parents and sister, who have always encouraged me to look forward to new beginnings rather than regret missed opportunities.

iii

Table of Contents 1. INTRODUCTION ...... 1 2. BACKGROUND ...... 3

2.1. TRANSCRIPTOMES ...... 4 2.2. SPLICED ALIGNMENT ...... 6 2.3. TRANSCRIPTOME ASSEMBLY ...... 7 2.4. TRANSCRIPTOME ALIGNMENT ...... 10 2.5. GENOME ANNOTATION ...... 11 3. ARCHITECTURAL MODEL ...... 13

3.1. ANNOTATION PIPELINE ...... 13 3.2. COMPARISON PIPELINE ...... 15 4. IMPLEMENTATION AND ANALYSIS ...... 17

4.1. DATA PREPARATION ...... 17 4.1.1. Input Data Source ...... 17 4.1.2. Data Trimming ...... 17 4.2. TRANSCRIPT ALIGNMENT ...... 18 4.2.1. PASA Pipeline ...... 18 4.2.2. Table 3: PASA Annotation Comparison ...... 20 4.3. TRAINING DATA PREPARATION ...... 25 4.3.1. Input Data ...... 25 4.3.2. ORF ...... 25 4.3.3. Extracting Unique Complete ORF ...... 26 4.3.4. Candidate Gene Structure Identification ...... 27 4.4. AB INITIO TRAINING AND PREDICTION ...... 28 4.5. REFERENCE PROTEIN ALIGNMENT GENERATION ...... 29 4.6. COMBINING EVIDENCES ...... 29 4.7. GENE CLUSTERING ...... 30 4.8. VALIDATING WORKFLOW ...... 31 4.9. COMPARATIVE STUDY ...... 32 4.9.1. Reference E. huxleyi ...... 32 4.9.2. Reference G. oceanica...... 33 4.9.3. Reference I. galbana ...... 33 4.9.4. Three way comparison with individual species as Reference ...... 34 4.9.5. Overlap for three species ...... 35 5. CONCLUSION ...... 36 REFERENCES ...... 38 APPENDIX 1: INPUT DATA PREPARATION ...... 41 APPENDIX 2: PASA ...... 42 APPENDIX 3: TRAINING SET GENERATION ...... 45

iv

APPENDIX 4: GENE PREDICTIONS ...... 49 APPENDIX 5: REFERENCE PROTEIN ALIGNMENT...... 53 APPENDIX 6: EVM ...... 54 APPENDIX 7: CLUSTERING ...... 57

v

LIST OF FIGURES Figure 1: RNA Sequencing...... 5 Figure 2: Spliced Alignment...... 6 Figure 3: Reference Based vs De Novo Assembly...... 7 Figure 4: Build a Graph Representing Alternative Splicing Events...... 8 Figure 5: De novo Assembly. (a) Generating substrings...... 9 Figure 6: PASA Pipeline...... 10 Figure 7: Sample Annotation Pipeline...... 11 Figure 8: Annotation Pipeline...... 13 Figure 9: Comparison Pipeline...... 15 Figure 10: Finding Unique Genes for a species...... 16 Figure 11: Workflow for Generating Training Set...... 25 Figure 12: G. oceanica on E. huxleyi and I. galbana on E. huxleyi...... 32 Figure 13: E. huxleyi on G. oceanica and I. galbana on G. oceanica ...... 33 Figure 14: E. huxleyi on I. galbana Genes and G. oceanica on I. galbana ...... 33 Figure 15: Common Genes based on (a) E.huxleyi, (b) G.oceanica and (c) I. galbana...... 34 Figure 16: Overall common genes...... 35

LIST OF TABLES Table 1: TRINITY Transcriptome Assembly ...... 17 Table 2: Seqclean Results ...... 18 Table 3: PASA Annotation Comparison...... 20 Table 4: Comparison of PASA annotation results and JGI Annotations of E. huxleyi ...... 21 Table 5: Comparison of PASA annotation results and previous MAKER Gene Annotations of G. oceanica ...... 22 Table 6: Comparison of PASA annotation results and previous MAKER Gene Annotations of I. galbana ...... 23 Table 7:Side by side PASA output comparison ...... 24 Table 8: Unique Complete ORF's Extracted ...... 26 Table 9: ORF nr-hits ...... 27 Table 10: Clustered nr Hits for Gene Predictors ...... 28 Table 11: AUGUSTUS Gene Prediction ...... 28 Table 12: SNAP Gene Prediction ...... 29 Table 13: EVM Predicted Genes ...... 30 Table 14: Final Predicted Genes ...... 30 Table 15: Workflow Validation: Inputs to EVM ...... 31

vi

1. Introduction

Phytoplankton derived from the Greek work phyto (plant) and plankton (drift) are microscopic organisms that live in the watery environments. Like the plants on land, phytoplankton captures sunlight and converts it into energy by the process of photosynthesis.

Coccolithophores are single celled eukaryotic phytoplankton that lives in upper regions of the ocean where the sunlight is plentiful and E. huxleyi species often form a major phytoplankton biomass. One of the main differences between coccolithophores and other oceanic phytoplankton is that coccolithophores have an exoskeleton made of calcite platelets otherwise known as coccoliths.

Coccolithophores are of particular interest in research that focuses on global climate change because as ocean acidity increases, their coccoliths may become even more important as a carbon sink [2]. Emiliania huxleyi (E. huxleyi) is the most abundant coccolithophore that blooms every year across large areas of North Atlantic [1]. E. huxleyi is considered the model coccolithophorid because of the ease with which it can be cultured in the laboratory and also the availability of its genome sequence [23]. E. huxleyi produces unusual lipids that do not degrade in the sediments and are of great value to earth scientists as a tool for estimating the past sea surface temperatures [6]. Among the hundreds of other species of phytoplankton, closely related species, Isochrysis galbana (I. galbana) and Gephyrocapsa oceanica (G. oceanica) were chosen in this comparative study. Despite the closeness of the species E. huxleyi and G. oceanica calcifies but I. galbana does not calcify.

Comparative genomics focuses on the study of similarities and differences between sequenced genomes of different species. With the advancement in sequencing technology, more and more complete genomes are becoming available; genomic data, in the form of sequence data,

1

accumulates at an exponential rate, doubling approximately every year and a half [5]. Whole

genome of a species allows for global views and multiple complete genomes increase predictive

power [4]. Comparative genomic is a powerful tool for studying evolutionary relationship among

organisms, in helping to identify genes that are conserved among species as well as genes that

give each organism its unique characteristics. The main driving power of comparative approach

is to view the same object such as whole genome in different ways.

The aim of this research is to do a comparative genomics study of three closely related

coccolithophorids Emiliania huxleyi, Gephyrocapsa oceanica and Isochrysis galbana to gain

deeper understanding of the similarities and differences between them. By annotating and

comparing the genomes we aim to facilitate the identification of genes and proteins involved in

the bio-mineralization and coccolithogenesis.

In this study, TRINITY1 assembled RNA sequence Solexa short transcriptomic reads from four different growth conditions, were processed through a customized pipeline to generate an improved annotation of the respective genomes. The newly annotated genomes were then compared. The transcriptomic data was obtained by extracting RNA from cultures of each of the three species grown in artificial seawater under the following conditions:

Condition 1: 0 millimolar calcium Condition 2: 9 millimolar calcium Condition 3: 0 millimolar calcium with a spike of sodium carbonate Condition 4: 9 millimolar calcium with a spike of sodium carbonate

We assembled RNA transcripts of three sister species using PASA2 (program to assemble

spliced alignment) program, followed by ab initio predictions using SNAP3 and AUGUSTUS4.

1 Trinity is De novo RNA sequence assembler (http://trinityrnaseq.github.io/) 2 PASA is a Program to Assemble Spliced Alignments (http://pasapipeline.github.io/) 2

MAKER2 was used to generate protein alignments based on hits with the Refseq5 database. The various types of gene evidence were then combined using the EVM6 (Evidence modeler) program to generate the final annotated catalog of genes for each species. This study represents the genomewide comparison of three closely related species and focuses on the assembly of the RNA transcripts, gene annotation and finally genome content analysis.

This research makes the following contributions:

• Illustrates an architectural model that reflects the implementation of our work

• Provides a framework for annotation that identifies more complete gene structures

• Demonstrates improved gene annotation for each respective species

• Validates that the improved gene annotation has enabled the identification of unique

genes present in each species.

The remainder of the thesis is organized as follows: Chapter 2 examines the related work.

The architectural model is presented in Chapter 3. Implementation and results are described in

Chapter 4. Finally, and a discussion and outline for future work is presented in Chapter 5.

2. Background

Once a genome is sequenced for a species, the genome needs to be annotated to be meaningful. Annotation is the process of attaching pertinent information about the sequences in the genome. In Genome annotation, locations of genes and all coding regions in the genome are determined followed by identifying the functions of the genes, physical characteristics of gene

3 SNAP is a general purpose gene finding program (http://korflab.ucdavis.edu/software.html) 5 AUGUSTAS is a gene predictor for eukaryotes http://nar.oxfordjournals.org/content/32/suppl_2/W309.full 5 The Reference Sequence (RefSeq) collection provides a comprehensive, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins. http://www.ncbi.nlm.nih.gov/refseq/about/ 6 EvidenceModeler software combines ab intio gene predictions and protein and transcript alignments into weighted consensus gene structures

3

and overall metabolic profile of the species. Genome annotation can be performed by automation

or manually. In the automatic process, computer programs are used to scan the genome and

statistical methods are used to identify coding regions. Manual annotation of a genome is done

by humans and is a very long and difficult process as the genomes of species can be in hundreds

if not thousands of kilo bases with tens of thousands of genes. Recent advances in the sequencing

technologies have made it possible to sequence new genomes in less time and at reduced cost

prompting automatic annotation to take the lead with limited manual annotation [7]. Since

genome annotation is an ongoing process, resource intensive and time consuming, improving the

quality of the annotation is also an ongoing process.

Comparative and functional genomic studies and evolutionary genetics rely on quality

genome annotation to assign and modify the functions of genes and other elements such as

regulatory regions. Genome annotations initially rely on automatic process such as ab initio gene

predictions using statistical methods and alignment/homology of transcripts from other species; however, the quality of the annotations can be greatly improved by incorporating same species transcriptomes [7].

2.1. Transcriptomes

The Transcriptome is the complete set of transcripts in a given population of cells.

Transcriptome of a eukaryote species is much smaller than its genome. Genome is the complete collection of DNA that includes genes and inter-genomic regions. Different cells of the organism have the same copy of genome but only the expressed genes determine which parts of the genome are used to code for proteins; example skin cells transcriptome has information to activate the parts of genome to make skin cells. RNA sequencing promises a comprehensive

4

analysis of the Transcriptome and allows for complete annotation of all genes and isoforms for a

sample in given condition [8].

Figure 1: RNA Sequencing. RNA sequence experiments involve making a cDNA fragment collection. The fragmented reads are then flanked by specific constant sequences known as adaptors that are necessary for sequencing. Adapters are used to link the ends of two DNA molecules that have different sequences at their ends. This collection referred to as library is then sequenced through high-throughput sequencer which produces millions of short sequence reads that correspond to individual cDNA fragments. Figure 1 depicts the flow of RNA sequencing. Different computational methodologies exist

to find genes. In this study, we assembled the short reads from the high speed sequencers and

aligned the assembled sequenced to the reference genome. The alignment program searches for

the matches with the reference genome to create gene models that include exons and introns.

This method of creating gene models and predicting genes is valued because it is based on

5

evidence from RNA transcripts. This method also allows existing annotations to be reviewed and

updated based on the transcripts alignment.

2.2. Spliced Alignment

Figure 2: Spliced Alignment. Un-spliced reads lie within exactly one exon whereas spliced reads in blue overlaps between two or more exons. In the ungapped alignment of the reads all the short reads lies within exactly one exon as

shown in grey blocks in Figure 2.

In the spliced reads, shown in blue color in the Figure 2, short reads may overlap between two or more exons. The overlapping of the reads between exons creates a more challenging

problem for alignment programs because of the need to correctly determine exon-intron

boundaries. Hence, the process of aligning spliced reads to the genome often is done in two-step

approach in which the initial read alignments are analyzed to discover exon junctions; these

junctions are then used to guide final alignment [10]. Additionally, alignment program may also want to take into consideration the density of the multiple transcripts for correct alignment.

6

2.3. Transcriptome Assembly

Transcriptome assembly or reconstruction is the process of identifying transcripts and isoforms in a specific population of cells. There are broadly two methods of Transcriptome assembly [9].

a. Reference genome based b. De novo transcriptome assembly

Figure 3: Reference Based vs De Novo Assembly. In reference-based method reads are mapped back to a reference genome. In a ‘de novo’ assembly strategy reads are compared to each other to reconstruct expressed isoforms without the need of using a reference genome. In the reference based transcriptome assembly, the sequence reads are mapped to the reference genome of the species under study or a closely related species. A critical step towards assembling short RNA reads is alignment of splice-align reads to the genome [10]. This is followed by building a graph representing alternative splicing events as shown in Figure 4. Next step is to follow the graph to assemble variants. Predicting the transcript from splicing graph can

7

be done using many different approaches such as maximum likelihood7or maximum parsimony

approach8.

Figure 4: Build a Graph Representing Alternative Splicing Events. The nodes represent the exons and directed edges represent the different connections between exons. A path from start to end can have different exons included representing different splice patterns. In the De novo based assembly method, the transcriptome is assembled without the aid of

a reference genome. This assembly is often the preferred method for studying non model

organisms as this is easier than building the genome, and reference based assembly is not

possible without an existing reference genome. The first step in the de novo based assembly

process involves generating all substrings of length k from the reads.

7 A method of estimating the parameters of a statistical model which makes the known likelihood distribution a maximum (https://en.wikipedia.org/wiki/Maximum_likelihood) 8 The maximum parsimony method searches all possible tree topologies for the optimal (minimal) tree. (http://www.icp.ucl.ac.be/~opperd/private/parsimony.html) 8

Figure 5: De novo Assembly. (a) Generating substrings. Split the reads into smaller reads called k-mers in order to be used in De Bruijn graph (b) De Bruijn Graph. Reads are represented by the nodes and the alignments are represented by the edges. Walking along the edges of the graph allows one to construct the assembly. After generating substring a De Bruijn graph9 is created and the graph is traversed to assemble

the isoforms.

In the absence of reference genome, de novo assembly is the method of choice. Though

genome alignment is a robust way of characterizing the transcript sequences, there is a

disadvantage when there is a structural alteration of mRNA transcripts such as alternative

splicing [11].

9 A graph whose nodes are sequences of symbols from some alphabet and whose edges indicate the sequences which might overlap (https://en.wikipedia.org/wiki/De_Bruijn_graph) 9

In this study, TRINITY assembled transcripts were used as an input to the PASA alignment

program to create the alignment assembly. TRINITY is a de novo transcriptome assembler,

recovers more full length transcripts and has sensitivity similar to reference based aligners [12].

TRINITY uses three independent software modules: Inchworm divides the reads and assembles

initial contigs, Chrysalis groups related contigs into de Bruijn graphs, and Butterfly collapses the

graphs and aligns the reads to them [13].

2.4. Transcriptome Alignment

Transcribed sequences are aligned with the reference genome or reference transcriptome to

delineate the gene structures via gapped alignments [3]. In our study, we used PASA (Program to

Assemble Spliced Alignments). PASA is considered the gold standard for gene structure

resolution via the spliced alignment.

Figure 6: PASA Pipeline. Transcripts are aligned to the genome with minimum 95% identity and minimum 90% transcript length aligned. Alignment is performed by gmap or blat or both. PASA also allows for Alignment comparison and updates to the existing annotation. PASA pipeline can automate the transcripts alignments into gene structure annotation. The

PASA pipeline is shown in Figure 6. In addition to annotation comparison, PASA also suggests

exon modification, alternate splice isoform additions, gene merges, gene splits and identification

of new genes. The PASA algorithm finds the maximal assembly of compatible alignments, to

provide the maximum evidence supporting gene structures.

10

2.5. Genome Annotation

Genome annotation broadly consists of:

- Assigning gene names

- Functional characteristics of gene products

- Physical characteristics of genes

- Metabolic profile of the species

Figure 7: Sample Annotation Pipeline. Region of DNA sequence from START codon to STOP codon is known as open reading frame (ORF). Annotation involves identification of the ORFs followed by generation of the protein sequences. The protein sequences are searched for homology with existing known proteins in the public nr database 10and pertinent information is attached to the genes. Annotation can be performed by automation or manually or using a combination of both. In the automatic process a computer program makes the decision. The advantage of automatic process is that the annotation is quick and easy, but the disadvantage is that quality of the annotation may be compromised. In the manual annotation process, humans makes the decision and although is of the highest quality, the process is time consuming and tedious. Due to the rate

10 The nr database is compiled by the NCBI (National Center for Biotechnology Information) as a protein database for Blast searches. It contains non-identical sequences from GenBank CDS translations, PDB, Swiss-Prot, PIR, and PRF (http://www.matrixscience.com/help/seq_db_setup_nr.html) 11 at which whole genomes are sequenced these days, the annotation process is mostly automatic with manual curation.

Annotation can be structural or functional. Structural annotation in its simplest form consists of identifying the genes in the genomic DNA. There are mainly two ways to define a gene structure: - by prediction and by sequence similarity. With gene prediction, algorithms are developed to identify structure based on nucleotide sequence and composition. In the Sequence similarity technique, transcripts of the species are aligned with the genomic sequence of the same or related species to identify the motifs and domains. In the absence of a related species, transcripts are blasted against the non-redundant database to identify proteins.

In the functional annotation, descriptive names for the gene products are established with as much specificity as there is evidence to support. Functional annotation consists of attaching biological information such as biochemical function, biological function to genomic elements

[22].

12

3. Architectural Model

3.1. Annotation pipeline

Figure 8: Annotation Pipeline. A customized pipeline using PASA, MAKER, AUGUSTUS, SNAP and EVM was developed to re-annotate E.huxleyi, G. oceanica and I.galbana. 13

Figure 8 shows the detailed workflow of the re-annotation pipeline. To check the validity

of the re-annotation pipeline, Emiliania Huxleyi transcripts were used as the first input to the

pipeline as we had a high confidence previous JGI generated annotation of the E. Huxleyi. We

expected more than 95% of previous annotated proteins of E. Huxleyi to be present in re- annotated E. Huxleyi to validate our workflow pipeline.

For each sister species, the input to the first step was TRINITY assembled transcripts.

The reads from all the four conditions were used as an input. We chose a pipeline based on

PASA (Program to Assemble Spliced Alignments) because it is the Gold standard for gene structure resolution (introns and exons via spliced alignment) and provides evidence for alternative splicing, polyadenylation sites, five prime and 3 prime untranslated regions. PASA pipeline automatically incorporated transcript alignments into gene structure annotations [3].

PASA pipeline was used to build the transcript database after aligning the transcripts with the respective reference genome, followed by alignment of assemblies. The alignment was carried

out using the GMAP alignment tool within PASA pipeline. The pipeline generated alignment

assemblies along with the gene structures for each of the three species. A comparison tool provided within PASA assembly was used to compare the PASA generated gene structures with the existing E. Huxleyi annotation. PASA also provides a tool for generating the training set that was used for training ab initio gene predictors such as SNAP and AUGUSTUS.

MAKER2 was used to generate protein alignments with the reference sequence uniprot11

database. The genome file for each species in was used as the input to MAKER2

and searched against the uniprot database.

Finally all the gene evidence generated from MAKER2, AUGUSTUS, SNAP and PASA was

combined using the EVM12 (Evidence modeler) tool to generate the annotation.

11 Universal Protein Resource (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC540024/) 14

3.2. Comparison pipeline

Figure 9: Comparison Pipeline. Blastp tool was used to extract the homologous genes between two species. In the second part of the study, annotated genes of each species were compared against

each other to find the common genes between any two species and between all three species.

Blastp tool included in the BLAST software suite was used with a strict expect value of 10e-6 to

blast one species gene set against the other.

12 Evidence modeler: http://evidencemodeler.github.io/ 15

Figure 10: Finding Unique Genes for a species. To find unique genes for E. huxleyi, blastp was used with E. huxleyi as the query and each other species as reference. A set intersection operation was performed on the resulting E. huxleyi ids from two blastp results to find E. huxleyi genes that have alignment to both G.oceanica and I. galbana. A similar method was applied to G.oceanica and I. galbana. In this part of study, unique genes were extracted for each species using the workflow

outlines in Figure 10. First, for a species, genes were identified that had a hit on both other

species. The resulting common genes for three species were used to extract common genes

present in only two species. Three way common genes and two way common genes were

subtracted from total genes for the species to extract the unique genes.

16

4. Implementation and Analysis

This section discusses in detail the steps of the pipeline outlined in the architecture and the

analysis of each step of the pipeline.

4.1. Data Preparation

4.1.1. Input Data Source

TRINITY assembled transcripts for each species were used as the initial input for this study.

Inputs for each species were provided in two files based on expressed transcript sequences in

four different growth conditions. The files were named _contigs and _singlets for each species.

The contigs file contains the expressed transcripts in all different conditions. The singlets file

contains the expressed transcripts in one or more but not all four conditions. The input files were

combined into one file and duplicate headers in singlets file were renamed. The commands to

generate the statistics are listed in the Appendix 1 and Table 1 below shows the statistics.

Contigs + Singlets E. huxleyi G. oceanica I. galbana No. of Transcripts 99523 191145 53044 Longest transcript 18018 22373 27385 Shortest transcript 201 201 201 Mean transcript size 1058 896 1404 N50 length 1507 1392 2167 Table 1: TRINITY Transcriptome Assembly

4.1.2. Data Trimming

The input transcripts for each of the species were cleaned using seqclean utility scripts.

SeqClean is a tool for validation and trimming of DNA sequences from a flat file like the fasta formatted file [14, 15, and 16]. A vector file was downloaded from the NCBI website [16]. The tool works by processing the input file and filtering its content based on criteria such as, 17

percentage of undetermined bases (assigned when the percentage of undetermined bases is

greater than 3% in the clear range), polyA tail removal, and overall low complexity analysis,

homology to vectors, homology with other contaminants or unwanted sequences. The program seqclean was used in its default parameters.

Sequence cleaning E. huxleyi G. oceanica I. galbana Sequences analyzed 99523 191145 53044 Valid 99515 191115 53043 Trashed 8 30 1 Trashed by ‘shortq’ 2 17 0 Trashed by ‘dust’ 6 13 1 Table 2: Seqclean Results

4.2. Transcript Alignment

The next step was transcript re-construction from the alignment of the assembled transcripts to the genome using the PASA pipeline.

4.2.1. PASA Pipeline

PASA also performed sequence cleaning by identifying polyadenylation sites and low quality sequences. The PASA pipeline first aligns the TRINITY assemblies to the reference. The alignment is considered valid when there is 95% identity and at least 90 % of the transcript length aligns. Although the criteria are configurable the defaults were employed herein. After first aligning the transcripts using alignment tool GMAP13, each of the alignments were validated

requiring the [GT, GC]/AG consensus donor/acceptor splice sites at all introns [3].

13 GMAP is a standalone program for mapping and aligning cDNA sequences to a genome (http://www.ncbi.nlm.nih.gov/pubmed/15728110) 18

The alignment was followed by clustering overlapping sequences into groups or clusters.

Each cluster was subjected to the PASA algorithm to generate a set of unique, maximal

alignment assemblies. The assemblies within each cluster were further divided into sub-clusters representing conflicting complete or partial transcripts likely corresponding to the same gene (i.e. alternative or anomalous splicing isoforms of the same gene). These groups or clusters are called

PASA assemblies and they were further clustered to combine transcripts into longer assembly.

These sub-clusters required the same spliced orientation and at least 50% overlap of a

neighboring alignment. These sub-clusters represent the genes in the PASA assembly and were used to perform the comparative study with the previous annotation. PASA enables different

alignment tools to be used within the pipeline. For example, GMAP or/and BLAT are supported

by the PASA for use in the transcript alignment step. We used the GMAP alignment tool in our

pipeline because it is specifically designed to align spliced alignments to the genome. BLAT is

not suitable for this analysis as it is not designed for aligning spliced transcripts. We used the

cleaned TRINITY assemblies as the PASA input and the PASA alignment program generated

assemblies, gene structures and descriptions of sub clusters.

The file alignAssembly.config was used with the default parameter settings for the pipeline.

The command for running the PASA pipeline is in the Appendix 2. PASA assemblies are stored

in a MYSQL database where they can readily be used by the PASA web interface tool for the

statistical analysis. The PASA statistics obtained for the E. huxleyi, G. oceanica, and I. galbana

annotations are present in Table 3 below.

19

Transcripts or Assemblies E. huxleyi G. oceanica I. galbana

Total transcript sequences 99,515 191,115 53,043

Full length cDNAs 0 0 0

Partial cDNAs (ESTs) 99,515 191,115 53,043

Number transcripts with any alignment 92,643 117,942 49,216

Valid gmap alignments 55,040 65,699 36,601

Total Valid alignments 55,040 65,699 36,601

Valid Full Length-cDNA alignments 0 0 0

Valid EST alignments 55,040 65,699 36,601

Number of assemblies 45,253 48,884 30,592

Number of sub-clusters (genes) 34,066 32,758 24,918

Number of Full Length -containing 0 0 0 assemblies

Number of non- Full Length -containing 45,253 48,884 30,592 assemblies

4.2.2. Table 3: PASA Annotation Comparison

PASA tools were also used to compare the E. huxleyi annotation generated herein with the previous JGI annotations (Table 4). The configuration file annotcompare.config was updated for each species with the correct MYSQL database names used in the PASA alignment assembly.

The parameters in the file were not changed and default values were used for the comparison.

The commands for the comparison of PASA alignments and existing annotations are listed in the

Appendix 2.

20

Overview Previous Annotation PASA Pipeline

Predicted genes 30,449 34,066

Annotated Transcripts

Transcripts with start and stop codon 26,756 29,181

Transcripts missing start or stop codon 3,693 4,885

Single exon transcripts 7,734 (25.4%) 7,372 (22.5%)

Transcript N50 14length 1368 1394

Exons

Total number of exons 110,697 124,366

Average exon length 293.4 297.36

3' UTR (Untranslated Region)

Total transcripts with 3' UTR 10,591 17,403

5' UTR (Untranslated Region)

Total transcripts with 5' UTR 9,854 16,053

Intron

Total Introns 80,237 91,654

Average Intron length 240.38 221.69

Total Intron Sequence length 19,256,880 20,318,775

Table 4: Comparison of PASA annotation results and JGI 15Annotations of E. huxleyi

14 The N50 of an assembly is a weighted median of the lengths of the sequences it contains, equal to the length of the longest sequence s, such that the sum of the lengths of sequences greater than or equal in length to s is greater than or equal to half the length of the genome being assembled 15 Joint Genome Institute (http://jgi.doe.gov/) 21

Overview Previous Annotation PASA Pipeline

Predicted genes 28,441 32,758

Annotated Transcripts

Transcripts with start and stop codon NA 29,415

Transcripts missing start or stop codon NA 3,343

Single exon transcripts 12,595 (44.28%) 12,537 (41.32%)

Transcript N50 length 1,392 1,462

Exons

Total number of exons 66,336 77,845

Average exon length 523.64 491.59

3' UTR

Total transcripts with 3' UTR 2,803 7,890

5' UTR

Total transcripts with 5' UTR 2,737 7,063

Intron

Total Introns 37,895 47,502

Average Intron length 244.34 222.94

Total Intron Sequence length 9,259,264 10,590,095

Table 5: Comparison of PASA annotation results and previous MAKER Gene Annotations of G. oceanica

22

Overview Previous Annotation PASA Pipeline

Predicted genes 13,148 24,917

Annotated Transcripts

Transcripts with start and stop codon NA 14,608

Transcripts missing start or stop codon NA 10,309

Single exon transcripts 3,333 (25.35%) 3,275 (21.56%)

Transcript N50 length 2,167 2,140

Exons

Total number of exons 36,689 44,719

Average exon length 481.96 490.93

3' UTR

Total transcripts with 3'UTR NA 8,749

5' UTR

Total transcripts with 5'UTR NA 8,448

Intron

Total Introns 23,541 29,531

Average Intron length 309.94 296.78

Total Intron Sequence length 7,296,297 8,764,210

Table 6: Comparison of PASA annotation results and previous MAKER Gene Annotations of I. galbana

23

E. huxleyi G. oceanica I. galbana

Transcripts with start and stop codon 29,181 29,415 14,608

Single exon transcripts 7,372 12,537 3,275

Transcript N50 length 1394 1,462 2,140

Total number of exons 124,366 77,845 44,719

Average exon length 297.36 491.59 490.93

Total transcripts with 3'UTR 17,403 7,890 8,749

Total transcripts with 5'UTR 16,053 7,063 8,448

Total Introns 91,654 47,502 29,531

Average Intron length 221.69 222.94 296.78

Table 7:Side by side PASA output comparison

Table 4, 5 and 6 shows comparison of the key statistics of previous annotation with the

PASA comparison results. First, PASA has identified more genes in each of the respective species; 3,617 more genes for E.huxleyi, 4,317 more genes for G.oceanica and 11,769 more genes for I.galbana. Second, PASA predicted proteins appear to be more complete; 29,181 vs

26,756 for E.huxleyi. Previous annotation did not have any START or STOP codon predicted for G.oceanica or I.galbana. PASA results, described 29,415 genes for G.oceanica and 14,608 genes for I.galbana containing both START and STOP codons. Table 7 shows that PASA predicted considerably more genes with 5’ UTR sequences for E.huxleyi than G.oceanica or

I.galbana.

24

4.3. Training Data Preparation

Figure 11: Workflow for Generating Training Set. Unique complete ORF of minimum protein length 300 was extracted using PASA utility and blasted against the non-redundant database. The homologous gene structures were extracted and checked for redundancy and gene feature overlap. 4.3.1. Input Data

A subset of PASA generated alignment assemblies was used to extract protein coding regions

to be used for generating a high quality data set for training ab initio gene predictors

AUGUSTUS and SNAP. A training data set is an essentially a hypothesis about the typical structure of the transcript produced by a gene. In this study, we extracted unique complete ORF’s from the PASA assemblies as an input to the ab initio predictors.

4.3.2. ORF

The region of DNA sequence from the START codon to the STOP codon is referred to as

ORF (open reading frame). The gene finding process starts by looking for an ORF. ORF usually starts with a START codon and terminate with one of the three terminating codons, TAA, TAG,

or TGA. Depending on the starting position there are six possible protein sequences that can be

generated by the same DNA sequence referred to as reading frames. A tool provided by the

25

PASA pipeline, Transdecoder module, was used to find likely coding regions within transcripts.

Transdecoder identifies candidate coding sequences based on the following criteria:

1. Transcript sequences are scanned for minimum length ORF. In our study the

minimum length of 300 amino acids was used as the criteria for finding coding

regions.

2. A log-likelihood score. Likelihood ratio test is a statistical test used to compare the

goodness of fit of two models (model under test vs alternate model). The test is based

on the likelihood ratio, which expresses how many times more likely the data are

under one model than the other. When the logarithm of the likelihood ratio is used,

the statistic is known as a log-likelihood score. Transdecoder calculates the score

based on how likely the sequence under test is a coding sequence (vs a random

sequence). A score of more than 0 indicates a likely coding sequence.

4.3.3. Extracting Unique Complete ORF

Transdecoder was run on the generated PASA assemblies, and resulting output file was scanned for unique complete ORFs. Non-overlapping ORFs were extracted as the initial data set.

The process of extracting the unique ORF is outlined in the Appendix 3.

Species Number of Unique ORF’s extracted E. huxleyi 7,239 G. oceanica 17113 I. galbana 13135 Table 8: Unique Complete ORF's Extracted

26

4.3.4. Candidate Gene Structure Identification

The complete ORFs extracted from the Transdecoder were then blasted against the non- redundant protein database at Genbank. This study selected ORFs with an expect value less than or equal to 0.000001 and an alignment coverage equal to or more than 65%. Gene structures extracted using this method can be confidently used for training [7]. The extracted gene structures were clustered using CD-HIT16 to remove redundancy. A PASA tool to extract the

best ORF’s of protein length of greater than or equal to 300 amino acids was used to further filter

the training set. Finally, the extracted training set was checked for overlapping genes using

bedtool.

Species nr Hits = No. of gene models E. huxleyi 7,830 G. oceanica 11,653 I. galbana 7014 Table 9: ORF nr-hits

The nr database hits as depicted in table 9, were further filtered to cluster the genes and

remove redundancy using the CD-HIT program. A gff317 file for the nr hits was extracted and corresponding protein sequences generated using the EVM tool. The protein fasta file were

extracted and clustered using the CD-HIT (Table 10) and the resulting gff3 files were used as a

training set for the ab initio gene predictors.

16 CD-HIT stands for Cluster Database at High Identity with Tolerance is a program for clustering and comparing protein or nucleotide sequences (http://www.bioinformatics.org/cd-hit/) 17 The general feature format (GFF) is a file format used for describing genes and other features of DNA, RNA and protein sequences. (https://en.wikipedia.org/wiki/General_feature_format ) 27

Species Clustered nr hits E. huxleyi 7,446 G. oceanica 6,743 I. galbana 4,495 Table 10: Clustered nr Hits for Gene Predictors

4.4. Ab Initio Training and Prediction

The training set of gene structures in gff3 format were used as an input to the gene predictors

AUGUSTUS and SNAP. As a minimum for optimum performance, at least 200 gene structures

are needed for AUGUSTUS training. Generally speaking, the larger the training set the better the

prediction, however there is little improvement when more than 1000 gene structures are used

[17]. AUGUSTUS is already trained for many species. To run AUGUSTUS on a new species, the algorithm must be optimized or trained based on the characteristics of particular species. The

Markov chain transition probabilities for coding and non-coding regions and Meta parameters for each species are created during the AUGUSTUS training process and can be found under

AUGUSTUS config folder.

AUGUSTUS manual training was performed and gene prediction species models were created for each of the three species referred to ehux, geph and iso. Gene prediction was performed thereafter, the results of which are presented in Table 11. The commands are listed in the Appendix 4.

Species Genes predicted E. huxleyi 49,111 G. oceanica 54,354 I. galbana 21,639 Table 11: AUGUSTUS Gene Prediction

28

Another ab initio prediction tool used was SNAP. Semi-HMM18-based Nucleic Acid Parser

is a general purpose gene predictor used for both eukaryotic and prokaryotic genomes. SNAP

requires a precompiled HMM model file or a HMM file be created for each new species to be

used in the prediction stage. The training gff3 format file was converted to a format required by

SNAP. The steps to create a HMM model for SNAP are listed in the Appendix and the results

obtained are presented in Table 12 below.

Species Genes predicted E. huxleyi 76,582 G. oceanica 104,440 I. galbana 29,697 Table 12: SNAP Gene Prediction 4.5. Reference Protein Alignment Generation

Reference protein alignments were generated using MAKER2 by aligning the sister species genome against the uniprot reference protein database. The alignments resulting from the run were formatted for EvidenceModeler (EVM) input. This technique tries to find parts of genome that have alignments with the known protein sequences in the reference database. The MAKER2 commands are listed in the Appendix 5.

4.6. Combining Evidences

PASA transcript assemblies, ab initio predictions from AUGUSTUS and Reference protein alignments were combined using the EVM package which provides a flexible framework to combine diverse evidence types into single gene structure [18]. The genome file in fasta format, protein alignment file, transcripts alignment and predictions files all in general feature format

(gff3) are used as input to EVM along with the weight file which contains a numeric weight

18 A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (hidden) states. A HMM can be presented as the simplest dynamic Bayesian network (https://en.wikipedia.org/wiki/Hidden_Markov_model ) 29

value to be applied to each type of evidence. The weights can be configured intuitively based on

the quality of the evidence; transcripts alignments typically weighted heaviest followed by protein alignments and then gene predictions. In our study we gave the most weight of 10 to the

PASA assembles, followed by a weight of 5 to Reference protein alignments and a weight of 1 for gene predictor AUGUSTUS. SNAP was not used in the re-annotation pipeline because it generated too many false positive predictions. Re-annotation pipeline validation and selection is described in section 4.8.

Species EVM predicted genes Previous genes E. huxleyi 46,705 30,446 G. oceanica 53,480 28,441 I. galbana 18,875 13,148 Table 13: EVM Predicted Genes Table 13 shows the predicted genes when all of the evidences was combined using the EVM.

The predicted genes were clustered in the next step to generate the final gff3 file for each

species.

4.7. Gene Clustering

EvidenceModeler output was clustered using the CD-HIT program to remove any redundancy and generate the final gff3 file for each species. This was followed by generating the protein and genes files.

Species Previous genes Clustered (Re-annotated) E. huxleyi 30,446 35,582 G. oceanica 28,441 52,680 I. galbana 13,148 18,712 Table 14: Final Predicted Genes

30

4.8. Validating Workflow

To validate the re-annotation workflow the previous high confidence JGI annotation of

Emiliania Huxleyi was compared against the re-annotated files. The 30499 E. Huxleyi proteins annotated by JGI were blasted against the re-annotated proteins with an expect value of

0.000001. The intermediate methods before the final EVM predictions were adjusted in order to get maximum hits of previous JGI E. huxleyi proteins on to the re-annotated E. huxleyi proteins.

The expectation for the re-annotation pipeline was to find homology of at least 95% between JGI annotated E. huxleyi proteins and re-annotated E. huxleyi proteins. Once the workflow methods were adjusted based on E. Huxleyi re-annotation, other two species were re-annotated using the same workflow and techniques. Table 15 shows the various methods that were used to generate the E. Huxleyi re-annotation.

Method Based on Unique complete ORF # Hit F1-score 1 PASA transcripts + RefSeq Alignments + AUGUSTUS 29,000 0.87 2 PASA transcripts + RefSeq Alignments + AUGUSTUS + 28,698 0.87 SNAP After Filtering Training set to remove overlapping features 3 PASA transcripts + RefSeq Alignments + AUGUSTUS 28,978 0.89 4 PASA transcripts + RefSeq Alignments + AUGUSTUS + 29,624 0.78 SNAP Table 15: Workflow Validation: Inputs to EVM

The re-annotation pipeline evaluation was based on the F1 score calculated for each method

listed in Table 15. The F1 score measures the accuracy of the method using the precision p and

recall r. Precision is the ratio of true positives (TP) to all predicted positives (TP + FP). Recall is

the ratio of true positives to all actual positives (TP + FN) [19]. F1 score or the balance metric is

the harmonic mean of precision and recall. 31

2( . ) 2 1 = = ( + ) (2 + + ) 푝 푟 푇푇 퐹 The F1 metric weights recall and푝 precision푟 equally,푇푇 퐹and� a 퐹퐹good method will maximize both

precision and recall simultaneously [19, 20]. A moderately good performance on both precision

and recall will be favored over extremely good performance [19, 20, and 21].

Based on the F1 score column in Table 15, method #3 was chosen as the best workflow for

re-annotating the other two sister species.

4.9. Comparative study

This part of the study focused on comparative study between the three species using Blast

suite of programs.

4.9.1. Reference E. huxleyi

Figure 12: G. oceanica on E. huxleyi and I. galbana on E. huxleyi. (a) Venn diagram illustrating the number of G.oceanica homologous proteins mapped to E.huxleyi (b) Venn diagram illustrating the number of I. galbana homologous proteins mapped to E.huxleyi Figure 12 shows the number of G. oceanica and I. galbana genes that maps to E. huxleyi.

blastp was used with a expect value of 0.000001, E. huxleyi as the reference and other two sister

species as the query. 76.6% of G.oceanica genes map on to E.huxleyi and 74.4% of I.galbana

genes map to E.huxleyi.

32

4.9.2. Reference G. oceanica

Figure 13: E. huxleyi on G. oceanica and I. galbana on G. oceanica (a) Venn diagram illustrating the number of E.huxleyi homologous proteins mapped to G.oceanica (b) Venn diagram illustrating the number of I. galbana homologous proteins mapped to G.oceanica Figure 13 shows 89.3 % of E. huxleyi genes map on to G. oceanica vs 82.0 % of I.

galbana genes map on to the G. oceanica genes. Figure 13 have different intersections of

common genes compared to Figure 12 because in Figure 13 G. oceanica is used as the reference

and in Figure 12, G. oceanica was used as the query (subject).

4.9.3. Reference I. galbana

Figure 14: E. huxleyi on I. galbana Genes and G. oceanica on I. galbana (a) Venn diagram illustrating the number of E.huxleyi homologous proteins mapped to I. galbana (b) Venn diagram illustrating the number of G.oceanica homologous proteins mapped to I. galbana 66.1% of E.huxleyi genes map on to I. galbana vs 68.6% of G.oceanica genes map on to

I. galbana. As expected the intersection region of common genes is different than that Figure 13

and Figure 14 because in this comparison I. galbana is used as the reference. G. oceanica has

33

more shared genes with I. galbana; this could be attributed to the more number of genes in G.

oceanica.

4.9.4. Three way comparison with individual species as Reference

In this part of the study, each species was blasted against each of the other two species using blastp and using set intersection function common gene ids were extracted. This method was used to find how many a particular species genes are in the 3 species intersection of genes.

Figure 15: Common Genes based on (a) E.huxleyi, (b) G.oceanica and (c) I. galbana. Venn diagram showing the common genes when one of each species is used as reference. Figure 15 (a) is based on E. huxleyi genes mapping on to G. oceanica and I. galbana.

Figure 15 (a) indicates that there are 22,875 genes in E. huxleyi that have shown homology to

genes in G. oceanica and I. galbana. Similarly, figure 15 (b) illustrates that there are 28,714

genes in G. oceanica that have found homology to genes in E. huxleyi and I. galbana. Finally,

figure 15c, validates that there are 13, 223 genes in I.galbana show homology to the genes in E.

huxleyi and G.oceanica. Based on the genome sizes, the most number of common genes between

species were shared when G.oceanica was used a reference, which was expected.

34

4.9.5. Overlap for three species

Figure 16: Overall common genes. The middle part of the Venn diagram shows the common genes. 28,714 G.oceanica genes are mapped to the both E. huxleyi and I. galbana genes. 22,875 of E. huxleyi genes are mapped to both G.oceanica and I. galbana genes. 13,223 of I. galbana genes are mapped to E. huxleyi and G.oceanica genes. In this part of the study, we combined the individual diagrams in Figure 15 to generate

overlapped regions that give us an idea about genes that are unique to each species and genes

shared between two and three species. As shown in Figure 16, there are 28,714 G.oceanica genes

that have similarities with E. huxleyi-G.oceanica-I.galbana genes. Additionally, 9,563

G.oceanica genes have similarities to 8,913 E. huxleyi-G.oceanica common genes. G.oceanica also has 7,444 genes that share similarities with 2,135 I. galbana genes. The study also found

8.8% of E.huxleyi genes are unique, 13.2% of G.oceanica genes are unique and 14.1% of

I.galbana genes are unique. G.oceanica has the most genes that share similarities with all three

E. huxleyi-G.oceanica-I.galbana genes, followed by E.huxleyi and I.galbana. We also found that

I.galbana has more genes that share similarities to E.huxleyi-I.galbana genes even though

I.galbana has fewer total genes predicted than E.huxleyi.

35

5. Conclusion

In this study we started with the JGI annotated E. huxleyi and MAKER annotated G.oceanica

and I. galbana. We attempted to improve the annotations by making use of the newly available

TRINITY assembled transcripts. We used a customized pipeline involving PASA, MAKER2,

AUGUSTUS, SNAP, BLAST suite of programs, CD-HIT and EvidenceModeler to assemble,

align and re-annotate the sister species. Our first dataset for the pipeline was E.huxleyi TRINITY

assembled transcripts, assembled under four different growth conditions. We chose E.huxleyi as

the first dataset as our existing annotations for this species is of higher confidence than the other two sister species. We tried to prove that our customized pipeline was adequate by re-annotating a known good species and comparing the results with the existing one.

A pipeline for re-annotation was presented. The re-annotation has produced improved results compared with the existing annotation, predicting more complete gene structures. The re- annotated proteins of E. huxleyi contain 96.7% of the previously annotated E. huxleyi proteins.

35,582 E.huxleyi genes were predicted compared to 30,499 in the previous JGI annotation.

52,680 G.oceanica genes were predicted compared to 28,441 predicted by previous MAKER2 annotation. For I.galbana an improvement of 43% from the previous MAKER annotation was obtained using the re-annotation pipeline. 2,425 additional E.huxleyi genes with both START and STOP codon were predicted by the re-annotation pipeline, a significant improvement over the JGI annotation. The workflow is a good starting point for further in-depth study. This workflow can be adjusted and default parameters changed to suite different set of transcriptome.

Incorporation of transcriptome sequence into the genome annotation has improved the first generation annotation and has set up a stage for further comparative and functional studies.

36

Comparative study between E. huxleyi and G.oceanica has found 31,788 E. huxleyi genes that are mapped to G. oceanica and 23,532 E. huxleyi genes mapped to I. galbana. The ids of the common genes were extracted for further comparative studies. With G.oceanica as reference and other two species as query, 89% of E. huxleyi maps to G.oceanica vs 82% of I.galbana, which suggests that E. huxleyi and G.oceanica may be more closely related to each other than to

I.galbana.

Improved annotation of the closely related sister species provide a valuable resource for comparative and functional studies. With the revised annotation, further studies can be performed to identify genes that are associated with calcification. Furthermore, the revised annotations allow for in-depth studies on molecular functions of the proteins.

37

References

1. Young, J. R. (n.d.). Emiliania huxleyi and other coccolithophores. Paleontology

Department, The Natural History Museum, London. [Cited: June 25, 2015]. Retrieved

from http://protozoa.uga.edu/portal/coccolithophores.html

2. Ehux “Tree of Life” alga sequenced. (June 13, 2013). [Cited: June 25, 2015]. Retrieved

from http://www.algaeindustrymagazine.com/ehux-tree-of-life-alga-sequenced/

3. Haas, B. J., Delcher, A. L., Mount, S. M., Wortman, J. R., Smith, R. K., Hannick, L. I.,

et al. (2003). Improving the Arabidopsis genome annotation using maximal transcript

alignment assemblies. Nucleic Acids Res, 31 (19), pp. 5654–5666.

4. Kellis, M., Patterson, N., Birren, B., Berger, B. & Lander, E. S. (2004). Methods in

comparative genomics: genome correspondence, gene identification and regulatory motif

discovery. Journal of computational biology : a journal of computational molecular cell

biology, 2–3, pp. 319–55.

5. Moret, B. M. E., Miller, W. C., Pevzner, P.A. & Sankoff, D. (n.d.). Computational

Challenges in Comparative Genomics. A Tutorial. [Cited: July 2, 2015]. Retrieved

from http://www.researchgate.net/publication/242480927_Computational_Challenges_in

_Comparative_Genomics_A_Tutorial

6. Emiliania huxleyi. (n.d.). In Wikipedia. [Cited: June 28, 2015]. Retrieved

from https://en.wikipedia.org/wiki/Emiliania_huxleyi

7. Eckalbar, W. L., Hutchins, E. D., Markov, G. J., Allen, A. N , Corneveaux, J. J.,

Lindblad-Toh, K., Di Palma, F., Alföldi, J., Huentelman, M. J., Kusumi, K. (2014).

Genome reannotation of the lizard Anolis carolinensis based on 14 adult and embryonic

deep transcriptomes. BMC Genomics, 14 (49).

38

8. Garber, M., Grabherr, M. G., Guttman, M. & Trapnell, C. (2011). Computational

methods for transcriptome annotation and quantification using RNA-seq. Nature

Methods, 8 (6), pp. 469-477.

9. Lu, B., Zeng, Z. & Shi, T. (2013). Comparative study of de novo assembly and genome-

guided assembly strategies for transcriptome reconstruction based on RNA-Seq. Sci

China Life Sci, 56 (2), pp. 143-155.

10. Engström, P. G., Steijger, T., Sipos, B., Grant, G. R., Kahles, A., RGASP Consortium,

Gunnar, R., Goldman, N., Hubbard, T. J., Harrow, J., Guigó, R., Bertone, P. (2013).

Systematic evaluation of spliced alignment programs for RNA-seq data. Nature Methods,

10, (12), pp. 1185-1191.

11. De_novo_transcriptome_assembly (n.d). In Wikipedia. [Cited: July 5, 2015]. Retrieved

from https://en.wikipedia.org/wiki/De_novo_transcriptome_assembly

12. Grabherr, M. G.,, Haas, B.J., Yassour, M., Levin, J. Z., Thompson, D. A., Amit, I.,

Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q., Chen, Z., Mauceli, E., Hacohen, N.,

Gnirke, A., Rhind, N., Di Palma, F., Birren, B. W., Nusbaum, C., Lindblad-Toh, K,,

Friedman N., Regev, A. (2011). Full-length transcriptome assembly from RNA-Seq data

without a reference genome. Nature Biotechnology Volume:29, pp. 644–652.Year

published:Pages:

13. Haas, B. J., Papanicolaou, A., Yassour, M., Grabherr, M., Blood, P. D., Bowden, J.,

Couger, M. B., Eccles, D., Li, B., Lieber, M., MacManes, M. D., Ott, M., Orvis, J.,

Pochet, N., Strozzi, F., Weeks,N., Westerman, R., William, T., Dewey, C. N., Henschel,

R., LeDuc, R. D., Friedman, N., Regev, A. (2013). De novo transcript sequence

39

reconstruction from RNA-Seq: reference generation and analysis with TRINITY. Nat

Protoc, 8 (8), pp. 1494-512.

14. Tae, H., Ryu, D., Sureshchandra,S. & Choi, J. (2012). ESTclean: a cleaning tool for

next-gen transcriptome shotgun sequencing. BMC , 13.

15. Sequence cleaner program. (July 22, 2010). [Cited: July 6, 2015]. Retrieved

from http://sourceforge.net/projects/seqclean/files/

16. Univec database. (n.d.) [Cited: July 3, 2015]. Retrieved

from http://www.ncbi.nlm.nih.gov/tools/vecscreen/univec/

17. AUGUSTUS training. (January 14, 2011). [Cited: July 8, 2015]. Retrieved

from http://bioinf.uni-greifswald.de/bioinf/wiki/pmwiki.php?n=Augustus.Augustus/

18. Haas, B. J., Salzberg, S. L., Zhu, W., Pertea, M., Allen, J. E., Orvis, J., White, O., Buell,

C. R. & Wortman, J. (2008). Automated eukaryotic gene structure annotation using

EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biology, 9

(1).

19. Mean F Score. (June 1, 2015). [Cited: July 12, 2015]. Retrieved

from https://www.kaggle.com/wiki/MeanFScore

20. Precision and recall. (June 13, 2015). [Cited: July 12, 2015]. Retrieved

from https://en.wikipedia.org/wiki/Precision_and_recall

21. Powers, David M W (2011). Evaluation: From Precision, Recall and F-Measure to ROC,

Informedness, Markedness & Correlation.

22. Genome Project. (June 10, 2015). [Cited: July 22, 2015]. Retrieved

from https://en.wikipedia.org/wiki/Genome_project#Genome_annotation

23. Zhang, Xiaoyu (2013). Characterization of Small Silencing RNAs in Emiliania Huxleyi

40

Appendix 1: Input Data Preparation

Generate Unique TRINITY Transcript headers Input Fasta format file 1. ehux_ all_Trinity.fasta.cap.singlets 2. geph_all_Trinity.fasta.cap.singlets 3. iso_all_Trinity.fasta.cap.singlets Command(s)/ fasta-unique-names –r Steps Notes First word of the transcript header in contigs and singlets file must be unique. This is required by PASA pipeline. There are few entries in singlets file that are not unique and had to be renamed.

Clean the transcript assembly Input 1. Command(s)/ seqclean Steps Notes Create one file for each species with all the different condition transcripts.

Generate Input TRINITY assembly statistics Input Fasta format transcript or genome file file 1. ehux_assy 2. geph_assy 3. iso_assy Command(s)/ assemblathon_stats.pl Steps Notes Input TRINITY assemblies were run through the script assemblathon_stats.pl to establish the baseline assembly statistics to be compared with PASA assembly in the re-annotation pipeline.

41

Appendix 2: PASA

Run PASA Alignment Pipeline Input 1. alignAssembly.config 2. 3. *.clean file transcript assembly in fasta 4. Transcript assembly in fasta format Command(s)/ Launch_PASA_pipeline.pl -c -C -R Steps -g -t -T -u --ALIGNERS gmap --CPU 2 Notes Edit alignAssembly.config MYSQLDB=PASA2_Ehux_reannotation_step3 validate_alignments_in_db.dbi:-- MIN_PERCENT_ALIGNED=<__MIN_PERCENT_ALIGNED__> validate_alignments_in_db.dbi:--MIN_AVG_PER_ID=<__MIN_AVG_PER_ID__> #script subcluster_builder.dbi subcluster_builder.dbi:-m=50

Compare previous annotation with PASA generated assembly Input 1. 2. 3. 4. Command(s)/ 1. Load your preexisting protein-coding gene annotations Steps Load_Current_Gene_Annotations.dbi -c -g -P 2. Perform an annotation comparison and generating an updated gene set. Launch_PASA_pipeline.pl -c -A

42

Compare previous annotation with PASA generated assembly -g -t Notes Make sure the db name in alignAssembly.config and annotCompare.config are same.

Validate gff3 file for PASA comparison Input 1.< species_previous_annotation_in_gff3> Command(s)/ pasa_gff3_validator.pl Steps Notes Output of this command should return empty string.

Create a gff3 file from JGI gtf file for PASA pipeline Input 1. Command(s)/ 1. Open gtf file Steps 2. Add " around transcript_id values using vi command 3. Add transcript_id for start_codon and stop_codon using vi commands 4. Convert gtf to gff3 gtf_to_gff3_format.pl > output.gff3 5. Open gff3 file 6. Edit ID= cds field and add start and stop positions ID=cds.NNN => ID=start_stop_cds.NNN 6. Remove mRNA without CDS feature Notes Ehux was annotated using the JGI pipeline. PASA expects gff3 in different format.

Create a gff3 file from MAKER gtf file for PASA pipeline Input 1. Command(s)/ 1. Format the maker gtf files to standard gtf file Steps maker2eval_gtf > formatted.gtf

43

Create a gff3 file from MAKER gtf file for PASA pipeline 2. Convert the gtf to gff3 format as required by PASA comparison pipeline gtf_to_gff3_format.pl > formatted.gff3 3. Edit ID= cds field and add start and stop positions ID=cds.NNN => ID=start_stop_cds.NNN 4. Validate gff3 file using #7 Notes Geph and Iso were annotated using MAKER pipeline

44

Appendix 3: Training Set Generation

Extract ORFs from PASA assemblies Input 1.<{pasadb}.assemblies.fasta> 2.<{asadb}.pasa_assemblies.gff3> Command(s)/ pasa_asmbls_to_training_set.dbi --pasa_transcripts_fasta Steps --pasa_transcripts_gff3 Notes The inputs are generated by the PASA alignment pipeline #5

Extract the ORF’s with type ‘complete’ Input 1.< {pasadb}.assemblies.fasta.transdecoder.gff3 > Command(s)/ 1.Extract the features with type=”complete” Steps extract_FL_transdecoder_entries.pl > type_complete_ids.txt 2.Open type_complete_ids.txt 3.Verify no duplicates present using vi command :sort u Notes The input file is generated from Transdecoder in #10

Extract the protein sequences for the “complete” features Input 1.< type_complete_ids.txt > 2.<{pasadb}.assemblies.fasta.transdecoder.pep> Command(s)/ extract_reads < > type_complete_proteins.fasta Steps Notes Input 1 is from #11 and input 2 is from #10

45

Blast the unique complete ORF’s against the nr database Input 1. 2. Command(s)/ blastp -query Steps -db -outfmt "7 qseqid sseqid evalue bitscore qlen length" -max_target_seqs 2 -evalue 0.000001 -num_threads 30 > nr_match_for_Training.o Notes Search the complete proteins against the non-redundant protein database and identify those ORFs that have good database matches. Such entries can be confidently used for training ab initio gene predictors.

Extract gff3 after the BLASTing unique complete ORF’s with nr database Input 1. 2.<{pasadb}.assemblies.fasta.transdecoder.genome.gff3> 3. Command(s)/ 1.Open the file in vi editor and delete all commented lines using vi Steps :g/# /d 2.Keep only the first word :%s/\(asmbl_[0-9]*|m.[0-9]*\).*/\1 3.Sort keeping only the unique ones : sort u 4.Index Transdecoder generated gff3 file index_gff3_files.pl 5.Generate the gff3 file for Training ab initio predictors gff3_subset.pl .inx > unique_complete_nrHits.gff3 6.Extract the protein file gff3_file_to_proteins.pl unique_complete_nrHits.gff3

46

Extract gff3 after the BLASTing unique complete ORF’s with nr database > unique_complete_nrHits.gff3.fasta 7.Cluster to remove redundancy cd-hit -i unique_complete_nrHits.gff3.fasta -o geph_unique_complete_nrHits.gff3.fasta.cd-hit -T 20 8.Extract the unique non redundant gff3 a).Copy clustered file to ids b).Open ids c).Remove all lines except headers d).Keep only the gene ids e).Sort the ids f).Create index file g).Extract gff3 gff3_subset.pl ids unique_complete_nrHits.gff3.inx > final_TS_for_gene_predictor.gff3 h).Extract the best ORFs pasa_asmbls_to_training_set.extract_reference_orfs.pl final_TS_for_ gene_predictor.gff3 > best_TS.gff3 Notes Input file is generated from the blast result from #13.

47

Check for feature overlaps in final Training set gff3 Input 1. Command(s)/ 1.Copy the to temporary file t Steps 2.Remove all lines except gene feature line :g!/\tgene\t/d 3.Check if there are any overlap intersectBed –c –a t –b t > o 3.Open the output file from previous step3. The last word of each line represents the count of overlaps. This should be 1. If this is more than 1, remove that gene from gff3. Check the number of last word (1) is equal to the total number of lines :%s/\t1\n//gn > should be equal to total lines Notes The input is from #14. Ab initio predictors do not work well if there are overlapping features in the training set.

48

Appendix 4: Gene Predictions

AUGUSTUS training and prediction Input 1.< best_TS.gff3> 2. Command(s)/ 1.Set the environment variable Steps export AUGUSTUS_CONFIG_PATH=/home/gopin001/prog/augustus.2.7/config/ 2.Rename the species folder if already exists mv /home/gopin001/prog/augustus.2.7/config/species/species /home/gopin001/prog/augustus.2.7/config/species/species_ 3.Run the auto training or Manual training a).Auto training autoAugTrain.pl –genome --trainingset= --species= b).Manual training a).Convert GFF3 file to Genbank format file gff2gbSmallDNA.pl b).Remove problematic locis from genes.raw.gb etraining --species=generic --stopCodonExcludedFromCDS=false genes.raw.gb 2> train.err c).filter out the problematic genes cat train.err | perl -pe 's/.*in sequence (\S+): .*/$1/' > badgenes.lst filterGenes.pl badgenes.lst genes.raw.gb > genes.gb grep -c "LOCUS" genes.raw.gb genes.gb d).Split gene structure set into training and test set randomSplit.pl genes.gb 100 e).Create a meta parameters file for your species

49

AUGUSTUS training and prediction new_species.pl --species= f).Edit species_parameters.cfg and set stopCodonExcludedFromCDS to false g).MAKE AN INITIAL TRAINING etraining --species= genes.gb.train h).Make a first try and predict the genes in genes.gb.train ab initio augustus --species= genes.gb.test | tee firsttest.out i).Test. Predict genes (small subset) augustus --species=ehux --predictionStart=1 -- predictionEnd=500000 > test_augustus.abinitio.gff 4.Create the Auto predict shell commands autoAugPred.pl --genome= --species=<> 5.Run the auto prediction a).Serial execution Run each shell command created by above step. b).Parallel execution For each shell scripts run in a screen or use setsid. setsid ./aug1

..

setsid ./aug20

6.Join the output files to generate final AUGUSTUS prediction file cat *out | join_aug_pred.pl > geph_aug.gff Notes Manual training is much faster that auto training. Both methods were used. Auto training can take days to finish. Make sure that there is no defined species in AUGUSTUS config folder.

50

SNAP training and prediction Input 1.< best_TS.gff3> 2. Command(s)/ 1.Create training set using SNAP. Convert the gff3 file. Steps Perl ./gff3_to_zff.pl < > speciesSNAP.ann 2.Extract the scaffold names to speciesSNAP.txt grep ">" speciesSNAP.ann > speciesSNAP.txt 3.Edit speciesSNAP.txt to remove ‘>’ form each line 4.Remove duplicates from speciesSNAP.txt :sort u 5.extract the scaffolds from the genome file perl ./fasta_sort.pl speciesSNAP.txt < > speciesSNAP _scaf.fa 6.validate fathom speciesSNAP.ann speciesSNAP _scaf.fa –validate 7.categorize fathom speciesSNAP.ann speciesSNAP _scaf.fa -categorize 1000 > categorize.log 2>&1 fathom uni.ann uni. -export 1000 -plus > uni-plus.log 2>&1 8.create param folder mkdir params cd params forge ../export.ann ../export.dna > ../forge.log 2>&1 cd .. 9.generate the hmm file

51

SNAP training and prediction hmm-assembler.pl speciesSNAP params/ > speciesSNAP.hmm 10.SNAP prediction a).Predict gene structures and store in SNAP zff file snap speciesSNAP.hmm > speciesSNAP.zff b).Convert the zff file to gff3 zff2gff3.pl speciesSNAP.zff > speciesSNAP.gff3 c).Convert to standard gff3 SNAP_to_GFF3.pl speciesSNAP.gff3 > speciesSNAP _Predictions.gff3 Notes gff3_to_zff.pl has been modified from the original to generate unique genes. “exon” has been replaced to “CDS” to look for coding regions only.

52

Appendix 5: Reference Protein Alignment

REFSEQ Protein alignments using MAKER2 Input 1. 2.

Command(s)/ 1.Generate the MAKER control files Steps maker –CTL

2.Leave the maker_exe.ctl in default values

3.Edit maker_opts.ctl

nano maker_opts.ctl

genome=/home/gopin001/fasta/march2015/species_genome.fasta

#-----Protein Homology Evidence protein=/w2/xyzhang/data/uniprot_sprot.fasta #-----Repeat Masking (leave values blank to skip repeat masking) #-----Gene Prediction protein2genome=1 4.Run MAKER

mpirun -n 40 maker

5.Gather all of the GFFs, No DNA at the end

gff3_merge -n –d species_genome_master_datastore_index.log -o species_maker_annotations.gff

Notes Align the genome file with the uniprot reference to find known protein alignments.

53

Appendix 6: EVM

Format the MAKER2 Protein alignments for EVM Input 1. Command(s)/ 1.Split the files in smaller files Steps split –l 500000 2.Open each file and edit as follows a).Remove lines that do not match EVM specification using vi :g!/\tmatch_part\t/d b).Remove the Parent field : %s/;Parent=[0-9,a-z,A-Z,:,_,.]*// c).Rename the match_part :%s/\tmatch_part\t/\tnucleotide_to_protein_match\t/ Notes EVM requires input files in certain format

Format the AUGUSTUS and SNAP output for EVM Input 1. 2. Command(s)/ 1.Open the Steps a).Open the file and using vi command do the following b).ADD ID= before gene id :%s/\([.]*\tAUGUSTUS\tgene\t[0-9]*\t[0-9]*\t[0-9,.]*\t[+,-]\t[0- 9,.]\t\)\(.*\)/\1ID=\2 c).Edit the transcript line to add ID= and parent= fields :%s/\([*]*\tAUGUSTUS\ttranscript\t[0-9]*\t[0-9]*\t[0-9,.]*\t[+,- ]\t[0-9,.]\t\)\([a-z,0-9]*\)\(.*\)/\1ID=\2\3;Parent=\2 d).Edit the stop_codon line to add ID= and parent= fields :%s/\([*]*\tAUGUSTUS\tstop_codon\t[0-9]*\t[0-9]*\t[.]\t[+,-]\t[0- 9,.]\t\)transcript_id "\([a-z,0-9,.]*\)\(.*\)/\1Parent=\2 e).Edit CDS line to add ID= and parent= fields

54

Format the AUGUSTUS and SNAP output for EVM :%s/\([*]*\tAUGUSTUS\tCDS\t[0-9]*\t[0-9]*\t[0-9,.]*\t[+,-]\t[0- 9,.]\t\)transcript_id “\([a-z,0-9,.]*\)\(.*\)/\1ID=\2.cds;Parent=\2 f).Edit the start_codon line to add ID= and parent= fields :%s/\([*]*\tAUGUSTUS\tstart_codon\t[0-9]*\t[0-9]*\t[.]\t[+,-]\t[0- 9,.]\t\)transcript_id “\([a-z,0-9,.]*\)\(.*\)/\1Parent=\2 g).Convert the AUGUSTUS gff3 to EVM gff3 augustus_to_GFF3.pl species_aug_gene_predictions.gff > species_aug_formatted.gff3 2.Concatenate the AUGUSTUS and SNAP cat species_aug_formatted.gff3 > species_predictions.gff3 3.Validate the formatted file gff3_gene_prediction_file_validator.pl species_predictions.gff3 Notes EVM requires the gene prediction file to be in a certain format.

Combining evidence using EvidenceModeler (EVM) Input 1. 2. 3. 4. 5. Command(s)/ 1.Partition the Inputs Steps partition_EVM_inputs.pl --genome --gene_predictions --protein_alignments --transcript_alignments --segmentSize 100000 --overlapSize 10000 --partition_listing partitions_list.out

55

Combining evidence using EvidenceModeler (EVM) 2.Generate the EVM Command Set write_EVM_commands.pl --genome --weights --gene_predictions --protein_alignments --transcript_alignments --output_file_name evm.out --partitions partitions_list.out > commands.list 3.Run the commands a. Serial execution execute_EVM_commands.pl commands.list | tee run.log b. Parallel execution run_EVM_commands_parallel.pl commands.list 4.Combine the Partitions recombine_EVM_partial_outputs.pl --partitions partitions_list.out --output_file_name evm.out 5.Convert to GFF3 Format convert_EVM_outputs_to_GFF3.pl --partitions partitions_list.out --output evm.out --genome 6.Stitch all the evm.out.gff3 files from each folder combinegff3.pl --partitions partitions_list.out --output evm.gff3 Notes Parallel execution script was created to break the command set into smaller set and execute in parallel. Make sure that the weight file in step 2 has full path information.

56

Appendix 7: Clustering

Trimming the final output using the CDHIT Input 1. 2. Command(s)/ 1.Extract the protein file Steps gff3_file_to_proteins.pl evm.gff3 > species_evm_prot.fasta 2.Cluster using CDHIT cd-hit -I species_evm_prot.fasta -o ehux_evm_prot.fasta.cd-hit –T 20 3.Extract the gff3 a).Copy cdhit out to ids b).Open ids and remove all lines except headers using vi :g!/>/d c).Keep only the gene ids in each line using vi commands :%s/\(.*\)\(evm.TU.scaffold_[0-9]*.[0-9]*\)/\2 :%s/\(evm.TU.scaffold_[0-9]*.[0-9]*\)\( .*\)/\1 d).Create index file index_gff3_files_by_gene.pl evm.gff3 e).Extract final gff3 gff3_subset.pl ids evm.gff3.inx > species_final.gff3 4.Extract the genes file gff3_file_to_proteins.pl species_final.gff3 gene > species_final_genes.fasta

57