Comparative Gene Expression Analysis in Emiliania huxleyi, Isochrysis
galbana, and Gephyrocapsa oceanica
Arun Gopinath
In Partial Fulfillment of the Master of Computer Science
California State University San Marcos
July, 2015
Abstract
Comparative genomics is a field of biological research in which the genomic sequences of different species are compared. Comparative genomics provides a powerful tool for studying evolutionary changes among organisms, helping to identify genes that are conserved among species, as well as genes that give each organism its unique characteristics. Genome annotations of newly sequenced species initially rely primarily on ab initio gene predictions and alignment of reference transcripts of related species; however, the quality of gene models is greatly improved when incorporating same species transcriptomes.
In this study, we started with the re-annotation of the E. huxleyi, G. oceanica and I. galbana based on de novo assembly of transcriptomes. De novo assembly tool TRINITY was used to assemble the transcriptomes of each individual sister species, followed by a pipeline based on Program to Assemble Spliced Alignments (PASA), ab initio gene predictors
AUGUSTUS and SNAP, EvidenceModeler (EVM) and MAKER2. The pipeline was validated by re-annotating E. huxleyi, as we have a good first generation JGI E. huxleyi annotation. The revised E. huxleyi annotation describes 35,582 genes compared to 30,499 for first generation annotation. More than 95% of the previously annotated E.huxleyi genes mapped on to the re- annotated 35,582 E.huxleyi genes. The re-annotation pipeline predicted G.oceanica 52,680 genes compared to 28,441 for the previous annotation pipeline. Revised I.galbana annotation has
18,712 genes compared to 13,148 genes for the previous annotation.
Once we re-annotated the sister species genome, we focused the research to comparative study. Various PASA, EVM and BLAST tools were used to determine the genome size, number of genes predicted, and number of shared genes among the sister species. The revised E. huxleyi annotation shares 31,734 orthologous genes with G. oceanica and 23,392 orthologous genes with
i
I. galbana. Comparative study between G.oceanica and I.galbana found 15,358 common genes, with additional 20,800 G.oceanica genes sharing considerable similarity to the common
G.oceanica-I.galbana genes. Further comparative study found 2,642 unique I.galbana genes;
3,137 unique E.huxleyi genes and 6,959 unique G.oceanica genes.
ii
Acknowledgements
It is my pleasure to extend my deepest gratitude to my committee members, Dr. Xiaoyu
Zhang, Dr. Betsy Read and Dr. Ahmad Hadaegh for sharing their wisdom during the course of this study. I am extremely thankful to them for their time and valuable suggestions to better this study.
I am forever thankful to my wife, Kinga, for listening, offering me advice and supporting me through this entire process. My profoundest feelings of love to my sons, Milan and Emil, for the important roles they play in my life.
Finally, I want to acknowledge with gratitude the support of my parents and sister, who have always encouraged me to look forward to new beginnings rather than regret missed opportunities.
iii
Table of Contents 1. INTRODUCTION ...... 1 2. BACKGROUND ...... 3
2.1. TRANSCRIPTOMES ...... 4 2.2. SPLICED ALIGNMENT ...... 6 2.3. TRANSCRIPTOME ASSEMBLY ...... 7 2.4. TRANSCRIPTOME ALIGNMENT ...... 10 2.5. GENOME ANNOTATION ...... 11 3. ARCHITECTURAL MODEL ...... 13
3.1. ANNOTATION PIPELINE ...... 13 3.2. COMPARISON PIPELINE ...... 15 4. IMPLEMENTATION AND ANALYSIS ...... 17
4.1. DATA PREPARATION ...... 17 4.1.1. Input Data Source ...... 17 4.1.2. Data Trimming ...... 17 4.2. TRANSCRIPT ALIGNMENT ...... 18 4.2.1. PASA Pipeline ...... 18 4.2.2. Table 3: PASA Annotation Comparison ...... 20 4.3. TRAINING DATA PREPARATION ...... 25 4.3.1. Input Data ...... 25 4.3.2. ORF ...... 25 4.3.3. Extracting Unique Complete ORF ...... 26 4.3.4. Candidate Gene Structure Identification ...... 27 4.4. AB INITIO TRAINING AND PREDICTION ...... 28 4.5. REFERENCE PROTEIN ALIGNMENT GENERATION ...... 29 4.6. COMBINING EVIDENCES ...... 29 4.7. GENE CLUSTERING ...... 30 4.8. VALIDATING WORKFLOW ...... 31 4.9. COMPARATIVE STUDY ...... 32 4.9.1. Reference E. huxleyi ...... 32 4.9.2. Reference G. oceanica...... 33 4.9.3. Reference I. galbana ...... 33 4.9.4. Three way comparison with individual species as Reference ...... 34 4.9.5. Overlap for three species ...... 35 5. CONCLUSION ...... 36 REFERENCES ...... 38 APPENDIX 1: INPUT DATA PREPARATION ...... 41 APPENDIX 2: PASA ...... 42 APPENDIX 3: TRAINING SET GENERATION ...... 45
iv
APPENDIX 4: GENE PREDICTIONS ...... 49 APPENDIX 5: REFERENCE PROTEIN ALIGNMENT...... 53 APPENDIX 6: EVM ...... 54 APPENDIX 7: CLUSTERING ...... 57
v
LIST OF FIGURES Figure 1: RNA Sequencing...... 5 Figure 2: Spliced Alignment...... 6 Figure 3: Reference Based vs De Novo Assembly...... 7 Figure 4: Build a Graph Representing Alternative Splicing Events...... 8 Figure 5: De novo Assembly. (a) Generating substrings...... 9 Figure 6: PASA Pipeline...... 10 Figure 7: Sample Annotation Pipeline...... 11 Figure 8: Annotation Pipeline...... 13 Figure 9: Comparison Pipeline...... 15 Figure 10: Finding Unique Genes for a species...... 16 Figure 11: Workflow for Generating Training Set...... 25 Figure 12: G. oceanica on E. huxleyi and I. galbana on E. huxleyi...... 32 Figure 13: E. huxleyi on G. oceanica and I. galbana on G. oceanica ...... 33 Figure 14: E. huxleyi on I. galbana Genes and G. oceanica on I. galbana ...... 33 Figure 15: Common Genes based on (a) E.huxleyi, (b) G.oceanica and (c) I. galbana...... 34 Figure 16: Overall common genes...... 35
LIST OF TABLES Table 1: TRINITY Transcriptome Assembly ...... 17 Table 2: Seqclean Results ...... 18 Table 3: PASA Annotation Comparison...... 20 Table 4: Comparison of PASA annotation results and JGI Annotations of E. huxleyi ...... 21 Table 5: Comparison of PASA annotation results and previous MAKER Gene Annotations of G. oceanica ...... 22 Table 6: Comparison of PASA annotation results and previous MAKER Gene Annotations of I. galbana ...... 23 Table 7:Side by side PASA output comparison ...... 24 Table 8: Unique Complete ORF's Extracted ...... 26 Table 9: ORF nr-hits ...... 27 Table 10: Clustered nr Hits for Gene Predictors ...... 28 Table 11: AUGUSTUS Gene Prediction ...... 28 Table 12: SNAP Gene Prediction ...... 29 Table 13: EVM Predicted Genes ...... 30 Table 14: Final Predicted Genes ...... 30 Table 15: Workflow Validation: Inputs to EVM ...... 31
vi
1. Introduction
Phytoplankton derived from the Greek work phyto (plant) and plankton (drift) are microscopic organisms that live in the watery environments. Like the plants on land, phytoplankton captures sunlight and converts it into energy by the process of photosynthesis.
Coccolithophores are single celled eukaryotic phytoplankton that lives in upper regions of the ocean where the sunlight is plentiful and E. huxleyi species often form a major phytoplankton biomass. One of the main differences between coccolithophores and other oceanic phytoplankton is that coccolithophores have an exoskeleton made of calcite platelets otherwise known as coccoliths.
Coccolithophores are of particular interest in research that focuses on global climate change because as ocean acidity increases, their coccoliths may become even more important as a carbon sink [2]. Emiliania huxleyi (E. huxleyi) is the most abundant coccolithophore that blooms every year across large areas of North Atlantic [1]. E. huxleyi is considered the model coccolithophorid because of the ease with which it can be cultured in the laboratory and also the availability of its genome sequence [23]. E. huxleyi produces unusual lipids that do not degrade in the sediments and are of great value to earth scientists as a tool for estimating the past sea surface temperatures [6]. Among the hundreds of other species of phytoplankton, closely related species, Isochrysis galbana (I. galbana) and Gephyrocapsa oceanica (G. oceanica) were chosen in this comparative study. Despite the closeness of the species E. huxleyi and G. oceanica calcifies but I. galbana does not calcify.
Comparative genomics focuses on the study of similarities and differences between sequenced genomes of different species. With the advancement in sequencing technology, more and more complete genomes are becoming available; genomic data, in the form of sequence data,
1
accumulates at an exponential rate, doubling approximately every year and a half [5]. Whole
genome of a species allows for global views and multiple complete genomes increase predictive
power [4]. Comparative genomic is a powerful tool for studying evolutionary relationship among
organisms, in helping to identify genes that are conserved among species as well as genes that
give each organism its unique characteristics. The main driving power of comparative approach
is to view the same object such as whole genome in different ways.
The aim of this research is to do a comparative genomics study of three closely related
coccolithophorids Emiliania huxleyi, Gephyrocapsa oceanica and Isochrysis galbana to gain
deeper understanding of the similarities and differences between them. By annotating and
comparing the genomes we aim to facilitate the identification of genes and proteins involved in
the bio-mineralization and coccolithogenesis.
In this study, TRINITY1 assembled RNA sequence Solexa short transcriptomic reads from four different growth conditions, were processed through a customized pipeline to generate an improved annotation of the respective genomes. The newly annotated genomes were then compared. The transcriptomic data was obtained by extracting RNA from cultures of each of the three species grown in artificial seawater under the following conditions:
Condition 1: 0 millimolar calcium Condition 2: 9 millimolar calcium Condition 3: 0 millimolar calcium with a spike of sodium carbonate Condition 4: 9 millimolar calcium with a spike of sodium carbonate
We assembled RNA transcripts of three sister species using PASA2 (program to assemble
spliced alignment) program, followed by ab initio predictions using SNAP3 and AUGUSTUS4.
1 Trinity is De novo RNA sequence assembler (http://trinityrnaseq.github.io/) 2 PASA is a Program to Assemble Spliced Alignments (http://pasapipeline.github.io/) 2
MAKER2 was used to generate protein alignments based on hits with the Refseq5 uniprot database. The various types of gene evidence were then combined using the EVM6 (Evidence modeler) program to generate the final annotated catalog of genes for each species. This study represents the genomewide comparison of three closely related species and focuses on the assembly of the RNA transcripts, gene annotation and finally genome content analysis.
This research makes the following contributions:
• Illustrates an architectural model that reflects the implementation of our work
• Provides a framework for annotation that identifies more complete gene structures
• Demonstrates improved gene annotation for each respective species
• Validates that the improved gene annotation has enabled the identification of unique
genes present in each species.
The remainder of the thesis is organized as follows: Chapter 2 examines the related work.
The architectural model is presented in Chapter 3. Implementation and results are described in
Chapter 4. Finally, and a discussion and outline for future work is presented in Chapter 5.
2. Background
Once a genome is sequenced for a species, the genome needs to be annotated to be meaningful. Annotation is the process of attaching pertinent information about the sequences in the genome. In Genome annotation, locations of genes and all coding regions in the genome are determined followed by identifying the functions of the genes, physical characteristics of gene
3 SNAP is a general purpose gene finding program (http://korflab.ucdavis.edu/software.html) 5 AUGUSTAS is a gene predictor for eukaryotes http://nar.oxfordjournals.org/content/32/suppl_2/W309.full 5 The Reference Sequence (RefSeq) collection provides a comprehensive, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins. http://www.ncbi.nlm.nih.gov/refseq/about/ 6 EvidenceModeler software combines ab intio gene predictions and protein and transcript alignments into weighted consensus gene structures
3
and overall metabolic profile of the species. Genome annotation can be performed by automation
or manually. In the automatic process, computer programs are used to scan the genome and
statistical methods are used to identify coding regions. Manual annotation of a genome is done
by humans and is a very long and difficult process as the genomes of species can be in hundreds
if not thousands of kilo bases with tens of thousands of genes. Recent advances in the sequencing
technologies have made it possible to sequence new genomes in less time and at reduced cost
prompting automatic annotation to take the lead with limited manual annotation [7]. Since
genome annotation is an ongoing process, resource intensive and time consuming, improving the
quality of the annotation is also an ongoing process.
Comparative and functional genomic studies and evolutionary genetics rely on quality
genome annotation to assign and modify the functions of genes and other elements such as
regulatory regions. Genome annotations initially rely on automatic process such as ab initio gene
predictions using statistical methods and alignment/homology of transcripts from other species; however, the quality of the annotations can be greatly improved by incorporating same species transcriptomes [7].
2.1. Transcriptomes
The Transcriptome is the complete set of transcripts in a given population of cells.
Transcriptome of a eukaryote species is much smaller than its genome. Genome is the complete collection of DNA that includes genes and inter-genomic regions. Different cells of the organism have the same copy of genome but only the expressed genes determine which parts of the genome are used to code for proteins; example skin cells transcriptome has information to activate the parts of genome to make skin cells. RNA sequencing promises a comprehensive
4
analysis of the Transcriptome and allows for complete annotation of all genes and isoforms for a
sample in given condition [8].
Figure 1: RNA Sequencing. RNA sequence experiments involve making a cDNA fragment collection. The fragmented reads are then flanked by specific constant sequences known as adaptors that are necessary for sequencing. Adapters are used to link the ends of two DNA molecules that have different sequences at their ends. This collection referred to as library is then sequenced through high-throughput sequencer which produces millions of short sequence reads that correspond to individual cDNA fragments. Figure 1 depicts the flow of RNA sequencing. Different computational methodologies exist
to find genes. In this study, we assembled the short reads from the high speed sequencers and
aligned the assembled sequenced to the reference genome. The alignment program searches for
the matches with the reference genome to create gene models that include exons and introns.
This method of creating gene models and predicting genes is valued because it is based on
5
evidence from RNA transcripts. This method also allows existing annotations to be reviewed and
updated based on the transcripts alignment.
2.2. Spliced Alignment
Figure 2: Spliced Alignment. Un-spliced reads lie within exactly one exon whereas spliced reads in blue overlaps between two or more exons. In the ungapped alignment of the reads all the short reads lies within exactly one exon as
shown in grey blocks in Figure 2.
In the spliced reads, shown in blue color in the Figure 2, short reads may overlap between two or more exons. The overlapping of the reads between exons creates a more challenging
problem for alignment programs because of the need to correctly determine exon-intron
boundaries. Hence, the process of aligning spliced reads to the genome often is done in two-step
approach in which the initial read alignments are analyzed to discover exon junctions; these
junctions are then used to guide final alignment [10]. Additionally, alignment program may also want to take into consideration the density of the multiple transcripts for correct alignment.
6
2.3. Transcriptome Assembly
Transcriptome assembly or reconstruction is the process of identifying transcripts and isoforms in a specific population of cells. There are broadly two methods of Transcriptome assembly [9].
a. Reference genome based b. De novo transcriptome assembly
Figure 3: Reference Based vs De Novo Assembly. In reference-based method reads are mapped back to a reference genome. In a ‘de novo’ assembly strategy reads are compared to each other to reconstruct expressed isoforms without the need of using a reference genome. In the reference based transcriptome assembly, the sequence reads are mapped to the reference genome of the species under study or a closely related species. A critical step towards assembling short RNA reads is alignment of splice-align reads to the genome [10]. This is followed by building a graph representing alternative splicing events as shown in Figure 4. Next step is to follow the graph to assemble variants. Predicting the transcript from splicing graph can
7
be done using many different approaches such as maximum likelihood7or maximum parsimony
approach8.
Figure 4: Build a Graph Representing Alternative Splicing Events. The nodes represent the exons and directed edges represent the different connections between exons. A path from start to end can have different exons included representing different splice patterns. In the De novo based assembly method, the transcriptome is assembled without the aid of
a reference genome. This assembly is often the preferred method for studying non model
organisms as this is easier than building the genome, and reference based assembly is not
possible without an existing reference genome. The first step in the de novo based assembly
process involves generating all substrings of length k from the reads.
7 A method of estimating the parameters of a statistical model which makes the known likelihood distribution a maximum (https://en.wikipedia.org/wiki/Maximum_likelihood) 8 The maximum parsimony method searches all possible tree topologies for the optimal (minimal) tree. (http://www.icp.ucl.ac.be/~opperd/private/parsimony.html) 8
Figure 5: De novo Assembly. (a) Generating substrings. Split the reads into smaller reads called k-mers in order to be used in De Bruijn graph (b) De Bruijn Graph. Reads are represented by the nodes and the alignments are represented by the edges. Walking along the edges of the graph allows one to construct the assembly. After generating substring a De Bruijn graph9 is created and the graph is traversed to assemble
the isoforms.
In the absence of reference genome, de novo assembly is the method of choice. Though
genome alignment is a robust way of characterizing the transcript sequences, there is a
disadvantage when there is a structural alteration of mRNA transcripts such as alternative
splicing [11].
9 A graph whose nodes are sequences of symbols from some alphabet and whose edges indicate the sequences which might overlap (https://en.wikipedia.org/wiki/De_Bruijn_graph) 9
In this study, TRINITY assembled transcripts were used as an input to the PASA alignment
program to create the alignment assembly. TRINITY is a de novo transcriptome assembler,
recovers more full length transcripts and has sensitivity similar to reference based aligners [12].
TRINITY uses three independent software modules: Inchworm divides the reads and assembles
initial contigs, Chrysalis groups related contigs into de Bruijn graphs, and Butterfly collapses the
graphs and aligns the reads to them [13].
2.4. Transcriptome Alignment
Transcribed sequences are aligned with the reference genome or reference transcriptome to
delineate the gene structures via gapped alignments [3]. In our study, we used PASA (Program to
Assemble Spliced Alignments). PASA is considered the gold standard for gene structure
resolution via the spliced alignment.
Figure 6: PASA Pipeline. Transcripts are aligned to the genome with minimum 95% identity and minimum 90% transcript length aligned. Alignment is performed by gmap or blat or both. PASA also allows for Alignment comparison and updates to the existing annotation. PASA pipeline can automate the transcripts alignments into gene structure annotation. The
PASA pipeline is shown in Figure 6. In addition to annotation comparison, PASA also suggests
exon modification, alternate splice isoform additions, gene merges, gene splits and identification
of new genes. The PASA algorithm finds the maximal assembly of compatible alignments, to
provide the maximum evidence supporting gene structures.
10
2.5. Genome Annotation
Genome annotation broadly consists of:
- Assigning gene names
- Functional characteristics of gene products
- Physical characteristics of genes
- Metabolic profile of the species
Figure 7: Sample Annotation Pipeline. Region of DNA sequence from START codon to STOP codon is known as open reading frame (ORF). Annotation involves identification of the ORFs followed by generation of the protein sequences. The protein sequences are searched for homology with existing known proteins in the public nr database 10and pertinent information is attached to the genes. Annotation can be performed by automation or manually or using a combination of both. In the automatic process a computer program makes the decision. The advantage of automatic process is that the annotation is quick and easy, but the disadvantage is that quality of the annotation may be compromised. In the manual annotation process, humans makes the decision and although is of the highest quality, the process is time consuming and tedious. Due to the rate
10 The nr database is compiled by the NCBI (National Center for Biotechnology Information) as a protein database for Blast searches. It contains non-identical sequences from GenBank CDS translations, PDB, Swiss-Prot, PIR, and PRF (http://www.matrixscience.com/help/seq_db_setup_nr.html) 11 at which whole genomes are sequenced these days, the annotation process is mostly automatic with manual curation.
Annotation can be structural or functional. Structural annotation in its simplest form consists of identifying the genes in the genomic DNA. There are mainly two ways to define a gene structure: - by prediction and by sequence similarity. With gene prediction, algorithms are developed to identify structure based on nucleotide sequence and composition. In the Sequence similarity technique, transcripts of the species are aligned with the genomic sequence of the same or related species to identify the motifs and domains. In the absence of a related species, transcripts are blasted against the non-redundant database to identify proteins.
In the functional annotation, descriptive names for the gene products are established with as much specificity as there is evidence to support. Functional annotation consists of attaching biological information such as biochemical function, biological function to genomic elements
[22].
12
3. Architectural Model
3.1. Annotation pipeline
Figure 8: Annotation Pipeline. A customized pipeline using PASA, MAKER, AUGUSTUS, SNAP and EVM was developed to re-annotate E.huxleyi, G. oceanica and I.galbana. 13
Figure 8 shows the detailed workflow of the re-annotation pipeline. To check the validity
of the re-annotation pipeline, Emiliania Huxleyi transcripts were used as the first input to the
pipeline as we had a high confidence previous JGI generated annotation of the E. Huxleyi. We
expected more than 95% of previous annotated proteins of E. Huxleyi to be present in re- annotated E. Huxleyi to validate our workflow pipeline.
For each sister species, the input to the first step was TRINITY assembled transcripts.
The reads from all the four conditions were used as an input. We chose a pipeline based on
PASA (Program to Assemble Spliced Alignments) because it is the Gold standard for gene structure resolution (introns and exons via spliced alignment) and provides evidence for alternative splicing, polyadenylation sites, five prime and 3 prime untranslated regions. PASA pipeline automatically incorporated transcript alignments into gene structure annotations [3].
PASA pipeline was used to build the transcript database after aligning the transcripts with the respective reference genome, followed by alignment of assemblies. The alignment was carried
out using the GMAP alignment tool within PASA pipeline. The pipeline generated alignment
assemblies along with the gene structures for each of the three species. A comparison tool provided within PASA assembly was used to compare the PASA generated gene structures with the existing E. Huxleyi annotation. PASA also provides a tool for generating the training set that was used for training ab initio gene predictors such as SNAP and AUGUSTUS.
MAKER2 was used to generate protein alignments with the reference sequence uniprot11
database. The genome file for each species in fasta format was used as the input to MAKER2
and searched against the uniprot database.
Finally all the gene evidence generated from MAKER2, AUGUSTUS, SNAP and PASA was
combined using the EVM12 (Evidence modeler) tool to generate the annotation.
11 Universal Protein Resource (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC540024/) 14
3.2. Comparison pipeline
Figure 9: Comparison Pipeline. Blastp tool was used to extract the homologous genes between two species. In the second part of the study, annotated genes of each species were compared against
each other to find the common genes between any two species and between all three species.
Blastp tool included in the BLAST software suite was used with a strict expect value of 10e-6 to
blast one species gene set against the other.
12 Evidence modeler: http://evidencemodeler.github.io/ 15
Figure 10: Finding Unique Genes for a species. To find unique genes for E. huxleyi, blastp was used with E. huxleyi as the query and each other species as reference. A set intersection operation was performed on the resulting E. huxleyi ids from two blastp results to find E. huxleyi genes that have alignment to both G.oceanica and I. galbana. A similar method was applied to G.oceanica and I. galbana. In this part of study, unique genes were extracted for each species using the workflow
outlines in Figure 10. First, for a species, genes were identified that had a hit on both other
species. The resulting common genes for three species were used to extract common genes
present in only two species. Three way common genes and two way common genes were
subtracted from total genes for the species to extract the unique genes.
16
4. Implementation and Analysis
This section discusses in detail the steps of the pipeline outlined in the architecture and the
analysis of each step of the pipeline.
4.1. Data Preparation
4.1.1. Input Data Source
TRINITY assembled transcripts for each species were used as the initial input for this study.
Inputs for each species were provided in two files based on expressed transcript sequences in
four different growth conditions. The files were named _contigs and _singlets for each species.
The contigs file contains the expressed transcripts in all different conditions. The singlets file
contains the expressed transcripts in one or more but not all four conditions. The input files were
combined into one file and duplicate headers in singlets file were renamed. The commands to
generate the statistics are listed in the Appendix 1 and Table 1 below shows the statistics.
Contigs + Singlets E. huxleyi G. oceanica I. galbana No. of Transcripts 99523 191145 53044 Longest transcript 18018 22373 27385 Shortest transcript 201 201 201 Mean transcript size 1058 896 1404 N50 length 1507 1392 2167 Table 1: TRINITY Transcriptome Assembly
4.1.2. Data Trimming
The input transcripts for each of the species were cleaned using seqclean utility scripts.
SeqClean is a tool for validation and trimming of DNA sequences from a flat file like the fasta formatted file [14, 15, and 16]. A vector file was downloaded from the NCBI website [16]. The tool works by processing the input file and filtering its content based on criteria such as, 17
percentage of undetermined bases (assigned when the percentage of undetermined bases is
greater than 3% in the clear range), polyA tail removal, and overall low complexity analysis,
homology to vectors, homology with other contaminants or unwanted sequences. The program seqclean was used in its default parameters.
Sequence cleaning E. huxleyi G. oceanica I. galbana Sequences analyzed 99523 191145 53044 Valid 99515 191115 53043 Trashed 8 30 1 Trashed by ‘shortq’ 2 17 0 Trashed by ‘dust’ 6 13 1 Table 2: Seqclean Results
4.2. Transcript Alignment
The next step was transcript re-construction from the alignment of the assembled transcripts to the genome using the PASA pipeline.
4.2.1. PASA Pipeline
PASA also performed sequence cleaning by identifying polyadenylation sites and low quality sequences. The PASA pipeline first aligns the TRINITY assemblies to the reference. The alignment is considered valid when there is 95% identity and at least 90 % of the transcript length aligns. Although the criteria are configurable the defaults were employed herein. After first aligning the transcripts using alignment tool GMAP13, each of the alignments were validated
requiring the [GT, GC]/AG consensus donor/acceptor splice sites at all introns [3].
13 GMAP is a standalone program for mapping and aligning cDNA sequences to a genome (http://www.ncbi.nlm.nih.gov/pubmed/15728110) 18
The alignment was followed by clustering overlapping sequences into groups or clusters.
Each cluster was subjected to the PASA algorithm to generate a set of unique, maximal
alignment assemblies. The assemblies within each cluster were further divided into sub-clusters representing conflicting complete or partial transcripts likely corresponding to the same gene (i.e. alternative or anomalous splicing isoforms of the same gene). These groups or clusters are called
PASA assemblies and they were further clustered to combine transcripts into longer assembly.
These sub-clusters required the same spliced orientation and at least 50% overlap of a
neighboring alignment. These sub-clusters represent the genes in the PASA assembly and were used to perform the comparative study with the previous annotation. PASA enables different
alignment tools to be used within the pipeline. For example, GMAP or/and BLAT are supported
by the PASA for use in the transcript alignment step. We used the GMAP alignment tool in our
pipeline because it is specifically designed to align spliced alignments to the genome. BLAT is
not suitable for this analysis as it is not designed for aligning spliced transcripts. We used the
cleaned TRINITY assemblies as the PASA input and the PASA alignment program generated
assemblies, gene structures and descriptions of sub clusters.
The file alignAssembly.config was used with the default parameter settings for the pipeline.
The command for running the PASA pipeline is in the Appendix 2. PASA assemblies are stored
in a MYSQL database where they can readily be used by the PASA web interface tool for the
statistical analysis. The PASA statistics obtained for the E. huxleyi, G. oceanica, and I. galbana
annotations are present in Table 3 below.
19
Transcripts or Assemblies E. huxleyi G. oceanica I. galbana
Total transcript sequences 99,515 191,115 53,043
Full length cDNAs 0 0 0
Partial cDNAs (ESTs) 99,515 191,115 53,043
Number transcripts with any alignment 92,643 117,942 49,216
Valid gmap alignments 55,040 65,699 36,601
Total Valid alignments 55,040 65,699 36,601
Valid Full Length-cDNA alignments 0 0 0
Valid EST alignments 55,040 65,699 36,601
Number of assemblies 45,253 48,884 30,592
Number of sub-clusters (genes) 34,066 32,758 24,918
Number of Full Length -containing 0 0 0 assemblies
Number of non- Full Length -containing 45,253 48,884 30,592 assemblies
4.2.2. Table 3: PASA Annotation Comparison
PASA tools were also used to compare the E. huxleyi annotation generated herein with the previous JGI annotations (Table 4). The configuration file annotcompare.config was updated for each species with the correct MYSQL database names used in the PASA alignment assembly.
The parameters in the file were not changed and default values were used for the comparison.
The commands for the comparison of PASA alignments and existing annotations are listed in the
Appendix 2.
20
Overview Previous Annotation PASA Pipeline
Predicted genes 30,449 34,066
Annotated Transcripts
Transcripts with start and stop codon 26,756 29,181
Transcripts missing start or stop codon 3,693 4,885
Single exon transcripts 7,734 (25.4%) 7,372 (22.5%)
Transcript N50 14length 1368 1394
Exons
Total number of exons 110,697 124,366
Average exon length 293.4 297.36
3' UTR (Untranslated Region)
Total transcripts with 3' UTR 10,591 17,403
5' UTR (Untranslated Region)
Total transcripts with 5' UTR 9,854 16,053
Intron
Total Introns 80,237 91,654
Average Intron length 240.38 221.69
Total Intron Sequence length 19,256,880 20,318,775
Table 4: Comparison of PASA annotation results and JGI 15Annotations of E. huxleyi
14 The N50 of an assembly is a weighted median of the lengths of the sequences it contains, equal to the length of the longest sequence s, such that the sum of the lengths of sequences greater than or equal in length to s is greater than or equal to half the length of the genome being assembled 15 Joint Genome Institute (http://jgi.doe.gov/) 21
Overview Previous Annotation PASA Pipeline
Predicted genes 28,441 32,758
Annotated Transcripts
Transcripts with start and stop codon NA 29,415
Transcripts missing start or stop codon NA 3,343
Single exon transcripts 12,595 (44.28%) 12,537 (41.32%)
Transcript N50 length 1,392 1,462
Exons
Total number of exons 66,336 77,845
Average exon length 523.64 491.59
3' UTR
Total transcripts with 3' UTR 2,803 7,890
5' UTR
Total transcripts with 5' UTR 2,737 7,063
Intron
Total Introns 37,895 47,502
Average Intron length 244.34 222.94
Total Intron Sequence length 9,259,264 10,590,095
Table 5: Comparison of PASA annotation results and previous MAKER Gene Annotations of G. oceanica
22
Overview Previous Annotation PASA Pipeline
Predicted genes 13,148 24,917
Annotated Transcripts
Transcripts with start and stop codon NA 14,608
Transcripts missing start or stop codon NA 10,309
Single exon transcripts 3,333 (25.35%) 3,275 (21.56%)
Transcript N50 length 2,167 2,140
Exons
Total number of exons 36,689 44,719
Average exon length 481.96 490.93
3' UTR
Total transcripts with 3'UTR NA 8,749
5' UTR
Total transcripts with 5'UTR NA 8,448
Intron
Total Introns 23,541 29,531
Average Intron length 309.94 296.78
Total Intron Sequence length 7,296,297 8,764,210
Table 6: Comparison of PASA annotation results and previous MAKER Gene Annotations of I. galbana
23
E. huxleyi G. oceanica I. galbana
Transcripts with start and stop codon 29,181 29,415 14,608
Single exon transcripts 7,372 12,537 3,275
Transcript N50 length 1394 1,462 2,140
Total number of exons 124,366 77,845 44,719
Average exon length 297.36 491.59 490.93
Total transcripts with 3'UTR 17,403 7,890 8,749
Total transcripts with 5'UTR 16,053 7,063 8,448
Total Introns 91,654 47,502 29,531
Average Intron length 221.69 222.94 296.78
Table 7:Side by side PASA output comparison
Table 4, 5 and 6 shows comparison of the key statistics of previous annotation with the
PASA comparison results. First, PASA has identified more genes in each of the respective species; 3,617 more genes for E.huxleyi, 4,317 more genes for G.oceanica and 11,769 more genes for I.galbana. Second, PASA predicted proteins appear to be more complete; 29,181 vs
26,756 for E.huxleyi. Previous annotation did not have any START or STOP codon predicted for G.oceanica or I.galbana. PASA results, described 29,415 genes for G.oceanica and 14,608 genes for I.galbana containing both START and STOP codons. Table 7 shows that PASA predicted considerably more genes with 5’ UTR sequences for E.huxleyi than G.oceanica or
I.galbana.
24
4.3. Training Data Preparation
Figure 11: Workflow for Generating Training Set. Unique complete ORF of minimum protein length 300 was extracted using PASA utility and blasted against the non-redundant database. The homologous gene structures were extracted and checked for redundancy and gene feature overlap. 4.3.1. Input Data
A subset of PASA generated alignment assemblies was used to extract protein coding regions
to be used for generating a high quality data set for training ab initio gene predictors
AUGUSTUS and SNAP. A training data set is an essentially a hypothesis about the typical structure of the transcript produced by a gene. In this study, we extracted unique complete ORF’s from the PASA assemblies as an input to the ab initio predictors.
4.3.2. ORF
The region of DNA sequence from the START codon to the STOP codon is referred to as
ORF (open reading frame). The gene finding process starts by looking for an ORF. ORF usually starts with a START codon and terminate with one of the three terminating codons, TAA, TAG,
or TGA. Depending on the starting position there are six possible protein sequences that can be
generated by the same DNA sequence referred to as reading frames. A tool provided by the
25
PASA pipeline, Transdecoder module, was used to find likely coding regions within transcripts.
Transdecoder identifies candidate coding sequences based on the following criteria:
1. Transcript sequences are scanned for minimum length ORF. In our study the
minimum length of 300 amino acids was used as the criteria for finding coding
regions.
2. A log-likelihood score. Likelihood ratio test is a statistical test used to compare the
goodness of fit of two models (model under test vs alternate model). The test is based
on the likelihood ratio, which expresses how many times more likely the data are
under one model than the other. When the logarithm of the likelihood ratio is used,
the statistic is known as a log-likelihood score. Transdecoder calculates the score
based on how likely the sequence under test is a coding sequence (vs a random
sequence). A score of more than 0 indicates a likely coding sequence.
4.3.3. Extracting Unique Complete ORF
Transdecoder was run on the generated PASA assemblies, and resulting output file was scanned for unique complete ORFs. Non-overlapping ORFs were extracted as the initial data set.
The process of extracting the unique ORF is outlined in the Appendix 3.
Species Number of Unique ORF’s extracted E. huxleyi 7,239 G. oceanica 17113 I. galbana 13135 Table 8: Unique Complete ORF's Extracted
26
4.3.4. Candidate Gene Structure Identification
The complete ORFs extracted from the Transdecoder were then blasted against the non- redundant protein database at Genbank. This study selected ORFs with an expect value less than or equal to 0.000001 and an alignment coverage equal to or more than 65%. Gene structures extracted using this method can be confidently used for training [7]. The extracted gene structures were clustered using CD-HIT16 to remove redundancy. A PASA tool to extract the
best ORF’s of protein length of greater than or equal to 300 amino acids was used to further filter
the training set. Finally, the extracted training set was checked for overlapping genes using
bedtool.
Species nr Hits = No. of gene models E. huxleyi 7,830 G. oceanica 11,653 I. galbana 7014 Table 9: ORF nr-hits
The nr database hits as depicted in table 9, were further filtered to cluster the genes and
remove redundancy using the CD-HIT program. A gff317 file for the nr hits was extracted and corresponding protein sequences generated using the EVM tool. The protein fasta file were
extracted and clustered using the CD-HIT (Table 10) and the resulting gff3 files were used as a
training set for the ab initio gene predictors.
16 CD-HIT stands for Cluster Database at High Identity with Tolerance is a program for clustering and comparing protein or nucleotide sequences (http://www.bioinformatics.org/cd-hit/) 17 The general feature format (GFF) is a file format used for describing genes and other features of DNA, RNA and protein sequences. (https://en.wikipedia.org/wiki/General_feature_format ) 27
Species Clustered nr hits E. huxleyi 7,446 G. oceanica 6,743 I. galbana 4,495 Table 10: Clustered nr Hits for Gene Predictors
4.4. Ab Initio Training and Prediction
The training set of gene structures in gff3 format were used as an input to the gene predictors
AUGUSTUS and SNAP. As a minimum for optimum performance, at least 200 gene structures
are needed for AUGUSTUS training. Generally speaking, the larger the training set the better the
prediction, however there is little improvement when more than 1000 gene structures are used
[17]. AUGUSTUS is already trained for many species. To run AUGUSTUS on a new species, the algorithm must be optimized or trained based on the characteristics of particular species. The
Markov chain transition probabilities for coding and non-coding regions and Meta parameters for each species are created during the AUGUSTUS training process and can be found under
AUGUSTUS config folder.
AUGUSTUS manual training was performed and gene prediction species models were created for each of the three species referred to ehux, geph and iso. Gene prediction was performed thereafter, the results of which are presented in Table 11. The commands are listed in the Appendix 4.
Species Genes predicted E. huxleyi 49,111 G. oceanica 54,354 I. galbana 21,639 Table 11: AUGUSTUS Gene Prediction
28
Another ab initio prediction tool used was SNAP. Semi-HMM18-based Nucleic Acid Parser
is a general purpose gene predictor used for both eukaryotic and prokaryotic genomes. SNAP
requires a precompiled HMM model file or a HMM file be created for each new species to be
used in the prediction stage. The training gff3 format file was converted to a format required by
SNAP. The steps to create a HMM model for SNAP are listed in the Appendix and the results
obtained are presented in Table 12 below.
Species Genes predicted E. huxleyi 76,582 G. oceanica 104,440 I. galbana 29,697 Table 12: SNAP Gene Prediction 4.5. Reference Protein Alignment Generation
Reference protein alignments were generated using MAKER2 by aligning the sister species genome against the uniprot reference protein database. The alignments resulting from the run were formatted for EvidenceModeler (EVM) input. This technique tries to find parts of genome that have alignments with the known protein sequences in the reference database. The MAKER2 commands are listed in the Appendix 5.
4.6. Combining Evidences
PASA transcript assemblies, ab initio predictions from AUGUSTUS and Reference protein alignments were combined using the EVM package which provides a flexible framework to combine diverse evidence types into single gene structure [18]. The genome file in fasta format, protein alignment file, transcripts alignment and predictions files all in general feature format
(gff3) are used as input to EVM along with the weight file which contains a numeric weight
18 A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (hidden) states. A HMM can be presented as the simplest dynamic Bayesian network (https://en.wikipedia.org/wiki/Hidden_Markov_model ) 29
value to be applied to each type of evidence. The weights can be configured intuitively based on
the quality of the evidence; transcripts alignments typically weighted heaviest followed by protein alignments and then gene predictions. In our study we gave the most weight of 10 to the
PASA assembles, followed by a weight of 5 to Reference protein alignments and a weight of 1 for gene predictor AUGUSTUS. SNAP was not used in the re-annotation pipeline because it generated too many false positive predictions. Re-annotation pipeline validation and selection is described in section 4.8.
Species EVM predicted genes Previous genes E. huxleyi 46,705 30,446 G. oceanica 53,480 28,441 I. galbana 18,875 13,148 Table 13: EVM Predicted Genes Table 13 shows the predicted genes when all of the evidences was combined using the EVM.
The predicted genes were clustered in the next step to generate the final gff3 file for each
species.
4.7. Gene Clustering
EvidenceModeler output was clustered using the CD-HIT program to remove any redundancy and generate the final gff3 file for each species. This was followed by generating the protein and genes files.
Species Previous genes Clustered (Re-annotated) E. huxleyi 30,446 35,582 G. oceanica 28,441 52,680 I. galbana 13,148 18,712 Table 14: Final Predicted Genes
30
4.8. Validating Workflow
To validate the re-annotation workflow the previous high confidence JGI annotation of
Emiliania Huxleyi was compared against the re-annotated files. The 30499 E. Huxleyi proteins annotated by JGI were blasted against the re-annotated proteins with an expect value of
0.000001. The intermediate methods before the final EVM predictions were adjusted in order to get maximum hits of previous JGI E. huxleyi proteins on to the re-annotated E. huxleyi proteins.
The expectation for the re-annotation pipeline was to find homology of at least 95% between JGI annotated E. huxleyi proteins and re-annotated E. huxleyi proteins. Once the workflow methods were adjusted based on E. Huxleyi re-annotation, other two species were re-annotated using the same workflow and techniques. Table 15 shows the various methods that were used to generate the E. Huxleyi re-annotation.
Method Based on Unique complete ORF # Hit F1-score 1 PASA transcripts + RefSeq Alignments + AUGUSTUS 29,000 0.87 2 PASA transcripts + RefSeq Alignments + AUGUSTUS + 28,698 0.87 SNAP After Filtering Training set to remove overlapping features 3 PASA transcripts + RefSeq Alignments + AUGUSTUS 28,978 0.89 4 PASA transcripts + RefSeq Alignments + AUGUSTUS + 29,624 0.78 SNAP Table 15: Workflow Validation: Inputs to EVM
The re-annotation pipeline evaluation was based on the F1 score calculated for each method
listed in Table 15. The F1 score measures the accuracy of the method using the precision p and
recall r. Precision is the ratio of true positives (TP) to all predicted positives (TP + FP). Recall is
the ratio of true positives to all actual positives (TP + FN) [19]. F1 score or the balance metric is
the harmonic mean of precision and recall. 31
2( . ) 2 1 = = ( + ) (2 + + ) 푝 푟 푇푇 퐹 The F1 metric weights recall and푝 precision푟 equally,푇푇 퐹and� a 퐹퐹good method will maximize both
precision and recall simultaneously [19, 20]. A moderately good performance on both precision
and recall will be favored over extremely good performance [19, 20, and 21].
Based on the F1 score column in Table 15, method #3 was chosen as the best workflow for
re-annotating the other two sister species.
4.9. Comparative study
This part of the study focused on comparative study between the three species using Blast
suite of programs.
4.9.1. Reference E. huxleyi
Figure 12: G. oceanica on E. huxleyi and I. galbana on E. huxleyi. (a) Venn diagram illustrating the number of G.oceanica homologous proteins mapped to E.huxleyi (b) Venn diagram illustrating the number of I. galbana homologous proteins mapped to E.huxleyi Figure 12 shows the number of G. oceanica and I. galbana genes that maps to E. huxleyi.
blastp was used with a expect value of 0.000001, E. huxleyi as the reference and other two sister
species as the query. 76.6% of G.oceanica genes map on to E.huxleyi and 74.4% of I.galbana
genes map to E.huxleyi.
32
4.9.2. Reference G. oceanica
Figure 13: E. huxleyi on G. oceanica and I. galbana on G. oceanica (a) Venn diagram illustrating the number of E.huxleyi homologous proteins mapped to G.oceanica (b) Venn diagram illustrating the number of I. galbana homologous proteins mapped to G.oceanica Figure 13 shows 89.3 % of E. huxleyi genes map on to G. oceanica vs 82.0 % of I.
galbana genes map on to the G. oceanica genes. Figure 13 have different intersections of
common genes compared to Figure 12 because in Figure 13 G. oceanica is used as the reference
and in Figure 12, G. oceanica was used as the query (subject).
4.9.3. Reference I. galbana
Figure 14: E. huxleyi on I. galbana Genes and G. oceanica on I. galbana (a) Venn diagram illustrating the number of E.huxleyi homologous proteins mapped to I. galbana (b) Venn diagram illustrating the number of G.oceanica homologous proteins mapped to I. galbana 66.1% of E.huxleyi genes map on to I. galbana vs 68.6% of G.oceanica genes map on to
I. galbana. As expected the intersection region of common genes is different than that Figure 13
and Figure 14 because in this comparison I. galbana is used as the reference. G. oceanica has
33
more shared genes with I. galbana; this could be attributed to the more number of genes in G.
oceanica.
4.9.4. Three way comparison with individual species as Reference
In this part of the study, each species was blasted against each of the other two species using blastp and using set intersection function common gene ids were extracted. This method was used to find how many a particular species genes are in the 3 species intersection of genes.
Figure 15: Common Genes based on (a) E.huxleyi, (b) G.oceanica and (c) I. galbana. Venn diagram showing the common genes when one of each species is used as reference. Figure 15 (a) is based on E. huxleyi genes mapping on to G. oceanica and I. galbana.
Figure 15 (a) indicates that there are 22,875 genes in E. huxleyi that have shown homology to
genes in G. oceanica and I. galbana. Similarly, figure 15 (b) illustrates that there are 28,714
genes in G. oceanica that have found homology to genes in E. huxleyi and I. galbana. Finally,
figure 15c, validates that there are 13, 223 genes in I.galbana show homology to the genes in E.
huxleyi and G.oceanica. Based on the genome sizes, the most number of common genes between
species were shared when G.oceanica was used a reference, which was expected.
34
4.9.5. Overlap for three species
Figure 16: Overall common genes. The middle part of the Venn diagram shows the common genes. 28,714 G.oceanica genes are mapped to the both E. huxleyi and I. galbana genes. 22,875 of E. huxleyi genes are mapped to both G.oceanica and I. galbana genes. 13,223 of I. galbana genes are mapped to E. huxleyi and G.oceanica genes. In this part of the study, we combined the individual diagrams in Figure 15 to generate
overlapped regions that give us an idea about genes that are unique to each species and genes
shared between two and three species. As shown in Figure 16, there are 28,714 G.oceanica genes
that have similarities with E. huxleyi-G.oceanica-I.galbana genes. Additionally, 9,563
G.oceanica genes have similarities to 8,913 E. huxleyi-G.oceanica common genes. G.oceanica also has 7,444 genes that share similarities with 2,135 I. galbana genes. The study also found
8.8% of E.huxleyi genes are unique, 13.2% of G.oceanica genes are unique and 14.1% of
I.galbana genes are unique. G.oceanica has the most genes that share similarities with all three
E. huxleyi-G.oceanica-I.galbana genes, followed by E.huxleyi and I.galbana. We also found that
I.galbana has more genes that share similarities to E.huxleyi-I.galbana genes even though
I.galbana has fewer total genes predicted than E.huxleyi.
35
5. Conclusion
In this study we started with the JGI annotated E. huxleyi and MAKER annotated G.oceanica
and I. galbana. We attempted to improve the annotations by making use of the newly available
TRINITY assembled transcripts. We used a customized pipeline involving PASA, MAKER2,
AUGUSTUS, SNAP, BLAST suite of programs, CD-HIT and EvidenceModeler to assemble,
align and re-annotate the sister species. Our first dataset for the pipeline was E.huxleyi TRINITY
assembled transcripts, assembled under four different growth conditions. We chose E.huxleyi as
the first dataset as our existing annotations for this species is of higher confidence than the other two sister species. We tried to prove that our customized pipeline was adequate by re-annotating a known good species and comparing the results with the existing one.
A pipeline for re-annotation was presented. The re-annotation has produced improved results compared with the existing annotation, predicting more complete gene structures. The re- annotated proteins of E. huxleyi contain 96.7% of the previously annotated E. huxleyi proteins.
35,582 E.huxleyi genes were predicted compared to 30,499 in the previous JGI annotation.
52,680 G.oceanica genes were predicted compared to 28,441 predicted by previous MAKER2 annotation. For I.galbana an improvement of 43% from the previous MAKER annotation was obtained using the re-annotation pipeline. 2,425 additional E.huxleyi genes with both START and STOP codon were predicted by the re-annotation pipeline, a significant improvement over the JGI annotation. The workflow is a good starting point for further in-depth study. This workflow can be adjusted and default parameters changed to suite different set of transcriptome.
Incorporation of transcriptome sequence into the genome annotation has improved the first generation annotation and has set up a stage for further comparative and functional studies.
36
Comparative study between E. huxleyi and G.oceanica has found 31,788 E. huxleyi genes that are mapped to G. oceanica and 23,532 E. huxleyi genes mapped to I. galbana. The ids of the common genes were extracted for further comparative studies. With G.oceanica as reference and other two species as query, 89% of E. huxleyi maps to G.oceanica vs 82% of I.galbana, which suggests that E. huxleyi and G.oceanica may be more closely related to each other than to
I.galbana.
Improved annotation of the closely related sister species provide a valuable resource for comparative and functional studies. With the revised annotation, further studies can be performed to identify genes that are associated with calcification. Furthermore, the revised annotations allow for in-depth studies on molecular functions of the proteins.
37
References
1. Young, J. R. (n.d.). Emiliania huxleyi and other coccolithophores. Paleontology
Department, The Natural History Museum, London. [Cited: June 25, 2015]. Retrieved
from http://protozoa.uga.edu/portal/coccolithophores.html
2. Ehux “Tree of Life” alga sequenced. (June 13, 2013). [Cited: June 25, 2015]. Retrieved
from http://www.algaeindustrymagazine.com/ehux-tree-of-life-alga-sequenced/
3. Haas, B. J., Delcher, A. L., Mount, S. M., Wortman, J. R., Smith, R. K., Hannick, L. I.,
et al. (2003). Improving the Arabidopsis genome annotation using maximal transcript
alignment assemblies. Nucleic Acids Res, 31 (19), pp. 5654–5666.
4. Kellis, M., Patterson, N., Birren, B., Berger, B. & Lander, E. S. (2004). Methods in
comparative genomics: genome correspondence, gene identification and regulatory motif
discovery. Journal of computational biology : a journal of computational molecular cell
biology, 2–3, pp. 319–55.
5. Moret, B. M. E., Miller, W. C., Pevzner, P.A. & Sankoff, D. (n.d.). Computational
Challenges in Comparative Genomics. A Tutorial. [Cited: July 2, 2015]. Retrieved
from http://www.researchgate.net/publication/242480927_Computational_Challenges_in
_Comparative_Genomics_A_Tutorial
6. Emiliania huxleyi. (n.d.). In Wikipedia. [Cited: June 28, 2015]. Retrieved
from https://en.wikipedia.org/wiki/Emiliania_huxleyi
7. Eckalbar, W. L., Hutchins, E. D., Markov, G. J., Allen, A. N , Corneveaux, J. J.,
Lindblad-Toh, K., Di Palma, F., Alföldi, J., Huentelman, M. J., Kusumi, K. (2014).
Genome reannotation of the lizard Anolis carolinensis based on 14 adult and embryonic
deep transcriptomes. BMC Genomics, 14 (49).
38
8. Garber, M., Grabherr, M. G., Guttman, M. & Trapnell, C. (2011). Computational
methods for transcriptome annotation and quantification using RNA-seq. Nature
Methods, 8 (6), pp. 469-477.
9. Lu, B., Zeng, Z. & Shi, T. (2013). Comparative study of de novo assembly and genome-
guided assembly strategies for transcriptome reconstruction based on RNA-Seq. Sci
China Life Sci, 56 (2), pp. 143-155.
10. Engström, P. G., Steijger, T., Sipos, B., Grant, G. R., Kahles, A., RGASP Consortium,
Gunnar, R., Goldman, N., Hubbard, T. J., Harrow, J., Guigó, R., Bertone, P. (2013).
Systematic evaluation of spliced alignment programs for RNA-seq data. Nature Methods,
10, (12), pp. 1185-1191.
11. De_novo_transcriptome_assembly (n.d). In Wikipedia. [Cited: July 5, 2015]. Retrieved
from https://en.wikipedia.org/wiki/De_novo_transcriptome_assembly
12. Grabherr, M. G.,, Haas, B.J., Yassour, M., Levin, J. Z., Thompson, D. A., Amit, I.,
Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q., Chen, Z., Mauceli, E., Hacohen, N.,
Gnirke, A., Rhind, N., Di Palma, F., Birren, B. W., Nusbaum, C., Lindblad-Toh, K,,
Friedman N., Regev, A. (2011). Full-length transcriptome assembly from RNA-Seq data
without a reference genome. Nature Biotechnology Volume:29, pp. 644–652.Year
published:Pages:
13. Haas, B. J., Papanicolaou, A., Yassour, M., Grabherr, M., Blood, P. D., Bowden, J.,
Couger, M. B., Eccles, D., Li, B., Lieber, M., MacManes, M. D., Ott, M., Orvis, J.,
Pochet, N., Strozzi, F., Weeks,N., Westerman, R., William, T., Dewey, C. N., Henschel,
R., LeDuc, R. D., Friedman, N., Regev, A. (2013). De novo transcript sequence
39
reconstruction from RNA-Seq: reference generation and analysis with TRINITY. Nat
Protoc, 8 (8), pp. 1494-512.
14. Tae, H., Ryu, D., Sureshchandra,S. & Choi, J. (2012). ESTclean: a cleaning tool for
next-gen transcriptome shotgun sequencing. BMC Bioinformatics, 13.
15. Sequence cleaner program. (July 22, 2010). [Cited: July 6, 2015]. Retrieved
from http://sourceforge.net/projects/seqclean/files/
16. Univec database. (n.d.) [Cited: July 3, 2015]. Retrieved
from http://www.ncbi.nlm.nih.gov/tools/vecscreen/univec/
17. AUGUSTUS training. (January 14, 2011). [Cited: July 8, 2015]. Retrieved
from http://bioinf.uni-greifswald.de/bioinf/wiki/pmwiki.php?n=Augustus.Augustus/
18. Haas, B. J., Salzberg, S. L., Zhu, W., Pertea, M., Allen, J. E., Orvis, J., White, O., Buell,
C. R. & Wortman, J. (2008). Automated eukaryotic gene structure annotation using
EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biology, 9
(1).
19. Mean F Score. (June 1, 2015). [Cited: July 12, 2015]. Retrieved
from https://www.kaggle.com/wiki/MeanFScore
20. Precision and recall. (June 13, 2015). [Cited: July 12, 2015]. Retrieved
from https://en.wikipedia.org/wiki/Precision_and_recall
21. Powers, David M W (2011). Evaluation: From Precision, Recall and F-Measure to ROC,
Informedness, Markedness & Correlation.
22. Genome Project. (June 10, 2015). [Cited: July 22, 2015]. Retrieved
from https://en.wikipedia.org/wiki/Genome_project#Genome_annotation
23. Zhang, Xiaoyu (2013). Characterization of Small Silencing RNAs in Emiliania Huxleyi
40
Appendix 1: Input Data Preparation
Generate Unique TRINITY Transcript headers Input
Clean the transcript assembly Input
Generate Input TRINITY assembly statistics Input
41
Appendix 2: PASA
Run PASA Alignment Pipeline Input
Compare previous annotation with PASA generated assembly Input
42
Compare previous annotation with PASA generated assembly -g
Validate gff3 file for PASA comparison Input
Create a gff3 file from JGI gtf file for PASA pipeline Input
Create a gff3 file from MAKER gtf file for PASA pipeline Input
43
Create a gff3 file from MAKER gtf file for PASA pipeline 2. Convert the gtf to gff3 format as required by PASA comparison pipeline gtf_to_gff3_format.pl
44
Appendix 3: Training Set Generation
Extract ORFs from PASA assemblies Input
Extract the ORF’s with type ‘complete’ Input
Extract the protein sequences for the “complete” features Input
45
Blast the unique complete ORF’s against the nr database Input
Extract gff3 after the BLASTing unique complete ORF’s with nr database Input
46
Extract gff3 after the BLASTing unique complete ORF’s with nr database
47
Check for feature overlaps in final Training set gff3 Input
48
Appendix 4: Gene Predictions
AUGUSTUS training and prediction Input
49
AUGUSTUS training and prediction new_species.pl --species=
..
setsid ./aug20
6.Join the output files to generate final AUGUSTUS prediction file cat *out | join_aug_pred.pl > geph_aug.gff Notes Manual training is much faster that auto training. Both methods were used. Auto training can take days to finish. Make sure that there is no defined species in AUGUSTUS config folder.
50
SNAP training and prediction Input
51
SNAP training and prediction hmm-assembler.pl speciesSNAP params/ > speciesSNAP.hmm 10.SNAP prediction a).Predict gene structures and store in SNAP zff file snap speciesSNAP.hmm
52
Appendix 5: Reference Protein Alignment
REFSEQ Protein alignments using MAKER2 Input
Command(s)/ 1.Generate the MAKER control files Steps maker –CTL
2.Leave the maker_exe.ctl in default values
3.Edit maker_opts.ctl
nano maker_opts.ctl
genome=/home/gopin001/fasta/march2015/species_genome.fasta
#-----Protein Homology Evidence protein=/w2/xyzhang/data/uniprot_sprot.fasta #-----Repeat Masking (leave values blank to skip repeat masking) #-----Gene Prediction protein2genome=1 4.Run MAKER
mpirun -n 40 maker
5.Gather all of the GFFs, No DNA at the end
gff3_merge -n –d species_genome_master_datastore_index.log -o species_maker_annotations.gff
Notes Align the genome file with the uniprot reference to find known protein alignments.
53
Appendix 6: EVM
Format the MAKER2 Protein alignments for EVM Input
Format the AUGUSTUS and SNAP output for EVM Input
54
Format the AUGUSTUS and SNAP output for EVM :%s/\([*]*\tAUGUSTUS\tCDS\t[0-9]*\t[0-9]*\t[0-9,.]*\t[+,-]\t[0- 9,.]\t\)transcript_id “\([a-z,0-9,.]*\)\(.*\)/\1ID=\2.cds;Parent=\2 f).Edit the start_codon line to add ID= and parent= fields :%s/\([*]*\tAUGUSTUS\tstart_codon\t[0-9]*\t[0-9]*\t[.]\t[+,-]\t[0- 9,.]\t\)transcript_id “\([a-z,0-9,.]*\)\(.*\)/\1Parent=\2 g).Convert the AUGUSTUS gff3 to EVM gff3 augustus_to_GFF3.pl species_aug_gene_predictions.gff > species_aug_formatted.gff3 2.Concatenate the AUGUSTUS and SNAP cat species_aug_formatted.gff3
Combining evidence using EvidenceModeler (EVM) Input
55
Combining evidence using EvidenceModeler (EVM) 2.Generate the EVM Command Set write_EVM_commands.pl --genome
56
Appendix 7: Clustering
Trimming the final output using the CDHIT Input
57