Downloaded from the NCBI Website [16]

Comparative Gene Expression Analysis in Emiliania huxleyi, Isochrysis galbana, and Gephyrocapsa oceanica Arun Gopinath In Partial Fulfillment of the Master of Computer Science California State University San Marcos July, 2015 Abstract Comparative genomics is a field of biological research in which the genomic sequences of different species are compared. Comparative genomics provides a powerful tool for studying evolutionary changes among organisms, helping to identify genes that are conserved among species, as well as genes that give each organism its unique characteristics. Genome annotations of newly sequenced species initially rely primarily on ab initio gene predictions and alignment of reference transcripts of related species; however, the quality of gene models is greatly improved when incorporating same species transcriptomes. In this study, we started with the re-annotation of the E. huxleyi, G. oceanica and I. galbana based on de novo assembly of transcriptomes. De novo assembly tool TRINITY was used to assemble the transcriptomes of each individual sister species, followed by a pipeline based on Program to Assemble Spliced Alignments (PASA), ab initio gene predictors AUGUSTUS and SNAP, EvidenceModeler (EVM) and MAKER2. The pipeline was validated by re-annotating E. huxleyi, as we have a good first generation JGI E. huxleyi annotation. The revised E. huxleyi annotation describes 35,582 genes compared to 30,499 for first generation annotation. More than 95% of the previously annotated E.huxleyi genes mapped on to the re- annotated 35,582 E.huxleyi genes. The re-annotation pipeline predicted G.oceanica 52,680 genes compared to 28,441 for the previous annotation pipeline. Revised I.galbana annotation has 18,712 genes compared to 13,148 genes for the previous annotation. Once we re-annotated the sister species genome, we focused the research to comparative study. Various PASA, EVM and BLAST tools were used to determine the genome size, number of genes predicted, and number of shared genes among the sister species. The revised E. huxleyi annotation shares 31,734 orthologous genes with G. oceanica and 23,392 orthologous genes with i I. galbana. Comparative study between G.oceanica and I.galbana found 15,358 common genes, with additional 20,800 G.oceanica genes sharing considerable similarity to the common G.oceanica-I.galbana genes. Further comparative study found 2,642 unique I.galbana genes; 3,137 unique E.huxleyi genes and 6,959 unique G.oceanica genes. ii Acknowledgements It is my pleasure to extend my deepest gratitude to my committee members, Dr. Xiaoyu Zhang, Dr. Betsy Read and Dr. Ahmad Hadaegh for sharing their wisdom during the course of this study. I am extremely thankful to them for their time and valuable suggestions to better this study. I am forever thankful to my wife, Kinga, for listening, offering me advice and supporting me through this entire process. My profoundest feelings of love to my sons, Milan and Emil, for the important roles they play in my life. Finally, I want to acknowledge with gratitude the support of my parents and sister, who have always encouraged me to look forward to new beginnings rather than regret missed opportunities. iii Table of Contents 1. INTRODUCTION ................................................................................................................. 1 2. BACKGROUND .................................................................................................................... 3 2.1. TRANSCRIPTOMES ......................................................................................................................................... 4 2.2. SPLICED ALIGNMENT .................................................................................................................................... 6 2.3. TRANSCRIPTOME ASSEMBLY ........................................................................................................................ 7 2.4. TRANSCRIPTOME ALIGNMENT .................................................................................................................... 10 2.5. GENOME ANNOTATION ............................................................................................................................... 11 3. ARCHITECTURAL MODEL ............................................................................................ 13 3.1. ANNOTATION PIPELINE ............................................................................................................................... 13 3.2. COMPARISON PIPELINE ................................................................................................................................ 15 4. IMPLEMENTATION AND ANALYSIS .......................................................................... 17 4.1. DATA PREPARATION ................................................................................................................................... 17 4.1.1. Input Data Source .............................................................................................................................. 17 4.1.2. Data Trimming .................................................................................................................................. 17 4.2. TRANSCRIPT ALIGNMENT ........................................................................................................................... 18 4.2.1. PASA Pipeline ................................................................................................................................... 18 4.2.2. Table 3: PASA Annotation Comparison ........................................................................................... 20 4.3. TRAINING DATA PREPARATION .................................................................................................................. 25 4.3.1. Input Data .......................................................................................................................................... 25 4.3.2. ORF ................................................................................................................................................... 25 4.3.3. Extracting Unique Complete ORF .................................................................................................... 26 4.3.4. Candidate Gene Structure Identification ........................................................................................... 27 4.4. AB INITIO TRAINING AND PREDICTION ....................................................................................................... 28 4.5. REFERENCE PROTEIN ALIGNMENT GENERATION ........................................................................................ 29 4.6. COMBINING EVIDENCES .............................................................................................................................. 29 4.7. GENE CLUSTERING ..................................................................................................................................... 30 4.8. VALIDATING WORKFLOW ........................................................................................................................... 31 4.9. COMPARATIVE STUDY ................................................................................................................................. 32 4.9.1. Reference E. huxleyi .......................................................................................................................... 32 4.9.2. Reference G. oceanica....................................................................................................................... 33 4.9.3. Reference I. galbana ......................................................................................................................... 33 4.9.4. Three way comparison with individual species as Reference ........................................................... 34 4.9.5. Overlap for three species ................................................................................................................... 35 5. CONCLUSION .................................................................................................................... 36 REFERENCES ............................................................................................................................ 38 APPENDIX 1: INPUT DATA PREPARATION ..................................................................... 41 APPENDIX 2: PASA .................................................................................................................. 42 APPENDIX 3: TRAINING SET GENERATION ................................................................... 45 iv APPENDIX 4: GENE PREDICTIONS .................................................................................... 49 APPENDIX 5: REFERENCE PROTEIN ALIGNMENT....................................................... 53 APPENDIX 6: EVM ................................................................................................................... 54 APPENDIX 7: CLUSTERING .................................................................................................. 57 v LIST OF FIGURES Figure 1: RNA Sequencing.. ........................................................................................................... 5 Figure 2: Spliced Alignment.. ......................................................................................................... 6 Figure 3: Reference Based vs De Novo Assembly.. ......................................................................

Downloaded from the NCBI Website [16]

Sequence Alignment/Map) Is a Text Format for Storing Sequence Alignment Data in a Series of Tab Delimited ASCII Columns

Alternate-Locus Aware Variant Calling in Whole Genome Sequencing Marten Jäger1,2, Max Schubach1, Tomasz Zemojtel1,Knutreinert3, Deanna M

Galaxy Platform for NGS Data Analyses

BMC Bioinformatics Biomed Central

An Online Visualization Tool for Functional Features of Human Fusion Genes Pora Kim 1,,†,Keyiya2,,† and Xiaobo Zhou 1,3,4,*

Gffread and Gffcompare[Version 1; Peer Review: 3 Approved]

Next-Generation DNA Sequencing Informatics, 2Nd Edition

Identifying Disease Genes

A Resource Optimized GATK 4 Based Open Source Variant Calling Workflow

Tools and Algorithms in Bioinformatics GCBA815, Fall 2013

An Efficient General-Purpose Program for Assigning Sequence Reads To

A Standard Variation File Format for Human Genome Sequences