Analysis of Gene Co-Expression Networks of Two Coccolithophore Species
Total Page:16
File Type:pdf, Size:1020Kb
Analysis of Gene Co-expression Networks of Two Coccolithophore Species In affiliation, with California State University, San Marcos In partial fulfillment of the Requirements for the Degree of Master of Computer Science By Nitesh Balasaheb Sabankar Summer 2018 1 ABSTRACT Emiliania Huxleyi (E. Huxleyi) and Gephyrocapsa oceanica (G. Oceanica) are some of the most abundant species of Coccolithophores in the ocean. G. Oceanica and E. Huxleyi produce coccoliths, making it feasible to use comparative genomics [1, 2]. Coccolithophores play an important role in the oceanic carbon cycle through calcification of coccoliths and photosynthesis. The main objective of the project was to study and compare two sister coccolithophores E. Huxleyi and G. Oceanica, using differential gene co-expression network analysis. Gene Co-Expression network involves nodes which corresponds to genes and edges corresponding to co-expression relationship between genes. The direction and type of co-expression relationships are not determined in gene co-expression networks. Gene Co-expression networks allows identification of different candidate biomarkers and curative targets. Such networks enable the inference of diseases and system-level functionality of genes. These inferences are helpful in identifying genes characteristics [4, 5]. In this project, E. Huxleyi and G. Oceanica RNA-Seq data was compared with each other and divided into different co-expressed group of genes between two species which are also called as ‘modules’. These modules are then compared with external traits to find out how the modules and traits are related. Functional enrichment analysis was also performed on these modules to identify significantly enriched genes in the interesting modules. After the analysis, by using 12 modules which are highly preserved in both the data sets, lipid metabolism genes and biomineralization genes to these modules are related and biological functions for these genes are obtained. Similarly, lists of genes which can be related to set of biomineralization genes was obtained which can vastly help us study biomineralization process in detail in the two species. 2 ACKNOWLEDGEMENT I would first like to express my sincere gratitude and thanks to Dr. Xiaoyu Zhang, without whose constant support and advice this project would not be possible. I would also like to thank Dr. Ahmad Hadaegh and Dr. Betsy Read for taking time off their busy schedule to guide me through the entire process. I am deeply indebted to all of them. My sincere appreciation also goes out to other faculty members and staff of the Computer Science department, my fellow students and my family members, without whose support and motivation I would not have been able to complete this project. 3 TABLE OF CONTENTS ABSTRACT .................................................................................................................................... 2 ACKNOWLEDGEMENT .............................................................................................................. 3 1 INTRODUCTION ................................................................................................................... 6 2 BACKGROUND ..................................................................................................................... 8 Gene co-expression networks........................................................................................... 8 Software for co-expression network analysis................................................................... 8 Weighted gene co-expression network analysis (WGCNA) ............................................ 9 3 ARCHITECTURE ................................................................................................................. 10 Input Data ....................................................................................................................... 11 Data pre-processing ........................................................................................................ 11 Low-count filtering ................................................................................................. 12 Log-transforming data ............................................................................................ 12 Normalization ......................................................................................................... 12 Co-expression network construction .............................................................................. 13 Soft-thresholding power selection .......................................................................... 13 Adjacency matrix construction ............................................................................... 14 Topological Overlap Matrix based network construction ...................................... 14 Hierarchical Clustering and module identification ........................................................ 15 Assessing module preservation ...................................................................................... 18 Module-trait relationship ................................................................................................ 19 Functional enrichment analysis ...................................................................................... 19 4 IMPLEMENTATION AND RESULTS ............................................................................... 20 Input Data sets ................................................................................................................ 20 Data pre-processing ........................................................................................................ 21 4 Evaluating correlation between the datasets .................................................................. 23 Co-expression network construction .............................................................................. 25 Adjacency matrix construction ............................................................................... 25 Topological Overlap Matrix based network construction .............................................. 26 Scaling of Topological Overlap Matrices ............................................................... 27 Hierarchical clustering and module assignment............................................................. 30 Imposing unmerged modules .................................................................................. 31 Imposing merged modules ...................................................................................... 31 Assessing module preservation ...................................................................................... 32 Relating modules to external information ...................................................................... 34 Module-trait relationship ........................................................................................ 34 Functional Enrichment Analysis ............................................................................. 38 Relating modules to biomineralization genes. ........................................................ 45 Relating modules to lipid metabolism genes. ......................................................... 52 5 CONCLUSION AND FUTURE WORK .............................................................................. 54 6 REFERENCES ...................................................................................................................... 55 7 APPENDICES ....................................................................................................................... 57 APPENDIX 1 – GO analysis table for unmerged E. Huxleyi data modules ................. 57 APPENDIX 2 – Table relating unmerged modules to biomineralization genes ............ 65 APPENDIX 3 – Table relating unmerged modules to lipid metabolism genes ............. 83 APPENDIX 4 – GO analysis table for merged E. Huxleyi data modules ..................... 86 APPENDIX 5 – Table relating merged modules to biomineralization genes ................ 94 APPENDIX 6 – Table relating merged modules to lipid metabolism genes ............... 106 5 1 INTRODUCTION G. Oceanica and E. Huxleyi are type of phytoplankton microscopic organisms. These microscopic organisms are found mostly in the upper layers of oceans where there is sufficient sunlight. Similar to plants, phytoplanktons derive energy through the process of photosynthesis [6]. Spherical cells about 5-100 micrometers across, enclosed by calcareous (coccoliths) plates are called 'Coccolithophores'. They are one of the most important micro-algae, and the third-most prominent group of phytoplanktons [7]. Coccolithophores are exclusively marine organisms, and, like other phytoplanktons, are found in abundance in those parts of the ocean that receive sufficient sunlight [7]. Some coccolithophores differ from other oceanic phytoplanktons in that they have an exclusive outer sphere of calcite plates known as coccoliths. Because of their unique properties, research on coccolithophores has emerged as a major area of interest among scientists who study global climate change. E. Huxleyi is one of the most abundant species of coccolithophores. This species received its name from Cesare Emiliani and Thomas Huxley, two scientists who discovered coccoliths that were embedded in the sediments at the bottom of the ocean [8]. The coccoliths of E. Huxleyi are generally transparent and colorless, and are made up of calcites that can refract light very efficiently in water. In this study the focus