<<

A cross-class analysis of learning-related transcriptional profiles

by

Yiru Sheng

A Thesis presented to The University of Guelph

In partial fulfilment of requirements for the degree of Master of Science in Bioinformatics

Guelph, Ontario, Canada

© Yiru Sheng, September, 2020 ABSTRACT

A CROSS-CLASS ANALYSIS OF LEARNING-RELATED TRANSCRIPTIONAL PROFILES

Yiru Sheng Advisor(s): University of Guelph, 2020 Dr. Ayesha Ali Dr. Andreas Heyland

Learning and memory, fundamental neuronal processes, are found in various species.

Memory consolidation is essential to transform short-term memory into a long-lasting form. However, whether mechanisms underlying memory consolidation are shared across distant groups remains controversial. Here, meta-analysis, using publicly available RNA Sequencing data, was conducted to identify the similarity or differences of mechanisms underlying memory consolidation across five classes of . I clustered

3091 orthologous gene groups across classes; among them I identified shared learning mechanisms across multiple classes related to cell cycling control, cell signalling, transcription and translation regulations and general cellular processes. I also found some targets for specific learning conditions in this study. This is the first study employing both meta-analysis and comparative transcriptomic analysis of gene expression in memory consolidation across distant species, helping to frame a structure and a database for further analysis.

iii

ACKNOWLEDGEMENTS

First and foremost, I have to thank my supervisors, Dr. Ayesha Ali and Dr. Andreas Heyland, for their support, patience, and understanding over past one year. Without their assistance and dedicated involvement in every step throughout the process, this thesis would have never been accomplished. I would also like to thank my committee member, Dr. Frédéric Laberge for his assistance and suggestion on my thesis.

I am also grateful to Vern Lewis for sharing expertise, and sincere and valuable guidance to me.

I take this opportunity to express gratitude to all of the faculty members of Department of

Integrative Biology and Department of Mathematics & Statistics for their guidance on the process of experiments and IT support on the work environment.

Finally, I would like to thank my family and friends for providing me with unfailing support and continuous encouragement through the thesis. This accomplishment would not have been possible without them.

iv

TABLE OF CONTENTS

Abstract ...... ii Acknowledgements ...... iii Table of Contents ...... iv List of Tables ...... vi List of Figures ...... vii List of Symbols, Abbreviations or Nomenclature ...... viii 1 Introduction and Background ...... 1 1.1 The evolution and mechanisms in memory ...... 1 1.1.1 Learning and memory-related studies from an evolutionary perspective ...... 1 1.1.2 Memory formation and consolidation ...... 3 1.2 Strategies in comparative analysis across distant species ...... 5 1.2.1 A meta-analysis of comparative transcriptomic data ...... 5 1.2.2 Orthologous gene group searching and Gene expression analyzing ...... 6 1.3 Contributions ...... 8 2 Methodology ...... 9 2.1 Datasets selection ...... 9 2.2 RNA-Seq data processing pipeline ...... 11 2.2.1 Set up processing environment and download experimental data ...... 11 2.2.2 De novo Transcriptome assembly ...... 11 2.2.3 Gene expression quantification ...... 12 2.2.4 Orthologs searching and annotation ...... 13 2.3 Meta-data differential expression analysis ...... 14 2.3.1 Exploratory differential expression analysis ...... 15 2.3.2 Targeted differential expression analysis ...... 17 3 Results ...... 17 3.1 Variance across datasets using standardized processes ...... 17 3.2 Interclass searching for cross-class orthologous gene groups ...... 20 3.3 Differential gene expression analysis ...... 21 3.3.1 Two-step clustering testing the dependency between expression levels and the class of species ...... 21 3.3.2 The most significant differential expressed contrast for each dataset ...... 23

v

3.3.3 Gene Ontology enrichment analysis ...... 25 3.3.4 Matching with the most relevant mammalian learning-associated gene list ...... 31 4 Discussion ...... 35 4.1 Orthologs searching across distant species and meta-data database construction ...... 35 4.2 Shared pathways underlying memory consolidation ...... 36 4.3 Validation of the expression patterns of learning-associated genes in mammals ...... 38 4.4 Potential mechanisms underlying specific training paradigms ...... 38 4.5 Integration of meta-analysis and comparative transcriptomic analysis ...... 40 4.6 Performance comparison of two methods in differential expression analysis ...... 41 4.7 Methodological limitation ...... 43 5 Conclusion ...... 43 References ...... 45 Appendix ...... 51

vi

LIST OF TABLES Table 2.1 Summary of 14 used studies...... 10 Table 3.1 Summary of transcriptome data after de novo assembly ...... 19 Table 3.2 DEcOGGs that included more than one filtered expression level (fold-change relative to control) ...... 23 Table 3.3 DEcOGGs and the source class enriching in 6 labelled GO clusters ...... 28 Table 3.4 Thirteen labelled GO clusters shared across source classes ...... 30 Table 3.5 The Learning-cOGGs found in Mammalian studies filtered by larger than 2-fold- change cut-offs...... 32

Table A. 1 Details of DEcOGGs in cluster_2 found from the global method ...... 53 Table A. 2 Details of DEGs found from the combined method ...... 55

vii

LIST OF FIGURES

Figure 1.1 Learning (associative learning) in the evolution of animals...... 2 Figure 1.2 A model of memory systems of relationships among brain regions in ...... 4 Figure 2.1 The flowchart of RNA-Seq raw data analysis pipeline ...... 11 Figure 2.2 The flowchart of two exploratory differential expression analysis methods ...... 15 Figure 3.1 Distribution of top 8 Level-1 GO term functional classification from all Mammalia sequence annotations ...... 21 Figure 3.2 The size and components of three clusters from two-step clustering ...... 22 Figure 3.3 Differential expression analysis results for each dataset ...... 24 Figure 3.4 Heatmap of the count of DEGs from all datasets grouping by the class of species and cOGG ...... 25 Figure 3.5 The heatmaps of GO enrichment analysis for two differential expression analysis methods ...... 27 Figure 3.6 The heatmap of the differential expression level of 40 Learning-cOGGs in Amphibia, Actinopteri, and Chromadorea studies ...... 34

Figure A. 1 Distribution of counts in cOGGs ...... 51 Figure A. 2 PCA plots ...... 51 Figure A. 3 Hierarchical clustering of classes for Gene Ontology enrichment matrices ...... 52 Figure A. 4 Hierarchical clustering of species in Amphibia, Actinopteri, and Chromadorea for the differential expression level of 40 Learning-cOGGs...... 67

viii

LIST OF SYMBOLS, ABBREVIATIONS OR NOMENCLATURE

STM Short-term memory LTM Long-term memory α-CaMKII α subunit of calcium/calmodulin dependent protein kinase II Arc Activity-regulated cytoskeleton-associated protein FMRP Fragile X mental retardation protein CRE cAMP response element CREB cAMP response element-binding protein RNA-Seq RNA sequencing OGGs Orthologous gene groups RSEM RNA-seq by Expectation-Maximization TPM Transcripts per million MCL Markov clustering cOGGs Cross-class orthologous gene groups GO Gene Ontology DEcOGGs Differentially expressed cOGGs DEGs Differentially expressed genes BP Biological processes CC Cellular component MF Molecular function Myo15 Unconventional myosin-XV Nmna1 Nicotinamide/nicotinic acid mononucleotide adenylyltransferase 1 Ubp30 Ubiquitin carboxyl-terminal hydrolase 30 Rcaf1 Respirasome complex assembly factor 1 TRAP-seq Translating ribosome affinity purification RNA sequencing Grp75 Stress-70 protein, mitochondrial Stim2 Stromal interaction molecule 2 Learning-cOGGs cOGGs in the list of most relevant mammalian learning-associated genes IEG Immediate early gene GPI Glycosylphosphatidylinositol cAMP Cyclic AMP CDKs Cyclin-dependent protein serine/threonine kinase activity Cdk5 Cyclin-dependent-like kinase 5 Rrp5 Protein RRP5 homolog VMP1 Vacuole membrane protein 1 Egr3 Early growth response protein 3 LTP Long-term potentiation Csad Cysteine sulfinic acid decarboxylase Fkbp4 Peptidyl-prolyl cis-trans isomerase FKBP4

ix

Uck2 Uridine-cytidine kinase 2 Ubb Polyubiquitin-B Syt4 Synaptotagmin-4

1 Introduction and Background

Learning and memory are key aspects of cognitive function and mechanisms underlying these processes are an important field of inquiry in neuroscience research. Learning and memory formation allows organisms to respond to environmental change in order to survive. These two fundamental processes are found in almost all species, but little is known about similarities and differences of learning and memory mechanisms across distant taxonomic groups. To address this question, I conducted a comparative meta-transcriptome study that first established orthologous relationships of genes (transcripts) among taxonomic groups. Expression levels of these orthologous transcripts were then compared between learning conditions and control. This analysis allowed us to obtain mechanisms associated with learning and memory across phylogenetically diverse and invertebrate animals. This is the first study employing both meta-analysis and cross-species comparative analysis of brain gene expression pattern in learning and memory. The process established here can be used to analyze publicly available datasets to analyze the evolution of complex biological processes between distantly related organisms and reconstruct putative evolutionary trajectories.

1.1 The evolution and mechanisms in memory

1.1.1 Learning and memory-related studies from an evolutionary perspective

In this thesis, I compared learning and memory processes between distant taxonomic groups. This analysis provided me with insights into the similarities and differences in these mechanisms and allowed me draw inferences about the evolution of learning and memory processes in animals. The core theory of this comparison is the general-process view, which assumes that learning processes

1

are basically the same in all animals that exhibit some form of learning. It has been pointed out that across all vertebrates, learning can be explained by the general processes of associative learning (Macphail, 1987; MAGPHAIL & Bolhuis, 2001). The existing ability to learn was attributed to associative learning in the extant representatives of the Metazoan groups, in particular in which appeared during the Cambrian explosion (Ginsburg & Jablonka, 2010). The learning mechanisms of some descendants in this group have been studied (Figure 1) , including

Mammalia, Amphibia and Actinopteri from Chordata, Insecta from Arthropoda, and Chromadorea from Nematoda.

Figure 1 Learning (associative learning) in the evolution of animals. The phylogenetic tree shows the lineage relationship among some species and different levels of categories (Ginsburg & Jablonka, 2010). The emergence of associative learning also as the beginning of the Cambrian explosion is marked with the thick strip. Some important taxonomy classifications in revealing the history of learning evolution are annotated with the angled font. The classes, as the most focused classification level in this study, are highlighted with blue, and the second main level- subphylum, includes vertebrate (groups in the gray box) and invertebrate (groups in the white box). The species, also interested in this study, were bolded with common names in parentheses.

Recently, researchers have analyzed transcriptional changes in the brain in response to a diversity of learning activities. In mammals, different learning tasks affecting brain plasticity share common mechanisms (Torrealba, Madrid, Contreras, & Gómez, 2017). However, it is unclear whether most processes involved in learning are shared across a broader taxonomy range.

2

1.1.2 Memory formation and consolidation

The identification of mechanisms responsible for long-term memory formation and consolidation has been a challenging goal of neuroscience for a long time. From the synaptic plasticity and memory theory, memory is the change of the strength of neural connections as postaqusition of the environment knowledge (Martin, Grimwood, & Morris, 2000). A widely recognized general process was learned information acquisition, comprised of short-term memory (STM), long-term memory (LTM) formation (or memory consolidation), memory recall, and reconsolidation (Stern

& Alberini, 2013). STM and LTM were defined by the duration of the memory: STM, the ephemeral memory lasting a few seconds or minutes; and LTM, the enduring memory lasting hours or days. STM and LTM are separate, but related mechanisms. STM is stored as a short-term firing pattern in neurons; when the firing pattern vanishes, the memory will not exist (Sweatt, 2010). The formation of LTM starts from the transformation of STM to a stable form (Davis & Squire, 1984), also regarded as the memory consolidation, and requires some repetitions.

From the perspective of experimental operation, repetition requires spaced multiple time stimulus

(E. R. Kandel, 2001). During this transformation period, some processes result in changes in response to the specific stimulus using the well-formed memory system (Labar & Cabeza, 2006;

Nadel & Hardt, 2011) in the brain (Figure 2). LTM requires relatively more complex molecular and cellular mechanisms than STM. Memory consolidation is also time dependent. During this consolidation period, memory trace is formed in the brain, building synaptic connections within a set of neurons (Gallistel & Matzel, 2013). Neuroprotection is important to maintain the memory by protecting the neuron networks. Neuroprotection has been targeted for the treatment of neurodegenerative disorder diseases because the destruction of neuron networks makes the storage of new memories impossible (Jahn, 2013). Other processes that occur in this memory

3

transformation time include: 1) changes linked to certain cell signalling pathways, 2) changes at transcription and translation levels, and 3) changes responsible for structural changes in the brain, such as the quantitative and structural changes of dendritic spines.

Figure 2 A model of memory systems of relationships among brain regions in vertebrates. This frame is based on the frames from Labar and Cabeza (2006); Nadel and Hardt (2011). Brain regions (in boxes) as well as some target functions interested in this study (annotated around the region). The direct and indirect connections are shown with solid and dashed arrows, respectively.

The changes of signalling pathway function as an upstream regulation in response to neurotransmitters, such as glutamate, can impact the synaptic plasticity and memory function. How transcription and translation impact memory consolidation correlates with de novo gene expression

(Abel, Martin, Bartsch, & Kandel, 1998) and de novo protein synthesis (Davis & Squire, 1984).

The involvement of de novo gene expression is supported by observations of blocked LTM due to inhibitors that function in transcription. In activated synapses, previous studies have found processes of protein synthesis, including α subunit of calcium/calmodulin dependent protein kinase

4

II (α-CaMKII), Activity-regulated cytoskeleton-associated protein (Arc), and fragile X mental retardation protein (FMRP) (Weiler et al., 1997; Y. Yin, Edelman, & Vanderklish, 2002). An example of regulation to structural changes is cAMP response element (CRE)-binding protein

(CREB) and its related signalling pathway (Silva, Kogan, Frankland, & Kida, 1998; Xia et al.,

2009), first defined in Aplysia (Kaang, Kandel, & Grant, 1993). CREB has been shown to affect the formation of LTM across distinct species including flies (Perazzona, 2004; Sakai, Tamura,

Kitamoto, & Kidokoro, 2004; J. C. P. Yin et al., 1994), and mice (Bourtchuladze et al., 1994;

Pandey, Mittal, & Silva, 2000). After the activation of the cAMP-dependent protein kinase and the Ras-MEK-ERK/MAPK cascade, the CREB-mediated transcription results in structural changes of activated neurons (Barco & Marie, 2011; Levenson, 2004).

1.2 Strategies in comparative analysis across distant species

1.2.1 A meta-analysis of comparative transcriptomic data

In recent decades, the development of RNA sequencing (RNA-Seq) techniques has provided efficient methods to analyze whole transcriptomes, allowing for a full detection of the transcriptomic response of organisms under various conditions. Numerous studies have applied these high-throughput techniques to characterize associated learning mechanisms, while comparing the transcriptomic profiles from stimulated and naïve animals’ responses to single learning paradigms for specific species, such as mouse (Bero et al., 2014; Chen et al., 2017; Rao-

Ruiz et al., 2019), rat (Pardo et al., 2017; Willeman et al., 2018), toad (Lewis, Laberge, & Heyland,

2020), swordtail fish (Cui, Delclos, Schumer, & Rosenthal, 2017), zebrafish (Huang, Butler, &

Lubin, 2019), honey bee (McNeill, Kapheim, Brockmann, McGill, & Robinson, 2016), fruit fly

(Crocker, Guan, Murphy, & Murthy, 2016; Jones, Nixon, Chubak, & Kramer, 2018; Widmer,

5

Bilican, Bruggmann, & Sprecher, 2018), and roundworm (Camacho et al., 2018). Another study used transcriptome profiling data to make comparisons across species in order to explore the variations in sensory and cognitive specializations, such as the facial recognition of a unique species in paper wasps (Berens, Tibbetts, & Toth, 2017). To reveal characterizations of the evolution of mechanisms, recent studies have also applied wide cross-species RNA-Seq comparisons to discover conserved features, such as the shared mechanisms associated with monogamy across vertebrates (Young et al., 2019), and the conserved role for Ferroportins in nickel hyperaccumulation across five genera in plants (De La Torre et al., 2018).

The integration of meta-analysis and comparative transcriptome analysis has been applied in some studies to make full utilization of previous data in public repositories. A meta-transcriptomic data analysis of rice identified genes conferring tolerance to environmental stresses (Buti et al., 2019).

Similarly, a recent plant study integrated cross-species meta-analysis and machine-learning to identify microalgae targeted pathways that respond to salt stress (Panahi, Frahadian, Dums, &

Hejazi, 2019). However, to our knowledge, no cross-species comparative transcriptome studies have investigated the presence of conserved mechanisms associated with learning and memory across multiple distant species. In this study, I developed a wide cross-species meta-analysis of learning and memory with a bioinformatics pipeline. This investigation may have applications in revealing the evolutionary patterns of learning and memory for achieving a better understanding of brain activities.

1.2.2 Orthologous gene group searching and Gene expression analyzing

Until recently, comparative biology relied solely on morphological characters; but now, functional information of mRNA sequences is widely used in comparisons across species. While there is a significant similarity (>70%) between invertebrate and vertebrate genomes (Prachumwat & Li,

6

2008), it is not wise to directly compare sequence similarity and ignore function variations across species. However, interpretation of sequences from evolutionary and functional perspectives has seen great progress, primarily due to annotation advances in global genome sequence databases from a diverse range of organisms.

Based on the distinct evolutionary processes, Walter Fitch, in the 1970s, defined orthologs and paralogs from homologs (Fitch, 1970). Orthologs are clusters of homologous genes across species related through speciation of a single gene in the last common ancestor, whereas, paralogs are clusters of homologous genes across species related through gene duplication but have unique functions. Although the clustering of orthologous gene groups (OGGs) is based on the sequence similarity, the translated protein sequences representing the function of genes are used to inform the alignment (Li Li, Stoeckert, & Roos, 2003). The initial goal of clustering OGGs was to investigate the forces and mechanisms of the evolutionary process and to group functionally equivalent genes. It is beneficial to group unknown genes with known genes from other species, resulting in genes from non-model organisms that can then be compared to well-annotated model species. In summary, clustered OGGs is one of the most essential parts of cross-species comparative transcriptome analysis. In this thesis, the sets of differentially expressed genes across species were clustered by OGGs.

The differential meta-analysis of RNA-Seq data is not straightforward because the method used to combine expression data from multiple studies can strongly impact the results of the meta-analysis.

Two main strategies of analysis are 1) global differential analysis, in which raw data are jointly normalized across studies; and 2) combination of individual differential analyses, in which differential expression levels of every gene would be calculated in each dataset (or study) before being combined together. The global differential analysis assumes the per-gene dispersion is the

7

same across studies, while the combination method estimates the per-gene mean and dispersion separately for each study. Rau, Marot, and Jaffrézic (2014) compared several methods from these two strategies and demonstrated that when inter-study variability is low, the results are very close to those of a global GLM analysis with a study effect. When the inter-study variability is high, the combination of individual analyses tends to outperform the global analysis; however, the global method is still beneficial to conclude the general transcriptional patterns. This thesis examines results from both global differential analysis and from combining differential analysis from individual studies; however, other biological and technical effects across studies needed to be accounted for as well. In the present study, I first demonstrated that the global method on per-OGG expression data could be adapted to the differential meta-analysis of RNA-seq data and compared with the combined method on per-gene expression data.

1.3 Contributions

The main contributions of this thesis are: 1) development of a framework for comparing differentially expressed learning-related genes across distantly-related species; 2) design of a pipeline to standardize re-assembly of transcriptomes from raw data of published studies; 3) construction of a database with transcriptome data tagged with meta-data of study design features;

4) analysis of orthologous gene groups and comparison of differential gene expression patterns; and 5) identification of shared mechanisms across classes related to memory consolidation and to memory-type specific pathways.

Section 2 details how the meta-analysis was conducted. Section 3 presents the analysis results.

Section 4 discusses the results within the context of learning mechanisms and learning-related pathways. Finally, Section 5 provides conclusions and directions for future work.

8

2 Methodology

2.1 Datasets selection

An internet search on PubMed (https://pubmed.ncbi.nlm.nih.gov/) was conducted in April of 2019 using the keywords: “RNA Sequencing”, “learning (OR memory)” and “gene expression (OR transcriptome)”. A list of additional papers was provided by Vern Lewis and Dr. Andreas Heyland because of their expertise. Search results were filtered based on 1) access to raw data, and 2) study design. Papers without access to the raw data or with relatively complicated study design, or with distinct species object were discarded from further analysis. In total, 41 papers were identified, of which 22 provided access to the raw data. Of the remaining 22 papers, 2 had complicated study design and 2 focused on distinct species; the addition of one paper happened in September of 2019 using the same keywords above on the RNA-Seq public repository- the Sequence Read Archive

(Leinonen, Sugawara, Shumway, & International Nucleotide Sequence Database, 2011). The final list of datasets contained 15 datasets from 14 studies, including ten strains and 1-3 datasets per species (common name); the details of 14 studies are presented in Table 2.1.

9

Table 2.1 Summary of 14 used studies. Project Paper title Type of memory Reference mouse_1 Early remodelling of the neocortex upon episodic memory encoding fear memory Bero et al.

(2014) mouse_2 Mapping Gene Expression in Excitatory Neurons during Hippocampal hippocampal-dependent Chen et al.

Late-Phase Long-Term Potentiation memory (2017) mouse_3 Engram-specific transcriptome profiling of contextual memory fear memory Rao-Ruiz et

consolidation. al. (2019) rat_1 IGF-I Gene Therapy in Aging Rats Modulates Hippocampal Genes spatial memory Pardo et al.

Relevant to Memory Function (2017) rat_2 The PKC-β selective inhibitor, Enzastaurin, impairs memory in middle- spatial reference memory Willeman et

aged rats al. (2018) toad Temporal profile of brain gene expression associated with learning in an prey catching memory Lewis et al.

anuran amphibian (2020) swordtail Early social learning triggers neurogenomic expression changes in a the early experience of Cui et al.

swordtail fish social exposure memory (2017) zebrafish Telencephalon transcriptome analysis of chronically stressed adult stress experience memory Huang et al.

zebrafish (2019) wasp_1 Cognitive specialization for learning faces is associated with shifts in the pattern & face memory Berens et al.

brain transcriptome of a social wasp (2017) wasp_2 Cognitive specialization for learning faces is associated with shifts in the pattern & face memory Berens et al.

brain transcriptome of a social wasp (2017) bee Brain regions and molecular pathways responding to food reward type reward quality memory McNeill et al.

and value in honeybees (2016) fly_1 Cell-Type-Specific Transcriptome Analysis in the Drosophila aversive olfactory memory Crocker et al.

Mushroom Body Reveals Memory-Related Changes in Gene Expression (2016) fly_2 Regulators of Long-Term Memory Revealed by Mushroom Body- appetitive olfactory Widmer et al.

Specific Gene Expression Profiling in Drosophila melanogaster memory (2018) fly_3 Mushroom Body Specific Transcriptome Analysis Reveals Dynamic courtship memory Jones et al.

Regulation of Learning and Memory Genes After Acquisition of Long- (2018) Term Courtship Memory in Drosophila roundwor The Memory of Environmental Chemical Exposure in C. elegans Is ancestor chemical Camacho et

m Dependent on the Jumonji Demethylases jmjd-2 and jmjd-3/utx-1 exposure memory al. (2018)

10

2.2 RNA-Seq data processing pipeline

2.2.1 Set up processing environment and download experimental data

Datasets containing multiple runs were downloaded from the Sequence Read Archive (Leinonen,

Sugawara, et al., 2011) through a list of FTP links, obtained from the European Nucleotide

Archive (Leinonen, Akhtar, et al., 2011), using the Aspera connect. The recommended pipeline

(Figure 3) setup can be obtained from the GitHub repository (https://github.com/).

Figure 3 The flowchart of RNA-Seq raw data analysis pipeline. The key software used in most step were listed (italic font in brackets) after the description of steps.

2.2.2 De novo Transcriptome assembly

Raw data and cleaned data quality checks were performed using the FastQC

(http://www.bioinformatics.babraham.ac.uk/projects/fastqc) and MultiQC tool (Ewels,

11

Magnusson, Lundin, & Käller, 2016). RNA-Seq data quality control using Trim Galore! (v 0.6.4) included low-quality base calls (using default Phred score 20) trimming, adaptor auto-detection and trimming, and, the last step, short reads (sequences) filtering (the default minimum length cut- off was 20 bp).

To reduce the number of input files and decrease the computational complexity and processing time, data normalization was performed using Trinity (v2.8.6) separately from and before de novo assembly. The targeted maximum coverage for reads was 50, while the Kmer size and the maximum percent of the mean for the standard deviation of Kmer coverage across reads were both using the default value, 25 and 200, respectively. De novo assembly with Trinity (v2.2.0) was carried out in Galaxy (v2.9.1) (http://usegalaxy.org) using default parameter settings. Assemblies were assessed by basic contig N50 statistics, which specifies the length of the shortest contig that can cover 50% of the total genome length.

2.2.3 Gene expression quantification

Transcript mapping and quantifying back to the assembly were performed using the plugin in

Trinity. RNA-seq by Expectation-Maximization (RSEM) (Grabherr et al., 2011) was used to estimate the contigs abundance in an alignment-based algorithm with corrections for transcript lengths. Briefly, RESM assumes that each sequence read and read length are observed (one end for single reads and both ends for paired reads) and are generated according to a directed acyclic graph that includes parent sequence, length, start position and orientation. A Bayesian version of the expectation-maximization algorithm is used to obtain maximum likelihood estimates of the model parameters, transcript fractions and posterior mean estimates of abundances. The resulting gene expression matrices were then normalized to Transcripts Per Kilobase Million (TPM) to make them comparable across samples. The original TPM matrix generated from the Trinity 12

plugin included all gene names in each dataset. For each dataset, the expression matrix was transformed into a long format with gene names, sample names and TPM values.

2.2.4 Orthologs searching and annotation

For decreasing the complexity and processing time of computing, all 15 datasets were split into 5 taxonomy groups by class - Mammalia (including mouse and rat), Amphibia (including toad),

Actinopteri (including zebrafish and swordtail fish), Insecta (including honeybee, fruit fly and two species of paper wasp), and Chromadorea (including roundworm). Orthologs searching was performed at two levels- clustering intraclass and clustering interclass. The general process involved 1) in-parallel orthologs searching within each class; 2) filtering clustered genes from those OGGs that contained orthologous genes from all species in one class; 3) integrating filtered genes from 5 classes for interclass orthologs searching; and 4) filtering clustered genes from those

OGGs that contained orthologous genes from all classes.

TransDecoder (v 5.5.0) (Haas et al., 2013) was used to predict coding regions from transcripts, and consequently provided the amino acid sequences used in further analysis. Orthologous gene groups were clustered using OrthoMCL (S. Fischer et al., 2011). In this tool, all-versus-all Blast search with Opening Reading Frames provided the score of pairwise sequence similarity. Blast used the default setting except for E-value cut-off, which was different based on the diversity of species (or higher-level taxonomy groups). Then all matching pairs were filtered with “percent match length” (>= 50% left) and linked across or within the objects (species or class) for identifying the orthologs or in-paralogs pairs. Finally, the clustering of orthologs gene groups used

Markov Clustering (MCL) algorithms.

After the interclass orthologs searching and filtering, the cross-class OGGs were obtained, defined as cOGGs. I extracted all clustered Mammalia sequences for annotation. Annotation of Mammalia

13

proteins (using BlastP) and transcripts (using BlastX) were against the UniProt database obtained from Trinotate (v 3.2.1) (http://trinotate.github.io). The E-value cut-off for both were 10-3; BlastP only kept the most common target, while BlastX returned the top 5 hits; other parameters were set as defaults except for the parallel threads using the maximum cores. Protein sequences were also searched against a protein profile HMM database. Trinotate generated the final report. After annotation, only the most frequent gene symbol from BlastX and annotated with Gene Ontology

(GO) terms in biological processes for each cOGG was selected as the representative annotation identifier. When duplicate identifiers occurred, the second most frequent gene symbol was applied to the cOGG with the most frequency of its second-most gene symbol.

2.3 Meta-data differential expression analysis

The master local database was constructed by the integration of annotation data, meta-data about source studies, and multiple expression matrices. I used both exploratory (Figure 4) and targeted ways to investigate the similarity or the difference of LTM induced mechanisms across datasets, species and taxonomy groups.

14

Figure 4 The flowchart of two exploratory differential expression analysis methods. Both methods started from the input of multiple expression matrixes from processed datasets and ended with the GO enrichment analysis (remarked with thick line). Global comparison integrated data before differential expression analysis (highlighted with dark gray), while segregated comparison performed differential expression analysis for each study before combination (highlighted with light gray). DEGs, differentially expressed genes.

2.3.1 Exploratory differential expression analysis

Two methods of differential expression analysis were used to find learning-related cOGGs, herein referred to as the global and combined methods.

Global comparison finding differential expressed orthologous gene groups

I simplified the expression data by reconstructing the treatments into binary form (Learning and

Control) and averaged all the TPM values from the same cOGG for each study and treatment form.

The samples defined as Learning included all the treatments that the animals have experienced learning stimulus; for the temporal transcriptome profiles, all post-stimuli timepoints except the immediate sacrificed were grouped in the Learning treatment. The expression level, represented as the fold-change of learning against control, was calculated to show the up- or down-regulated cOGGs expression changes. Finally, two-step clustering analysis

(https://www.ibm.com/support/knowledgecenter/SSLVMB_24.0.0/spss/base/idh_twostep_main.

15

html), based on the likelihood distance measure, and the means procedure in SPSS were used to group cOGGs by expression level (continuous) and class to which the species of the dataset belongs (categorical), sorted into 3 clusters.

The GO enrichment analysis and visualization were performed on only the one cluster (cluster_2) containing all five classes of species. Expression levels were subsequently filtered to remove those with average absolute fold-change less than 2. The cOGGs that contained at least one filtered expression level from all datasets were defined as differentially expressed cOGGs (DEcOGGs).

The GO enrichment analysis in the biological processes was filtered by P-value (adjusted

Benjamini-Hochberg) using a 0.1 significance threshold. Then visualization was performed with the R package called “goSTAG”(Bennett & Bushel, 2017), using the following parameters: both enriched GO terms and the taxonomy groups, class of species clustering used “Euclidean” distance method, “complete” clustering method, and 0.5 distance threshold.

The combination from segregated comparison finding differential expressed genes across datasets

In the combination method, I used a R package called “edgeR” (McCarthy, Chen, & Smyth, 2012;

Robinson, McCarthy, & Smyth, 2010) to identify the differential expressed genes (DEGs) in each dataset, except for the zebrafish dataset because of no biological replicate. After making pairwise comparisons of each learning treatment against control, only the learning-vs-control contrast pair that provided the most DEGs was used. DEGs with the logarithm of fold-change to base 2 equal to 0 and false discovery rate adjusted p-value > 0.05 were filtered out. Annotation to these DEGs proceeded by using the representative cOGG annotation identifiers for GO enrichment analysis, thereby finding the shared biological processes across different levels.

16

2.3.2 Targeted differential expression analysis

The targeted learning associated gene list from mammalian studies was obtained from a previous study (Lewis et al., 2020) and used for targeted differential expression analysis. I retrieved cOGGs by the gene symbol through the learning gene list. Average cOGG expression levels were calculated using the same method used for global differential expression analysis. Heatmaps of the expression levels were constructed with “Euclidean” distance method, “average” clustering method for both learning-related genes and datasets using an R package, “pheatmap”

(https://cran.r-project.org/web/packages/pheatmap/index.html).

3 Results

3.1 Variance across datasets using standardized processes

All 15 RNA-Seq raw datasets (Table 2.1) were processed separately for de novo assembly before performing gene expression analysis. Detailed assembly information of the 15 datasets is provided in (Table 3.1). Due to the sampling inconsistency across study designs, the total number of base pairs included in each assembly showed more than 40 times difference in coverage, ranging from

18.35 to 732.02 mega base pairs. There was up to 76 times difference in the number of transcripts that comprised the N50 statistic over all 391 base pairs across datasets, ranging from 18,646 transcripts (roundworm) to 1,409,479 transcripts (toad). The mouse_2 dataset contained the largest number of base pairs (~732 Mb) while the toad dataset contained the largest number of transcript pairs (~1,410 k).

The roundworm dataset profiles had the smallest library size. One reason might be that one of roundworm’s most significant advantages is its simplicity in genome organization, the length of

17

which is only about 100 M base pairs (Hodgkin, Plasterk, & Waterston, 1995). Thus, to meet the desired depth of sequencing, it would require fewer base pairs than mammals. In the source study

(Camacho et al., 2018), researchers still obtained ~350 million reads in total from 12 samples.

Another reason was the purpose of the experiment. The source study aimed to compare the gene expression patterns under various conditions by mapping reads to the reference genome; thus, short single reads (50-bp) were required in quantifying the coding transcriptome of roundworm in order to minimize reading across splice junctions. However, in the de novo assembly-based processing pipeline, these short reads would hardly be assembled to high-quality transcripts passing the quality control, resulting in a few transcripts. Similarly, a few outputs of assembled transcripts in the mouse_1 dataset resulted from the short single-read sequencing design (Bero et al., 2014).

Intraclass orthologs searching used different restrictions (E-value threshold) in each class based on the evolutionary diversity across species in each class. In Actinopteri and Mammalia, a stringent

E-value (1 x 10-5) was applied in the alignment of sequences from closely related species (such as mouse and rat, zebrafish and swordtail fish). Considering the diversity of species in Insecta, a less restrictive E-value (1 x 10-3) was applied in orthologs searching by clustering 84% of input sequences on average. Most datasets had up to 70% genes clustered, expect for mouse_2 and mouse_3 (68% and 61% respectively). These two studies used the single-cell (dentate gyrus) transcriptome profiles rather than using tissues (regions of the brain or whole brain). Single-cell transcriptome profiling is good for investigating the exact activated mechanisms underlying learning, avoiding the noise of gene expression from other brain cells responding to uninterested biological processes. Compared to brain transcriptome profiles in orthologs searching, single-cell transcriptome datasets provide less orthologous genes clustered, but those genes might be specific to learning and memory.

18

Table 3.1 Summary of transcriptome data after de novo assembly Length of Total of Total number Clustered Clustered Total base Class Dataset N50 contig Samples transcripts of predicted intraclass interclass pairs (Mb) (Kb) (k) proteins (k) (k) (k) Mammalia mouse_1 22.37 0.99 8 33.1 12.6 10.6 6.9 mouse_2 732.02 1.03 36 1,038.7 130.9 89.3 51.2 mouse_3 461.68 0.53 38 966.6 120.8 74.0 27.0 rat_1 84.27 1.12 8 124.8 25.5 21.6 15.4 rat_2 503.49 1.97 17 584.9 165.6 130.9 76.3 Amphibia toad 663.52 0.53 12 1,409.5 150.1 150.1 130.2 Actinopteri swordtail 177.58 1.08 12 249.6 73.8 52.9 33.9 zebrafish 410.32 1.10 2 590.7 91.1 71.0 45.9 Insecta wasp_1 303.51 2.95 17 259.8 115.0 106.2 48.7 wasp_2 315.25 3.16 16 249.6 122.0 113.2 51.7 bee 251.18 0.98 39 375.3 111.8 86.1 23.3 fly_1 135.88 1.28 176 176.0 45.0 37.1 14.4 fly_2 234.52 0.43 64 583.5 56.3 42.1 4.6 fly_3 170.85 2.39 24 148.0 60.4 52.5 21.4 Chromadorea roundworm 18.35 1.75 12 18.6 11.2 11.2 7.3 * The counts of clustered proteins for each species by different ranges were represented as “Clustered intraclass” and “Clustered interclass” (Figure A. 1) respectively.

19

3.2 Interclass searching for cross-class orthologous gene groups

Interclass searching across the 5 classes generated 38,595 OGGs from the 558,179 (94% of total

593,804 input protein sequences) clustered sequences, among which 3,019 cOGGs were identified.

The class Insecta was involved in 2,166 (72% of total) cOGGs. However, the clustered sequences from Mammalia, which were involved in the second largest proportion of cOGGs were used as representatives for subsequent annotation (53,253 OGGs). Mammals are widely studied, and a well-developed and high-quality annotation database is accessible.

The level-1 GO annotation of Mammalia sequences (Figure 5) provided a general idea of potential functions of orthologous genes from the cross-class cOGGs. Across all levels of GO terms in mammalian sequences annotation, there were 8 molecular functions, 20 biological processes, and

14 cellular components enriched. Most GO-annotated mammalian sequences were clustered to metabolic and cellular processes.

20

Figure 5 Distribution of top 8 Level-1 GO term functional classification from all Mammalia sequence annotations. X-axis, counts of gene symbols from annotated mammalian sequences. Y-axis, GO terms and their accession numbers, ordered by three main categories- cellular component (CC), biological process (BP), and molecular function (MF).

3.3 Differential gene expression analysis

3.3.1 Two-step clustering testing the dependency between expression levels and the class of

species

Three clusters generated by two-step clustering were similar in size but distinct in composition

(Figure 6). Cluster_1 and cluster_3 were comprised of expression levels from only one class of species, Insecta and Mammalia respectively. On the other hand, cluster_2 contained expression levels from all five classes though Insecta and Mammalia only took tiny portions in cluster_2

(Table A. 1).

21

Figure 6 The size and components of three clusters from two-step clustering. A) The pie chart shows the percentages of each cluster in total expression levels. B) The heatmap shows the component of clusters by class. Cell colour, from blue to red and the numbers shown correspond to the number of expression levels of datasets clustered in each cluster grouped by class.

Exploration of cluster_2 by costumed differential expression analysis provided 165 DEcOGGs related to learning. Most of DEcOGGs were class-specific, even dataset-specific, whereas only four cOGGs found more than one filtered expression levels (fold-change >2) (Table 3.2).

Only in the DEcOGG (represented by its gene symbol) Myo15, Amphibia and Actinopteri shared nearly the same expression level, while the expression levels in the other three DEcOGGs varied between classes (or datasets). In the DEcOGG Nmna1, Actinopteri and Mammalia expressed with different directions of regulation and the average fold-change found in mouse_3 was 9 times larger than that found in the swordtail data. While the direction was that same for the Ubp30 and Rcaf1, the respective sizes of the average fold-change in the mouse_3 dataset were approximately 64 times and 10 times that found in the swordtail dataset.

22

Other DEcOGGs were mainly from Mammalia (72 DEcOGGs), whereas 53 from Actinopteri, 21 from Chromadorea, 10 from Amphibia, and 4 from Insecta.

Table 3.2 DEcOGGs that included more than one filtered expression level (fold-change relative to control) Gene Symbol Dataset Class Fold-change Myo15 toad Amphibia -2.10 zebrafish Actinopteri -2.00 Nmna1 swordtail Actinopteri -2.08 mouse_3 Mammalia 18.13 Ubp30 swordtail Actinopteri 2.12 mouse_3 Mammalia 128.00 Rcaf1 swordtail Actinopteri 2.35 mouse_3 Mammalia 21.59

3.3.2 The most significant differential expressed contrast for each dataset

Similar to the results from orthologs searching, the results of differential expressed analysis were varied across datasets (Figure 7). Most datasets did not have many DEGs, even using the most different contrast pair (Table A. 2). Consistent with the study design in using single-cell transcriptome profiles, the mouse_2 dataset provided a significantly greater number of DEGs.

From the source study, another reason for it could be the application of the Translating Ribosome

Affinity Purification RNA sequencing (TRAP-seq), which could detect both greater numbers of

DEGs and greater magnitudes of differential expression levels than only traditional RNA-Seq.

23

Figure 7 Differential expression analysis results for each dataset. The count of up-regulated was shown in black bars, while the down-regulated was in white bars. The bottom colour bars showed 5 class of species.

Combining results from all datasets and grouping them by cOGG, I obtained 744 cOGGs from

1119 DEGs. Unlike the few amounts of DEcOGGs that contained more than one class of species, combined differential expression analysis provided more cOGGs which contained DEGs from more than one class of species (Figure 8).

Most shared cOGGs that contained DEGs from more than one class of species contained the DEGs provided from Mammalia, Actinopteri and Insecta. None of thirty-one shared cOGGs included

DEGs from Amphibia; similarly, only one shared cOGG contained one DEG from Chromadorea.

This could be explained by the few counts of DEGs detected in Amphibia and Chromadorea. The cOGG (represented by its gene symbol), Grp75, contained the most DEGs, which were differentially expressed in four classes, mostly from Insecta.

24

3 0 1 6 1 Grp75 6 5 0 0 1 0 Cam2b 1 0 0 5 0 Enpl 5 2 0 4 0 0 Macf1 4 0 1 1 0 Scrb1 1 0 0 4 0 Cryab 4 0 0 1 4 0 Est5a 0 0 3 2 0 Flnb 3 3 0 2 0 0 M4k4 1 0 3 1 0 Myh7b 4 0 1 0 0 Ndrg3 2 3 0 0 1 0 Brd3 2 0 2 0 0 Clk3 1 2 0 0 2 0 Dnjb5 2 0 2 0 0 If4g1 3 0 1 0 0 Ki13b 0 3 0 0 1 0 Odo1 2 0 2 0 0 Peli1 2 0 0 1 0 Arp3 2 0 0 1 0 At1a3 2 0 0 1 0 Chd4 0 0 1 2 0 Hs74l 2 0 0 1 0 If4a2 2 0 1 0 0 Kinh 2 0 1 0 0 P85b 2 0 1 0 0 Pde4d 2 0 0 1 0 Rrp5 2 0 0 1 0 Stb5l 2 0 0 1 0 Tba4a 0 0 1 2 0 Ubp40 1 0 0 2 0 Vpp3

Insecta Amphibia Mammalia Actinopteri Chromadorea

Figure 8 Heatmap of the count of DEGs from all datasets grouping by the class of species and cOGG. X- axis, class of species. Y-axis, thirty-one cOGGs, represented by their gene symbols, that contained DEGs from more than one class of species. The order of Y-axis was sorted by the decreasing count of total DEGs in each cOGG.

Most cOGGs that contained at least one DEG were class-specific, consistent with the results of global differential expression analysis. Although most DEGs were grouped in the cOGGs, in Stim2, all DEGs were from one class- Mammalia. There were 655 cOGGs (88% of total) that were exclusively differentially expressed in one class of species. Furthermore, among these class- specific cOGGs, more than 80% cOGGs only included one DEG.

3.3.3 Gene Ontology enrichment analysis

Consistent with the differential expression analysis, GO enrichment analysis was mainly class- specific by both the global and combined methods (Figure 9). Although GO enrichment analysis

25

was performed on a higher-level functional clustering, most of GO term clusters were still class specific.

For global method (Figure 9 A), using the mouse reference GO annotations, I identified 9 of 22

DEcOGGs from Mammalia, 1 of 4 DEcOGGs from Amphibia, 14 of 35 DEcOGGs from

Actinopteri, 2 of 3 DEcOGGs from Insecta, and 7 of 8 DEcOGGs from Chromadorea mapping to all-level GO terms under biological processes. The GO enrichment analysis provided 6 of 24 clusters from these mapped DEcOGGs, which contained more than 10 GO terms. Six GO clusters were class-specific for three classes- Actinopteri, Chromadorea, and Insecta, and related to regulation in structural changes of neurons and cellular processes (Table 3.3).

26

A B Insecta Amphibia Mammalia Actinopteri Chromadorea Insecta Amphibia Mammalia Actinopteri Chromadorea

posttranslational protein targeting to endoplasmic reticulum membrane

coenzyme A biosynthetic process

positive regulation of proteasomal ubiquitin−dependent pro...

cAMP biosynthetic process negative regulation of cyclin−dependent protein serine/thre...

UMP salvage

positive regulation of transcription by RNA polymerase II

negative regulation of translation negative regulation of proteasomal ubiquitin−dependent pr...

negative regulation of cyclin−dependent protein serine/threonine kinase activity

fatty acid beta−oxidation positive regulation of transcription elongation from RNA polymerase II promoter

ubiquinone biosynthetic process negative regulation of transcription by RNA polymerase II

negative regulation of transcription, DNA−templated GPI anchor biosynthetic process negative regulation of neuron projection development

positive regulation of cyclin−dependent protein serine/threonine kinase activity UMP salvage

Global DEcOGG analysis: DEcOGGs from Cluster_2 Segregated DEG analysis: combination of DEGs

Figure 9 The heatmaps of GO enrichment analysis for two differential expression analysis methods. A) GO enrichment of DEcOGGs from cluster_2 by global differential expressed DEcOGGs analysis, including 24 clusters of enriched GO terms. B) GO enrichment of DEGs from all datasets by combined DEG analysis, including 65 clusters of enriched GO terms. X-axis, class of species (Figure A. 3; Figure A. 4). Y-axis, clusters of enriched GO terms from biological process category, the representative GO term from the most of paths to the root of the subtree. The clusters with label contained at least 10 GO terms. The redder a cell was in the heatmap, the more significant the enrichment of the GO terms was. Gray cells indicate that no DEcOGG or DEG was found under these GO terms.

27

Table 3.3 DEcOGGs and the source class enriching in 6 labelled GO clusters

Category GO Description Class DEcOGGs

Cellular metabolic GO:0044206 UMP salvage Actinopteri Csad, Fkbp4, Vmp1, process Egr3, Vamp2, Isca1, Uck2

Structural changes GO:0010977 Negative regulation of neuron projection Actinopteri Fkbp4, Egr3, Vamp2, development Uck2

Cellular catabolic GO:0006635 Fatty acid beta-oxidation Chromadorea Acox1, Ncam1, Gipc1, process Ubb

Cellular catabolic GO:0032435 Negative regulation of proteasomal Chromadorea Acox1, Ncam1, Syt4, process ubiquitin-dependent protein catabolic Mpp5, Gipc1, Ubb process

Structural changes GO:0045736 Negative regulation of cyclin-dependent Insecta Fas, Plk1 protein serine/threonine kinase activity

Cellular catabolic GO:0032436 Positive regulation of proteasomal Insecta Fas, Plk1 process ubiquitin-dependent protein catabolic process

28

For the combined method (Figure 9 B), with more input DEGs, the number of GO clusters increased to 65, thirteen of which contained more than ten biological processes related to GO terms.

Intriguingly, although most of GO terms in one cluster were class-specific, all 13 labelled clusters were shared across more than two classes of Insecta, Mammalia, and Actinopteri. These shared

GO clusters revealed potentially shared mechanisms in structural changes, cellular biosynthetic process, cellular metabolic process, signalling, transcription, and translation (Table 3.4).

29

Table 3.4 Thirteen labelled GO clusters shared across source classes

Category GO Description Class Structural GO:0045737 Positive regulation of cyclin-dependent protein serine/threonine Insecta, Mammalia, Actinopteri changes kinase activity Structural GO:0045736 Negative regulation of cyclin-dependent protein serine/threonine Insecta, Mammalia, Actinopteri changes kinase activity Cellular GO:0006744 Ubiquinone biosynthetic process Mammalia, Actinopteri biosynthetic process Cellular GO:0044206 UMP salvage Insecta, Mammalia, Actinopteri metabolic process Cellular GO:0015937 Coenzyme A biosynthetic process Mammalia, Actinopteri metabolic process Signalling GO:0006506 GPI anchor biosynthetic process Insecta, Mammalia, Actinopteri Signalling GO:0006171 Camp biosynthetic process Insecta, Mammalia, Actinopteri Signalling GO:0006620 Posttranslational protein targeting to endoplasmic reticulum Mammalia, Actinopteri membrane Transcription GO:0045892 Negative regulation of transcription, DNA-templated Insecta, Mammalia, Actinopteri Transcription GO:0000122 Negative regulation of transcription by RNA polymerase II Insecta, Mammalia, Actinopteri Transcription GO:0032968 Positive regulation of transcription elongation from RNA Insecta, Mammalia, Actinopteri polymerase II promoter Transcription GO:0045944 Positive regulation of transcription by RNA polymerase II Insecta, Mammalia, Actinopteri Translation GO:0017148 Negative regulation of translation Insecta, Mammalia, Actinopteri

30

3.3.4 Matching with the most relevant mammalian learning-associated gene list

Here, I found 40 cOGGs annotated with common gene names in the list of the most relevant mammalian learning-associated genes. I defined these cOGGs in the list of most relevant mammalian learning-associated genes, as Learning-cOGGs. In mammalian studies, only a few

Learning-cOGGs from one dataset, mouse_3, were differentially expressed, listed in Table 3.5, while in other mammalian datasets, the expression pattern of these Learning-cOGGs did not significantly differ between the learning and control conditions. Detected genes were categorized in apoptosis, general learning-related mechanisms, and structural-related mechanisms, which indicates that these changes of gene expression are essential during learning and memory.

31

Table 3.5 The Learning-cOGGs found in Mammalian studies filtered by larger than 2 fold-change cut-offs.

Gene Name Fold-change Learning gene group Gene full name

Rrp5 5.81 Apoptotic Protein RRP5 homolog

S39a7 2.99 General learning-related Zinc transporter SLC39A7

Nac1 36.22 General learning-related Sodium/calcium exchanger 1

Nadap 2.57 General learning-related Kanadaptin

S13a5 18.80 General learning-related Solute carrier family 13 member 5

S22a5 22.50 General learning-related Solute carrier family 22 member 5

S2542 3.32 General learning-related Mitochondrial coenzyme A transporter SLC25A42

S2546 2.05 General learning-related Solute carrier family 25 member 46

Ctnnd2 2.07 Structural Catenin delta-2

Grik5 2.07 Structural Glutamate receptor ionotropic, kainate 5

Iqec2 3.59 Structural IQ motif and SEC7 domain-containing protein 2

L1cam 2.17 Structural Neural cell adhesion molecule L1

Tmod1 2.15 Structural Tropomodulin-1

Ulk1 4.28 Structural Serine/threonine-protein kinase ULK1

32

In the exploratory global analysis, I found that Actinopteri, Amphibia, and Chromadorea showed similar expression patterns in terms of the fold-change values because these three groups only clustered into cluster_2 (Figure 9 B). Thus, I compared the Learning-cOGGs expression patterns in those three classes and found that species from Actinopteri clustered together, while the other two classes separated into two different clusters. However, diverse expression patterns were found for some Learning-cOGGs between two species in Actinopteri (Figure 10).

33

Class Class 3 Sc6a6 Actinopteri Kpcg 3.48 Argi2 2 Amphibia Faf2 Chromadorea Ghc1 1 Pdc6i Grik5 Learning gene group Ctnnd2 0 Nac1 Apoptotic Arhgf Ephaa −1 General learning−related Ulk1 Guidance S36a4 −2 S2546 IEG Baip3 2.17 S2542 −3 Structural L1cam Ctnb1 Mftc Cmc1 S4a7 S35b2 S26a6 Tmod1 2.2 Znt1 Kcc2d Iqec2 Slit3 S35a2 Nadap Arhg7 Rrp5 Snp25 S22a5 S13a5 S39a7 Kkcc1 −2.38 Egr3 Scmc3 Mcat Learning gene groupLearning gene swordtail zebrafish roundworm toad

Figure 10 The heatmap of the differential expression level of 40 Learning-cOGGs in Amphibia, Actinopteri, and Chromadorea studies. X-axis, dataset names, labelled with the class of species; Y-axis, forty Learning- cOGGs represented as annotated gene symbols. The genes in Apoptotic group are involved in programmed cell death; the genes in General learning-related group are relevant to mammalian learning but do not fit into other categories; the genes in Guidance group are involved in axon guidance; the genes in immediate early gene (IEG) group are categorized as immediate-early genes; the genes in Structural group are involved in cytoskeletal modification. Red cells with number are differentially upregulated, while dark gray cells with numbers are differentially downregulated (fold-change of averaged expression value under learning conditions against control larger than 2).

34

4 Discussion

In this study, I explored the hypothesis on learning-related mechanisms that shared mechanisms underlie the general process of memory consolidation across distant species. Core mechanisms would be shared because all species must undergo a general process by which short-term memory transforms into long-term memory. However, different learning-induced mechanisms results from various aspects, such as the effect of aging and the types of memory. In this study, I found shared mechanisms across Insecta, Mammalia, Actinopteri, including changes in the structure of brain regions, signalling pathways, and in the transcription, translation and metabolic processes.

Intriguingly, specific learning-related mechanisms were mostly explained by diverse learning paradigms in this study. Not surprisingly, similar expression patterns were found for the most relevant mammalian learning-associated genes in this study, compared with the results in Lewis et al. (2020) study.

4.1 Orthologs searching across distant species and meta-data database

construction

The sequences and expression values associated with the 3,019 cOGGs set the foundation of differential expression analysis. Ortholog searching identified genes mainly from cellular and metabolic processes, which agrees with the previous finding that most consensus pathways across species are related to cellular and metabolic processes (Jiménez, Mitchell, & Sgouros, 2003).

The expression patterns are mostly similar in OGGs in a dataset that can be grouped. Through a microarray study, Grigoryev et al. (2004) showed that expression pattern and biologically significant fold-change in the gene-expression level were both similar in genes from the same

OGGs. 35

4.2 Shared pathways underlying memory consolidation

Shared pathways across five classes might indicate some shared mechanisms in response to learning and memory, particularly in memory consolidation. These mechanisms are related to cell cycling of neurons, neural connection, transcription and translational regulations and general cellular mechanisms. Some of the shared pathways identified in the differential expression analysis have been implicated in the learning process in the literature.

Among the four DEcOGGs (Myo15, Nmna1, Ubp30, Rcaf1) found to include more than one filtered expression level, two (Nmna1, Ubp30) have been previously linked to learning and memory in control of cell cycling of neurons. Increased Nmna1 has been found to reduce neuronal apoptosis functioning as a neuroprotective role in rodent (Rossi et al., 2018; Yang, Liu, & Chu,

2020; Zhao et al., 2011). The average fold-change found in mouse_3 was higher in the learning group, compared to the controls, which supports the theory that Nmna1 facilitates neuroprotection in learning. However, smaller average fold-change and decreased expression pattern that I found in swordtail has not been discussed in the literature.

Ubp30 has been described in flies and human cells with a negative regulation of mitophagy such that reduced Ubp30 activity is associated with enhanced mitochondrial degradation in neurons

(Bingol et al., 2014; Fivenson et al., 2017; Han, Wang, Kang, & Fang, 2020; Lou et al., 2020).

However, I found that the average fold-changes in both swordtail and mouse_3 datasets were higher in the learning groups compared to the controls, contrary to the theory. This finding might indicate that during learning, in particular memory consolidation, Ubp30 might be activated to avoid the neurodegeneration by inhibition of degrading mitochondria.

From the combined method, the identified DEGs were clustered according to their GO functional annotation under the biological process category. The cellular signalling related GO clusters that

36

I identified in this study, regulating the neuron connections, includes biosynthetic process of the glycosylphosphatidylinositol (GPI) anchor protein, biosynthetic process of the cyclic AMP

(cAMP), and the process of posttranslational protein targeting to endoplasmic reticulum membrane.

Several enriched GO terms found in the Mammalian group agree with the important role of GPI anchor proteins related to neuronal development and synapse regulation by mediate interactions between neurons and between neurons and other cells in the mouse nervous system (Mizuno et al.,

2007; Tan, Leshchyns'ka, & Sytnyk, 2017). Several studies suggest cAMP signalling within neurons plays an important role in the formation of long-term memories in the hippocampus (Eric

R Kandel, 2012; Lee, 2015; Ma, Abel, & Hernandez, 2009). Interestingly, no study has revealed the involvement of post-translational proteins targeting to endoplasmic reticulum membrane in learning and memory.

I see the sharing clusters of enriched GO terms related to the regulation at the transcription level.

The essential role of transcriptional regulation includes the generation of not only mRNAs that function in protein synthesis, but also noncoding RNA that affects gene expression (Alberini &

Kandel, 2015). Also, I see the enriched biological processes related to translational regulation.

Translation controlling, particularly around the synapse, participates in the regulation of synaptic plasticity and memory consolidation by synaptic transmission (Richter & Klann, 2009).

From the combined method in differential expression analysis, the functional enrichment analysis highlights shared pathways in the regulation of cyclin-dependent protein serine/threonine kinase activity (CDKs). The CDK family facilities the regulation of cell-cycle in mammals

(Satyanarayana & Kaldis, 2009). Among them, Cdk5 was identified to play an important role in learning and memory by the involvement of the regulation related to neurotransmission and synaptic plasticity (Cheung, Fu, & Ip, 2006; A. Fischer, Sananbenesi, Pang, Lu, & Tsai, 2005).

37

4.3 Validation of the expression patterns of learning-associated genes in

mammals

The expression patterns of differentially expressed mammalian learning-related genes in response to contextual memory consolidation in the mouse_3 dataset are consistent with Lewis et al. (2020) in the aspect of direction. All 14 differentially expressed Learning-cOGGs were up-regulated same as the dierection of those in Lewis’s results. Among them, the only gene that might involve in the regulation of apoptosis found up-regulated in a mammalian dataset (mouse_3), which has been found with the involvement in pre-ribosome assembly (Turowski & Tollervey, 2015). By increasing the number of ribosomes, the efficiency of transcription and translation would also be increased in neurons. The higher expressions of structure-related genes in the learning groups compared to the controls, which supports the theory that during memory consolidation, neurons change the cell structures to enhance connections in the neuron system (Cavallaro, Schreurs, Zhao,

D'Agata, & Alkon, 2001).

However, I could see the absence of many known mammalian learning-associated genes that should have been detected to be activated in Learning conditions in mammalian datasets in this study, compared to the list of all mammalian learning-associated genes in Lewis’s study. The reason for the absence might be the averaging of expression values across multiple treatments or the filtering of orthologs searching. In fact, the averaging may also explain the absence of

Learning-cOGGs detected from other mammalian datasets.

4.4 Potential mechanisms underlying specific training paradigms

Although I see most DEcOGGs and DEGs differentially expressed in specifically classes, the dominant effect might be the learning paradigms rather than the phytogenic relationship.

38

For example, in Table 3.3, I see Actinopteri specific GO clusters- UMP salvage and negative regulation of neuron projection development. Among the DEcOGGs clustered in these two GO clusters, Csad was detected to be differentially expressed in the swordtail dataset that tested a type of cognitive memory. The expression change of Csad in a cognitive conditioning study was also detected in rats (Jung et al., 2019).

Detected from the same swordtail dataset, some DEcOGGs might be related to the early-life memory, when a learning stimulus performed on young swordtail fish and the behavior changes happened in their adulthood. For example, the early-life memory response gene, Fkbp4 was detected to be differentially expressed. Similar result was found in a human study revealing the effect of childhood stressed memory in increasing the risk of developing stress-related behavioural disorders in adulthood (Klengel et al., 2013). Also, the detection of DEcOGG, Uck2 in this swordtail dataset might be due to the childhood learning stimulus, which is supported by the detection of differentially expressed Uck2 in a rat study related to early-life memory (Thiriet et al.,

2008).

Learning from the environment for the roundworm might cause the expression change of Ubb, which was identified with differential expression pattern under a running-related environmental enrichment condition (Grégoire et al., 2018).

However, detected from the same roundworm dataset, Syt4, related to neurotransmission or ion conductance, might specific to response to chronic-stress learning condition that the ancestor was treated with a long-term chemical exposure. This kind of chronic-stress learning condition would finally result in passive avoidance behaviour changes and expression changes of Syt4 , which was also identified in rodents (O’Sullivan et al., 2007; J. Wang et al., 2019).

39

4.5 Integration of meta-analysis and comparative transcriptomic analysis

The workflow of processing RNA-Seq raw data in this study is consistent with the widely-used non-model organisms transcriptome profiling processes so that reference genome would be not needed for both model and non-model animals (Berens et al., 2017; Lewis et al., 2020; Ponniah,

Thimmapuram, Bhide, Kalavacharla, & Manoharan, 2017; X.-W. Wang et al., 2011).

Standardization and normalization were both treated carefully to make datasets from different studies and different species comparable.

Unlike traditional meta-analysis of transcriptome data, I reanalyzed the public transcriptome profiles related to learning and memory in order to enforce a standardized processing workflow starting from the RNA-Seq raw data. Without standardization, observed differences due to data processing may be falsely attributed to learning mechanisms. By performing de novo assembly to all datasets, the comparative transcriptomics analysis here could avoid the variance in alignment resulting from different selections of reference genomes, and broaden the range of usable species for comparison. For example, in three mouse datasets, two different reference mouse genomes were used, mouse mm9 (https://www.ncbi.nlm.nih.gov/assembly/GCF_000001635.18/) and mouse mm10 (https://www.ncbi.nlm.nih.gov/assembly/GCF_000001635.20/). Besides, here I used a novel annotation method whereby OGGs were annotated by rat and mouse. This is beneficial for the downstream analysis, such as GO annotation and enrichment. During this downstream analysis, the consistency of annotated species in the selection of background database of a species mostly provided with model species, could avoid missed or mistaken matches between input queries and mapping subjects. The occurrence of mismatch errors might due to the inconsistent matching between function and gene identifier across species.

40

The normalization step was essential for the differential expression analysis. Study specific sequencing depth and class specific expression baselines can both affects the scales of quantification of reads aligning to genes. I applied two exploratory methods in differential expression analysis to investigate the expression patterns in response to learning conditions by study (dataset).

4.6 Performance comparison of two methods in differential expression

analysis

I have explored two integration methods in differential expression analysis. The global method averaged expression values on a per-cOGG and per-dataset basis; multiple treatments were also averaged when classified into a general learning group, or a general control group. This is a simplified way to find the most learning-related gene sets.

However, it was too general in simply averaging across treatments into just Learning and Control.

Averaging expression patterns would cover up the significant difference of some gene expression changes in specific treatment conditions. For example, in the temporal study of toad, time- depended genes, such as immediate early genes, were significantly up-regulated soon after the last learning stimulus and tapered to baseline (Lewis et al., 2020). In addition, the statistical significance of expression pattern changes is hard to calculate in this general method due to the inter-dataset variability. Another two examples of the absence of previously detected mammalian learning-associated genes also results from the covering of expression changes for specific condtions by general patterns. One is the transmembrane protein VMP1, which was only detected differentially expressed in the dataset from the Actinopteri group. VMP1 has been previously found to play an important role in the nucleation step in mammals, but in this study was not

41

identified to be differentially expressed in any dataset in the Mammalia group (Nobili, Cavallucci,

& D’Amelio, 2016). Another example is Egr3, functioning as a transcriptional regulator, which is essential in long-term potentiation (LTP) and in hippocampal and amygdala dependent learning and memory; however, it was not detected to be widely shared across classes (Lin Li et al., 2007).

Although batch effects correction, in which statistical testing was performed for cOGGs expression patterns to identify the real signals against a complex and noise-existing background

(Goh, Wang, & Wong, 2017), was tried in the global method to calculate the statistical significance of changes in expression regulated by learning conditions, the expression levels (fold-change relative to control) were still clustered separately only by treatments, indicating that the expression levels were affected by other factors (Figure A. 2).

Segregated differential expression analysis was based on traditional workflow- analyze individual datasets separately and combined the results for meta-analysis. This method is more suitable to detect individual genes and each contrast, resulting in more DEGs in total than the global way.

Batch effects fixed by the R package “edgeR” exist only across biological replicates of each treatment and each dataset. However, it is still biased because I only selected one contrast that provided the most DEGs. However, regulated by learning, genes might change expression patterns under different conditions, such as time after stimulus. Especially in a temporal study (Lewis et al., 2020), some genes related to structural modification early (2-4h after learning) in actin cytoskeletal organization, axon guidance, adhesion, and synapse assembly, while later (24 h after learning), genes related to neurite genesis and apoptosis show expression changes in response to learning.

42

4.7 Methodological limitation

Although learning and memory have been an important research field in neuroscience, not much public RNA-Seq raw data can be obtained from open-access databases. This limitation in availability of datasets resulted in a heterogeneous set of datasets being included in the meta- analysis. For example, some studies used single-cell transcriptome vs whole brain transcriptome or obtained RNA-Seq data vs a combination of RNA-Seq and TRAP-Seq data. Due to the small sample size, these factors, as well as type of learning or other design factors (e.g., the age of animals when they were treated with stimulus) were not accounted for in the analysis.

Besides, other factors during learning did not take into account, such as the types of learning, which might also lead to different mechanisms. Limited learning related studies of multiple training paradigms available for some species, such as Bombina orientalis, Polistes fuscatus, and

Polistes metricus, resulted in deficient test for other learning-associated factors, such as the age of animals learning.

In this study, the hypothesis was that every animal would share the core mechanisms in learning and memory, which indicates that it is impossible to find a control of species that cannot learn or memory. This study could only provide some targets which might be related to learning and memory based on the expression data analysis. The further study could provide the solid proof for the target mechanisms detected in this study by inhibiting any of them to see if the memory consolidation destroyed.

5 Conclusion

This study firstly constructed a meta-transcriptome database across five classes of species

(Mammalia, Amphibia, Actinopteri, Insecta and Chromadorea) in investigating mechanisms

43

related to learning and memory. Standardized processing workflow and normalization from two methods in differential expression analysis enabled a meta RNA-Seq comparative analysis across distant species. The combination of intra-class and inter-class orthologs searching identified 3,019 cOGGs (based on homology relationships) from all five classes, mainly related to cellular and metabolic processes. The set of 3019 cOGGs along with their associated sequences and expression values, provides a rich dataset for studying cross-species learning and memory mechanisms. This meta-dataset and the framework established through this thesis facilitate comparisons at any of the inter-class, inter-species or inter-study (dataset) levels. I found not only shared mechanisms across classes related to memory consolidation but also some memory-type specific pathways.

With this framework established, further study can be easily augmented by the addition of available datasets. Future work could better explore the effect of aging in learning, standardization choices, and time after stimulus. Finally, while in this study the interest lay in finding commonalities across species, understanding the differences is just as interesting and this framework allows one to do that too.

44

REFERENCES

Abel, T., Martin, K. C., Bartsch, D., & Kandel, E. R. (1998). Memory suppressor genes: inhibitory constraints on the storage of long-term memory. Science, 279(5349), 338-341.

Alberini, C. M., & Kandel, E. R. (2015). The regulation of transcription in memory consolidation. Cold Spring Harbor perspectives in biology, 7(1), a021741.

Barco, A., & Marie, H. (2011). Genetic approaches to investigate the role of CREB in neuronal plasticity and memory. Molecular neurobiology, 44(3), 330-349.

Bennett, B. D., & Bushel, P. R. (2017). goSTAG: gene ontology subtrees to tag and annotate genes within a set. Source Code for Biology and Medicine, 12(1), 6. doi:10.1186/s13029-017-0066-1

Berens, A. J., Tibbetts, E. A., & Toth, A. L. (2017). Cognitive specialization for learning faces is associated with shifts in the brain transcriptome of a social wasp. The Journal of Experimental Biology, 220(12), 2149- 2153. doi:10.1242/jeb.155200

Bero, A. W., Meng, J., Cho, S., Shen, A. H., Canter, R. G., Ericsson, M., & Tsai, L.-H. (2014). Early remodeling of the neocortex upon episodic memory encoding. Proceedings of the National Academy of Sciences, 111(32), 11852. doi:10.1073/pnas.1408378111

Bingol, B., Tea, J. S., Phu, L., Reichelt, M., Bakalarski, C. E., Song, Q., . . . Sheng, M. (2014). The mitochondrial deubiquitinase USP30 opposes parkin-mediated mitophagy. Nature, 510(7505), 370-375.

Bourtchuladze, R., Frenguelli, B., Blendy, J., Cioffi, D., Schutz, G., & Silva, A. J. (1994). Deficient long-term memory in mice with a targeted mutation of the cAMP-responsive element-binding protein. Cell, 79(1), 59- 68. doi:10.1016/0092-8674(94)90400-6

Buti, M., Baldoni, E., Formentin, E., Milc, J., Frugis, G., Lo Schiavo, F., . . . Francia, E. (2019). A Meta-Analysis of Comparative Transcriptomic Data Reveals a Set of Key Genes Involved in the Tolerance to Abiotic Stresses in Rice. International journal of molecular sciences, 20(22), 5662. doi:10.3390/ijms20225662

Camacho, J., Truong, L., Kurt, Z., Chen, Y.-W., Morselli, M., Gutierrez, G., . . . Allard, P. (2018). The Memory of Environmental Chemical Exposure in C. elegans Is Dependent on the Jumonji Demethylases jmjd-2 and jmjd-3/utx-1. Cell reports, 23(8), 2392-2404. doi:10.1016/j.celrep.2018.04.078

Cavallaro, S., Schreurs, B. G., Zhao, W., D'Agata, V., & Alkon, D. L. (2001). Gene expression profiles during long- term memory consolidation. 13(9), 1809-1815. doi:10.1046/j.0953-816x.2001.01543.x

Chen, P. B., Kawaguchi, R., Blum, C., Achiro, J. M., Coppola, G., O'Dell, T. J., & Martin, K. C. (2017). Mapping Gene Expression in Excitatory Neurons during Hippocampal Late-Phase Long-Term Potentiation. Frontiers in molecular neuroscience, 10, 39-39. doi:10.3389/fnmol.2017.00039

Cheung, Z. H., Fu, A. K., & Ip, N. Y. (2006). Synaptic roles of Cdk5: implications in higher cognitive functions and neurodegenerative diseases. Neuron, 50(1), 13-18.

Crocker, A., Guan, X.-J., Murphy, C. T., & Murthy, M. (2016). Cell-Type-Specific Transcriptome Analysis in the Drosophila Mushroom Body Reveals Memory-Related Changes in Gene Expression. Cell reports, 15(7), 1580-1596. doi:10.1016/j.celrep.2016.04.046

45

Cui, R., Delclos, P. J., Schumer, M., & Rosenthal, G. G. (2017). Early social learning triggers neurogenomic expression changes in a swordtail fish. Proceedings. Biological sciences, 284(1854), 20170701. doi:10.1098/rspb.2017.0701

Davis, H. P., & Squire, L. R. (1984). Protein synthesis and memory: a review. Psychological bulletin, 96(3), 518.

De La Torre, V. S. G., Majorel-Loulergue, C., Gonzalez, D. A., Soubigou-Taconnat, L., Rigaill, G. J., Pillon, Y., . . . Merlot, S. (2018). Wide cross-species RNA-Seq comparison reveals a highly conserved role for Ferroportins in nickel hyperaccumulation in plants. Cold Spring Harbor Laboratory. Retrieved from https://dx.doi.org/10.1101/420729

Ewels, P., Magnusson, M., Lundin, S., & Käller, M. (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics, 32(19), 3047-3048. doi:10.1093/bioinformatics/btw354

Fischer, A., Sananbenesi, F., Pang, P. T., Lu, B., & Tsai, L.-H. (2005). Opposing roles of transient and prolonged expression of p25 in synaptic plasticity and hippocampus-dependent memory. Neuron, 48(5), 825-838.

Fischer, S., Brunk, B. P., Chen, F., Gao, X., Harb, O. S., Iodice, J. B., . . . Stoeckert, C. J., Jr. (2011). Using OrthoMCL to assign proteins to OrthoMCL-DB groups or to cluster proteomes into new ortholog groups. Current protocols in bioinformatics, Chapter 6, Unit-6.12.19. doi:10.1002/0471250953.bi0612s35

Fitch, W. M. (1970). Distinguishing homologous from analogous proteins. Systematic , 19(2), 99-113.

Fivenson, E. M., Lautrup, S., Sun, N., Scheibye-Knudsen, M., Stevnsner, T., Nilsen, H., . . . Fang, E. F. (2017). Mitophagy in neurodegeneration and aging. Neurochemistry international, 109, 202-209.

Gallistel, C. R., & Matzel, L. D. (2013). The neuroscience of learning: beyond the Hebbian synapse. Annual review of psychology, 64, 169-200.

Ginsburg, S., & Jablonka, E. (2010). The evolution of associative learning: A factor in the Cambrian explosion. 266(1), 11-20. doi:10.1016/j.jtbi.2010.06.017

Goh, W. W. B., Wang, W., & Wong, L. (2017). Why batch effects matter in omics data, and how to avoid them. Trends in biotechnology, 35(6), 498-507.

Grabherr, M. G., Haas, B. J., Yassour, M., Levin, J. Z., Thompson, D. A., Amit, I., . . . Regev, A. (2011). Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnology, 29(7), 644-652. doi:10.1038/nbt.1883

Grégoire, C.-A., Tobin, S., Goldenstein, B. L., Samarut, É., Leclerc, A., Aumont, A., . . . Fernandes, K. J. (2018). RNA-sequencing reveals unique transcriptional signatures of running and running-independent environmental enrichment in the adult mouse dentate gyrus. Frontiers in molecular neuroscience, 11, 126.

Grigoryev, D. N., Ma, S.-F., Irizarry, R. A., Ye, S. Q., Quackenbush, J., & Garcia, J. G. (2004). Orthologous gene- expression profiling in multi-species models: search for candidate genes. Genome biology, 5(5), R34.

Haas, B. J., Papanicolaou, A., Yassour, M., Grabherr, M., Blood, P. D., Bowden, J., . . . Regev, A. (2013). De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nature Protocols, 8(8), 1494-1512. doi:10.1038/nprot.2013.084

Han, Y., Wang, N., Kang, J., & Fang, Y. (2020). β-Asarone improves learning and memory in Aβ 1-42-induced Alzheimer’s disease rats by regulating PINK1-Parkin-mediated mitophagy. Metabolic Brain Disease, 1-9.

Hodgkin, J., Plasterk, R. H., & Waterston, R. H. (1995). The elegans and its genome. Science, 270(5235), 410-414. doi:10.1126/science.270.5235.410

46

Huang, V., Butler, A., & Lubin, F. (2019). Telencephalon transcriptome analysis of chronically stressed adult zebrafish. Scientific Reports, 9, 1379. doi:10.1038/s41598-018-37761-7

Jahn, H. (2013). Memory loss in Alzheimer's disease. Dialogues in clinical neuroscience, 15(4), 445-454. Retrieved from https://pubmed.ncbi.nlm.nih.gov/24459411 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3898682/

Jiménez, J. L., Mitchell, M. P., & Sgouros, J. G. (2003). Microarray analysis of orthologous genes: conservation of the translational machinery across species at the sequence and expression level. Genome biology, 4(1), R4- R4. doi:10.1186/gb-2002-4-1-r4

Jones, S. G., Nixon, K. C. J., Chubak, M. C., & Kramer, J. M. (2018). Mushroom Body Specific Transcriptome Analysis Reveals Dynamic Regulation of Learning and Memory Genes After Acquisition of Long-Term Courtship Memory in Drosophila. G3 (Bethesda, Md.), 8(11), 3433-3446. doi:10.1534/g3.118.200560

Jung, S. H., Hatcher-Solis, C., Moore, R., Bechmann, N., Harshman, S., Martin, J., & Jankord, R. (2019). Noninvasive brain stimulation enhances memory acquisition and is associated with synaptoneurosome modification in the rat hippocampus. Eneuro, 6(6).

Kaang, B.-K., Kandel, E. R., & Grant, S. G. N. (1993). Activation of cAMP-Responsive genes by stimuli that produce long-term facilitation in aplysia sensory neurons. 10(3), 427-435. doi:10.1016/0896- 6273(93)90331-k

Kandel, E. R. (2001). The Molecular Biology of Memory Storage: A Dialogue Between Genes and Synapses. Science, 294(5544), 1030-1038. doi:10.1126/science.1067020

Kandel, E. R. (2012). The molecular biology of memory: cAMP, PKA, CRE, CREB-1, CREB-2, and CPEB. Molecular brain, 5(1), 1-12.

Klengel, T., Mehta, D., Anacker, C., Rex-Haffner, M., Pruessner, J. C., Pariante, C. M., . . . Bradley, B. (2013). Allele-specific FKBP5 DNA demethylation mediates gene–childhood trauma interactions. Nature neuroscience, 16(1), 33-41.

Labar, K. S., & Cabeza, R. (2006). Cognitive neuroscience of emotional memory. Nature Reviews Neuroscience, 7(1), 54-64. doi:10.1038/nrn1825

Lee, D. (2015). Global and local missions of cAMP signaling in neural plasticity, learning, and memory. Frontiers in pharmacology, 6, 161-161. doi:10.3389/fphar.2015.00161

Leinonen, R., Akhtar, R., Birney, E., Bower, L., Cerdeno-Tárraga, A., Cheng, Y., . . . Cochrane, G. (2011). The European Nucleotide Archive. Nucleic acids research, 39(Database issue), D28-D31. doi:10.1093/nar/gkq967

Leinonen, R., Sugawara, H., Shumway, M., & International Nucleotide Sequence Database, C. (2011). The sequence read archive. Nucleic acids research, 39(Database issue), D19-D21. doi:10.1093/nar/gkq1019

Levenson, J. M. (2004). A Bioinformatics Analysis of Memory Consolidation Reveals Involvement of the Transcription Factor c-Rel. 24(16), 3933-3943. doi:10.1523/jneurosci.5646-03.2004

Lewis, V., Laberge, F., & Heyland, A. (2020). Temporal Profile of Brain Gene Expression After Prey Catching Conditioning in an Anuran Amphibian. Frontiers in neuroscience, 13, 1407-1407. doi:10.3389/fnins.2019.01407

47

Li, L., Stoeckert, C. J., & Roos, D. S. (2003). OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome research, 13(9), 2178-2189.

Li, L., Yun, S. H., Keblesh, J., Trommer, B. L., Xiong, H., Radulovic, J., & Tourtellotte, W. G. (2007). Egr3, a synaptic activity regulated transcription factor that is essential for learning and memory. Molecular and Cellular Neuroscience, 35(1), 76-88.

Lou, G., Palikaras, K., Lautrup, S., Scheibye-Knudsen, M., Tavernarakis, N., & Fang, E. F. (2020). Mitophagy and neuroprotection. Trends in Molecular Medicine, 26(1), 8-20.

Ma, N., Abel, T., & Hernandez, P. J. (2009). Exchange protein activated by cAMP enhances long-term memory formation independent of protein kinase A. Learning & memory (Cold Spring Harbor, N.Y.), 16(6), 367- 370. doi:10.1101/lm.1231009

Macphail, E. M. (1987). The comparative psychology of intelligence. Behavioral and Brain Sciences, 10(4), 645- 656.

MAGPHAIL, E. M., & Bolhuis, J. J. (2001). The evolution of intelligence: adaptive specializations versus general process. Biological Reviews, 76(3), 341-364.

Martin, S. J., Grimwood, P. D., & Morris, R. G. (2000). Synaptic plasticity and memory: an evaluation of the hypothesis. Annual Review of Neuroscience, 23(1), 649-711.

McCarthy, D. J., Chen, Y., & Smyth, G. K. (2012). Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic acids research, 40(10), 4288-4297.

McNeill, M. S., Kapheim, K. M., Brockmann, A., McGill, T. A. W., & Robinson, G. E. (2016). Brain regions and molecular pathways responding to food reward type and value in honey bees. Genes, Brain and Behavior, 15(3), 305-317. doi:10.1111/gbb.12275

Mizuno, K., Antunes-Martins, A., Ris, L., Peters, M., Godaux, E., & Giese, K. (2007). Calcium/calmodulin kinase kinase β has a male-specific role in memory formation. Neuroscience, 145(2), 393-402.

Nadel, L., & Hardt, O. (2011). Update on Memory Systems and Processes. Neuropsychopharmacology, 36(1), 251- 273. doi:10.1038/npp.2010.169

Nobili, A., Cavallucci, V., & D’Amelio, M. (2016). Role of autophagy in brain sculpture: physiological and pathological implications. In Autophagy Networks in Inflammation (pp. 203-234): Springer.

O’Sullivan, N. C., McGettigan, P. A., Sheridan, G. K., Pickering, M., Conboy, L., O’Connor, J. J., . . . Murphy, K. J. (2007). Temporal change in gene expression in the rat dentate gyrus following passive avoidance learning. Journal of neurochemistry, 101(4), 1085-1098.

Panahi, B., Frahadian, M., Dums, J. T., & Hejazi, M. A. (2019). Integration of Cross Species RNA-seq Meta- Analysis and Machine-Learning Models Identifies the Most Important Salt Stress–Responsive Pathways in Microalga Dunaliella. Frontiers in Genetics, 10(752). doi:10.3389/fgene.2019.00752

Pandey, S. C., Mittal, N., & Silva, A. J. (2000). Blockade of cyclic AMP-responsive element DNA binding in the brain of CREBΔ/α mutant mice. 11(11), 2577-2579. doi:10.1097/00001756-200008030-00045

Pardo, J., Abba, M. C., Lacunza, E., Ogundele, O. M., Paiva, I., Morel, G. R., . . . Goya, R. G. (2017). IGF-I Gene Therapy in Aging Rats Modulates Hippocampal Genes Relevant to Memory Function. The Journals of Gerontology: Series A, 73(4), 459-467. doi:10.1093/gerona/glx125

48

Perazzona, B. (2004). The Role of cAMP Response Element-Binding Protein in Drosophila Long-Term Memory. 24(40), 8823-8828. doi:10.1523/jneurosci.4542-03.2004

Ponniah, S. K., Thimmapuram, J., Bhide, K., Kalavacharla, V., & Manoharan, M. (2017). Comparative analysis of the root transcriptomes of cultivated sweetpotato (Ipomoea batatas [L.] Lam) and its wild ancestor (Ipomoea trifida [Kunth] G. Don). BMC Plant Biology, 17(1). doi:10.1186/s12870-016-0950-x

Prachumwat, A., & Li, W.-H. (2008). Gene number expansion and contraction in vertebrate genomes with respect to invertebrate genomes. Genome research, 18(2), 221-232. doi:10.1101/gr.7046608

Rao-Ruiz, P., Couey, J. J., Marcelo, I. M., Bouwkamp, C. G., Slump, D. E., Matos, M. R., . . . Kushner, S. A. (2019). Engram-specific transcriptome profiling of contextual memory consolidation. Nature Communications, 10(1), 2232. doi:10.1038/s41467-019-09960-x

Rau, A., Marot, G., & Jaffrézic, F. (2014). Differential meta-analysis of RNA-seq data from multiple studies. BMC Bioinformatics, 15(1), 91. doi:10.1186/1471-2105-15-91

Richter, J. D., & Klann, E. (2009). Making synaptic plasticity and memory last: mechanisms of translational regulation. Genes & development, 23(1), 1-11.

Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139-140.

Rossi, F., Geiszler, P. C., Meng, W., Barron, M. R., Prior, M., Herd-Smith, A., . . . Pardon, M.-C. (2018). NAD- biosynthetic enzyme NMNAT1 reduces early behavioral impairment in the htau mouse model of tauopathy. Behavioural brain research, 339, 140-152.

Sakai, T., Tamura, T., Kitamoto, T., & Kidokoro, Y. (2004). A clock gene, period, plays a key role in long-term memory formation in Drosophila. 101(45), 16058-16063. doi:10.1073/pnas.0401472101

Satyanarayana, A., & Kaldis, P. (2009). Mammalian cell-cycle regulation: several Cdks, numerous cyclins and diverse compensatory mechanisms. Oncogene, 28(33), 2925-2939.

Silva, A. J., Kogan, J. H., Frankland, P. W., & Kida, S. (1998). CREB AND MEMORY. Annual Review of Neuroscience, 21(1), 127-148. doi:10.1146/annurev.neuro.21.1.127

Stern, S. A., & Alberini, C. M. (2013). Mechanisms of memory enhancement. Wiley interdisciplinary reviews. Systems biology and medicine, 5(1), 37-53. doi:10.1002/wsbm.1196

Sweatt, J. D. (2010). Chapter 1 - Introduction: The Basics of Psychological Learning and Memory Theory. In J. D. Sweatt (Ed.), Mechanisms of Memory (Second Edition) (pp. 1-23). London: Academic Press.

Tan, R. P. A., Leshchyns'ka, I., & Sytnyk, V. (2017). Glycosylphosphatidylinositol-Anchored Immunoglobulin Superfamily Cell Adhesion Molecules and Their Role in Neuronal Development and Synapse Regulation. Frontiers in molecular neuroscience, 10, 378-378. doi:10.3389/fnmol.2017.00378

Thiriet, N., Amar, L., Toussay, X., Lardeux, V., Ladenheim, B., Becker, K. G., . . . Jaber, M. (2008). Environmental enrichment during adolescence regulates gene expression in the striatum of mice. Brain research, 1222, 31- 41.

Torrealba, F., Madrid, C., Contreras, M., & Gómez, K. (2017). Plasticity in the Interoceptive System. In R. von Bernhardi, J. Eugenín, & K. J. Muller (Eds.), The Plastic Brain (pp. 59-74). Cham: Springer International Publishing.

49

Turowski, T. W., & Tollervey, D. (2015). Cotranscriptional events in eukaryotic ribosome synthesis. Wiley Interdiscip Rev RNA, 6(1), 129-139. doi:10.1002/wrna.1263

Wang, J., Chen, Y., Zhang, C., Xiang, Z., Ding, J., & Han, X. (2019). Learning and memory deficits and alzheimer's disease-like changes in mice after chronic exposure to microcystin-LR. Journal of hazardous materials, 373, 504-518.

Wang, X.-W., Luan, J.-B., Li, J.-M., Su, Y.-L., Xia, J., & Liu, S.-S. (2011). Transcriptome analysis and comparison reveal divergence between two invasive whitefly cryptic species. BMC Genomics, 12(1), 458. doi:10.1186/1471-2164-12-458

Weiler, I. J., Irwin, S. A., Klintsova, A. Y., Spencer, C. M., Brazelton, A. D., Miyashiro, K., . . . Greenough, W. T. (1997). Fragile X mental retardation protein is translated near synapses in response to neurotransmitter activation. Proc Natl Acad Sci U S A, 94(10), 5395-5400. doi:10.1073/pnas.94.10.5395

Widmer, Y. F., Bilican, A., Bruggmann, R., & Sprecher, S. G. (2018). Regulators of Long-Term Memory Revealed by Mushroom Body-Specific Gene Expression Profiling in Drosophila melanogaster. Genetics, 209(4), 1167-1181. doi:10.1534/genetics.118.301106

Willeman, M. N., Mennenga, S. E., Siniard, A. L., Corneveaux, J. J., De Both, M., Hewitt, L. T., . . . Huentelman, M. J. (2018). The PKC-β selective inhibitor, Enzastaurin, impairs memory in middle-aged rats. PloS one, 13(6), e0198256-e0198256. doi:10.1371/journal.pone.0198256

Xia, M., Huang, R., Guo, V., Southall, N., Cho, M. H., Inglese, J., . . . Nirenberg, M. (2009). Identification of compounds that potentiate CREB signaling as possible enhancers of long-term memory. Proc Natl Acad Sci U S A, 106(7), 2412-2417. doi:10.1073/pnas.0813020106

Yang, Z. Y., Liu, J., & Chu, H. C. (2020). Effect of NMDAR-NMNAT1/2 pathway on neuronal cell damage and cognitive impairment of sevoflurane-induced aged rats. Neurol Res, 42(2), 108-117. doi:10.1080/01616412.2019.1710393

Yin, J. C. P., Wallach, J. S., Del Vecchio, M., Wilder, E. L., Zhou, H., Quinn, W. G., & Tully, T. (1994). Induction of a dominant negative CREB transgene specifically blocks long-term memory in Drosophila. 79(1), 49-58. doi:10.1016/0092-8674(94)90399-9

Yin, Y., Edelman, G. M., & Vanderklish, P. W. (2002). The brain-derived neurotrophic factor enhances synthesis of Arc in synaptoneurosomes. Proc Natl Acad Sci U S A, 99(4), 2368-2373. doi:10.1073/pnas.042693699

Young, R. L., Ferkin, M. H., Ockendon-Powell, N. F., Orr, V. N., Phelps, S. M., Pogány, Á., . . . Trainor, B. C. (2019). Conserved transcriptomic profiles underpin monogamy across vertebrates. Proceedings of the National Academy of Sciences, 116(4), 1331-1336.

Zhao, H., Zhang, J. Y., Yang, Z. C., Liu, M., Gang, B. Z., & Zhao, Q. J. (2011). Nicotinamide mononucleotide adenylyltransferase 1 gene NMNAT1 regulates neuronal dendrite and axon morphogenesis in vitro. Chin Med J (Engl), 124(20), 3373-3377.

50

APPENDIX

Figure A. 1 Distribution of counts in cOGGs. A) The proportion of each taxonomy group; B) The logarithm of total counts on base 2 in cOGGs

Figure A. 2 PCA plots. A) Raw expression matrix of 3091 cOGGs, grouped by treatments. B) The expression matrix processed with batch effects removal by R package “limma”

51

Figure A. 3 Hierarchical clustering of classes for Gene Ontology enrichment matrices. Numbers at the nodes indicate support for the clustering. Bootstrap probability (BP) values and approximately unbiased (AU) probability values (p-values) (Efron et al., 1996; Shimodaira, 2002, 2004) were both calculated with same distance measurement (“Euclidean”) and clustering method (“complete”).

52

Table A. 1 Details of DEcOGGs in cluster_2 found from the global method Gene symbol Class Dataset fold-change Abcf2 Actinopteri swordtail 2.821043 Anm5 Actinopteri swordtail 2.144737 Argi2 Actinopteri swordtail 3.480224 Arip4 Actinopteri swordtail 2.246719 At2l1 Actinopteri swordtail -2.151858 Atad3 Actinopteri swordtail 3.301075 Best3 Actinopteri zebrafish 2.666667 Brd3 Actinopteri swordtail 2.210775 Cert Actinopteri swordtail 2.019558 Csad Actinopteri swordtail 14.10382 Dcor Actinopteri swordtail 7.571026 Egr3 Actinopteri swordtail -2.383721 Fkbp4 Actinopteri swordtail 5.021377 Gcsp Actinopteri swordtail 5.646047 Gna1 Actinopteri swordtail 2.492424 Isca1 Actinopteri swordtail 2.04811 Jip1 Actinopteri swordtail 2.107143 Kinh Actinopteri swordtail 2.176423 Nmna1 Actinopteri swordtail -2.075611 Pcy2 Actinopteri swordtail 2.812287 Pric2 Actinopteri swordtail 2.326651 Rbms1 Actinopteri swordtail 2.03876 Rcaf1 Actinopteri swordtail 2.353053 Rgf1a Actinopteri swordtail 2.012216 S2542 Actinopteri swordtail 2.171446 Sesn1 Actinopteri swordtail 2.148466 Snrk Actinopteri swordtail 2.892097 St32c Actinopteri swordtail 2.070513 Syn2 Actinopteri swordtail 2.679104 Tinal Actinopteri swordtail 2.286778 Ubp40 Actinopteri swordtail 2.363128 Uck2 Actinopteri swordtail 2.513357 Vamp2 Actinopteri swordtail 2.333333 Vmp1 Actinopteri swordtail 2.251259 Znt1 Actinopteri swordtail 2.198856 Caf1b Amphibia toad 3.066667 Orc2 Amphibia toad 2.045977

53

Rl24 Amphibia toad 4.607843 Titin Amphibia toad -2.25 Acox1 Chromadorea roundworm -2.366667 Catf Chromadorea roundworm -2.0369 Gipc1 Chromadorea roundworm -11.40952 Mpp5 Chromadorea roundworm -2.201258 Ncam1 Chromadorea roundworm -2.460317 Syt4 Chromadorea roundworm 2.363636 Ttbk1 Chromadorea roundworm -13.95475 Ubb Chromadorea roundworm -31.33333 F10a1 Insecta fly_2 -26.25 Fas Insecta fly_3 -20.41525 Plk1 Insecta fly_3 -31.8 Bag2 Mammalia mouse_3 108.5 Dhrs4 Mammalia mouse_3 216 Erbb3 Mammalia mouse_3 104 Gtpb6 Mammalia mouse_3 18.02941 Ita7 Mammalia mouse_3 23.25 Lph Mammalia rat_2 -77.3125 Mcm3 Mammalia mouse_3 46.66667 Mcm4 Mammalia mouse_3 28.4 Nac1 Mammalia mouse_3 36.22222 Nmna1 Mammalia mouse_3 18.13333 Noc2l Mammalia mouse_3 17.28 Nu155 Mammalia mouse_3 48.66667 Psf1 Mammalia mouse_3 59.2 Qsox1 Mammalia mouse_3 72 Rab30 Mammalia mouse_3 41.5 Rcaf1 Mammalia mouse_3 21.59322 Rl35 Mammalia mouse_3 114 Rs15 Mammalia mouse_3 -72 S13a5 Mammalia mouse_3 18.8 S22a5 Mammalia mouse_3 22.5 Smg8 Mammalia mouse_3 56.72727 Zchc4 Mammalia mouse_3 35.71429

54

Table A. 2 Details of DEGs found from the combined method Gene Gene Function in Class Dataset logFC P- symbo Biological Value l Processes g1528_INdme_PRJNA302146_DN76608_c0_g1 Arp3 Arp2/3 Insecta fly_1 1.06 1.76E- complex- 03 mediated actin nucleation g1528_RDmus_PRJNA316996_DN111502_c3_g Arp3 Arp2/3 Mammalia mouse_2 -0.19 1.21E- 1 complex- 05 mediated actin nucleation g1528_RDmus_PRJNA316996_DN322379_c0_g Arp3 Arp2/3 Mammalia mouse_2 -0.23 3.18E- 1 complex- 04 mediated actin nucleation g640_RDmus_PRJNA316996_DN587279_c0_g5 At1a3 adult Mammalia mouse_2 0.18 2.24E- locomotory 07 behavior g640_RDmus_PRJNA316996_DN454090_c0_g1 At1a3 adult Mammalia mouse_2 -0.15 1.88E- locomotory 03 behavior g640_IName_PRJNA541005_DN206937_c3_g2 At1a3 adult Insecta bee -0.30 1.64E- locomotory 04 behavior g493_INpme_PRJNA287152_DN40085_c1_g1 Brd3 chromatin Insecta wasp_2 -0.83 3.21E- organization 05 g493_RDmus_PRJNA316996_DN116479_c1_g1 Brd3 chromatin Mammalia mouse_2 0.65 2.68E- organization 03 g493_RDmus_PRJNA316996_DN120552_c3_g2 Brd3 chromatin Mammalia mouse_2 0.49 3.48E- organization 04 g493_RDmus_PRJNA316996_DN120552_c3_g3 Brd3 chromatin Mammalia mouse_2 0.51 3.47E- organization 07

55

g4072_RDmus_PRJNA316996_DN587180_c0_g Cam2b calcium- Mammalia mouse_2 4.08 5.55E- 4 mediated 06 signaling g4072_RDmus_PRJNA316996_DN587180_c0_g Cam2b calcium- Mammalia mouse_2 1.19 1.25E- 1 mediated 05 signaling g4072_RDmus_PRJNA316996_DN587180_c0_g Cam2b calcium- Mammalia mouse_2 0.98 1.59E- 3 mediated 13 signaling g4072_RDmus_PRJNA316996_DN587180_c0_g Cam2b calcium- Mammalia mouse_2 5.52 5.86E- 6 mediated 05 signaling g4072_RDmus_PRJNA316996_DN587180_c0_g Cam2b calcium- Mammalia mouse_2 -2.34 1.49E- 7 mediated 21 signaling g4072_IName_PRJNA541005_DN162820_c0_g Cam2b calcium- Insecta bee -0.22 4.61E- 1 mediated 05 signaling g396_INdme_PRJNA302146_DN71415_c0_g1 Chd4 chromatin Insecta fly_1 3.58 4.52E- organization 07 g396_RDmus_PRJNA316996_DN124133_c0_g1 Chd4 chromatin Mammalia mouse_2 0.36 2.29E- organization 03 g396_RDmus_PRJNA316996_DN325837_c0_g1 Chd4 chromatin Mammalia mouse_2 0.38 3.27E- organization 03 g373_RDmus_PRJNA316996_DN122652_c14_g Clk3 peptidyl-tyrosine Mammalia mouse_2 -0.72 2.15E- 2 phosphorylation 08 g373_FIswo_PRJNA381689_DN635_c0_g1 Clk3 peptidyl-tyrosine Actinopteri swordtail -0.45 1.32E- phosphorylation 03 g373_FIswo_PRJNA381689_DN1135_c0_g1 Clk3 peptidyl-tyrosine Actinopteri swordtail -0.48 5.35E- phosphorylation 04 g373_RDmus_PRJNA529794_DN101049_c0_g1 Clk3 peptidyl-tyrosine Mammalia mouse_3 5.17 1.55E- phosphorylation 04

56

g3094_INpme_PRJNA287152_DN41847_c0_g1 Cryab apoptotic Insecta wasp_2 -1.15 3.02E- process involved 04 in morphogenesis g3094_INpme_PRJNA287152_DN38185_c0_g1 Cryab apoptotic Insecta wasp_2 -0.73 4.79E- process involved 04 in morphogenesis g3094_RDmus_PRJNA316996_DN119776_c1_g Cryab apoptotic Mammalia mouse_2 0.68 4.84E- 1 process involved 07 in morphogenesis g3094_IName_PRJNA541005_DN130709_c0_g Cryab apoptotic Insecta bee 0.92 3.67E- 1 process involved 06 in morphogenesis g3094_IName_PRJNA541005_DN205528_c0_g Cryab apoptotic Insecta bee 1.90 2.84E- 1 process involved 08 in morphogenesis g2132_INpme_PRJNA287152_DN27723_c0_g1 Dnjb5 chaperone Insecta wasp_2 -1.09 8.19E- cofactor- 21 dependent protein refolding g2132_RDmus_PRJNA316996_DN117176_c1_g Dnjb5 chaperone Mammalia mouse_2 0.30 3.28E- 1 cofactor- 03 dependent protein refolding g2132_RDmus_PRJNA316996_DN453986_c0_g Dnjb5 chaperone Mammalia mouse_2 0.38 2.48E- 1 cofactor- 04 dependent protein refolding

57

g2132_IName_PRJNA541005_DN138194_c0_g Dnjb5 chaperone Insecta bee 1.13 9.62E- 2 cofactor- 10 dependent protein refolding g808_INpme_PRJNA287152_DN40128_c1_g4 Enpl actin rod Insecta wasp_2 -0.88 4.28E- assembly 11 g808_INpme_PRJNA287152_DN41341_c2_g2 Enpl actin rod Insecta wasp_2 -0.76 6.76E- assembly 06 g808_RDmus_PRJNA316996_DN320243_c1_g1 Enpl actin rod Mammalia mouse_2 0.19 7.89E- assembly 08 g808_IName_PRJNA541005_DN225248_c0_g1 Enpl actin rod Insecta bee 1.86 5.31E- assembly 17 g808_IName_PRJNA541005_DN175067_c0_g1 Enpl actin rod Insecta bee 2.07 1.73E- assembly 23 g808_IName_PRJNA541005_DN156858_c0_g1 Enpl actin rod Insecta bee 0.71 5.50E- assembly 04 g146_INpme_PRJNA287152_DN39059_c1_g4 Est5a lipid catabolic Insecta wasp_2 -0.84 3.76E- process 08 g146_INpme_PRJNA287152_DN39059_c1_g5 Est5a lipid catabolic Insecta wasp_2 -0.88 1.66E- process 07 g146_INdme_PRJNA302146_DN70562_c0_g1 Est5a lipid catabolic Insecta fly_1 2.91 3.54E- process 05 g146_INdme_PRJNA302146_DN76421_c1_g1 Est5a lipid catabolic Insecta fly_1 3.91 2.48E- process 10 g146_FIswo_PRJNA381689_DN97_c0_g1 Est5a lipid catabolic Actinopteri swordtail -0.51 4.47E- process 04 g112_INdme_PRJNA302146_DN79034_c0_g2 Flnb actin Insecta fly_1 3.22 5.37E- cytoskeleton 06 organization g112_INdme_PRJNA302146_DN79034_c0_g3 Flnb actin Insecta fly_1 2.46 1.35E- cytoskeleton 03 organization

58

g112_FIswo_PRJNA381689_DN5_c0_g1 Flnb actin Actinopteri swordtail 0.69 1.69E- cytoskeleton 04 organization g112_FIswo_PRJNA381689_DN9554_c0_g1 Flnb actin Actinopteri swordtail 0.98 3.64E- cytoskeleton 07 organization g112_FIswo_PRJNA381689_DN144941_c0_g1 Flnb actin Actinopteri swordtail 0.93 2.68E- cytoskeleton 06 organization g85_INpme_PRJNA287152_DN41286_c1_g3 Grp75 cellular response Insecta wasp_2 -0.66 1.67E- to interleukin-1 04 g85_INpme_PRJNA287152_DN38944_c1_g1 Grp75 cellular response Insecta wasp_2 -1.15 3.60E- to interleukin-1 15 g85_INpme_PRJNA287152_DN37347_c1_g1 Grp75 cellular response Insecta wasp_2 -0.65 1.12E- to interleukin-1 05 g85_RDmus_PRJNA316996_DN122773_c6_g2 Grp75 cellular response Mammalia mouse_2 -4.89 6.37E- to interleukin-1 58 g85_RDmus_PRJNA316996_DN720964_c0_g1 Grp75 cellular response Mammalia mouse_2 -0.14 2.89E- to interleukin-1 05 g85_RDmus_PRJNA316996_DN122773_c6_g1 Grp75 cellular response Mammalia mouse_2 0.56 3.48E- to interleukin-1 09 g85_FIswo_PRJNA381689_DN146475_c0_g1 Grp75 cellular response Actinopteri swordtail 0.47 4.47E- to interleukin-1 04 g85_RWcel_PRJNA450614_DN9800_c0_g1 Grp75 cellular response Chromadore roundwor -2.44 7.94E- to interleukin-1 a m 07 g85_IName_PRJNA541005_DN184327_c1_g1 Grp75 cellular response Insecta bee 0.84 1.09E- to interleukin-1 09 g85_IName_PRJNA541005_DN194301_c0_g2 Grp75 cellular response Insecta bee 1.78 6.64E- to interleukin-1 08 g85_IName_PRJNA541005_DN184295_c0_g2 Grp75 cellular response Insecta bee 1.00 1.59E- to interleukin-1 12 g1635_INpme_PRJNA287152_DN41806_c5_g1 Hs74l protein folding Insecta wasp_2 -0.95 9.60E- 08

59

g1635_FIswo_PRJNA381689_DN114069_c0_g1 Hs74l protein folding Actinopteri swordtail 0.62 3.48E- 04 g1635_IName_PRJNA541005_DN206413_c2_g Hs74l protein folding Insecta bee 0.95 1.53E- 1 12 g976_INpme_PRJNA287152_DN32340_c4_g1 If4a2 cellular response Insecta wasp_2 0.37 1.03E- to leukemia 04 inhibitory factor g976_RDmus_PRJNA316996_DN452698_c2_g1 If4a2 cellular response Mammalia mouse_2 -0.19 3.81E- to leukemia 03 inhibitory factor g976_RDmus_PRJNA316996_DN453428_c0_g1 If4a2 cellular response Mammalia mouse_2 -0.15 1.79E- to leukemia 03 inhibitory factor g442_RDmus_PRJNA316996_DN119065_c0_g2 If4g1 behavioral fear Mammalia mouse_2 -0.80 1.20E- response 04 g442_RDmus_PRJNA316996_DN119065_c0_g1 If4g1 behavioral fear Mammalia mouse_2 0.50 1.33E- response 04 g442_FIswo_PRJNA381689_DN8279_c0_g1 If4g1 behavioral fear Actinopteri swordtail 1.03 4.39E- response 09 g442_FIswo_PRJNA381689_DN3868_c0_g1 If4g1 behavioral fear Actinopteri swordtail 1.09 6.87E- response 07 g16_RDmus_PRJNA316996_DN109046_c1_g1 Ki13b microtubule- Mammalia mouse_2 0.32 6.81E- based movement 05 g16_RDmus_PRJNA316996_DN112909_c0_g2 Ki13b microtubule- Mammalia mouse_2 1.43 1.22E- based movement 04 g16_RDmus_PRJNA316996_DN111477_c2_g1 Ki13b microtubule- Mammalia mouse_2 0.20 2.73E- based movement 04 g16_FIswo_PRJNA381689_DN8789_c0_g1 Ki13b microtubule- Actinopteri swordtail 1.95 2.20E- based movement 04 g676_RDmus_PRJNA316996_DN320055_c0_g1 Kinh anterograde Mammalia mouse_2 0.15 1.97E- axonal protein 03 transport

60

g676_RDmus_PRJNA316996_DN195271_c0_g1 Kinh anterograde Mammalia mouse_2 0.26 7.16E- axonal protein 09 transport g676_FIswo_PRJNA381689_DN1822_c0_g1 Kinh anterograde Actinopteri swordtail 1.16 6.92E- axonal protein 06 transport g129_RDmus_PRJNA316996_DN115294_c1_g4 M4k4 activation of Mammalia mouse_2 -2.48 4.16E- protein kinase 03 activity g129_RDmus_PRJNA316996_DN83175_c0_g1 M4k4 activation of Mammalia mouse_2 0.43 6.24E- protein kinase 10 activity g129_RDmus_PRJNA316996_DN122554_c1_g1 M4k4 activation of Mammalia mouse_2 0.75 2.21E- protein kinase 07 activity g129_FIswo_PRJNA381689_DN88342_c0_g1 M4k4 activation of Actinopteri swordtail -0.61 1.45E- protein kinase 03 activity g129_FIswo_PRJNA381689_DN209_c0_g1 M4k4 activation of Actinopteri swordtail -0.45 7.58E- protein kinase 04 activity g61_RDmus_PRJNA316996_DN118241_c0_g3 Macf1 establishment or Mammalia mouse_2 0.65 1.25E- maintenance of 04 cell polarity g61_RDmus_PRJNA316996_DN118241_c0_g2 Macf1 establishment or Mammalia mouse_2 0.39 1.18E- maintenance of 03 cell polarity g61_FIswo_PRJNA381689_DN2388_c0_g1 Macf1 establishment or Actinopteri swordtail 1.07 2.32E- maintenance of 15 cell polarity g61_FIswo_PRJNA381689_DN176346_c0_g1 Macf1 establishment or Actinopteri swordtail 1.33 1.84E- maintenance of 08 cell polarity

61

g61_FIswo_PRJNA381689_DN721_c0_g1 Macf1 establishment or Actinopteri swordtail -1.10 4.07E- maintenance of 13 cell polarity g61_FIswo_PRJNA381689_DN4135_c0_g1 Macf1 establishment or Actinopteri swordtail 0.94 1.29E- maintenance of 11 cell polarity g50_INpme_PRJNA287152_DN40008_c3_g1 Myh7b actin Insecta wasp_2 0.90 1.59E- cytoskeleton 04 organization g50_RDmus_PRJNA316996_DN121569_c1_g2 Myh7b actin Mammalia mouse_2 0.29 3.09E- cytoskeleton 03 organization g50_FIswo_PRJNA381689_DN682_c0_g3 Myh7b actin Actinopteri swordtail -0.61 3.58E- cytoskeleton 06 organization g50_FIswo_PRJNA381689_DN781_c2_g1 Myh7b actin Actinopteri swordtail 0.93 7.97E- cytoskeleton 04 organization g50_FIswo_PRJNA381689_DN762_c0_g1 Myh7b actin Actinopteri swordtail 0.86 1.72E- cytoskeleton 04 organization g1794_RDmus_PRJNA316996_DN319855_c0_g Ndrg3 signal Mammalia mouse_2 9.58 7.87E- 1 transduction 30 g1794_RDmus_PRJNA316996_DN319855_c0_g Ndrg3 signal Mammalia mouse_2 -4.99 3.11E- 4 transduction 10 g1794_RDmus_PRJNA316996_DN319855_c0_g Ndrg3 signal Mammalia mouse_2 7.37 5.39E- 3 transduction 81 g1794_RDmus_PRJNA316996_DN319855_c0_g Ndrg3 signal Mammalia mouse_2 -8.52 6.94E- 2 transduction 20 g1794_FIswo_PRJNA381689_DN6844_c0_g1 Ndrg3 signal Actinopteri swordtail 0.68 1.37E- transduction 03

62

g120_RDmus_PRJNA316996_DN112448_c0_g1 Odo1 2-oxoglutarate Mammalia mouse_2 0.99 3.37E- metabolic 04 process g120_RDmus_PRJNA316996_DN115159_c0_g1 Odo1 2-oxoglutarate Mammalia mouse_2 0.65 2.06E- metabolic 05 process g120_RDmus_PRJNA316996_DN115159_c0_g2 Odo1 2-oxoglutarate Mammalia mouse_2 -6.10 2.18E- metabolic 28 process g120_INdme_PRJNA475804_DN59680_c1_g2 Odo1 2-oxoglutarate Insecta fly_3 -2.58 1.31E- metabolic 05 process g1337_RDmus_PRJNA316996_DN108901_c0_g P85b cellular glucose Mammalia mouse_2 -3.83 7.90E- 2 homeostasis 07 g3317_RDmus_PRJNA316996_DN108901_c0_g P85b cellular glucose Mammalia mouse_2 -3.83 7.90E- 2 homeostasis 07 g1337_FIswo_PRJNA381689_DN67_c1_g1 P85b cellular glucose Actinopteri swordtail 0.61 8.27E- homeostasis 05 g133_RDmus_PRJNA316996_DN119158_c2_g4 Pde4d adenylate Mammalia mouse_2 0.31 3.94E- cyclase- 03 activating adrenergic receptor signaling pathway involved in positive regulation of heart rate g133_RDmus_PRJNA316996_DN98519_c0_g1 Pde4d adenylate Mammalia mouse_2 0.61 7.23E- cyclase- 04 activating adrenergic

63

receptor signaling pathway involved in positive regulation of heart rate g133_FIswo_PRJNA381689_DN1382_c0_g1 Pde4d adenylate Actinopteri swordtail -0.91 1.65E- cyclase- 04 activating adrenergic receptor signaling pathway involved in positive regulation of heart rate g334_RDmus_PRJNA316996_DN76580_c0_g1 Peli1 interleukin-1- Mammalia mouse_2 0.35 9.19E- mediated 04 signaling pathway g334_RDmus_PRJNA316996_DN117671_c0_g2 Peli1 interleukin-1- Mammalia mouse_2 -3.58 6.22E- mediated 06 signaling pathway g3366_FIswo_PRJNA381689_DN4273_c0_g1 Peli1 interleukin-1- Actinopteri swordtail 0.80 6.27E- mediated 05 signaling pathway g334_FIswo_PRJNA381689_DN2687_c0_g1 Peli1 interleukin-1- Actinopteri swordtail 0.89 3.91E- mediated 04

64

signaling pathway g1027_INpme_PRJNA287152_DN40071_c2_g2 Rrp5 mRNA Insecta wasp_2 0.50 1.62E- processing 04 g1027_RDmus_PRJNA316996_DN123191_c2_g Rrp5 mRNA Mammalia mouse_2 -0.70 1.85E- 3 processing 04 g1027_RDmus_PRJNA529794_DN597388_c0_g Rrp5 mRNA Mammalia mouse_3 6.08 3.52E- 1 processing 04 g1788_INdme_PRJNA302146_DN75320_c0_g1 Scrb1 adhesion of Insecta fly_1 2.21 2.47E- symbiont to host 04 g1788_RDmus_PRJNA316996_DN14139_c0_g1 Scrb1 adhesion of Mammalia mouse_2 -3.30 1.98E- symbiont to host 06 g1788_RDmus_PRJNA316996_DN14139_c0_g3 Scrb1 adhesion of Mammalia mouse_2 2.35 3.63E- symbiont to host 05 g1788_RDmus_PRJNA316996_DN14139_c0_g5 Scrb1 adhesion of Mammalia mouse_2 -3.04 2.27E- symbiont to host 04 g1788_FIswo_PRJNA381689_DN4047_c0_g1 Scrb1 adhesion of Actinopteri swordtail 0.95 3.41E- symbiont to host 06 g1788_RDmus_PRJNA529794_DN77528_c0_g1 Scrb1 adhesion of Mammalia mouse_3 6.31 1.32E- symbiont to host 04 g207_RDmus_PRJNA316996_DN121201_c1_g1 Stb5l exocytosis Mammalia mouse_2 -0.30 3.02E- 03 g207_RDmus_PRJNA529794_DN93164_c1_g2 Stb5l exocytosis Mammalia mouse_3 6.26 7.81E- 05 g207_IName_PRJNA541005_DN201708_c2_g2 Stb5l exocytosis Insecta bee -0.42 6.44E- 04 g788_RDmus_PRJNA316996_DN322184_c0_g3 Tba4a microtubule Mammalia mouse_2 0.17 1.26E- cytoskeleton 05 organization g788_RDmus_PRJNA316996_DN46705_c3_g2 Tba4a microtubule Mammalia mouse_2 0.17 1.67E- cytoskeleton 05 organization

65

g788_IName_PRJNA541005_DN203746_c4_g2 Tba4a microtubule Insecta bee 0.48 3.02E- cytoskeleton 04 organization g560_INdme_PRJNA302146_DN74885_c0_g1 Ubp40 protein Insecta fly_1 2.79 7.40E- deubiquitination 06 g560_INdme_PRJNA302146_DN75296_c1_g2 Ubp40 protein Insecta fly_1 1.99 2.46E- deubiquitination 04 g560_FIswo_PRJNA381689_DN1131_c0_g1 Ubp40 protein Actinopteri swordtail 1.68 1.20E- deubiquitination 06 g174_INdme_PRJNA302146_DN78331_c0_g1 Vpp3 apoptotic Insecta fly_1 1.59 1.48E- process 03 g174_INdme_PRJNA302146_DN73747_c0_g1 Vpp3 apoptotic Insecta fly_1 4.99 6.51E- process 18 g174_RDmus_PRJNA316996_DN123539_c0_g1 Vpp3 apoptotic Mammalia mouse_2 0.25 1.83E- process 05

66

Cluster dendrogram with p−values (%)

au bp 10.2 edge # 66 40 98 53 2

9.0 1 Height toad swordtail zebrafish roundworm

Distance: euclidean Cluster method: average

Figure A. 4 Hierarchical clustering of species in Amphibia, Actinopteri, and Chromadorea for the differential expression level of 40 Learning-cOGGs. Numbers at the nodes indicate support for the clustering. Bootstrap probability (BP) values and approximately unbiased (AU) probability values (p-values) were both calculated with same distance measurement (“Euclidean”) and clustering method (“average”).

67