A Data Mining Approach for Detecting Evolutionary Divergence in Transcriptomic Data

A Data Mining Approach for Detecting Evolutionary Divergence in Transcriptomic Data by Owen Zeno Woody A thesis presented by the University of Waterloo in fulfilment of the thesis requirement for the degree of Doctor of Philosophy in Biology Waterloo, Ontario, Canada, 2019 © Owen Zeno Woody 2019 Examining Committee Membership The following served on the Examining Committee for this thesis. The decision of the Examining Committee is by majority vote. External Examiner Dr. Teresa Crease Professor, Department of Integrative Biology, University of Guelph Supervisor(s) Dr. Brendan J. McConkey Associate Professor, Department of Biology, University of Waterloo Internal Member Dr. Kirsten M. Müller Professor, Department of Biology, University of Waterloo Internal-external Member Dr. Dan Brown Professor, Cheriton School of Computer Science, University of Waterloo Other Member(s) Dr. Josh D. Neufeld Professor, Department of Biology, University of Waterloo ii I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. iii Abstract It has become common to produce genome sequences for organisms of scientific or popular interest. Although these genome projects provide insight into the gene and protein complements of a species including their evolutionary relationships, it remains challenging to determine gene regulatory behavior from genome sequence alone. It has also become common to produce “expression atlas” transcriptomic data sets. These atlases employ high-throughput transcript assays to survey an assortment of tissues, developmental states, and responses to stimuli that each may individually elicit or inhibit the transcription of genes. Although genomic and transcriptomic data sets are both routinely collected, they are seldom analyzed in tandem. Here I present a novel approach to combining these complementary data with a software package called BranchOut. BranchOut uses genomic information to construct gene family phylogenies, and then attempts to map gene expression activity onto this phylogeny to allow estimation of ancestral expression states. This allows the identification of specific innovations due to gene duplications that resulted in fundamental diversification in the roles of otherwise closely related genes. As a proof of concept, the BranchOut technique is first applied to a tangible small- scale example in Apis mellifera. Subsequently, the power of BranchOut to analyze complete genomes is shown for two mammalian genomes, Sus scrofa and Bos taurus. The transcriptomic data sets for these two mammals employ microarray and RNAseq platforms, respectively, for expression analysis, demonstrating BranchOut’s applicability to both future and historic expression atlases. Potential refinements to the approach are also discussed. iv Acknowledgements I would like to thank my supervisor, Brendan McConkey, for his patience, kindness and insight throughout the preparation of this work. I’d also like to thank my committee members for their suggestions and support over many years: Dan Brown, Kirsten Müller, and Josh Neufeld. It was thanks to your mentorship that this was all possible. I am very grateful for the support I received through both NSERC and OGS for this project. v Table of Contents List of Figures vii List of Tables ix 1 Introduction – Sporks, Soups and Sausages 1 2 Preprocessing Methodology for Genomic and Transcriptomic Data 23 3 BranchOut Software Specifications 31 4 Small-scale Application Involving the Honeybee, Apis mellifera 47 5 Application of BranchOut to Sus scrofa Microarray Expression Atlas 63 6 Application of BranchOut to High-Throughput Sequencing: Bos taurus Data Set 89 7 Future Directions and Conclusion 117 Bibliography 128 Appendices 141 vi List of Figures Figure 3.1: An example of an “all conditions” heatmap from BranchOut 39 Figure 3.2: A reconstruction block produced by BranchOut 41 Figure 3.3: An example state assignment diagram 43 Figure 3.4: An example BranchOut single-condition reconstruction 45 Figure 4.1: Heatmap showing gene expression behavior for the yellow gene family in Apis mellifera 55 Figure 4.2: MCLUST results for the yellow protein family in Apis mellifera 57 Figure 4.3: MrBayes tree for the yellow protein family in Apis mellifera 59 Figure 4.4: Hypothetical expression states of ancestral yellow gene family members from two Apis mellifera tissues 60 Figure 5.1a & b: Map of Expression Clustering Assignments for the Disintegrin protein family in Sus scrofa 73, 74 Figure 5.2: BranchOut reconstruction of the Disintegrin protein family in the prefrontal cortex 75 Figure 5.3: BranchOut reconstruction of the Disintegrin protein family in a blood sample 76 Figure 5.4: BranchOut reconstruction of the Disintegrin protein family in the testis 77 Figure 5.5a & b: Map of Expression Clustering Assignments for the ABC transporter protein family in Sus scrofa 79, 80 Figure 5.6: BranchOut reconstruction of the ABC transporter protein family in the ileum 81 Figure 5.7: Screenshot of Ensembl exon/intron model for selected ABC-transporters 82 Figure 5.8a & b: Map of Expression Clustering Assignments for the Cytochrome P450 protein family in Sus scrofa 84, 85 Figure 5.9: BranchOut reconstruction of the cytochrome P450 protein family in the cortex of the kidney 86 Figure 6.1a & b: Map of Expression Clustering Assignments for the WW-domain protein family in Bos taurus 98, 99 Figure 6.2: BranchOut reconstruction of the WW protein family in an ovarian follicle sample 100 Figure 6.3: BranchOut reconstruction of the WW protein family in a lactating mammary gland 102 vii Figure 6.4a & b: Map of Expression Clustering Assignments for the Tubulin FtsZ family: GTPase domain protein family in Bos taurus 104, 105 Figure 6.5: BranchOut reconstruction of the Tubulin protein family in the temporal cortex 106 Figure 6.6: BranchOut reconstruction of the Tubulin protein family in the supraspinatus 107 Figure 6.7a & b: Map of Expression Clustering Assignments for the Ligand binding domain of nuclear hormone receptor protein family in Bos taurus 109, 110 Figure 6.8: BranchOut reconstruction of the ligand binding domain of nuclear hormone receptor family in the rumen 111 Figure 6.9: BranchOut reconstruction of the ligand binding domain of nuclear hormone receptor family in the pituitary gland 112 Figure 6.10: BranchOut reconstruction of the ligand binding domain of nuclear hormone receptor family in the salivary gland 113 Figure 6.11: BranchOut reconstruction of the ligand binding domain of nuclear hormone receptor family in the infundibulum 114 Figure 7.1: Pairwise comparisons of BranchOut scores with varying input sources 124 viii List of Tables Table 5.1: Sus scrofa tissues with many high-scoring BranchOut signal scores and a summary of findings 68 Table 5.2: Rank-ordered list of Sus scrofa tissues that contained a large number of high- scoring BranchOut reconstruction signals 70 Table 5.3: Rank-ordered list of Sus scrofa gene families that contained a large number of high-scoring BranchOut reconstruction signals 71 Table 6.1: Bos taurus tissues with many high-scoring BranchOut signal scores and a summary of findings 92 Table 6.2: Rank-ordered list of Bos taurus tissues that contained a large number of high- scoring BranchOut reconstruction signals 94 Table 6.3: Rank-ordered list of Bos taurus gene families that contained a large number of high-scoring BranchOut reconstruction signals 96 ix Chapter 1: Introduction – Sporks, Soups and Sausages The work presented in this thesis addresses a specific aspect of evolutionary biology: the evolution of novel function at the gene family level. Biological organisms are capable of remarkably complex activities, and all this functionality is somehow encoded in the organism’s genomic complement. In order to discuss the evolution of function – broadly, how an organism can become capable of “new” activities – it is important first to establish a working understanding of a few key concepts. First, I will describe the features and activities of a gene that make it a functional unit in the cell. Next, I will cover the means by which new genes can arise, and what it means for a gene to be part of a family. Lastly, I will describe what I define as a “novel function”, and how novelty can be introduced into the genome through the process of gene origination and duplication. 1.1 Defining the gene as a functional unit This work will focus on genes as fundamental units of inheritance and evolution. There are some intriguing examples of inheritable adaptation that do not require changes at the genetic level (RNA interference, for example (Spracklin, Fields et al. 2017)), but they will lie outside the scope of this discussion. The “Central Dogma of Biology” – that DNA genes encode (messenger) RNA molecules, which are in turn translated to form proteins – remains a powerful explanatory tool, and my research will make a similar assumption about the central role genes play in biological activity and inheritance. 1 A gene is a string of information in the genome with two main parts: the coding sequence and its regulatory control (MacCarthy, Bergman 2007). The coding sequence of a gene provides a blueprint for the construction of an active biomolecule – typically a protein, but sometimes an RNA molecule with catalytic properties (ribozyme). Because the number of building blocks for these biomolecules is limited (4 nucleotides or approximately 20 amino acids), the sequence and arrangement of these components is the fundamental characteristic that determines a biomolecule’s role. Consequently, when the coding sequence is the focus, the Central Dogma can be restated as follows: (nucleotide) sequence directs (biomolecule) structure, which in turn directs function. Changes to the coding sequence can fundamentally alter the structure of the encoded biomolecule and may impact the biological capability of the resulting product. A first draft for the definition of “gene function” might stop there.

A Data Mining Approach for Detecting Evolutionary Divergence in Transcriptomic Data

Searching for Novel Peptide Hormones in the Human Genome Olivier Mirabeau

The EMILIN/Multimerin Family

Supplementary Table 1: Adhesion Genes Data Set

Multimerin-2 Is a Ligand for Group 14 Family C-Type Lectins CLEC14A, CD93 and CD248 Spanning the Endothelial Pericyte Interface

Supplementary Table S4. FGA Co-Expressed Gene List in LUAD

Identification of the Rassf3 Gene As a Potential Tumor Suppressor

EGFL7 Meets Mirna-126: an Angiogenesis Alliance

A Proteomic Analysis of Salivary Glands of Female Anopheles Gambiae Mosquito

WO 2013/184908 A2 12 December 2013 (12.12.2013) P O P C T

The Structure, Function and Evolution of the Extracellular Matrix: a Systems-Level Analysis

Periostin (POSTN) Regulates Tumor Resistance to Antiangiogenic Therapy in Glioma Models Soon Young Park1, Yuji Piao1, Kang Jin Jeong2, Jianwen Dong1, and John F

Table S5.Xlsx