Download The
Total Page:16
File Type:pdf, Size:1020Kb
THE INTERPRETATION OF GENE COEXPRESSION IN SYSTEMS BIOLOGY by Marjan Farahbod MSc, Chalmers University of Technology, 2013 BSc, University of Tehran, 2010 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES (Bioinformatics) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) December 2019 Ó Marjan Farahbod, 2019 The following individuals certify that they have read, and recommend to the Faculty of Graduate and Postdoctoral Studies for acceptance, the dissertation entitled: The interpretation of gene coexpression in systems biology submitted by Marjan Farahbod in partial fulfillment of the requirements for the degree of Doctor of philosophy in Bioinformatics Examining Committee: Dr. Paul Pavlidis Supervisor Dr. Alexandre Bouchard Côté Supervisory Committee Member Dr. Joerg Gsponer Supervisory Committee Member Dr. Marco A. Marra University Examiner Dr. T. Michael Underhill University Examiner Additional Supervisory Committee Members: Dr. Wyeth Wasserman Supervisory Committee Member Supervisory Committee Member ii Abstract One of the key features of transcriptomic data is the similarity of expression patterns among groups of genes, referred to as coexpression. It has been shown that coexpressed genes tend to share similar functions. Based on this, a common assumption is that gene coexpression is a result of transcriptional regulation and therefore, regulatory relationships could be inferred from coexpression. However, success in inferring such relationships has been limited and there are questions about the source and interpretation of coexpression. Here I explore coexpression as an observed signal from the data, examine its source and assess its relevance for inferring regulatory relationships. In chapter 2 I studied differential coexpression, which refers to the alteration of gene coexpression between biological conditions. It is commonly assumed that differential coexpression can reveal rewiring of transcription regulatory networks, specifically among the genes that maintain their average expression level between the conditions. However, I show that to a large extent and in contrast to this common assumption, differential coexpression is more parsimoniously explained by changes in average expression levels. This finding demonstrates limitations for inference of regulatory rewiring from coexpression and poses questions for the underlying causes of the observed coexpression. In Chapter 3, I studied cellular composition variation among bulk tissue samples as a source of variance and the observed coexpression. I found that for most genes, differences in expression levels across cell types account for a large fraction of their variance and as a result genes with similar cell-type expression profiles appear to be coexpressed. Finally, I showed that this coexpression dominates the underlying intra-cell- type coexpression and also has the two prominent features of coexpression in the bulk tissue: reproducibility and biological relevance. Through my studies, I was able to provide an explanation for much of the observed coexpression in the bulk tissue and shed light on its resolution and limitation for inference of regulatory relationships. I also studied coexpression in single-nucleus data and show that some of the observed coexpression in it is likely to be attributed to the transcriptional regulation, which could be a subject for future studies. iii Lay Summary Based on the cell’s state and context, functional units of the cellular machinery are generated from the genome through a process called gene expression. These units work in collaborating groups to perform various cellular functions. Coexpression is a process in which a group of genes are expressed coordinately, likely due to their required collaboration in a cellular process. Therefore, understanding which genes, and why and when they are coexpressed, is extremely helpful for untangling the complexity of cellular mechanisms. Coexpression analysis can roughly tell us which genes are being expressed simultaneously. While this type of analysis has potential for increasing our understanding of cellular functions, there are certain issues with the interpretation of the observed signal. Here I address these issues by examining the signal from a data analytics perspective to delineate its limits and pitfalls in interpretation, in order to ultimately present alternative approaches for improvement. iv Preface All the work presented in this dissertation was conducted in the Michael Smith Laboratories at the University of British Columbia. I was solely responsible for conducting the research and the primary author of every chapter and corresponding publications. My supervisor, Paul Pavlidis, contributed ideas, supervision as well as text, and editorial suggestions for all chapters. A version of chapter 2 has been published [1]. I was the lead investigator, responsible for all major areas of concept formation, data collection and analysis, as well as manuscript composition. Paul Pavlidis was the supervisory author on this project and was involved throughout the project in concept formation and manuscript composition and writing. At the time of writing, a version of Chapter 3 has been made available online [2] and is currently under review for publication. I was the lead investigator, responsible for all major areas of concept formation, data collection and analysis, as well as manuscript composition. Paul Pavlidis was the supervisory author on this project and was involved throughout the project in concept formation and manuscript composition and writing. [1] Farahbod,M. and Pavlidis,P. (2019) Differential coexpression in human tissues and the confounding effect of mean expression levels. Bioinformatics, 35, 55–61. [2] Farahbod,M. and Pavlidis,P. (2019) Untangling the effects of cellular composition on coexpression analysis. BioRxiv doi: https://doi.org/10.1101/735951 v Table of Contents Abstract ................................................................................................................................ iii Lay Summary ...................................................................................................................... iv Preface ................................................................................................................................... v Table of Contents ................................................................................................................ vi List of Tables ........................................................................................................................ x List of Figures ...................................................................................................................... xi List of Supplementary Material ...................................................................................... xiii List of Abbreviations ......................................................................................................... xv Acknowledgements .......................................................................................................... xvii Dedication ........................................................................................................................ xviii Chapter 1: Introduction ...................................................................................................... 1 1.1 Codifying gene function .................................................................................... 4 1.2 Regulation of transcription ................................................................................ 5 1.3 Transcriptomic data ........................................................................................... 8 1.3.1 Platforms and specifics .............................................................................. 8 1.3.2 Noise and confounding factors in expression data .................................... 9 1.4 Coexpression analysis ...................................................................................... 11 1.4.1 Gene clustering in transcriptomic data .................................................... 12 1.4.2 Coexpression networks ............................................................................ 14 1.4.2.1 Similarity measure ............................................................................... 14 1.4.2.2 Network characteristics and interpretation .......................................... 15 1.4.3 Reproducibility and conservation of coexpression .................................. 18 vi 1.4.4 Coexpression in expression data analysis: overview ............................... 19 1.4.5 Coexpression in expression data analysis: applications ........................... 23 1.5 Coexpression for function prediction ............................................................... 25 1.5.1 Methods and frameworks ......................................................................... 27 1.5.2 Evaluation and shortcomings ................................................................... 28 1.6 Inferring regulatory relationships from transcriptomic data ............................ 31 1.6.1 Challenges and complexities .................................................................... 31 1.6.2 Inference and validation ........................................................................... 33 1.7 Research contribution .....................................................................................