Bayesian Infinite Mixture Models for Gene Clustering and Simultaneous
Total Page:16
File Type:pdf, Size:1020Kb
UNIVERSITY OF CINCINNATI Date: 22-Oct-2009 I, Johannes M Freudenberg , hereby submit this original work as part of the requirements for the degree of: Doctor of Philosophy in Biomedical Engineering It is entitled: Bayesian Infinite Mixture Models for Gene Clustering and Simultaneous Context Selection Using High-Throughput Gene Expression Data Student Signature: Johannes M Freudenberg This work and its defense approved by: Committee Chair: Mario Medvedovic, PhD Mario Medvedovic, PhD Bruce Aronow, PhD Bruce Aronow, PhD Michael Wagner, PhD Michael Wagner, PhD Jaroslaw Meller, PhD Jaroslaw Meller, PhD 11/18/2009 248 Bayesian Infinite Mixture Models for Gene Clustering and Simultaneous Context Selection Using High-Throughput Gene Expression Data A dissertation submitted to the Graduate School of the University of Cincinnati in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department Biomedical Engineering of the Colleges of Engineering and Medicine by Johannes M. Freudenberg M.S., 2004, Leipzig University, Leipzig, Germany (German Diplom degree) Committee Chair: Mario Medvedovic, Ph.D. Abstract Applying clustering algorithms to identify groups of co-expressed genes is an important step in the analysis of high-throughput genomics data in order to elucidate affected biological pathways and transcriptional regulatory mechanisms. As these data are becoming ever more abundant the integration with both, existing biological knowledge and other experimental data becomes as crucial as the ability to perform such analysis in a meaningful but virtually unsupervised fashion. Clustering analysis often relies on ad-hoc methods such as k-means or hierarchical clustering with Euclidean distance but model-based methods such as the Bayesian Infinite Mixtures approach have been shown to produce better, more reproducible results. Further improvements have been accomplished by context-specific gene clustering algorithms designed to determine groups of co-expressed genes within a given subset of biological samples termed context. The complementary problem of finding differentially co-expressed genes given two or more contexts has been addressed but relies on the a priori definition of contexts and has not been used to facilitate the clustering of biological samples. Here we describe a new computational method using Bayesian infinite mixture models to cluster genes simultaneously utilizing the concept of differential co-expression as a unique similarity measure to find groups of similar samples. We compute a novel per-gene differential co-expression score that is reproducible and biologically meaningful. To evaluate, annotate, and display clustering results we present the integrated software package CLEAN which contains functionality for performing Clustering Enrichment Analysis, a method to functionally annotate clustering results and to assign a novel gene-specific functional coherence score. 2 We apply our method to a number of simulated datasets comparing it to other commonly used clustering algorithms, and we re-analyze several breast cancer studies. We find that our unsupervised method determines patient groupings highly predictive of clinically relevant factors such as estrogen receptor status, tumor grade, and disease specific survival. Integrating these data with computationally and literature-derived information by applying CLEAN to the corresponding clusterings as well as the DCS signature substantiates these findings. Our results demonstrate the range of applications our methodology provides, offering a comprehensive analysis tool to study gene co-expression and differential co-expression patterns specific to the biological conditions of interest while simultaneously determining subsets of such biological conditions using a unique similarity measure that is complementary to the currently existing methods. It allows us to further our understanding of highly complex diseases such as breast cancer, and it has the potential to greatly facilitate research in many other, not yet as intensively studied areas. 3 4 Preface I would like to thank everyone who helped and supported me to reach this milestone on my Bioinformatics journey. I especially thank my advisor and mentor Dr. Mario Medvedovic for all of his support, insight, and encouragement. Discussions and conversations with him are always inspiring and enlightening and I am deeply grateful for the opportunity to work with him. I also thank Dr. Bruce Aronow, Dr. Jarek Meller, and Dr. Michael Wagner for serving on my thesis committee, but more importantly for their expertise, advice, and support over the past years. Thanks and gratitude also to all my collaborators, especially Dr. Siva Sivaganesan whose expertise in Bayesian Statistics is invaluable for this dissertation, Monika Ray for trusting me and challenging my ways, and Xiangdong Liu, Vineet Joshi, and Zhen Hu. I thank my fellow travelers Jing Chen, Sivakumar Gowrisankar, Ranga Chandra Gudivada, Rachana Jain, and Mukta Phatak for leading the way. I want to also thank my parents for their love and support (and for sending care packages across the ocean) and my daughter Anouk for reminding me everyday what‟s really important in life. Finally, I wish to thank my partner, my friend, my love Un Kyong Ho: This is for you. 5 Table of Contents Abstract ................................................................................................................................... 2 Preface ................................................................................................................................... 5 List of Tables .................................................................................................................................. 7 List of Figures ................................................................................................................................. 8 List of Algorithms ........................................................................................................................... 8 Chapter I – Background .................................................................................................................. 9 Introduction ............................................................................................................................. 9 Clustering .............................................................................................................................. 11 Gaussian finite and infinite mixture models .......................................................................... 12 Context-specific clustering .................................................................................................... 13 Differential Co-expression .................................................................................................... 14 Outline of this dissertation .................................................................................................... 15 Chapter II – Semiparametric Bayesian model for unsupervised differential co-expression analysis identifies novel molecular subtypes ......................................................................... 17 Introduction ........................................................................................................................... 17 Results ................................................................................................................................... 19 Context-specific infinite mixture model ......................................................................... 19 Recovery of simulated contexts ...................................................................................... 21 Identifying molecular subtypes in breast cancer gene expression data ........................ 23 Differentially co-expressed genes .................................................................................. 27 Comparison to other outcome predictors ...................................................................... 28 Reproducibility of differential co-expression signature across independent datasets .. 29 Meta-analysis based on the differential co-expression signature.................................. 31 Discussion and Conclusions .................................................................................................. 32 Methods ................................................................................................................................. 34 Infinite mixture model .................................................................................................... 34 Differential co-expression score .................................................................................... 43 Simulation study ............................................................................................................. 45 Breast cancer studies ..................................................................................................... 47 Chapter III: CLEAN – CLustering Enrichment ANalysis ............................................................ 50 Background ........................................................................................................................... 50 Results ................................................................................................................................... 53 Comparing clustering results using the CLEAN score .................................................. 55 Reproducibility and the comparison with cluster-wide