Multipartite Graph Algorithms for the Analysis of Heterogeneous Data
Total Page:16
File Type:pdf, Size:1020Kb
University of Tennessee, Knoxville TRACE: Tennessee Research and Creative Exchange Doctoral Dissertations Graduate School 12-2015 Multipartite Graph Algorithms for the Analysis of Heterogeneous Data Charles Alexander Phillips University of Tennessee - Knoxville, [email protected] Follow this and additional works at: https://trace.tennessee.edu/utk_graddiss Part of the Computational Biology Commons Recommended Citation Phillips, Charles Alexander, "Multipartite Graph Algorithms for the Analysis of Heterogeneous Data. " PhD diss., University of Tennessee, 2015. https://trace.tennessee.edu/utk_graddiss/3600 This Dissertation is brought to you for free and open access by the Graduate School at TRACE: Tennessee Research and Creative Exchange. It has been accepted for inclusion in Doctoral Dissertations by an authorized administrator of TRACE: Tennessee Research and Creative Exchange. For more information, please contact [email protected]. To the Graduate Council: I am submitting herewith a dissertation written by Charles Alexander Phillips entitled "Multipartite Graph Algorithms for the Analysis of Heterogeneous Data." I have examined the final electronic copy of this dissertation for form and content and recommend that it be accepted in partial fulfillment of the equirr ements for the degree of Doctor of Philosophy, with a major in Computer Science. Michael A. Langston, Major Professor We have read this dissertation and recommend its acceptance: Bruce J. MacLennon, Brynn H. Voy, David J. Icove Accepted for the Council: Carolyn R. Hodges Vice Provost and Dean of the Graduate School (Original signatures are on file with official studentecor r ds.) Multipartite Graph Algorithms for the Analysis of Heterogeneous Data A Dissertation Presented for the Doctor of Philosophy Degree The University of Tennessee, Knoxville Charles Alexander Phillips December 2015 Copyright © 2015 by Charles A. Phillips All rights reserved. ii Acknowledgements In the course of my education and research I have had the good fortune to cross paths with many bright and dedicated people. I extend my gratitude to many of them here, although I am certain the list is not complete. First I would like to thank my advisor, Dr. Michael A. Langston, for his guidance, patience and above all the example he sets for high standards in scientific research and work ethics. My special thanks go out to those who served on my dissertation committee: Drs. David Icove, Bruce MacLennan, Lynne Parker and Brynn Voy. Former and present students I have worked with as part of Dr. Langston’s research team here at the University of Tennessee include John Eblen, Ron Hagan, Jeremy Jay, Jordan Lefebvre, Allan Lu, Sudhir Naswa, Clinton Nolan, Andy Perkins, Gary Rogers, Kai Wang, Dinesh Weerapurage and Yun Zhang. Research Collaborators include Erich Baker, Jason Bubier, Elissa Chesler, Frank Dehne, Dan Goldowitz, Mike Miles and Aaron Wolen. My thanks go to Suzanne Baktash for her encouragement, helpfulness and support. Other professors at UT who I have been fortunate to collaborate with include Drs. Arnold Saxton and Meg Staton. My appreciation goes to instructors at Moberly Area Community College, where I completed my associate degree, and Columbia College, where I did my bachelor’s degree, for helping to fan the spark of my interest in computer science and encouraging me to pursue a graduate degree. These instructors include David Heise, Yihsiang Liow, David Pence and Lawrence West. And last but not least, my gratitude and appreciation goes to my family: my sister Lisa, brothers Chris and Mike, stepmother Sylvia, and especially my father, Alex, whose support and encouragement through the years have been beyond price. iii Abstract The explosive growth in the rate of data generation in recent years threatens to outpace the growth in computer power, motivating the need for new, scalable algorithms and big data analytic techniques. No field may be more emblematic of this data deluge than the life sciences, where technologies such as high-throughput mRNA arrays and next generation genome sequencing are routinely used to generate datasets of extreme scale. Data from experiments in genomics, transcriptomics, metabolomics and proteomics are continuously being added to existing repositories. A goal of exploratory analysis of such omics data is to illuminate the functions and relationships of biomolecules within an organism. This dissertation describes the design, implementation and application of graph algorithms, with the goal of seeking dense structure in data derived from omics experiments in order to detect latent associations between often heterogeneous entities, such as genes, diseases and phenotypes. Exact combinatorial solutions are developed and implemented, rather than relying on approximations or heuristics, even when problems are exceedingly large and/or difficult. Datasets on which the algorithms are applied include time series transcriptomic data from an experiment on the developing mouse cerebellum, gene expression data measuring acute ethanol response in the prefrontal cortex, and the analysis of a predicted protein-protein interaction network. A bipartite graph model is used to integrate heterogeneous data types, such as genes with phenotypes and microbes with mouse strains. The techniques are then extended to a multipartite algorithm to enumerate dense substructure in multipartite graphs, constructed using data from three or more heterogeneous sources, with applications to functional genomics. Several new theoretical results are given regarding multipartite graphs and the multipartite enumeration algorithm. In all cases, practical implementations are demonstrated to expand the frontier of computational feasibility. iv Table of Contents Chapter 1 Introduction and Background .......................................................................... 1 1.1 Definitions, Notation and Preliminaries ............................................................... 2 1.2 Omics Data .................................................................................................................. 4 1.3 Constructing Graphs from High-Throughput Data ............................................ 4 1.3.1 Similarity Metrics ................................................................................................ 5 1.3.2 Thresholding ......................................................................................................... 5 1.4 The Quest for Dense Subgraphs ............................................................................. 6 1.4.1 Maximum Clique ................................................................................................. 7 1.4.2 Maximal Clique Enumeration ............................................................................. 7 1.4.3 The Paraclique Algorithm .................................................................................... 9 Chapter 2 Algorithms for General Graphs .................................................................... 11 2.1 Ethanol Responsive Gene Networks in the Prefrontal Cortex ........................ 11 2.1.1 Paraclique and Network Analysis ...................................................................... 13 2.1.2 Functional Analysis ........................................................................................... 15 2.1.3 Combining Transcriptomic and Phenotype Data .............................................. 15 2.1.4 QTL Analysis ..................................................................................................... 17 2.1.5 Maximal Clique Enumeration ........................................................................... 17 2.2 Time Series Analysis of the Developing Mouse Cerebellum ......................... 19 2.2.1 Data Description ................................................................................................ 19 2.2.2 Paraclique Method.............................................................................................. 20 2.2.3 Paraclique Results .............................................................................................. 21 2.3 A Custom Algorithm for Protein-Protein Interaction Prediction ................... 23 2.3.1 Motivation .......................................................................................................... 24 2.3.2 Algorithm ........................................................................................................... 25 2.3.3 Results ................................................................................................................ 26 2.4 Maximum Clique Enumeration ............................................................................. 28 2.4.1 Background ......................................................................................................... 29 2.4.2 Results and Discussion ...................................................................................... 30 2.4.2.1 Algorithms ............................................................................................... 30 2.4.2.2 Basic Backtracking ................................................................................... 31 2.4.2.3 Finding a Single Maximum Clique ......................................................... 32 2.4.2.4 Intelligent Backtracking .......................................................................... 32 2.4.2.5 Parameterized Enumeration .................................................................... 33 2.4.2.6 Maximum Clique Covers ........................................................................