Using Weighted Set Cover to Identify Biologically Significant Motifs

Using Weighted Set Cover to Identify Biologically Significant Motifs A thesis presented to the faculty of the Russ College of Engineering and Technology of Ohio University In partial fulfillment of the requirements for the degree Master of Science Robert J.M. Schmidt December 2015 © 2015 Robert J.M. Schmidt. All Rights Reserved. 2 This thesis titled Using Weighted Set Cover to Identify Biologically Significant Motifs by ROBERT J.M. SCHMIDT has been approved for the School of Electrical Engineering and Computer Science and the Russ College of Engineering and Technology by Lonnie R. Welch Stuckey Professor of Electrical Engineering and Computer Science Dennis Irwin Dean, Russ College of Engineering and Technology 3 ABSTRACT SCHMIDT, ROBERT J.M., M.S., December 2015, Computer Science Using Weighted Set Cover to Identify Biologically Significant Motifs Director of Thesis: Lonnie R. Welch One of the greatest challenges of mankind is understanding how living organisms operate, and a key step towards understanding this challenge is identifying how genes are regulated. Promoter regions play a key role in the regulation of genes via sequences of DNA base pairs known as transcription factor binding sites. When a transcription factor binding site is activated, the genes associated with the transcription factor binding site are transcribed, the first step towards creating proteins. The identification of transcription factor binding sites has come a long way with the advancements of next generation sequencing technologies and projects like ENCODE, but still relies on motif discovery algorithms to pinpoint the exact binding sites. In this thesis, the motif discovery problem is explored and a novel method based on weighted set cover is presented to identify the minimal set of motifs, with objective functions, that discriminately cover a set of DNA sequences. The results show that some motif set cover methods can more accurately identify biologically significant motifs over simply selecting the top scoring motifs. However, the weighed set cover algorithms did not perform exceptionally well when compared to standard selection methods, which is attributed to the use of a discriminative motif discovery application. Detailed results can be found at http://motifpipeline.com. 4 ACKNOWLEDGMENTS I would like to express my gratitude to Dr. Lonnie Welch, as an advisor and teacher throughout graduate school, for providing solid direction in my research, countless ideas and suggestions, continuous support, and for his work in obtaining the Choose Ohio First for Bioinformatics scholarships. Without Dr. Welch this thesis would not be possible. I would like to thank Dr. Frank Drews for listening to countless research presentations, providing useful guidance and numerous helpful suggestions along the way. I would like to thank Dr. David Juedes for teaching me the majority of what I know about the hardness of problems and the different tools available for solving computationally difficult problems. I also want to thank Dr. Sonsoles de Lacalle for all of her hard work on the rat estrous cycle project, allowing me to work with and analyze her data, and providing crucial insight into the biology behind the numbers. I also want to give a huge thanks to my friend Rami, Al-Ouran, who has given me countless hours of his time, and provided me with so much help and guidance throughout graduate school. I also want to give a huge thanks to Xiaoyu Liang for all of her help throughout graduate school and for always answering my questions. I also want to thank Richard Wolfe, Jeffery Jones, Yichao Li, Ashwini Naik, and the whole bioinformatics lab for all of their help and support throughout graduate school. I also want to give a big thanks to Choose Ohio First for Bioinformatics for help funding me throughout graduate school. Finally, I want to thank my love, Kasia, for always being by my side. 5 TABLE OF CONTENTS Page Abstract ............................................................................................................................... 3 Acknowledgments............................................................................................................... 4 List of Tables ...................................................................................................................... 8 List of Figures ..................................................................................................................... 9 Chapter 1: Introduction ..................................................................................................... 10 1.1 Background ............................................................................................................. 10 1.2 Problem Statement .................................................................................................. 15 Chapter 2: Literature Review ............................................................................................ 17 2.1 Motif Discovery ...................................................................................................... 17 2.1.1 Non-discriminative Motif Discovery Algorithms ............................................ 17 2.1.2 Discriminative Motif Discovery Algorithms ................................................... 19 2.2 Set Cover ................................................................................................................. 21 2.2.1 Set Cover Problem ........................................................................................... 21 2.2.2 Hitting Set Problem .......................................................................................... 23 Chapter 3: Algorithmic Approaches ................................................................................. 24 3.1 Set Cover Methods .................................................................................................. 24 3.1.1 Weighted Greedy Set Cover ............................................................................ 24 3.1.2 Weighted Relaxed Greedy Set Cover .............................................................. 25 3.1.3 Weighted Modified Greedy Approach Set Cover ............................................ 25 3.1.4 Weighted Hill Climbing Set Cover .................................................................. 26 3.1.5 Weighted Random Set Cover ........................................................................... 27 3.1.6 Weighted Simulated Annealing Set Cover ...................................................... 27 3.1.7 Integer Linear Programming Formulation ....................................................... 29 3.1.8 Linear Programming using Branch and Cut .................................................... 29 3.1.9 Linear Programming Relaxation with Randomized Rounding ........................ 30 3.2 Filter Methods ......................................................................................................... 31 3.2.1 Greedy Removal Method ................................................................................. 31 3.3 Weight Schemes ..................................................................................................... 32 3.3.1 Solution Based Weight Schemes ..................................................................... 32 6 3.3.2 Greedy Based Weight Schemes ....................................................................... 33 3.4 Metrics .................................................................................................................... 34 3.4.1 Comparisons .................................................................................................... 34 3.4.2 Classification Metrics ...................................................................................... 36 3.4.3 Ranking ............................................................................................................ 39 3.5 Baseline Motif Selection Methods .......................................................................... 40 Chapter 4: Case Studies .................................................................................................... 41 4.1 ENCODE Case Study ............................................................................................. 41 4.2 Rat Estrous Cycle Case Study ................................................................................ 43 4.3 Brugia Malayi Case Study ...................................................................................... 46 Chapter 5: Evaluation of Algorithms ................................................................................ 49 5.1 Weighted Greedy Set Cover ................................................................................... 57 5.1.1 Results .............................................................................................................. 57 5.1.2 Discussion ........................................................................................................ 61 5.2 Weighted Relaxed Greedy Set Cover ..................................................................... 62 5.2.1 Results .............................................................................................................. 62 5.2.2 Discussion ........................................................................................................ 65 5.3 Weighted Modified Greedy Set Cover ................................................................... 66 5.3.1 Results .............................................................................................................

Load more