Graph Modification Problems and Their Applications to Genomic Research
Total Page:16
File Type:pdf, Size:1020Kb
Sackler Faculty of Exact Sciences, School of Computer Science Graph Modification Problems and their Applications to Genomic Research THESIS SUBMITTED FOR THE DEGREE OF “DOCTOR OF PHILOSOPHY” by Roded Sharan The work on this thesis has been carried out under the supervision of Prof. Ron Shamir Submitted to the Senate of Tel-Aviv University August 2002 2 Abstract Edge modification problems call for making small changes to the edge set of an input graph in order to obtain a graph with a desired property. These problems play an important role in computer science and have applications in several fields, including molecular biology. In many application areas a graph is used to model experimental data, and then edge modifications correspond to correcting errors in the data: Adding an edge corrects a false negative error, and deleting an edge corrects a false positive error. This thesis deals with theoretical and practical modification problems. We first study the complexity and approximability of edge modification problems on some structured classes of graphs. We show that most of the studied problems are compu- tationally hard, but some have efficient solutions when restricting the degrees in the input graph. We then give a polynomial approximation algorithm for the classical minimum fill-in problem which has applications in numerical algebra. We provide fast algorithms for recognizing certain properties on dynamically changing graphs, with applications to physical mapping of DNA. We study a graph sandwich problem arising in phylogeny reconstruction and devise an efficient algorithm for it. Finally, we develop a new clustering algorithm which combines probabilistic and graph the- oretic reasoning. The algorithm was implemented and we report on its successful application in a variety of gene expression experiments as well as other biological problems. 3 4 Contents 1 Introduction 11 1.1 MotivationandBackground . 11 1.2 SummaryofResults ........................... 16 1.3 Preliminaries . 20 1.3.1 Definitions............................. 20 1.3.2 GraphClasses........................... 21 2 Complexity Analysis 23 2.1 Introduction................................ 23 2.2 BasicResults ............................... 25 2.3 NP-HardModificationProblems. 27 2.3.1 ChainGraphs........................... 27 2.3.2 ChordalGraphs.......................... 28 2.3.3 AT-FreeGraphs.......................... 29 2.3.4 ClusterGraphs .......................... 31 2.3.5 AGeneralNP-HardnessResult . 34 2.4 PolynomialAlgorithms .......................... 36 2.4.1 2-Cluster Deletion . 36 2.4.2 BoundedDegreeGraphs . 37 2.5 Approximating 2-Cluster Editing . 38 5 6 CONTENTS 2.6 Inapproximability Results . 39 3 Approximating the Minimum Fill-In 43 3.1 Introduction................................ 43 3.2 Preliminaries . 46 3.3 Improvements to the Partition Algorithm . 49 3.4 TheApproximationAlgorithm. 51 3.5 BoundedDegreeGraphs .. .. .. 53 3.6 ReducingtheKernelSize.. .. .. 54 3.7 An Approximation Algorithm for Chain Completion . 59 4 Dynamic Recognition Algorithms 61 4.1 Background ................................ 62 4.2 ProperIntervalGraphRecognition . 63 4.2.1 Introduction............................ 63 4.2.2 Preliminaries . 65 4.2.3 TheDataStructure. .. .. 67 4.2.4 AVertex-IncrementalAlgorithm. 69 4.2.5 AnEdge-IncrementalAlgorithm . 75 4.2.6 A Fully Dynamic Algorithm . 78 4.2.7 Maintaining the Connected Components . 82 4.2.8 TheLowerBounds ........................ 83 4.3 CographRecognition ........................... 86 4.3.1 Introduction............................ 86 4.3.2 Preliminaries . 87 4.3.3 AReduction............................ 88 4.3.4 Cographs ............................. 88 4.3.5 ThresholdGraphs......................... 94 CONTENTS 7 4.3.6 Trivially Perfect Graphs . 95 5 Incomplete Directed Perfect Phylogeny 99 5.1 Introduction................................100 5.2 Preliminaries . 104 5.3 Characterizations of Explainable Binary Matrices . 106 5.3.1 ForbiddenSubgraphCharacterization . 106 5.3.2 Forbidden Submatrix Characterizations . 108 5.4 AlgorithmsforSolvingIDP . .110 5.4.1 AlgorithmA............................111 5.4.2 AlgorithmB............................115 5.4.3 GreedyApproachFails . .117 5.5 Determining the Generality of the Solution . 117 5.6 An Application to Biological Data . 125 6 Clustering Gene Expression Data 127 6.1 Introduction................................128 6.2 BiologicalBackground . .129 6.2.1 cDNAMicroarrays . .. .. .130 6.2.2 Oligonucleotide Microarrays . 130 6.2.3 Oligonucleotide Fingerprinting . 131 6.3 MathematicalFormulationsandBackground . 132 6.3.1 AssessmentofSolutions . .134 6.4 ApproachestoClustering. .136 6.4.1 Hierarchical Clustering . 137 6.4.2 K-Means..............................138 6.4.3 HCS ................................139 6.4.4 CAST ...............................141 8 CONTENTS 6.4.5 Self Organizing Maps . 141 6.5 TheCLICKClusteringAlgorithm . .143 6.5.1 The Probabilistic Framework . 144 6.5.2 The Basic CLICK Algorithm . 146 6.5.3 Computing a Minimum Cut . 149 6.5.4 The Full Algorithm . 150 6.5.5 HandlingLargeandPartialDatasets . 153 6.5.6 FingerprintDataEnhancements . .155 6.5.7 Implementation and Simulation Results . 156 6.5.8 Limitations of CLICK . 157 6.6 Applications to Biological Data . 161 6.6.1 GeneExpression .........................161 6.6.2 cDNA oligo-fingerprints . 163 6.6.3 ProteinClasses ..........................165 6.6.4 ABlindTest ...........................167 6.7 Application to Ataxia-Telangiectasia . 168 6.7.1 The Ataxia-Telangiectasia Disease . 168 6.7.2 Experimental Design and Data Preprocessing . 170 6.7.3 Tissue Clustering . 170 6.7.4 GeneClustering..........................171 6.7.5 Discussion.............................174 6.8 IdentifyingRegulatoryMotifs . .175 6.9 Tissueclassification............................177 6.10 The EXPANDER Clustering and Visualization Tool . 182 6.10.1 ClusteringMethods. .182 6.10.2 Matrix Visualizations . 182 6.10.3 Clustering Visualizations . 184 CONTENTS 9 6.10.4 FunctionalEnrichment . .185 10 CONTENTS Chapter 1 Introduction In this chapter we introduce graph modification problems, and provide background on previous studies on such problems. We summarize the results of the thesis, and close with preliminaries and basic definitions on graph theoretic notions. 1.1 Motivation and Background Edge modification problems on graphs play an important role in computer science and have applications in several fields, including molecular biology. This thesis consists of two main parts: The first, theoretical part studies the complexity and approximability of edge modification problems. The second, applied part highlights the applications of such problems to genomic research. Problem definition: Edge modification problems call for making small changes to the edge set of an input graph in order to obtain a graph with a desired property. They include completion, deletion and editing problems. Let Π be a graph property. In the Π-Editing problem the input is a graph G =(V, E), and the goal is to find a minimum set F V V such that G′ =(V, E F ) satisfies Π, where E F denotes ⊂ × △ △ the symmetric difference between E and F , i.e., E F (E F ) (F E). In the △ ≡ \ ∪ \ Π-Deletion problem only edge deletions are permitted, i.e., F E. The problem ⊆ is equivalent to finding a maximum subgraph of G with property Π. In the Π- Completion problem one is only allowed to add edges, i.e., F E = . Equivalently, ∩ ∅ we seek a minimum supergraph of G with property Π. 11 12 CHAPTER 1. INTRODUCTION Motivation: Graph modification problems are fundamental in graph theory. Already in 1979, Garey and Johnson mentioned 18 different types of vertex and edge modification problems [69, Section A1.2]. Edge modification problems have applications in several fields, including molecular biology and numerical algebra. In many application areas a graph is used to model experimental data, and then edge modifications correspond to correcting errors in the data: Adding an edge corrects a false negative error, and deleting an edge corrects a false positive error. We summarize below some of these applications. Definitions of the graph classes are given in Section 1.3. Interval modification problems have important applications in physical mapping of DNA (see [22, 33, 80, 84]). Since direct sequencing of large DNA molecules is currently infeasible, they are first cut into smaller fragments. In this process the order of the fragments is lost, and a major problem is to reconstruct it. One way to reconstruct the order is to test for any two fragments whether they overlap, and use this information for deducing the fragments’ order. One can model the resulting problem as follows: Construct a graph G whose vertices correspond to fragments and there is an edge between two vertices if and only if their corresponding fragments overlap. Ideally, G would be an interval graph and the reconstruction problem would translate into that of finding a realization for G. However, experimental data is error-prone and, hence, G is only close to being an interval graph. Depending on the technology used and the kind of experimental errors, completion, deletion and editing problem arise, both for interval graphs and for unit interval graphs. The chordal completion problem, also called the minimum fill-in problem, arises when numerically performing a Gaussian elimination on a sparse symmetric positive- definite matrix [164]. Since the time of the computation and its storage needs depend on the sparseness of the matrix, it is desirable to find an elimination order