Recombinatorics the Algorithmics of Ancestral Recombination Graphs and Explicit Phylogenetic Networks
Total Page:16
File Type:pdf, Size:1020Kb
ReCombinatorics The Algorithmics of Ancestral Recombination Graphs and Explicit Phylogenetic Networks Dan Gusfield ReCombinatorics The Algorithmics of Ancestral Recombination Graphs and Explicit Phylogenetic Networks Dan Gusfield With contributions from Charles H. Langley, Yun S. Song, and Yufeng Wu The MIT Press Cambridge, Massachusetts London, England The metal tree sculpture on the cover was created by Shawnie Briggs of Winters California. Contents Preface xiii 1Introduction 1 1.1 Combinatorial Genomes and the Grand Challenge . 1 1.2 Networks ............................... 2 1.2.1 Recombination and Networks . 4 1.2.2 WhyNetworks?........................ 5 1.2.2.1 Pedigrees . 5 1.2.2.2 Back to Sequences . 6 1.2.2.3 Adding in Mutation and Recombination . 11 1.2.2.4 Transmission Paths Form (Parts of) a Genealog- icalNetwork .................... 12 1.2.2.5 Genealogical Networks Relate Sequences, Not Peo- ple.......................... 13 1.3 The Central Thesis of the Book . 14 1.4 Fundamental Definitions . .................. 14 1.4.1 Mutation, Infinite Sites, Perfect Characters, and Binary Sequences . 16 1.4.1.1 The Infinite-Sites Model . ........ 17 1.4.1.2 SNPs . 18 1.4.1.3 Perfect Characters and Homoplasy . 18 1.4.1.4 Perfection Is an Abstraction . 20 1.4.1.5 And, We Can Often Incorporate Homoplasy . 21 1.5 TheObservedData.......................... 22 1.6 Graph Definitions . 22 1.6.1 Trees and DAGs: The Most Central Graphs in This Book 24 1.6.1.1 Trees and Subtrees . ........ 25 1.6.1.2 Directed Acyclic Graphs (DAGs) . 27 1.6.2 Genealogical Networks and ARGs: First Definitions . 29 1.7 TheBook ............................... 32 iii iv CONTENTS 2TreesFirst 35 2.1 Rooted Perfect Phylogeny . 35 2.1.1 Alternative Definitions . 38 2.1.2 The Perfect-Phylogeny Problem and Solution . 40 2.1.2.1 An Alternate Statement of the Perfect-Phylogeny Theorem ...................... 46 2.2 Known, Non-Zero, Roots . 47 2.3 Root-Unknown Perfect Phylogeny . 49 2.3.1 The Four-Gametes Theorem . 51 2.3.1.1 An Alternate Reduction . 52 2.3.2 Uniqueness of an Undirected Perfect Phylogeny . 52 2.4 Splits-Equivalence .......................... 54 2.4.1 The Existence Problem . 56 2.5 General References on Phylogenetic Trees . 58 3ADeeperLookatRecombination 59 3.1 Biological/Physical Context . 60 3.1.1 Crossing Over . 60 3.1.2 Double-Crossover, Gene-Conversion, and Multiple-Crossover Recombination ........................ 62 3.1.3 The Physical Context of SNP Sequences . 64 3.2 Algorithmic/Mathematical Context . 64 3.2.1 Representing a History of Recombinations and Mutations 65 3.2.2 Ancestral Recombination Graphs (ARGs) . 68 3.2.3 Formal Definitions for a Genealogical Network and an An- cestral Recombination Graph . 68 3.2.3.1 Where Are the Meioses? . 72 3.2.3.2 Specializing Genealogical Networks to Ancestral Recombination Graphs . 72 3.2.3.3 Double-Crossover Recombination Models Back and Recurrent Mutation . 75 3.2.3.4 Additional Helpful, but Not Limiting, Assumptions 76 3.2.4 There Is No ARG-Feasibility Problem . .......... 80 3.3 Recombination Minimization: The Core Problem . 80 3.4 Why Do We Care about Rmin(M)andMinARGs?. 82 3.4.1 A Robust Literature . 84 3.4.2 Networks Replacing Trees . 84 3.5 ARGs Beyond Plants and Animals . 85 3.6 MindtheGap............................. 86 3.6.1 The Gap Between Total Recombinations and Rmin(M). 86 3.6.2 Undetectable recombination . 87 3.6.3 Back to the Gap. What’s in it? . 91 CONTENTS v 3.6.4 A Modest Proposal: Locally Inconsequential Recombination 92 4ExploitingRecombination 97 4.1 Haplotypes and Genotypes . 98 4.2 Genetic Mapping by Linkage Analysis . 99 4.2.1 Recombinant Gametes . 100 4.2.1.1 Recombination Fractions . 101 4.2.1.2 Deducing the Linear Order and Rough Locations 102 4.2.1.3 Complications . 103 4.2.1.4 How Could Sturtevant Recognize Recombinant Gametes?...................... 104 4.2.1.5 Sturtevant’s Success . ........ 105 4.2.1.6 Linkage Mapping: In Summary . 107 4.3 Signatures of Positive Selection . 109 4.3.1 Positive Selection . 110 4.3.2 Identity by Descent Without Recombination . 111 4.3.3 Recombination and Haplotype Length: The Recombina- tionClock........................... 112 4.4 Idealized Introduction to Association Mapping . 117 4.4.1 Association Mapping and ARGs . 118 4.4.2 Association Mapping of a Simple-Mendelian Trait . 118 4.4.3 Association Mapping: In Summary . 123 5FirstBounds 125 5.1 Introduction to Bounds . .................. 125 5.2 Lower Bounds on Rmin ....................... 126 5.2.1 First Observations . 126 5.2.2 The HK Bound . 127 5.2.2.1 The Root-Known Version of the HK Bound ... 131 5.2.2.2 Computing the HK Bound ............ 131 5.2.2.3 Another Characterization of the HK Bound ... 132 5.2.3 The Haplotype Lower Bound . 133 5.2.3.1 Removing Fully-Compatible Characters Usually BooststheHaplotypeBound . 136 5.2.4 The Composite Bound Method . 138 5.2.4.1 Computing and Using the Composite Bound . 140 5.2.4.2 The Composite Bound Greatly Boosts the Hap- lotypeBound.................... 141 5.2.4.3 Applying the Haplotype Bound to Subsets of Sites: AMajorImprovement. 143 5.2.5 The Optimal RecMin Bound . 144 5.2.6 Program RecMin . 145 5.2.7 HapBound . 145 vi CONTENTS 5.2.7.1 Finding S∗(I)byIntegerLinearProgramming. 147 5.2.7.2 Program HapBound: Speedups and Extensions . 150 5.2.7.3 HapBound Can Often Compute Larger Bounds Than the Optimal RecMin Bound . 151 5.2.7.4 Self-Derivability and Its Use . .......... 152 5.2.8 Lower Bounds for the LPL and ADH Datasets . 156 5.2.8.1 The Human LPL Data in More Detail . 157 5.2.9 The Rank Bound . 158 5.2.10 Lower Bounds and Gene Conversion . .......... 161 5.2.10.1 Solving the Gene-Conversion Coverage Problem: The Special Case of Unbounded Tract Lengths . 164 5.2.10.2 Improving the Bounds . .......... 166 5.2.10.3 An ILP Formulation for Bounded t ....... 166 5.2.11 More Lower Bounds Later . .......... 168 5.3 A Sharp Upper Bound on Rmin(M)................169 6 Fundamental Structure 173 6.1 Incompatibility and Conflict Graphs . .......... 173 6.2 Connected Components . 175 6.2.1 A (Hopefully) Motivating Preview . 175 6.2.2 Back to the Formal Treatment . 176 6.2.3 The Key Structural Result . 179 6.2.4 Adding New Sequences . 181 6.2.5 The Results Apply in Many Contexts . 181 6.3 Fast Identification of Connected Components . 182 6.3.1 Computing All the L(c)ValuesinO(nm)TotalTime . 189 6.3.2 Identifying a Conflicting Pair (L(˜c),c)or(c, c˜)inConstant Time.............................. 190 6.3.3 The Case of the Incompatibility Graph .......... 191 7 First Uses of Structure 193 7.1 Connected-Component Lower Bound . 193 7.1.1 A Root-Known Version of the Connected-Component Lower Bound............................. 196 7.1.2 Use of the Connected-Component Lower Bound in the Composite-Bound Method . 196 7.1.3 Extensions to Other Biological Phenomena . 196 7.2 Full Decomposition . .................... 197 7.2.1 Cycles That Share Edges Form Blobs in ARGs . 197 7.2.2 The Backbone Tree of an ARG . 198 7.2.3 The Full-Decomposition Theorem and Its Reverse . 200 7.3 Proof of Full Decomposition . .................... 203 7.3.1 The Superstates of M and the Matrix W ......... 204 CONTENTS vii 7.3.2 Using T (M)toCreateaFullyDecomposedARGforM .207 7.3.3 Completion of the Proof . 211 7.3.4 Summary of the Procedure for Finding a Fully Decom- posed ARG for M ...................... 212 7.3.5 Invariant Properties of Fully Decomposed ARGs . 213 7.3.5.1 A Major Invariant . 218 7.3.5.2 The Number of Recombination Nodes in a Fully Decomposed ARG . ........ 219 7.3.6 The Reverse of the Full-Decomposition Theorem . 220 7.3.7 The Root-Known Full-Decomposition Theorem . 220 7.4 The Utility of Full Decomposition . ........ 221 7.4.1 Decomposing the MinARG Problem . ........ 221 7.4.2 Building the “Most Treelike” ARGs . 222 7.4.3 Is There Always a Fully Decomposed MinARG? . 223 7.5 Broader Implications and Applications . 224 7.5.1 Superstates Succeed and Succeed . ........ 224 7.5.2 Superstates Play a General Role . 224 8GalledTrees 229 8.1 Introduction to Galled Trees . 229 8.1.1 Galled Trees: A Biologically and Algorithmically Moti- vatedStructuralRestriction . 231 8.1.2 Motivation for Galled Trees . 233 8.1.3 The Galled-Tree Problems . 234 8.2 First Major Results . 235 8.3 Efficient Algorithm . 237 8.3.1 The Algorithm . 238 8.3.1.1 Testing Whether a Single Recombination Cycle Suffices ....................... 238 8.4 Essential Uniqueness ......................... 244 8.5 Root-Unknown Algorithm . 245 8.5.1 The Root-Unknown Problem Reduces to the Root-Known Problem............................ 246 8.5.2 A Faster Solution to the Root-Unknown Problem . 249 8.5.2.1 Correctness and Time Analysis . 251 8.6 Extensions . 253 8.6.1 Multiple Crossovers Model Complex Biological Phenomena 254 8.7 Galled Trees and Lower Bounds . .................. 255 8.8 FurtherSpeed-ups .......................... 256 8.8.1 The Key Observations . 256 8.8.2 Summarizing the Improved Time Bound . 259 8.9 Further Limitations . 260 viii CONTENTS 8.10 A Concise Condition . 263 8.10.1 The Root-Known Case . 269 8.10.1.1 Proof of Theorem 8.3.4 . 269 8.10.2 Extension to Multiple Galls . 270 8.11 Character Removal . 271 8.12 Back Mutation . 272 9GeneralConstruction 275 9.1 Construction by Destruction . 275 9.1.1 The Perfect-Phylogeny Case with the All-Zero Ancestral Sequence .