Scalable, High-Performance Forward Time Population Genetic Simulation

Scalable, High-Performance Forward Time Population Genetic Simulation A dissertation submitted to the Division of Research and Advanced Studies of the University of Cincinnati in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Electrical Engineering and Computer Science of the College of Engineering and Applied Sciences March 26, 2018 by Patrick P. Putnam BSCS, University of Cincinnati, 2008 Thesis Advisor and Committee Chair: Dr. Philip A. Wilsey Abstract Forward-time population genetic simulators are computational tools used in the study of population genetics. Simulations aim to evolve the genetic state of a population relative to a set of genetic models that reflect the processes that occur in nature under various configurations. Often, these simulations are limited to evolutionary scales that can be represented within the memory space and feasibly computed using a standard workstation computer. This presents a general challenge of how to represent the genetics of a population to enable evolutionary scenarios of sufficient scale to be performed on a memory constrained system. In addition, as the evolutionary scales increase so too does the computational time necessary to complete the simulation. This work considers the general problems of scale and performance as they relate to forward-time population genetic simulation. It explores the representation of a population from the perspective of a graph. Improved memory utilization and computational performance are achieved through the use of a binary adjacency matrix representation of the graph. This use of this representation is generally uncommon in forward-time population genetic simulation. Further improvements are made to the performance of the simulator through the use of parallel computation. This work considers a forward-time population genetic simulation from both a task- and a data- parallel perspective. Each of these perspectives present certain challenges and offer different levels of performance gains. The utilization of the binary adjacency matrix representation enables each of these parallel approaches to be achieved. Finally, although scale and performance improvements are enabled through the use of a binary adjacency matrix representation of the graph, it does have limits in forward-time population genetic simulation. These limits are related to the density of the graph. This work offers a situation where this representation would not be beneficial. ii This page would be intentionally left blank. iii Acknowledgements I would like to express my deepest gratitude to my advisor, Dr. Philip A. Wilsey. His coninued support and guidance throughout the course of my studies has been immeasurable. He has been an invaluable asset over the years and has demonstrated an overwhelming level of patience as I worked towards completing my education while advancing my career. I would also like to thank my committee member and manager, Dr. Ge Zhang, without whom I would not have started down this topic of study. Finally, I would like to thank my committee members Dr. Fred Beyette, Dr. Yizong Cheng, and Dr. Karen Davis for their valuable time and involvement in my journey. iv Contents 1 Introduction 1 1.1 Background . 1 1.2 Computer Simulations . 2 1.3 This work . 3 2 Population Genetics 5 2.1 Change in a Population . 6 2.1.1 Mutation . 6 2.1.2 Recombination . 7 2.1.3 Selection . 8 3 Base design of a Forward-Time Population Genetic Simulation 9 3.1 Graph Interpretation of a Population . 10 3.1.1 Genome Sequence Representation . 12 3.1.2 Adjacency List Representation . 14 3.1.3 Adjacency Matrix Representation . 15 3.2 Genetic Models . 18 3.2.1 Mutation Model . 18 3.2.2 Recombination . 19 3.2.3 Phenotype Evaluation . 21 3.2.4 Fitness . 22 3.2.5 Selection . 22 3.3 Results . 22 v 3.3.1 Evolutionary Scenarios . 23 3.3.2 Simulators and Workstation Configuration . 24 3.3.3 Memory Utilization Comparison . 24 3.3.4 Medium Scale Simulation . 26 3.3.5 Large Scale Simulation . 27 4 Task Parallelism in Forward-Time Population Genetic Simulation 29 4.1 Task Parallelism . 31 4.1.1 Partitioning . 33 4.1.2 Batch Model . 33 4.1.3 Pipeline Model . 35 4.2 Results . 36 4.2.1 Simulator and Hardware Configuration . 36 4.2.2 Evolutionary Scenarios . 37 4.2.3 Small Evolutionary Scale . 37 4.2.4 Medium Evolutionary Scale . 38 4.2.5 Large Evolutionary Scale . 40 5 Data Parallelism in Forward-Time Population Genetic Simulation 43 5.1 Graphics Processing Units . 43 5.1.1 GPU vs CPU . 44 5.1.2 Kernels . 45 5.1.3 GPU Challenge . 46 5.2 Data Parallel Life Cycle Models . 47 5.2.1 Random Number Generation . 47 5.2.2 Recombination . 48 5.2.3 Orchestration . 50 5.3 Results . 51 5.3.1 Evolutionary Scenario . 51 5.3.2 Small Scale Simulation . 52 5.3.3 Medium Scale Simulation . 53 vi 5.3.4 Large Scale Simulation . 54 5.3.5 Extra Large Scale Simulation . 54 6 Limit for use 56 6.1 Allele based simulations . 57 7 Conclusion 60 vii List of Figures 2.1 Example of genetic crossover over . 7 3.1 High-level abstraction of a population as a graph . 11 3.2 Small Scale Simulation Comparison . 26 3.3 Medium Scale Simulation Comparison . 27 3.4 Large Scale Simulation Comparison . 28 4.1 Batch Model Concept . 34 4.2 Pipeline Model Concept . 35 4.3 QTLSimMT - Small Scale; Neutral . 38 4.4 QTLSimMT - Small Scale; Selected . 39 4.5 QTLSimMT - Medium Scale; Neutral . 39 4.6 QTLSimMT - Medium Scale; Selected . 40 4.7 QTLSimMT - Large Scale; Neutral . 41 4.8 QTLSimMT - Large Scale; Selected . 41 5.1 QTLSimCUDA - Small scale simulation performance results . 53 5.2 QTLSimCUDA - Medium scale simulation performance results . 53 5.3 QTLSimCUDA - Large scale simulation performance results . 54 5.4 QTLSimCUDA -Extra large scale simulation performance results . 55 6.1 Locus graph . 57 viii List of Tables 3.1 Simulation scales . 23 3.2 Hardware configuration - Workstation #1 . 24 3.3 Breakdown of memory footprint of genome graph representations . 25 3.4 Memory comparison of Genome Graph representations . 25 4.1 Hardware Configuration - Workstation #2 . 37 4.2 Common Population Configuration . 37 4.3 Evolutionary scales and threading configuration . 37 5.1 GPU specification . 52 5.2 Extra Large evolutionary scale . 52 ix List of Algorithms 1 Outline of a FTPGS . 10 2 Mutation Algorithm . 19 3 Unordered Adjacency Recombination Algorithm . 21 x Chapter 1 Introduction 1.1 Background Populations of organisms are complex systems. Loosely defined, a population is a collection of organisms that are grouped together, typically by their physical location. Changes in the environment where a population resides help to govern the cyclic process of a population's survival. That is, as the environment changes individuals in a population are forced to adapt. Those individuals whom are better suited for the current environment, or are better fit, are more likely to produce offspring that will survive in the next generation. Offspring inherit from their parents the attributes, or traits, that have enabled them to survive. As each new generation matures and procreates the population continues to survive, changing with the environment. The phenotype is the outward manifestation of the internal genetic structure, or genome, of an individual as it interact with the environment. In other words, the phenotype is a collection of observed traits. Traits may be associated with specific heritable units of genetic material called genes. Offspring inherit one copy of each gene from each of their parents. Naturally occurring processes, such as recombination and mutation, may alter the genetic material of a parent's gene. This may lead to an offspring inheriting a variant of their parent's gene. Although, there are relatively few genetic variations between a human child's genome and its parents [1]. At the population level, however, there may exist multiple forms of the same gene. Each specific form of a gene is referred to as an allele. Alleles are a basic unit of study in Population Genetics. 1 The field of Population Genetics is generally interested in studying the genetic landscape of.

Load more