<<

Scalable, High-Performance Forward Time Population

Genetic Simulation

A dissertation submitted to the

Division of Research and Advanced Studies of the University of Cincinnati

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

in the Department of Electrical Engineering and Computer Science of the College of Engineering and Applied Sciences

March 26, 2018

by

Patrick P. Putnam BSCS, University of Cincinnati, 2008

Thesis Advisor and Committee Chair: Dr. Philip A. Wilsey Abstract

Forward-time population genetic simulators are computational tools used in the study of population . Simulations aim to evolve the genetic state of a population relative to a set of genetic models that reflect the processes that occur in nature under various configurations. Often, these simulations are limited to evolutionary scales that can be represented within the memory space and feasibly computed using a standard workstation computer. This presents a general challenge of how to represent the genetics of a population to enable evolutionary scenarios of sufficient scale to be performed on a memory constrained system. In addition, as the evolutionary scales increase so too does the computational time necessary to complete the simulation.

This work considers the general problems of scale and performance as they relate to forward-time population genetic simulation. It explores the representation of a population from the perspective of a graph. Improved memory utilization and computational performance are achieved through the use of a binary adjacency matrix representation of the graph. This use of this representation is generally uncommon in forward-time population genetic simulation.

Further improvements are made to the performance of the simulator through the use of parallel computation. This work considers a forward-time population genetic simulation from both a task- and a data- parallel perspective. Each of these perspectives present certain challenges and offer different levels of performance gains. The utilization of the binary adjacency matrix representation enables each of these parallel approaches to be achieved.

Finally, although scale and performance improvements are enabled through the use of a binary adjacency matrix representation of the graph, it does have limits in forward-time population genetic simulation. These limits are related to the density of the graph. This work offers a situation where this representation would not be beneficial.

ii This page would be intentionally left blank.

iii Acknowledgements

I would like to express my deepest gratitude to my advisor, Dr. Philip A. Wilsey. His coninued support and guidance throughout the course of my studies has been immeasurable. He has been an invaluable asset over the years and has demonstrated an overwhelming level of patience as I worked towards completing my education while advancing my career. I would also like to thank my committee member and manager, Dr. Ge Zhang, without whom I would not have started down this topic of study. Finally, I would like to thank my committee members Dr. Fred Beyette, Dr.

Yizong Cheng, and Dr. Karen Davis for their valuable time and involvement in my journey.

iv Contents

1 Introduction 1

1.1 Background ...... 1

1.2 Computer Simulations ...... 2

1.3 This work ...... 3

2 5

2.1 Change in a Population ...... 6

2.1.1 ...... 6

2.1.2 Recombination ...... 7

2.1.3 Selection ...... 8

3 Base design of a Forward-Time Population Genetic Simulation 9

3.1 Graph Interpretation of a Population ...... 10

3.1.1 Sequence Representation ...... 12

3.1.2 Adjacency List Representation ...... 14

3.1.3 Adjacency Matrix Representation ...... 15

3.2 Genetic Models ...... 18

3.2.1 Mutation Model ...... 18

3.2.2 Recombination ...... 19

3.2.3 Phenotype Evaluation ...... 21

3.2.4 ...... 22

3.2.5 Selection ...... 22

3.3 Results ...... 22

v 3.3.1 Evolutionary Scenarios ...... 23

3.3.2 Simulators and Workstation Configuration ...... 24

3.3.3 Memory Utilization Comparison ...... 24

3.3.4 Medium Scale Simulation ...... 26

3.3.5 Large Scale Simulation ...... 27

4 Task Parallelism in Forward-Time Population Genetic Simulation 29

4.1 Task Parallelism ...... 31

4.1.1 Partitioning ...... 33

4.1.2 Batch Model ...... 33

4.1.3 Pipeline Model ...... 35

4.2 Results ...... 36

4.2.1 Simulator and Hardware Configuration ...... 36

4.2.2 Evolutionary Scenarios ...... 37

4.2.3 Small Evolutionary Scale ...... 37

4.2.4 Medium Evolutionary Scale ...... 38

4.2.5 Large Evolutionary Scale ...... 40

5 Data Parallelism in Forward-Time Population Genetic Simulation 43

5.1 Graphics Processing Units ...... 43

5.1.1 GPU vs CPU ...... 44

5.1.2 Kernels ...... 45

5.1.3 GPU Challenge ...... 46

5.2 Data Parallel Life Cycle Models ...... 47

5.2.1 Random Number Generation ...... 47

5.2.2 Recombination ...... 48

5.2.3 Orchestration ...... 50

5.3 Results ...... 51

5.3.1 Evolutionary Scenario ...... 51

5.3.2 Small Scale Simulation ...... 52

5.3.3 Medium Scale Simulation ...... 53

vi 5.3.4 Large Scale Simulation ...... 54

5.3.5 Extra Large Scale Simulation ...... 54

6 Limit for use 56

6.1 based simulations ...... 57

7 Conclusion 60

vii List of Figures

2.1 Example of genetic crossover over ...... 7

3.1 High-level abstraction of a population as a graph ...... 11

3.2 Small Scale Simulation Comparison ...... 26

3.3 Medium Scale Simulation Comparison ...... 27

3.4 Large Scale Simulation Comparison ...... 28

4.1 Batch Model Concept ...... 34

4.2 Pipeline Model Concept ...... 35

4.3 QTLSimMT - Small Scale; Neutral ...... 38

4.4 QTLSimMT - Small Scale; Selected ...... 39

4.5 QTLSimMT - Medium Scale; Neutral ...... 39

4.6 QTLSimMT - Medium Scale; Selected ...... 40

4.7 QTLSimMT - Large Scale; Neutral ...... 41

4.8 QTLSimMT - Large Scale; Selected ...... 41

5.1 QTLSimCUDA - Small scale simulation performance results ...... 53

5.2 QTLSimCUDA - Medium scale simulation performance results ...... 53

5.3 QTLSimCUDA - Large scale simulation performance results ...... 54

5.4 QTLSimCUDA -Extra large scale simulation performance results ...... 55

6.1 graph ...... 57

viii List of Tables

3.1 Simulation scales ...... 23

3.2 Hardware configuration - Workstation #1 ...... 24

3.3 Breakdown of memory footprint of genome graph representations ...... 25

3.4 Memory comparison of Genome Graph representations ...... 25

4.1 Hardware Configuration - Workstation #2 ...... 37

4.2 Common Population Configuration ...... 37

4.3 Evolutionary scales and threading configuration ...... 37

5.1 GPU specification ...... 52

5.2 Extra Large evolutionary scale ...... 52

ix List of Algorithms

1 Outline of a FTPGS ...... 10

2 Mutation Algorithm ...... 19

3 Unordered Adjacency Recombination Algorithm ...... 21

x Chapter 1

Introduction

1.1 Background

Populations of organisms are complex systems. Loosely defined, a population is a collection of or- ganisms that are grouped together, typically by their physical location. Changes in the environment where a population resides help to govern the cyclic process of a population’s survival. That is, as the environment changes individuals in a population are forced to adapt. Those individuals whom are better suited for the current environment, or are better fit, are more likely to produce offspring that will survive in the next generation. Offspring inherit from their parents the attributes, or traits, that have enabled them to survive. As each new generation matures and procreates the population continues to survive, changing with the environment.

The phenotype is the outward manifestation of the internal genetic structure, or genome, of an individual as it interact with the environment. In other words, the phenotype is a collection of observed traits. Traits may be associated with specific heritable units of genetic material called .

Offspring inherit one copy of each from each of their parents. Naturally occurring processes, such as recombination and mutation, may alter the genetic material of a parent’s gene. This may lead to an offspring inheriting a variant of their parent’s gene. Although, there are relatively few genetic variations between a human child’s genome and its parents [1]. At the population level, however, there may exist multiple forms of the same gene. Each specific form of a gene is referred to as an allele. are a basic unit of study in Population Genetics.

1 The field of Population Genetics is generally interested in studying the genetic landscape of a population of organisms over time. In some respects, it is the variations at the genetic level of a population that determine its ability to survive. The idea being that a variation causes change in the traits of an individual. The changes may either increase, decrease, or not affect the ability of the individual to survive. In essence, events occurring at the micro-scale have an impact on the macro- scale. An active area of research attempts to quantify the role of genetic variation in determining a trait, and how the variations spread through a population over time. In effect, researchers in this

field study a population’s distribution of genetic variation frequency, or allele frequency, in order to better understand the interconnection between its micro- and macro-scales.

Studies of this nature involve having to collect and analyze genetic material as well as measuring various traits from many individuals across multiple generations of a population. Although techno- logical advances have significantly improved the data collection process in terms of both accuracy and speed, nature still imposes a delay between generations of a population. Furthermore, large sample sizes are often necessary before a statistical inferences about the distribution of alleles in the population can be made. In effect, time and the volume of data present significant challenges in the field of Population Genetics. As a result, populations of fast reproducing model organisms are often studied. Genetic models based upon evidence from these populations are often constructed and then considered in the context of populations of other organisms. Translating genetic mod- els between organisms is not always mathematically tractable. Computer simulations are used in

Population Genetics to evaluate mathematically intractable genetic models.

1.2 Computer Simulations

The synthetic data sets produced by computer simulations may be used to predict properties of a population that result from mathematically intractable models. In addition, the ability to produce large volume data sets aids in the process of assessing the validity, statistical inferences, and power analysis of statistical models [2]. Over the years, many simulators have been developed [3]. It is not uncommon to find simulators that overlap in their set of genetic models, but provide different levels of configurability, analytic capabilities, and general performance [4].

Population Genetic Simulators [5] fall into two classes: coalescent based, and forward-time.

2 Coalescent based simulators start from a current population and work backward through time to derive a common ancestral population using retrospective stochastic models. These simulators enable populations of organisms with large to be efficiently explored. Often though, incorporating new genetic models into this simulation style can be challenging.

Forward-time simulators take the opposite approach by starting from an ancestral population and constructing future generations following the abstract process found in nature. The associ- ation with the natural process often allows for a simpler, more general design to be used in the simulation process. This often enables novel genetic models to be more easily incorporated into an evolutionary scenario. These simulations also offer the ability to produce a more fine-grained view of a population.

The granularity offered in forward-time simulations increases both the memory and the run-time needed to perform a simulation. In general, the memory requirement of a simulation is proportional to the size of the population and the size of the genome being simulated. This has an expected cascading effect on the overall performance of a simulation. As a result, relatively small increases to the scale of an evolutionary scenario can test the limits of what is an acceptable amount of resources for a simulation. Unfortunately, the scalability of a simulator is often not a primary concern when a simulator is developed. This was a motivating observation behind this work.

1.3 This work

Often, Population Genetic researchers are not interested in the computational aspects of a simula- tion. Rather they are motivated by what the simulation enables them to learn, and rightfully so.

A side-effect of this, however, is that they develop simulators that are often designed for specific evolutionary scenarios using sub-optimal data structures and algorithms. Although the simulator may work for the intended scenario, increasing the evolutionary scale tends to result in requiring significantly more computational resources and processing time.

To account for this problem, several forward-time simulation frameworks [6–8] have been de- veloped that provide generalized algorithms and generalized data structures for common objects and natural processed that describe a population. These frameworks can be customized and are highly configurable allowing researchers with limited programming experience to develop a sim-

3 ulator for the specific evolutionary scenario they are studying without having to re-invent the wheel. Researchers often rely upon the framework to handle issues related to the scalability of their simulator.

Although frameworks provide functionality intended to enable the same goal, they differ in their approach of how best to reach the goal. In a general sense, frameworks provide different abstractions of both the macro- and micro-scales of a population. From a computational perspective, the representation of a population at the micro-scale plays a significant role in the scalability of a simulator than the macro-scale. That is, an individual’s genome forms the basis of any population simulation. Its in silico representation influences the algorithmic design of the genetic models, and any computational optimization that may be possible.

One area that is often under utilized in forward-time population genetic simulation is the use of parallel processing. The under utilization is unfortunate as nature is an inherently parallel system.

It stands to reason then that a system that is intended mirror nature should also be inherently parallel. Indeed, simulation frameworks such as simuPOP [6] do enable the use of multiple threads.

However, there are some stated situations under which multiple threads cannot be used. As part of this work, a more deeper exploration into the parallel opportunities available a forward-time population genetic simulation is conducted. Parallelism is investigated from the perspective of a task-level parallel solution leveraging multiple CPU threads. It is also considered from a data parallel perspective using a GPU.

This work focuses on the development of forward-time population genetic simulators that are efficient and scalable. Chapter 2 provides a general overview of the genetics of a population and the natural processes modeled in a forward-time population genetic simulation. Chapter 3 expands upon the design of a forward-time population genetic simulator and the algorithmic impact of the graph-based representation. It provides a baseline comparison with the representations provided in the existing simulation frameworks. Chapter 4 looks at using task parallelism for a population genetic simulation. Chapter 5 explores the use of a GPU to take advantage of data parallelism within a simulation. Chapter 6 explores some of the limitations that result from the simulator design followed. Chapter 7 concludes the work.

4 Chapter 2

Population Genetics

The genome of an individual is defined by a set of . Each is a double- stranded chain of nucleic acids, or . That is, a chromosome is two sequences of genetic material joined together. However, due to the molecular structure of the nucleic acids, only com- plementary nucleic acids may join with one another. Therefore, the two sequences of a chromosome are reverse complements of one another. This natural pairing of nucleotides allows a chromosome to be expressed as a single sequence of base pairs, or bases, instead of two sequences of nucleotides.

The length of each chromosome varies with some being tens of thousands of nucleotides, and others being hundreds of millions. The genome of an organism is the collection of a set of chromo- somes. It may consist of billions of nucleotides.

Encoded within each chromosome is a set of genes. A gene is a sub-sequence of genetic material that serve a functional purpose in the organism. It is the genes of an organism that govern the state of a trait. Variant forms of a gene are called alleles, and refer to a specific state of the chromosomal sequence for the gene. From a scale perspective, the human genome has roughly 6.4 billion bases spread over 46 chromosomes encoding an estimated 20,000 genes.

Although a genome may be defined in terms of independent chromosomes, some organisms possess multiple copies of homologous chromosomes. Humans, for instance, are diploid organisms meaning that we possess 2 sets of chromosomes, with each parent contributing one set of their chro- mosomes to their offspring. In other words, the human genome is more accurately described as 22 pairs of homologous chromosomes, and a pair of chromosomes. From a simulation perspective, it is convenient to abstract an organism in terms of single sets of chromosomes, or haploid genomes.

5 That is, organisms that have multiple copies of chromosomes, such as humans, can be viewed as having multiple haploid genomes instead of having one set of grouped chromosomes.

2.1 Change in a Population

Naturally occurring processes act to modify the chromosomes a parent passes along to their off- spring. Some of these processes occur at the microscopic layer of a population, while others are more macroscopic. Two of the more common microscopic processes are mutation and recombina- tion. A mutation may be considered a transformation that occurs at single of a genome.

Recombination is a process by which genetic material is exchanged between two homologous chro- mosomes of a parent’s genome. Selection is a more macroscopic process that controls population change through identifying which individuals of a population will mate with one another. A simu- lation aims to provide algorithmic models for these processes enabling one to vary the assumptions made about each process.

2.1.1 Mutation

Mutation is a micro-scale process that results in a state change of a genetic location. Historically, there are two mutation models that are commonly considered in Population Genetics. The infinite site mutation model [9] assumes that mutation can occur at an infinite number of genetic sites. Each mutation is considered to always be a novel site. It estimates the mutation rate θ of the population as Equation (2.1) in terms of the effective population size, Ne, and the number of found in a random sequence of the population, µ∗. This in turn can be estimated by the length of the sequence, k, and the per base mutation rate, µ. For a human population, the per base mutation rate is on the order of µ = 10−8 [1].

∗ θ = 4Neµ (2.1a)

µ∗ = kµ (2.1b)

The infinite allele model [10] considers that a locus, or fixed position of a genome, will undergo

6 mutation. When mutation occurs a novel allele is formed. This results from an underlying assump- tion that there are an infinite number of genetic states for each locus. It estimates a lower bound on the number of alleles a population can support as being given by Equation (2.2), where u is the allelic mutation rate.

n = 4Neu + 1 (2.2)

Both of these models follow a neutral theory of mutation [11–14]. A neutral mutation is a genetic change that does not affect the fitness of the individual. That is, mutation does not change an individual’s ability to survive and reproduce. From a more general perspective, one might consider that each mutation carries with it an effect size, or a quantitative measure of the strength of the mutation. Under the neutral theory, all mutations have a zero effect size.

2.1.2 Recombination

Another process that introduces genetic change to a population is recombination. In nature, this process physically exchanged genetic material between homologous chromosomes during a type of division called . Homologous chromosomes are the maternal and paternal copies of a chromosome that pair together during reproduction. The physical exchange of genetic material is referred to as a crossover.

Figure 2.1: Example of genetic crossover over

Figure 2.1 provides a conceptual view of genetic crossover over. Two homologous chromosomes

7 align with one another during meiosis. At some point the two cross with one another and genetic material from one chromosome is effectively swapped with genetic material from the other. The end result is two chromosomes that are a mosaic pattern of paternal and maternal genetic material.

Meiosis results in four daughter cells, each containing a haploid genome of the parent cell. One of these four cells may potentially find its way into the next generation as part of an offspring.

The recombination rate, ρ, varies by organism [15] and may even vary between the different of an organism. In humans, if two genetic loci are separated by, on average, 1 million base pairs, then there is about a 1% chance that they will be separated by a crossover event in the next generation. Put another way, there is, on average, 1 crossover event per 100 million base pairs of the human genome.

2.1.3 Selection

The genetics of a population is also influenced by which individuals reproduce to the next genera- tion. Modeling how individuals of a population, in effect, select a mate is a complex and challenging problem. Often, the modeling of selection is reduced to computing the fitness of each individual, then pairing individuals of the population based upon their respective fitness values.

The fitness of an individual is a quantitative measure of an individual’s reproductive success, or their ability to produce offspring that will survive to maturity. Fitness can be defined in terms of either the phenotype or genotype in a given environment. The genetics of an individual interact with environmental forces to define the phenotype of an individual (Equation (2.3)) [16, 17].

P (henotype) = G(enotype) + E(nvironment) + GE (2.3)

The phenotype of an individual may consider a set of characteristics, or observable traits. As described earlier, mutation is thought to produce variations in the observed traits of in a population.

Each variation carries with it a certain weight, or effect size, for each of the P characteristics considered in the phenotype. It is often assumed that the state of a genetic location has the same effect throughout a population. However, it is not always clear how a single trait may result from the gene-environmnet interaction. In fact, it is one of the aims of population genetic simulations to provide insight into the interaction between genetic and environmental forces.

8 Chapter 3

Base design of a Forward-Time Population Genetic Simulation

A forward-time simulation begins with the construction of genetic models describing how a pop- ulation is expected to change and an ancestral population is initialized. The critical loop of the simulation can be described as a 2-phase process: a child population generation phase, and an an- alytic phase. The first phase begins by selecting parents from the existing population. The parents undergo a reproduction process. This process transforms the input parental genome by evaluating the genetic models, specifically the models for mutation and recombination, to produce a partial genome from each parent. The partial parental genomes are then combined to form the genome of a child. A viable child is then added to the child population.

Once the child population reaches a specific size, or another condition has been satisfied, the analytic phase begins. This phase computes statistics about the child population that can be used to further direct the simulation or are simply recorded for future offline processing. Some statistics may require significant computational time to compute for large populations. Therefore the analytic phase may be optionally performed with each generation.

The cycle repeats using the child generation as the parent of the next generation. This cycle continues until either a desired T -th generation has been created, or another condition has been satisfied. The general algorithm is given in Algorithm 1.

9 Algorithm 1 Outline of a FTPGS P ← {} . define the set of parents C ← φ for g = 0 to T do . for each generation while evaluation of C fails do p ← σ(P ) . selection model p0 ← γ(p) . reproduce p if ψ(p0) then . test viability add p0 to C end if end while save statistics about C clear P swap C and P end for

Both phases perform operations that either inspect or modify the genome of individuals within a population. For example, during the reproduction process parental genomes are effectively copied and modified to form a child genome. Similarly, a common statistic to compute is the allele frequency spectrum. This involves counting the occurrences of each allele present in the population.

From a computational perspective, the data structure used to represent a genome plays a central role in the scalability of a simulation.

3.1 Graph Interpretation of a Population

As described earlier, a population consists of a set of individuals with each individual possessing a set of haploid genomes. Each haploid genome consists of a set of genetic locations. Each genetic location has specific genetic state from a set of genetic states. Similarly, each individual may also be described by a set of traits. A population genetic simulation aims to enable an efficient between the genetic set and the trait set.

From an abstract perspective, the elements of each of these sets are unique objects, and a relationship describes how objects from different sets are associated with one another. In this way, a slightly more general conceptualization of a population is as a graph [18] (Figure 3.1). Individuals, haploid genomes, and mutations can be considered vertices. Edges of the graph represent a simple connected relationship, or adjacency, between the objects of different sets.

10 Figure 3.1: High-level abstraction of a population as a graph

As a matter of convenience, the sub-graph of the population graph formed from the haploid genome vertices and their associated mutations as being a genome graph. This graph represents the current genetic state of an entire population. As such, it forms the foundation upon which a genetic simulation is built.

There are several observations to be made about the genome graph. In general, the vertex set of a graph considers the entire set of possible elements. That is, it abstractly refers to the set that results from combining the individuals and mutations into a single set of elements. This logically enables edges to exist between individuals, as well as edges to exist between mutations.

Although these edges do have meaning, they generally indicate some alternate layer of relatedness.

For example, individuals may be related to one another as siblings, or mutations may be related as having occurred at the same location. However, these types of relationships can be viewed as being independent of a generation’s genome graph. In effect, the genome graph can be reduced to those relationships between individual genomes and the mutations they possess.

A secone observation is that the edges of the graph are simple in that they only represent a connected relationship. That is, an edge only represents whether an individual has a mutation, or not. This type of binary relationship is trivially represented by a single bit. For example, when the bit is set then the relationship exists, otherwise it does not.

11 A third observation is that the graph is undirected. A directed relationship means that there is an ordered relationship between adjacent vertices. That is, an edge exists between vertex A and vertex B is different from an edge that exists between vertex B and vertex A. In the case the genome graph, this would suggest that an edge between an individual and mutation is different from an edge between a mutation and an individual, or vice versa. That is, an individual could be paired with a mutation, but that same mutation may not be paired with the same individual.

This is counterintuitive. If an individual has a given mutation, then that mutation is part of that genome. Similarly, if a mutation is part of an genome, then that genome has that mutation.

Although the genome is undirected, it often suffices to physically represent the graph as being directed. That is, the bi-directional nature of each edge is physically represented as a single directed edge, with the reciprical edge being logically inferred. This has the effect of reducing the physical space by 50% as only half the edges are necessarily represented. This is advantageous from a simulation perspective as most simulated models have a directional flow. That is, a simulation may consider how the population of individuals changes over time. In which case, it may make more intuitive sense to change the mutations of a genome rather than change which genomes a mutation exists in. For the purposes of this work, a simulation will be assumed to progress from the perspective of adding and removing mutations from a genome.

The choice of computational representation of the genome graph greatly influences how one may choose to implement the genetic models of a simulation as well as the amount of resources a simulation requires. This work presents a relatively uncommon representation of the genome graph that combines some of the advantages of two commonly used representations. The first common representation is based off of the traditional representation of a genome is as a sequence where the state of every genetic location is present. The second approach assumes that only the state of mutated locations within a genome need to be represented.

3.1.1 Genome Sequence Representation

The most basic representation of a genome is as an ordered state sequence. In this representation, each position of the genome is expected to have a uniquely defined state. However, the set of states for each position is finite. Traditionally, the state of each position reflects the nucleotide at that position. A DNA sequence has four states possible states for each nucleotide: A for adenine, C for

12 cytosine, G for guanine, and T for thymine. In effect, a genome sequence is represented as a string of characters.

This representation may be generalized to enable different character sets to be used. For exam- ple, a gene is identified as a well-defined sub-region of the sequence, and an allele is identified by the current state of the subsequence. Although the set of alleles may be considered unbounded as a result of the infinite allele model [10], from a simulation perspective one might choose to represent the alleles using a sufficiently large integer space. This will be discussed in more detail a later chapter. Indeed, this representation is offered in simulation frameworks such as simuPOP [6] and

NEMO [8]. In many respects, this is a simple, flexible representation of a genome. Furthermore, it is intuitive for most researchers as it directly reflects the structure encountered in nature.

Representing every genetic location has its computational advantages and disadvantages. On the plus side, identifying the state of specific genetic locations becomes a trivial, constant time operation. This is advantageous as several of the common steps of a simulation, such as mutation and phenotype evaluation, either transform or evaluate the state of a given genetic location or contiguous sub-region. Having constant time access to each location enables these operations to be efficiently performed.

On the negative side, the memory for a single sequence scales linearly with the size of the genome. That is, a population of genomes generally requires O(NLb) memory, where N is the size of the population, L is the size of a genome, and b is the number of bits used to represent the state of a genetic location.

From a performance perspective, increasing the size of a genome would require more compu- tational time to perform memory movement operations. Intuitively, it takes longer to copy larger blocks thus increasing the overall runtime. At the hardware level though, the latency of memory copy operation increases with the block size being copied [19]. This results in degrading performance for simulation steps such as recombination. This process aims to copy variable-length, contiguous subsequences from a parental genome to an offspring genome. In other words, it copies variable length blocks of memory between memory spaces. Thus, reducing the physical representation of a genome not only saves memory, but also helps to mitigate the effects of some hardware-level overheads.

13 3.1.2 Adjacency List Representation

A general observation about the genetic structure of a population is that there are fewer genetic locations across a population than there are genetic locations within a genome. Indeed, in [20], an estimate for the number of segregation sites, which are generally referred to as mutations throughout this work, for a population under neutral mutation [11,13,14] is given by Equation (3.1).

The estimate is given in terms of the per generation mutation rate, µ, and effective population size,

4 Ne for a random sample of n sequences. As an example, a population of n = N = 10 , a per generation mutation rate of µ = 10−2, the expected number of segregation sites is approximately

E[S] = 3, 916. This is a significant reduction from projected genome size of 106. In effect, by rep- resenting only those sites which differ within a population a simulator stands to reduce its memory footprint by several orders of magnitude relative to a comparable fixed-length representation.

n−1 X 1 E[S] = 4N µ (3.1) e i i=1 A way to leverage this observation is to represent a genome as a list of mutations. More specifically, a common set of mutations can be maintained and each genome can be represented as a variable-length list of references to elements of the mutation set. This is analogous to an adjacency list representation of a graph. Simulation frameworks, such as FWDPP [7] and simuPOP [6], provide the necessary data structures for representing a genome graph in this manner. However, their implementations are noticeably different from one another. These differences will be described in a later section.

From a simulation perspective, an adjacency list offers several useful features. An adjacency list aims to represent a minimal edge space for a given vertex of the graph. This is done by only maintaining those edges, or references to adjacent vertices, present in the graph. Thus, the physical size of a graph using this representation is |E| ∗ B, where E is the set of edges, and B is the size of the edge representation. In the context of this work, the edge set is dynamic and changes with each generation of a simulation. Certainly, this will contribute most to the in-memory size of the graph. It is of interest though to provide an estimate for the minimum number of bits necessary to serve as a vertex reference in the genome graph.

The representation of an edge can vary by application, but it is often reduced to being a simple

14 integer value that maps to a specific vertex. In the case of genome graph, the integer is intended to map to a mutation. To obtain an integer for each mutation one may simply enumerate the set of mutations. That is, if there are M mutations in the set, then each mutation can be trivially assigned an index value in the range [0,M), and that index value can used as a reference. It follows then that B = dlog2(M)e bits are sufficient to represent each index of the set. In practice, it is often necessary to over estimate the set by adding padding bits. This helps to avoid potential issues with overflowing an integer when more mutations exist than were expected.

In addition, padding bits may be added to enable a better memory alignment. A slightly more accurate representation for the number of bits per edge would be Bp = dlog2(M)e + p, where p is a non-negative integer. With the general necessity of padding bits and the subjectivity with how many to add, one might wonder whether a more compact representation would be possible.

3.1.3 Adjacency Matrix Representation

Another method of representing a graph is as an adjacency matrix. In this representation of a graph, the vertices are assigned a row and column of the matrix. The value in each cell is used to indicate an edge between the corresponding vertices.

This work uses a binary adjacency matrix to represent the genome graph as described in [21].

Each row of the matrix represents a haploid genome of the population. Each column of the matrix represents a mutation. Because the edges of the genome graph are simple, the value of each cell is either a 0 or a 1. This enables a row to be reduced to a bit vector with each bit positionally mapped to a column of the matrix.

The use of bit vectors to represent genetic sequences is not uncommon. Indeed, SimuPOP [6] can be configured to use bit vectors to simulate bi-allelic loci. The intent is to enable the simulation of alleles for which there are only two possible states: a reference state and a wild state. By reducing using a bit vector one enables larger populations with a few loci or a smaller population with many loci to be simulated. The reduction to a bit vector representation has primarily been limited to use cases where the set of mutations or loci are known.

It is generally uncommon to use a binary adjacency matrix graph representation in a population genetic simulation when the set of mutations is dynamic. This is often the case for two reasons.

The first reason is related to the expected density of the genome graph. The second reason results

15 from having to actively control the growth of the matrix. The following sections expand upon these reasons.

Graph Density

Intuitively, both the adjacency matrix and the adjacency list intend to represent the same graph.

The choice of when to use one or the other generally depends upon the density of the graph. The density of a simple, directed graph is given by Equation (3.2). Graph density is the relationship between the number of edges, |E|, of the graph and the total number of possible edges. For a vertex space, V , where a vertex cannot be adjacent to itself, or there are no self-looping edges, then the total number of possible edges is |V |(|V | − 1). A graph with a density that is close to zero is referred to as being sparse. Conversely, a dense graph is one where D is close to one.

|E| D = (3.2) |V |(|V | − 1)

In the case of the population genome graph, the vertex space is divided into two sets: a set for the individual genomes, and a set for the mutations of the population. However, there are no edges between individuals within a generation, nor are there edges between mutations. Thus, the total number of possible edges for the population is the product of the sizes of the two sets, 2NM, where

M is the size of the mutation set. Substituting these values into the previous equation produces

Equation (3.3).

|E| D = (3.3a) 2NM

|E|B0 ≤ 2NM (3.3b)

2NMDB0 ≤ 2NM (3.3c) 1 D ≤ (3.3d) B0

Consider the total size of the graph representations relative to one another. As shown by

Equation (3.3), the adjacency list graph will offer a more compact representation when there is less than 1 edge for every B0 possible edges. Using values from the earlier adjancency list example, for

16 E[S] = 3916 = M a population where the average genome has 326 mutations is considered dense as there is 1 edge for every 12 possible.

Controlling Genome Graph Growth

As previously mentioned, there are several forces that result in changes to the set of mutations present in a population. A mutation model may add novel mutations to a population, or it may add or remove associations between individual genomes and existing mutations. Selection may act to maintain the set of advantageous mutations while trying to remove those mutations that are not as those individuals with non-advantageous mutation may not procreate. Recombination may either act to preserve a mutation or it may result in the loss of a mutation. Over time these processes can lead to some mutations becoming lost from the population, while others become fixed.

A lost mutation is one that is no longer present in any of the haploid genomes of the population; whereas a fixed mutation is one that is present in every haploid genome.

The dynamic nature of the genome graph presents a general challenge in population genetic simulation. If left unchecked the set of mutations may continually grow. This will result in a sparse graph. This would therefore justify the use of an adjacency list based genome graph. However, the lost and fixed mutations present an opportunity for maintaining a denser representation of the genome graph that would enable the use of an adjacency matrix. The observation being that a

fixed or lost mutation can be removed from the genome graph.

Indeed, a simulation scenario may be interested in studying the sets of fixed or lost mutations of a population over time. Thus, it may be necessary retain information pertaining to these mutation sets. However, it is generally not necessary to retain this information in the genome graph. Instead, information retention can be achieved through the use of secondary sets.

In general, as a mutation becomes fixed or lost within a population it can be removed from the genome graph and added to an implicit global set of mutations that reflects its state within the population. This retains the information about the relationship between a mutation and its presence or absense within the population. In addition, the mutation’s vertex can be removed from the genome graph and, in the case of a fixed mutation, any explicit edges can also be removed.

This reduces the size of the genome graph.

The ability to remove a vertex from the graph suggests that the physical space that was allocated

17 for the vertex and any edges can be reclaimed by the system. In practice, however, reclaiming this space is generally a computationally intensive task when an adjacency matrix is used to represent the graph. Removing a mutation vertex from the genome graph is equivalent to removing a column from the adjacency matrix. Reclaiming the physical space allocated to this column would result in effectively re-ordering the column space of the matrix and propogating the changes through the matrix. This is further complicated when columns are reduced to single bits in a compact bit vector that form a row of the matrix, as is done in this work. These reasons often prevent a binary adjacency matrix from being used with a dynamic graph.

To avoid re-ordering columns in a binary matrix this work does not remove a vertex from the graph. Instead, a vertex that is associated with a fixed or lost mutation is considered to be a free vertex. Meaning that it is no longer actively used in the genome graph. It is therefore free to be used to represent a new mutation in subsequent generations of the graph. In effect, each new mutation is assigned to a free vertex of the graph, or column of the matrix.

The assignment of a mutation to a vertex is independent of the genetic position of the mutation.

This results in the column space of the matrix being an unordered set of mutations. The unordered sequence representation impacts the implementation of the genetic models.

3.2 Genetic Models

In the previous chapter, several of natural processes that drive population change were described.

This section aims to provide some algorithmic models of these processes. It describes how the use of an adjacency matrix as the genetic representation influences the implementation of the algorithm.

3.2.1 Mutation Model

The first natural process to describe is that of mutation. From a graph perspective, the mutation process may result in changes to both the vertex and edge space of the genome graph. That is, if the mutation is novel, then a new vertex will be added to the space as well as a new edge between a genome and the mutation. Conversely, if the mutation is not novel, then only a new edge is added to the graph.

Algorithm 2 provides the base method used in this work. Each mutation that is generated is

18 assiged the next available free vertex of the mutation space, A. Subsequently, a random genome is identified from the population. Finally, a conflict resolution step is performed.

Algorithm 2 Mutation Algorithm Require: G the current population’s genome graph Require: M the number of mutations to be added to the population for m = 0 to M do . for each expected Mutation A[F [m]] ← θ() . Generate a mutation assign it a vertex i ← RNG(0, 2N) . Generate random index in [0,2N-1] ResolveConflicts( G, A[F [m]], i ) . Set bit F [m] of genome i end for

The definition of a conflict may vary by simulation. In cases when all new mutations are viewed as being novel, as in the infinite site model, there are no conflicts to resolve. Therefore, an edge is simply added between the new mutation and the genome. However, in other cases a newly generated mutation may occur at a genetic position that may have already have been mutated elsewhere in the population. Although multiple mutations may be allowed to exist for a given position, a haploid genome can only exhibit one of the mutations. This would result in a logical conflict. Thus, it becomes necessary to remove any existing mutations for a given genetic position and haploid that may result in a logical conflict. This is generally referred to as a conflict resolution.

An advantage of the adjacency matrix is that adding and removing mutations are simple updates to existing bit positions. That is, at the sequence level the operations for adding and removing edges are reduced to flipping the state of specific bit positions. Both operations can be done in constant time for known positions.

3.2.2 Recombination

The recombination process changes the genetic state of a genome by exchanging genetic material between a pair of chromosomes. An algorithmic abstraction of a recombination process will accept a pair of parental homologous chromosomes as input. It will return a mosaic chromosome with ordered segments from each of the parental chromosomes.

In effect, the process will divide the two input sequences into ordered sets of fragments. The mechanism used to fragment the parental chromosomes may vary. For example, it may be the case that recombination is expected to occur at specific regions with probability greater than that of the

19 rest of the chromosome. More generally, it may be assumed that recombination occurs randomly at a given rate. Algorithmically either mechanism can be translated to a process that generates a set of chromosomal positions where parental chromosomes are cut to form fragments. The cutting process results in positionally ordered sets of fragments.

A chromosomal crossover model is used to simulate recombination in this work. This model of recombination makes some simplifying assumptions about the structure of homologous chromo- somes and the recombination process. Specifically, it assumes that the homologous chromosomes are positionally aligned with one another. It also assumes that recombination occurs at the same randomly generated position on both chromosomes as a result of their alignment. Finally, it as- sumes that fragments are selected in an alternating pattern from each parental chromosome’s set of fragments. That is, if the maternal chromosome is chosen to provide the first fragment, then the paternal chromosome will provide the second. In effect, which ever parental chromosome is se- lected to provide the first fragment, it will subsequently provide all of the odd positioned fragments.

Similarly, the other parental chromosome will provide the even positions.

The positional relationship of the order of events enables a crossover process to be described using a counting method. Given the set of positions where crossover occurs, one is able to identify which parental chromosome is to provide the state for each mutation of a genome by counting the number of crossover events that precede a give mutation. In an abstract sense, this counting method is one approach that can be used to classify a mutation as either being from the parent’s maternal chromosome or paternal chromosome.

Algorithm 3 provides the pseudo-code for recombination model used in this work. It assumes a genetically unordered sequence representation that results from mutations being assigned to free vertices without consideration of their genetic position. It relies on a bit masking technique to select the state of individual bits from each of the input vectors and combines them together to form the output vector. The bit mask is constructed from the results of a binary classification method that decides which input vector should provide the state of the current bit based upon the mutation associated with the current bit and the set of recombination events. In this work, the classify step counts the number of recombination events that precede a given mutation. When the count is odd, the state of the second input sequence is used.

This algorithm does not make use of the genetic order of mutations, but it does make use of

20 Algorithm 3 Unordered Adjacency Recombination Algorithm Require: P0 and P1 binary adjacency vectors Require: R recombination events Require: M vector of mutation locations O ← [] . Initialize offspring adjacency vector B ← (bitblocksize) j ← 0 for all bit block a ∈ P0, bit block b ∈ P1 do hets ← (a ⊕ b) hetMask ← 0 for all set bit positions i ∈ hets do hetMask[i] ← classify(M[i + j], R) . Keep state from P1? end for o ← ((a ∧ ¬hetMask) ∨ (b ∧ hetMask)) Append o to O j ← j + B end for return O whether a mutation is present in both input sequences or not. It uses whether or not a mutation is homozygous, or present in both sequences of the parental genome. When a mutation is homozygous in a parental genome, the offspring is expected to inherit the same state for the mutation regardless of how recombination occurs. Therefore, it suffices to only classify those mutations where the two input sequences differ, or are heterozygous. This is a general optimization that is enabled by the positional alignment of the adjacency matrix.

3.2.3 Phenotype Evaluation

Phenotype evaluation is a stage of the life cycle model that is intended to evaluate a measurement for each individual’s traits. As described in Equation (2.3), the phenotype of an individual may be the result of their genetic makeup and environment forces. Although, how these forces work with or against one another is not well understood in many cases. For the purposes of this work, the environmental force was expected to be constant.

To model the genetic component of the phenotype this work assumes that each mutation carries with it some effect on the set of traits. The effect size of a mutation is how much a given mutation shifts a specific trait from its normal state. In this work, effect sizes are random values selected from a normal distribution, N (0, 1). The specific measurement for a given individual’s trait is

21 evaluated as a linear summation of the effect sizes for each of the mutations in their genome.

In effect, phenotype evaluation is a two step process. As each mutation will have an effect size for each simulated trait, an effect size matrix can be defined. The use of the adjacency matrix helps to define the first step as a matrix multiplication operation. This produces a relative phenotype for each haploid genome in the population. This results in a phenotype for each of an individual’s haploid genomes. A second step reduces the two haploid genome phenotypes for each individual into a single phenotype per individual.

3.2.4 Fitness

The fitness of an individual is a general measurement for the probability an individual will reproduce to the next generation. There are many methods by which this probability can be assessed. In this work, it suffices to assume that there exists a method by which an individual’s phenotype can be translated to a probability.

3.2.5 Selection

The selection process identifies those individuals that will reproduce to the next generation. Like

fitness, there are many methods by which selection may be defined. This work will assume that individuals are randomly paired with one another according to their fitness. In effect, individuals are selected from a multinomial distribution where fitness provides the probability of being selected.

3.3 Results

The initial stage of this work was to conduct a performance comparison study to evaluate the relative impact of using the binary adjacency matrix representation. As such, a simulator was developed using the graph-based data structures and genetic models described previously. This simulator was compared with simulators developed using common simulation frameworks of simuPOP [6] and

FWDPP [7]. To compare these simulators two evolutionary scenarios were considered: a neutral scenario and a selected scenario. Each of these scenarios were evaluated at several evolutionary scales. The life cycle runtime for each simulator for each scenario was measured. This section will describe the evolutionary scenarios in more detail and provide the comparison results.

22 3.3.1 Evolutionary Scenarios

The goal for each simulator was to evolve a population of N diploid individuals over G generations.

Offspring were generated by randomly selecting individuals from the parent population and pairing them together. The reproduction process introduced random recombination events with a recombi- nation rate ρ. Mutations were randomly introduced following an infinite site mutation model at a rate of µ. That is, mutations were only allowed to occur at genetic locations which were unmutated in the population. In effect, each mutation resulted in a new segregation site. The configuration for these parameters are given in Table 3.1.

Medium Large Population Size (N) 104 Generations (G) 105 Mutation Rate (µ) 0.01 0.1 Recombination Rate (ρ) 0.01 0.1

Table 3.1: Simulation scales

The first evolutionary scenario considered a neutral mutation model. This mutation model argues that mutations do not change an organism’s fitness. Although this model is debated, the efforts of [11, 13] suggest that many mutations are in fact neutral in nature and may become beneficial over time.

From a simulation perspective, the neutral scenario allows for the phenotype evaluation and

fitness steps in each generation to be optimized out at time of compilation. This significantly reduces the computation time required to perform a simulation. The primary intent this scenario is to establish a lower bound for the runtime of a forward-time population genetic simulation.

Furthermore, as described in [14], this model behaves in an expected manner with the expected number of mutations being given by Equation (3.1). This enables a level of validation to be performed as there are other simulators, such as MS [22], also consider this scenario.

The second evolutionary scenario considers that a mutation may indeed influence an individ- ual’s fitness. That is, they are involved in the selection process. Thus, the phenotype evaluation and fitness steps of the life cycle model are required. For the purposes of the initial performance comparison, however, fitness based selection was not performed. As a result, the phenotype evalua- tion and fitness steps were performed, but not used to influence the next generation. Furthermore,

23 the phenotype evaluation step counted the number of mutations present in each haploid genome.

Each simulator used a counting algorithm best suited for its genome graph representation. As a result, this scenario allows for a relative upper bound to be established for the different simulators.

3.3.2 Simulators and Workstation Configuration

The four simulators considered in this comparison were described earlier in Table 3.3. All sim- ulations were performed on a single workstation. Its hardware configuration is listed in Table

3.2.

CPU Intel Xeon 3.5GHz CPU Cores 6 Hyperthread enabled Memory 32 GB Operating System Fedora Linux 20 (64-bit)

Table 3.2: Hardware configuration - Workstation #1

3.3.3 Memory Utilization Comparison

Data structures available in simuPOP [6], FWDPP [7], and Clotho [21] were used to compare the relative memory requirements of the three different genome graph representations. Table 3.3 pro- vides a listing of the different models, the source implementation considered, as well as a theoretical memory footprint based upon the different implementations.

Both simuPOP [6] and FWDPP [7] provide an adjacency list based genome graph. However, their implementations differ in the handling of fixed mutations, with FWDPP [7] choosing to remove them from the graph, and simuPOP [6] allowing them to remain. SimuPOP [6] incurs a growing overhead as more mutations become fixed over time, Ft. Each of these data structures were experimentally tested using simulators developed from each of the simulation frameworks. Each simulator was given the task to evolve a population of N = 104 individuals over T = 105 generations. A neutral mutation model [11] was used to evolve the populations. Two mutation rates were considered: µ = 0.01 and µ = 0.1. The memory footprint of the final generation’s genome graphs were captured.

The results of the experiments are shown in Table 3.4. As expected, the genome sequence

24 Simulator Genome Graph Framework Memory Estimation Terms N - Population Size 1 Fixed-length simuPOP [6] NLb L - Genome Size

2 Adjacency List FWDPP [7] |E|Bp + M ∗ O Ft - Fixed mutations in generation t |E|Bp + NFt + M ∗ O Bp - bits per reference 3 Adjacency List simuPOP [6] element Ft = Ft−1 + f(t) M >= E[S] F0 = 0 O - Overhead bits per mutation

Adjacency 4 Matrix Clotho [21] M(N + O)

Table 3.3: Breakdown of memory footprint of genome graph representations

Population Size (N) Mutation Rate (µ) 1 2 3 4 0.01 2.3 GB 7.2 MB 185MB 2.4 MB 10,000 0.1 - 350 MB - 75 MB

Table 3.4: Memory comparison of Genome Graph representations representation required the most memory. Testing of the µ = 0.1 was skipped as it was expected to require upwards of 23GB of memory which far exceeded the other representations. Of the two adjacency list approaches, FWDPP’s [7] offered a smaller memory footprint. Although the additional overhead incurred by simuPOP’s [6] was expected to increase its utilization, the difference between the two is somewhat surprising.

Upon further investigation it was determined that at the time the reference value simuPOP [6] stores in its adjacency lists was slightly larger than that of FWDPP [7]. In effect, simuPOP [6] encodes multiple values into a single value to enable a more efficient lookup in some steps of a simulation. In addition, FWDPP [7] had made use of an observation that in small scale simulation scenarios there is a potential for genetic sequences to be duplicated. Rather than allow duplicates to exist, they coalesce the duplicate sequences into a single vertex. Conversely, simuPOP [6] does not perform such an optimization.

The binary adjacency matrix representation offers the smallest memory footprint of the three representations. This is in part do to the implementation provided in Clotho [21] taking advantage of observations like those made in FWDPP [7]. It also made the observation that the bit vectors

25 could be truncated to the last set bit position, and simply inferring that every bit position beyond the end of the vector was unset.

In subsequent versions of the Clotho [21] implementation, these optimizations were removed.

These optimizations are primarily advantageous in small scale simulations. As evolutionary scenar- ios increase in scale, sequences become increasingly unique to the point that every sequence in the population is expected to be unique. Removing this optimization does result in a greater memory footprint. However, for a large scale simulation the memory footprint is expected to be 115MB.

This is still better than a 3x reduction when compared with an adjacency list representation.

3.3.4 Medium Scale Simulation

Figure 3.2 shows the runtimes for simulators #2 (FWDPP [7]) and #4 (Clotho [21]) for the two evolutionary scenarios. In both scenarios, the simulator based upon the binary adjacency matrix does outperform the adjacency list based simulator. However, there seems to be greater amount of relative overhead incurred by #4 when the phenotype evaluation step is performed. That is, the Clotho [21] simulator has a nearly 800% increase in runtime, whereas FWDPP [7] only incurs about a 190% increase.

Figure 3.2: Small Scale Simulation Comparison

Simulators #1 and #3 did not perform well for the neutral scenario. As shown in Figure 3.3, both were more than two orders of magnitude slower than simulators #2 and #4. Both of these

26 simulators were developed using the simuPOP [6] framework. These performance results, combined with the earlier memory measurements (Table 3.4), and expectation that the remaining evolutionary scenarios and scales would serve to further increase the runtime and memory requirements, it was decided not pursue additional testing with either of these simulators.

Figure 3.3: Medium Scale Simulation Comparison

3.3.5 Large Scale Simulation

An order of magnitude increase in the mutation rate results in an even greater performance gap between simulators #2 and #4. Figure 3.4 provides the performance results for the large scale evolutionary scenarios. In the neutral scenario, the Clotho [21] simulator offers a 46.5x speedup over the FWDPP [7] simulator. Adding in the additional phenotype evaluation and fitness step shrinks the performance gap with the Clotho [21] simulator offering a 6.1x speedup.

This scale of runtime increase felt by FWDPP [7] is not entirely linked to the adjacency list representation, rather it is more likely linked to the specific implementation. At the time, a generic, pointer-based linked list data structure was used in FWDPP [7] as the base list structure. In addition, after each generation a memory cleanup stage was performed to remove lost mutations and unused adjacency lists. These likely combined to produce a significant amount of overhead in each generation.

27 Figure 3.4: Large Scale Simulation Comparison

28 Chapter 4

Task Parallelism in Forward-Time Population Genetic Simulation

Modern computing environments offer various levels of parallel processing capabilities. A typical workstation is built around a central processing unit (CPU) that has between 4 to 16 physical cores, if not more. This type of hardware offering opens the door to substantially improve the relative performance of a simulator. The challenge though is identifying how to effectively leverage this capability.

As previously mentioned, one of the primary use cases for simulators is to generate large volumes of in silico data sets. This may involve running a single simulator with multiple configurations.

In situations like this, it is trivial to concurrently execute as many instances of the simulator as resources will allow. This is an example of a Single Program, Multiple Data style of parallelism.

Indeed, this model of parallelism does offer several benefits. Perhaps the most obvious benefit is that it is simple to achieve as it requires no additional simulator development. In addition, it does improve the total runtime needed to evaluate the multiple configurations by taking better advantage of available resources. Although the question of better utilization of resources is a bit vague.

The basic assumption of this style is that there is a trivial one-to-one mapping between appli- cation instance and hardware resources. For instance, if there are 64 simulation scenarios to be run, then 64 CPU cores with sufficient memory will enable all simulations to run concurrently and

29 they will complete at roughly the same time.

Results from previous chapters indicate that increases in the evolutionary scenario can signifi- cantly increase the memory footprint and runtime of a simulation. Using Equation (3.1) again, a mutation rate of µ = 10.0 with an effective population size of N = 104 suggests that there is an expected E[S] 4∗106 mutations in the population. A single simulation of this scale, which considers

1 a genome size that is roughly 3 of the human genome, would require tens of gigabytes of memory, and days of computation time. While acquiring resources to facilitate a SPMD style of execution may still be feasible for this scale, the hardware configuration begins to be imbalanced with the per core memory requirements needing to be in the tens of gigabytes. In effect, one has to buy more time on larger systems to achieve a final goal.

To slow the hardware configuration imbalance an alternate approach would be to utilize a multiple core system to evaluate a single simulation configuration. The observation being that in nature the life cycle of a population is inherently parallel. Individuals of a population grow and reproduce, for all intents and purposes, independently of one another at the same time. This is by definition a parallel process. It stands to reason then that a simulation of this natural process should also be inherently parallel.

Leveraging multiple cores or threads for a single execution is intended to reduce the runtime for that instance while maintaining the same system configuration. This enables a smaller, more manageable hardware configuration to be used. Continuing the earlier SPMD example, a 4-core workstation with 64GB of memory is able to concurrently run 4 evolutionary scenarios that require at most 16 GB of memory each. Conversely, if the 4-cores were used to evaluate the same evolu- tionary scenario and doing so would reduce runtime by a factor of 4, then serially executing the 4 scenarios would still complete in the same total time. Because only a single simulator is running at any given time, using the full resources of the system, the system memory requirement is reduced to that of a single application instance.

Certainly, modern systems do offer a 4-core processor with 64GB of memory, if not more.

However, the evolutionary scenario is effectively limited to 16GB. An increase in scale would require a decrease in the number of parallel instances. As a result, one or more of the cores would be idle while the rest are working, under utilizing the system as a whole. From this perspective, a SPMD approach is not always a scalable solution.

30 Unfortunately, relatively easy access to computational resources, combined with not having to modify a simulator, make SPMD an obvious choice for most researchers. Indeed, cloud-computing services provide many options that enable one to obtain the necessary computational resources, if not otherwise available. Or, researchers may often simply accept the runtime on immediately available resources even though those resources are being under utilized. In effect, modifying existing open source software is basically not an option for most researchers. As a result, there has been relatively little effort to explore the inherently parallel nature of a forward-time population genetic simulation.

This chapter considers the task parallel opportunities available in a forward-time population genetic simulation. It presents the idea of partitioning a population for parallel execution. Two approaches are described for enabling task level parallelism in a simulator. Finally, performance results are given for the two approaches.

4.1 Task Parallelism

From a software perspective, task level parallel solutions organize work such that it can be dis- tributed across and simultaneously executed by a set of parallel processing mechanisms. The Single

Program, Multiple Data model of parallelism is an example of task level parallelism. In this model, a multiple independent processes of the same application execute in parallel to evaluate a set input data elements. Although the same program is being simultaneously executed, this does not mean that each program is simultaneously executing the same instruction at the hardware level. Instead, each application follows an execution path relative to its input data. From a broader perspective, this makes SPMD a sub-category of a Multiple Instruction, Multiple Data (MIMD) parallelism.

In SPMD, each process is viewed primarily as being completely independent of one another.

This means that data is self-contained within the application space for a given process. As described earlier though, this leads to problems when the application domain is large. It is, therefore, desirable to try and use multiple processes for a single, shared purpose. This typically results in processes needing to communicate data between one another. Message passing techniques are often employed to communicate data between parallel processes. Depending upon the message communication mechanism, communicating data between processes can incur a noticeable overhead.

31 Another form of MIMD parallelism uses parallel threads, instead of processes. Threads are lightweight processes that exist within the same application space. In other words, a multi-threaded application is a single application that distributes a units of work, or tasks, across a set of threads that it maintains. Threads use the application’s shared memory space to serve as a more efficient communication medium. What are the tasks that a multi-threaded forward-time population genetic simulator would be able to perform in parallel?

To answer this question one has to revisit the life cycle model from Chapter 3. From one perspective, the models themselves could be considered tasks. One might argue, each genetic model can be assigned to different and the models can use the shared memory to communicate data between one another. In effect, a producer-consumer model can be followed where a thread produces a result of one genetic model that another thread will eventually consume.

Unfortunately, this is generally not an efficient use of threading in a forward-time population genetic simulation because of the linear dependency between steps that exists. That is, the re- production step must complete before phenotype evaluation can occur. This ordered relationship would result in threads being sequentially active with no work would be performed in parallel despite having used parallel threads.

Instead, a more efficient use of parallelism would be to observe that a typical forward-time population genetic simulator iteratively builds an offspring population. Each offspring individual is taken through the life cycle model independently of one another. This suggests that producing an offspring represents a unit of work that can be performed in parallel. In effect, if a life cycle model is a task that transforms a input data to produce an offspring, then the life cycle model can be mapped over a set of independent input data to produce a set of independent offspring. If a set of independent input data exists, then it can be divided, or partitioned, into independent subsets.

Furthermore, the life cycle model can be mapped over each partition to produce an independent set of offspring. The sets of offspring can then reduced to a single set to form the offspring population.

Mapping the application of a life cycle model across partitions of the input configurations then reducing the results to form a common population serves as the basis for achieving parallel implementation of a forward-time population genetic simulator in this work. From a general flow perspective, each generation begins with several pre-processing step prepare the inputs necessary for a life cycle model for each offspring individual. The inputs are then partitioned and the life cycle

32 model is applied to each partition. The resulting group of offspring individuals are then reduced to form the offspring population. The next several sections expand upon the idea of partitioning the offspring population and how to map the steps of a life cycle model.

4.1.1 Partitioning

Conceptually, partitioning is straightforward. Given a collection of elements, group elements to form a sub-collection of elements that satisfy a set of criteria. Grouping criteria can be as general as having groups of a certain size, or as specific as to organize elements based upon a common attribute.

As previously mentioned, producing a single offspring from the necessary inputs represents the base unit of work in a forward-time population genetic simulation. Each offspring is produced independently from one another suggesting that they can be generated in parallel. Therefore, if the goal is to produce an offspring population, then partitioning the expected offspring population would enable the offspring population to be produced in parallel.

In this work, partitioning generally refers to selecting contiguous sub-matrices of the binary adjacency matrix used to define the genome graph. It may also include slicing lists into sub-lists of varying sizes. Both of these operations can be reduced to defining a list of non-overlapping index ranges relative the source data structure. Furthermore, as original data structures are not modified by partitioning, the reduction steps become as trivial as deleting the index ranges.

There is a bit of a question as to how to best utilize the partitioning in practice. Since the the offspring population is of known size after the selection step, is it better to apply a static partitioning schema for the application of the entire life cycle? Or, is it better to dynamically partition during each step of the life cycle model?

4.1.2 Batch Model

This model of execution follows an iterative execution each step of the life cycle model. It assumes that each step will dynamically partition its input relative to the task to be performed. Once partitioned, the task is mapped across each of the partitions, and the outputs are reduced. In effect, a batch of tasks are executed in parallel with synchronization occurring between each step.

Figure 4.1 provides a high-level breakdown of a 2-step execution.

33 Figure 4.1: Batch Model Concept

Dynamically partitioning at each step inherently assumes that the partitioning step is effi- ciently achieved. Certainly, dividing an index range into a list of non-overlapping index ranges is a lightweight step. However, repeatedly constructing these lists does add overhead. Similarly, per- forming a barrier synchronization between each step also introduces an overhead cost. Therefore, from a performance perspective, reducing the frequency of these steps would be advantageous. One way to achieve this through the use of a pipeline model of execution.

34 4.1.3 Pipeline Model

This parallel execution model assumes that a single partitioning step is performed. Subsequently, the life cycle model is applied to each partition as one contiguous flow of steps. This avoids performing the frequent repartitioning and synchronization steps that were performed in the batch model. Figure 4.2 provides a high-level diagram of the concept.

Figure 4.2: Pipeline Model Concept

In some respects, this is the simpler of the two models in that it abstracts away the concept of having to define a partitioning scheme per model. This simplifies the model development process for researchers who develop their own models.

35 4.2 Results

As part of this work, a multi-threaded forward-time population genetic simulator was developed.

The simulator used in these tests is called QTLSimMT and was developed using the Clotho [21] framework. As the name suggests, it is a multi-threaded simulator aimed at studying quantitative trait loci (QTL) under selection. The simulator was put through a series of performance based tests to evaluate the relative advantage of using a multi-threaded application, as well as how the batch and pipeline models perform relative to one another.

4.2.1 Simulator and Hardware Configuration

There are two primary differences between QTLSimMT and the simulator used in the previous experiments. First, the phenotype evaluation and fitness steps are implemented as described in

Chapter 3. Second, the phenotype evaluation and fitness steps are no longer optimized out via a compilation mode. Instead, the neutrality of a mutation is linked to its effect size. An effect size of E[m, t] = 0 means that mutation, m, is neutral for trait, t. The randomly generated effect sizes in the population are controlled by a Bernoulli random variable. As result, E[m, t] is given by Equation (4.1). The Bernoulli process is configurable; setting p = 0 will result in all effect sizes will be zeroed.

X = Bernoulli Random Variable with Pr( x = 1 ) = p (4.1a)

E[m, t] = X ∗ N (µ, σ2) (4.1b)

A check of the effect size matrix is performed prior to each phenotype evaluation stage to determine whether any of the mutations are non-neutral. That is, if any of the effect sizes are not equal to zero, then the phenotype evaluation matrix multiplication is performed. Otherwise, all phenotypes are trivially set to the same value. This optimization enables the performance difference between the neutral and selected scenarios to be achieved.

Finally, all simulations were performed on a new workstation. Its configuration is defined in

Table 4.1.

36 CPU Intel i7-3930K Phyiscal Cores 6 Hyperthreading enabled RAM 32 GB Operating System CentOS 7.0 64-bit

Table 4.1: Hardware Configuration - Workstation #2

4.2.2 Evolutionary Scenarios

The goal of the simulations in this study is the same as that from the previous chapter. The neutral and selected are again considered to be the baseline evolutionary scenarios. The base configurations for the two scenarios are given in Table 4.2. A smaller genetic scale simulation has been added to the scales in the previous study. Also, six thread counts (T ) are used to evaluate each of the evolutionary scenarios and the genetic scales. Table 4.3 provides the scale configurations. Finally, all the combinations of these different parameter configurations are evaluated using both a batch model and a pipeline model of task parallelism.

Neutral Selected Population Size (N) 104 Generations (G) 105 Probability of Selected Mutation (p) 0.0 1.0

Table 4.2: Common Population Configuration

Small Medium Large Mutation Rate (µ) 0.001 0.01 0.1 Recombination Rate (ρ) 0.001 0.01 0.1 Thread Counts (T ) 1; 2; 4; 6; 8; 12

Table 4.3: Evolutionary scales and threading configuration

4.2.3 Small Evolutionary Scale

Figure 4.3 shows the runtimes for the neutral scenario. Although using multiple threads did in- crease the relative performance of the batch model, the gains were relatively minor. As a result of the evolutionary scale configuration, the potential performance gains from using multiple threads is low. In effect, the amount of work for each thread is small. The costs associated with repartition- ing, synchronization, and other overheads incurred by threading make it difficult to significantly

37 reduce the runtime. Conversely, the pipeline model experienced noticeably better performance improvements. This is a result of its reduced threading overheads.

Figure 4.3: QTLSimMT - Small Scale; Neutral

The runtimes for the selected scenario are provided in Figure 4.4. As indicated by the earlier experiments, the phenotype evaluation stage significantly increases runtime. Interestingly, the best performance for both scenarios was achieved with T = 6. This suggests that leveraging hyperthreading was not providing a significant advantage in this situation.

For both evolutionary scenarios, the pipeline model offers better performance. With T = 12 threads, the pipeline model offers better than a 2.1x speedup over the single-threaded execution in the selected scenario and 2.7x in the neutral scenario.

4.2.4 Medium Evolutionary Scale

An order of magnitude increase in the mutation rate from the small scale further increases the performance offerings of a multi-threaded simulation. Even at this scale, the batch mode struggles to show significant runtime improvements relative to its sequential execution Figure 4.5. Furthermore, its peak performance was achieved at T = 6, with T = 12 performing noticeably worse. The pipeline

38 Figure 4.4: QTLSimMT - Small Scale; Selected mode offers a similar performance profile as it did in the small scale experiment.

Figure 4.5: QTLSimMT - Medium Scale; Neutral

39 Both execution models exhibit a strikingly similar performance profile to one another when considering the selected scenario of this scale as shown in Figure 4.6. The batched mode does start out with a slower runtime for the sequential execution. Although, as the number of threads are increased in the selected scenario, the relative difference in runtimes between the two models decreases from a 19% gap to a 7% when T = 12. This equates to less than a 2 minute difference.

In terms of simple runtime though, the pipeline model continues to outperform the batch model.

Figure 4.6: QTLSimMT - Medium Scale; Selected

4.2.5 Large Evolutionary Scale

The neutral scenario produced an interesting result for a large scale scenario. There was a significant runtime spike encountered when sequentially executing using a pipeline mode. Figure 4.7 shows this event with the pipeline execution taking 43% longer than the batch. The batch model outperforms the pipeline model with the exception of when T = 12.

For the selected scenario, the relative runtime profile is very similar to the previous scale. The batch mode is initially 18% slower than the pipeline. This shrinks as the thread count increases.

At T = 12, the gap is only 4%. The pipeline model again outperforms the batch model.

40 Figure 4.7: QTLSimMT - Large Scale; Neutral

Figure 4.8: QTLSimMT - Large Scale; Selected

Overall, the pipeline model offered consistently better performance than the batch model. When tasked with evaluating selected scenarios, the sequential batch model experiences about a 18-23%

41 longer runtime than the sequential pipeline. However, this gap shrinks as the number of threads are increased. However, this is not the case for the neutral scenario.

42 Chapter 5

Data Parallelism in Forward-Time Population Genetic Simulation

As the name suggests, data parallelism is rooted in distributing vectors of data across a parallel processing environment. Each processing unit of the environment will concurrently perform the same operation on their unit of the data space. This style of processing is referred to as Single

Instruction, Multiple Data (SIMD).

A Graphics Processing Unit (GPU) is a commonly available device that provides data parallel computation. As the name suggests, their original purpose was to perform vector processing tasks that traditionally arose in graphics processing. These operations, however, have more general use cases. Since the early 2000’s GPUs have been used in many application domains spanning from molecular dynamics to computational finance [23].

Although there are several GPU providers available, this work will focus primarily on the use a CUDA-enabled NVIDIA GPU. It will assume the use of the CUDA API [24]. Terminology will be used as it relates to this computing platform and programming interface.

5.1 Graphics Processing Units

A GPU offers thousands of parallel processing cores and excel at performing Single Instruction,

Multiple Data (SIMD) operations. Dedicated versions of these processing units exist as separate physical devices within a host workstation. As such, they maintain a separate memory space from

43 the host system. The host and device operate in parallel and communicate data between one another via a communication bus. As the host and device offer different types of processing units, systems that offer a GPU are referred to as heterogeneous systems.

Utilization of these devices is often described as a pipeline. First, a portion of a data set is transferred from host memory to the device memory. The host invokes a kernel, or GPU function, on a specific stream of the GPU. A stream is similar to a First In, First Out (FIFO) queue of operations that the GPU is to perform. In other words, when the host invokes a kernel it is placing its execution into an execution queue. The GPU guarantees that all operations within a stream will be executed serially.

Kernel execution is an asynchronous process allowing the host to continue processing while the device operates. The host initiates a synchronization event when it reaches a point in its execution where it expects to retrieve data from the device. Synchronization events can be performed either at the stream or device level.

5.1.1 GPU vs CPU

In recent GPU generations, devices are able to execute tasks from multiple streams concurrently.

Thus, a GPU offers a level of MIMD parallelism in addition to being an inherently SIMD. Similarly, a multi-core CPU core offers MIMD parallelism with SIMD capabilities through extended instruc- tion sets. For example, the streaming SIMD extensions (SSE) instructions or Advanced Vector

Extension (AVX) [25] are available on current generation Intel multi-core processors that provide

SIMD operations. This may then suggest that a GPU and a CPU are similar to one another as they are MIMD devices, that offer SIMD modes of execution. However, they are quite different from one another.

There are many technical differences that separate a GPU from a CPU that are beyond the scope of this work. Of interest here are some of the features that differentiate the parallelization techniques and the relative parallel processing scales offered by the two.

For all intents and purposes, the cores of a CPU are independent of one another. This enables parallel execution of unique tasks to be accomplished. As discussed in the previous chapter, a multi-threaded application may map the application of a single task across available threads. Each thread is logically performing the same set of instructions across separate units of data. This would

44 suggest that data parallel execution is being applied. However, because each core is independent, the instructions may not necessarily be concurrently executed on all of the cores.

Current generation multi-core CPU has between 2 and 20 separate cores. Hyperthreading technology [26] enables two threads to be secheduled by each core concurrently. Each core is also able to perform SIMD instructions that operate on words of 128-, 256-, or 512-bits. Conversely, an NVIDIA CUDA-enabled GPU may offer thousands of smaller cores that operate on a 32- or

64-bit words. For example, the NVIDIA GeForce GTX 970 used in this work offers 1664 cores. In effect, if a multi-core processor was able to concurrently perform a SIMD operation across all of its threads, then the top-of-the-line current generation processor may operate upon 2560 bytes of data. Conversely, a GPU may operate upon more than 13,312 bytes.

5.1.2 Kernels

The GPU executes functions called kernels. Kernels are invoked with a specific definition for how many threads and their logical arrangement. In effect, the developer instructs the GPU on how to execute a given workload. This definition can be expressed in multiple layers of schedulable units of threads. A thread performs a unit of work on a distinct unit of data. Threads are defined as units called blocks. Finally, blocks are further grouped as units of a grid. Each of these units can be expressed as 3D space with the relative scale of each space limited by the physical limits of the

GPU.

As a result of the physical design of a CUDA GPU, threads are also grouped into units called warps. Each is composed of a set of threads that will concurrently execute the same instruction.

In the current generation of CUDA GPU, each warp is a group of 32 threads.

Although the computation space is defined in multiple layers, this does not mean all threads will execute concurrently. In some respects, the block and grid definitions can be viewed as layers that enable the GPU to plan the future execution of kernel across the data space. Thus, if the developer has defined the compute space effectively, then the GPU is expected to be able to schedule tasks such that all cores are always kept busy with work.

45 5.1.3 GPU Challenge

As a result of the vector based use cases, the physical design of GPU has been optimized for operations where physically aligned cores are assigned units of work upon data that is physically aligned in memory. The preferred memory alignment is to have vector data contiguously aligned, or coalesced, in memory. Strided memory access, where data elements are aligned with a uniform gap between them, are performant but less desirable as they incur an additional memory access overhead.

Accessing random memory locations represent a worst case scenario for a GPU. When a random access pattern is encountered, each memory request diverges from one another. Thus, the GPU attempts to compensate by unrolling the operations. This serializes the execution of memory requests. Furthermore, the SIMD design prevents cores from proceeding to the next instruction until the previous has completed. Thus, a serialized memory request effectively results in blocking all threads of a thread group. This generally results in poor throughput. Divergence is not limited to memory access.

A core function of any computing device is to conditionally perform operations. From an high- level perspective, conditional statements serve to guide the flow of an algorithm down a specific execution path, or branch. In a sequential machine these serve relatively low impact as the code simply jumps to a different segment of memory and continues executing. The SIMD architectures, however, allows for only one code segment to be performed by all cores. Therefore, when cores conditionally jump to different code segments the GPU will serially apply the code segments. That is, some cores will be blocked until their code segment can be executed.

Experience suggests that it is not uncommon to find developers who simply try to move existing code from a sequential machine to a GPU, but discount the importance of data organization or branch divergence. Subsequently, they are disappointed by the relative performance gains achieved when using the GPU. Furthermore, it can be a non-trivial task for some to translate an existing algorithm to account for data organization or branch divergence. This makes developing for a GPU a challenge for many. Indeed, as part of this work it was necessary to translate the life cycle models to effectively leverage the SIMD architecture.

46 5.2 Data Parallel Life Cycle Models

The translation of a life cycle model for a data parallel environment can be a non-trivial task. Much of the difficulty is rooted, again, in the decision of how to represent the genome graph. The use of a binary adjacency matrix to represent the genome graph helps to reduce the translation efforts.

This is primary due to it being a vector based representation. As such, several of the algorithms previously discussed translate well to kernels.

Several challenges were encountered in this work. The first was in the generation of random numbers. The second is in the general flow of the kernels. The following sections will describe the challenges that were encountered as well as the kernel adjustments.

5.2.1 Random Number Generation

There are two general patterns for generating random numbers: inline generation or offline gen- eration. The inline generation performs random number generation approach as a step within the algorithm itself. Indeed, algorithms presented earlier in this work follow the inline approach.

The offline approach treats random number generation as a separate task that generates a pool of random values. Tasks that require random numbers will then select values from this shared pool.

Certainly, both approaches can be used in a GPU environment. One recommendation in the

NVIDIA cuRAND Library [27] documentation is that generating large batches of random numbers will tend to offer better performance. From a forward-time population genetic simulation per- spective, this was interpreted to suggest that following an inline approach may lead to additional performance overheads as the number of random numbers generated at each step of a simulation is small per generation. Therefore, it was decided to follow an offline approach in the data parallel aspect of this work.

As a result of this decision, each of the life cycle models were decomposed into two-step processes.

The first step performs the random number generation for a given model. The second step evaluates the model. From a heterogeneous computing perspective, this offline approach also helps to decouple a model from being performed entirely by one type of processing unit.

Random number generation libraries have been available on host systems for decades. These libraries have undergone continuous development and optimization and provide a robust set of

47 random number distributions. However, the use of host libraries requires having to communicate the random numbers to the device. Although communication is not cheap, its cost can be hidden through the use of asynchronous operations. That is, through the use of different streams a GPU can be performing a kernel while concurrently copying data on another. As such, for the purposes of this work it was decided to continue to leverage the host for random number generation tasks, despite the offerings of NVIDIA’s cuRAND library [27].

An interesting challenge that the offline approach introduces in a data parallel environment is determining how a thread retrieves the random value it needs from the pool. In a sequential solution, this is relatively straightforward as the pool can be treated like a queue with each task selecting and removing the first value in the queue. This presents a data race condition in a parallel environment that can be avoided by pre-generating a list of non-overlapping index ranges.

This technique is primarily made possible because the maximum size of the offspring population is generally known after selection and that the number of random events is generated independently of the state of the population.

The non-overlapping index technique can also be leveraged in a data parallel solution. In effect, each parallel thread will select the random values that fall within a specified index range of the pool. This technique is necessarily used in the data parallel versions of both the recombination and the mutation models.

5.2.2 Recombination

Two modifications were made to the GPU version of the recombination model. The first modifi- cation is to the classification method to avoid a branch divergence. The second is to decompose the recombination into two methods. The first method builds a crossover matrix for the offspring generation. The second method uses the crossover matrix and the results of the selection process to produce an offspring haploid genome.

Classification Method

Recall, that classification step is used to determine how many recombination events precede a given mutation from an individual. The count is used to determine which haploid genome should be used to contribute its state the resulting haploid sequence.

48 In the sequential algorithm, it was first assumed that a list of recombination events, ordered by their genetic location, would be provided as input. The classification process would then simply search the event list to find the position at which the given mutation could be inserted. The search approach may be implemented as either a binary search or an early terminating linear scan.

Unfortunately, both of these counting methods would introduce a branch divergence in a SIMD algorithm.

The branch divergence can be avoided though by simply scanning through the entire event list for each mutation. This simplifies the constraint that the recombination event list be in a genetically sorted order. Certainly, scanning the entire event list is algorithmically less efficient as it results in an O(mΓ), whereas the original version is O(m ∗ log(Γ)). However, the penalty for branch divergence is expected to be greater than the overhead incurred by linearly scanning a list of elements. Especially, if the list of elements is only tens, or even hundreds, of elements in size.

Algorithm decomposition

The decomposition of the recombination model is intended to unroll the nested looping logic from

Algorithm 3 for data parallel environment. Recall, that the outer loop is used to iterate over two input bit vectors, selecting a block from each vector. The inner loop of the algorithm iterates over the set bits from the symmetric difference of the current bit blocks. Each set bit of the symmetric difference corresponds to a mutation that is to be classified. The result of the classification is used to select which input block will provide its state to the output block.

By simple observation, the bit mask that is built in each iteration of the inner loop only depends upon the mutation the corresponds to that bit position and the set of recombination events. It is independent of the upon the state of any specific sequence data. Indeed, having the symmetric difference of two blocks is used to determine which mutations need to be classified. However, this is primarily done as an optimization for sequential execution. This optimization is valuable to perform as it reduces the amount of bit walking and classification steps performed per generation.

The worst case, though, is when every bit of the symmetric difference is set. This would amount to an individual being heterozygous for every mutation in the population. While this is unlikely to occur in nature, it may result in a simulation. It results in every mutation needing to be classified.

In which case, having evaluated the symmetric difference does not provide any benefit.

49 The more general form of the inner loop is a simple loop over each of the mutation corresponding to bits in the current block. Furthermore, the outer loop iterates over every block of mutations.

It follows then that these two loops could be trivially linearized to a single loop that iterates over all mutations, classifying each relative to the set of recombination events, and building a bit mask vector. This type of simple iteration is trivially performed in a SIMD environment.

As a result, the first method decomposed from recombination model builds a matrix of bit mask vectors in this manner. For the purposes of this work, this matrix is referred to as being a crossover matrix. The first step of method is for the host to generate a list of recombination events expected to occur for each individual of the offspring population.

The second step invokes a kernel that applies the classifier across all of the mutations relative to the recombination events. Each thread of the kernel is responsible for classifying one mutation.

In effect, each thread is responsible for generating a single bit of the resulting bit mask. Each warp of threads is coalesced to form a single bit mask that is stored in the crossover matrix. A row of blocks is used to evaluate a single bit mask vector. The kernel is initialized with one block row for each offspring haploid genome, or index range defined in the first step.

The second decomposed method recombines the input haploid genomes of a parent using the bit masks defined in the crossover matrix. The first step is to perform selection on the parent population. The second step invokes a kernel where each thread reads a bit block from each parental haploid genomes, and bit mask from the crossover matrix. The bit blocks are then reduced to a single bit block using the same masking logic defined in the original algorithm.

Decomposing recombination in this manner provides several advantages. First, the kernels for each method are simplified compared to a single kernel that more closely resembled the sequential algorithm. The simplicity of the kernels enables them to take full advantage of the computational power of the GPU. Finally, this separation enables the generation of the crossover matrix on the device to be overlapped with the selection process.

5.2.3 Orchestration

By leveraging both the host and device for specific tasks in this work, it becomes necessary to effec- tively orchestrate the communication between the two processing elements. Intuitively, overlapping host task and kernel executions is intended to help to reduce the overall runtime of a simulation.

50 Although the life cycle is an inherently linear process, there are several opportunities where tasks can be overlapped.

The general idea approach is to overlap the second step of a model with the first step of the next. In effect, while a kernel executes on the GPU, the host is free to generate random numbers for the next step. For example, as mentioned in the previous section, as the crossover matrix is being evaluated the host is able to perform the selection model. Similarly, while the recombine kernel is being performed, the host can generate new mutations for the mutation model to consume.

There are also opportunities to overlap the execution of two kernels through the use of different streams. For instance, once the mutation model has completed, the state of the genome graph is effectively static for the remainder of life cycle model. The genome graph is directly used in the phenotype evaluation step and the fixation step. These steps, however, are independent of one another and do not change the state of the genome graph. Therefore, they can be trivially executed on different streams.

There are three device level synchronization steps performed within each generation. The first occurs before the recombine kernel is invoked. This done to ensure that the crossover matrix has been evaluated and that the selected pairs have been communicated to the device. The second synchronization occurs after the mutation model has completed. This is done to ensure that the genome graph is in a static state for remainder of the generation. The final device synchronization is performed at the end of the generation.

5.3 Results

QTLSimCUDA is the CUDA-enabled sibling simulator to QTLSimMT. It leverages CUDA-enabled devices to handle genetic models that benefit from a data parallel architecture. Both simulators were evaluated using the same base workstation described in Table 4.1. Table 5.1 provides the specifications for the GPU used in this work.

5.3.1 Evolutionary Scenario

The evolutionary scales provided in Table 4.3 were repeated using QTLSimCUDA. The additional evolutionary scale listed in Table 5.2 was tested using both QTLSimMT and QTLSimCUDA. For

51 GPU NVIDIA GeForce 970 GPU Memory 4 GB Compute Capability 5.2 CUDA API version 7.5 Cores 1664

Table 5.1: GPU specification

Extra-large Population Size (N) 104 Generations (G) 105 Thread Count (T ) 12 Mutation Rate (µ) 1.0 Recombination Rate (ρ) 1.0

Table 5.2: Extra Large evolutionary scale this scale, QTLSimMT was limited to using T = 12 configuration. This is done primarily because

T = 12 enabled the best performance at the previous scale for the multi-threaded simulator and the order of magnitude increase in mutation rate was going to significantly increase the runtime.

Unlike QTLSimMT and previous simulators, QTLSimCUDA does not perform any optimization to handle the neutral scenarios. As a result, it always performs a matrix multiplication during the phenotype evaluation step. Even when it amounts to a multiplication by a 0-matrix. It is therefore expected that QTLSimCUDA may not out perform QTLSimMT for neutral scenarios. Especially, for the smaller scales.

5.3.2 Small Scale Simulation

QTLSimCUDA does not outperform QTLSimMT in the neutral scenario as indicated by the results in Figure 5.1. In fact, QTLSimCUDA takes more than double the runtime of QTLSimMT. This was expected to occur because QTLSimCUDA is not optimized for the neutral scenario. As a result, the neutral and selected scenarios are expected to perform similarly to one another, which they do. The selected scenario performs slightly faster because there are fewer total mutations in the population as a result of fitness differences within the population.

Although the sequential execution of QTLSimMT is slower than QTLSimCUDA for the selected scenario, using even one additional thread enables the multi-threaded solution to outperform the

GPU solution. At this scale, a multi-threaded solution is generally more performant than the GPU

52 Figure 5.1: QTLSimCUDA - Small scale simulation performance results counterpart.

5.3.3 Medium Scale Simulation

At this scale, the neutral scenario again is more efficiently simulated using a multi-threaded solu- tion as shown in Figure 5.2. However, the selected scenario offers a different story. In all cases,

QTLSimCUDA outperformed QTLSimMT. It offered a 6.1x speedup over the sequential execution, but was only 54% more efficient than the T = 12 execution.

Figure 5.2: QTLSimCUDA - Medium scale simulation performance results

53 5.3.4 Large Scale Simulation

The large scale simulation scenario is the first scale at which QTLSimCUDA outperforms QTL-

SimMT in the neutral scenario. This is limited to only the sequential execution of QTLSimMT.

The use of shared memory and the neutral scenario optimization still enable the use of multiple threads to outperform the GPU version.

QTLSimCUDA, again, offers a significant performance advantage over QTLSimMT in the se- lected scenario. Compared to the sequential execution of QTLSimMT, QTLSimCUDA offers a

14.4x speedup. Even when QTLSimMT is configured with T = 12, its GPU counterpart is able to offer a 2.8x speedup as shown by Figure 5.3.

Figure 5.3: QTLSimCUDA - Large scale simulation performance results

5.3.5 Extra Large Scale Simulation

Even at this final scale, QTLSimMT configured with T = 12 is able to offer a 3% performance advantage over QTLSimCUDA in the neutral scenario (Figure 5.4). At this scale, this equates to a 20 minute difference in runtime. Again, the selected scenario is best performed on the GPU as it offers a 4.3x speedup over its multi-threaded sibling. This is a 24.2 hour difference in runtime.

54 Figure 5.4: QTLSimCUDA -Extra large scale simulation performance results

55 Chapter 6

Limit for use

This work has focused around the in silico representation of a population in a population genetic simulation. Specifically, it considers a binary adjacency matrix as the base data structure to represent the population genome graph that results from the associations between a population of individual genomes and a set of elements. The initial observation for the use of this data structure is that the population genome graph is a simple association graph. That is, edges simply indicate whether or not two vertices are linked with one another. As such, a single bit is sufficient to provide such information. Thus, taking advantage of this observation may enable a level of space savings.

As discussed in Chapter 3, this compact representation does help to reduce the memory footprint of a simulation. This can enable larger evolutionary scenarios to be considered without requiring more robust hardware environments. However, the compact representation does introduce some additional costs into the performance equation. For example, bit walking the sequence of set bits in each vector requires an additional step to first identify a set bit within a block of bits.

The compact adjacency matrix representation generally result from having considered the rep- resentation of a genome from a mutation based perspective. That is, the primary element of interest in the simulation is the set of mutations that are present within a population. As demostrated, this results in a sufficiently dense graph that enables the adjacency matrix to be of benefit. However, there are evolutionary scenarios that will result in a sparse graph.

56 6.1 Allele based simulations

This work has attempted to view a genome as an abstract set of elements, with individuals of the population being described by their combination of elements. By abstracting the definition of the element, one enables for the set of elements to be defined relative to the simulation scenario. For instance, when simulating an infinite site model [9] one might consider the element to be a mutation, or polymorphism. Conversely, an infinite allele model [10] might consider each element to be an allele, or a variant form of a specific locus. Although these models are related to one another [28], leveraging additional structural information for a locus can help to reduce the scale of the population genome graph.

The infinite allele mutation model views the genome a finite set of loci. Each time a locus is mutated a new allele is added to its set of alleles. In effect, this model allows for the set of elements to be grouped according to common structural information. Moreover, it enables one to iteratively enumerate the unique combination of mutations that occur in a population. This amounts to collapsing multiple edges of the genome graph into a single edge.

Figure 6.1: Locus graph

57 Figure 6.1 provides a general outline of how a locus-based simulator may represent the pop- ulation genome graph. A global set of loci are defined relative to a reference genome. Instead of tracking each mutation independently of one another, mutations are accumulated to define a unique allele. As mutation occurs within a given locus, a new allele is generated and added to the set of alleles for that locus. Since the set of loci is fixed, and each locus can exist in only one state per sequence, an individual’s genome can be represented using fixed-length adjacency vectors where each element is a reference to specific allele for a given locus. Indeed, this a base sequence representation offered by the SimuPOP [6] simulation framework.

This approach tends to view the population genome graph as being sparse. Recall, the density of a graph refers to the number of edges of the graph, |E|, relative to the maximum number of edges for the graph and is given by Equation (3.2). The incorporation of additional structural information enables the number of edges to be treated as a constant with a single edge per locus.

The vertex space dynamically grows as new alleles are added to the population. In effect, as the number of alleles increases, the density of the graph to decreases.

In a sparse graph with a constant number of edges, it is more efficient to represent an edge by using a reference to the adjacent vertex. Again, this is achieved by enumerating the possible vertex space and using a simple integer to serve as the vertex reference, and the bit length of the integer depends upon the expected maximum size of the vertex space. In a locus-based population genome graph, this would be the expected maximum number of alleles for each locus.

In [10], Kimura and Crow showed that in the absense of selection, the number of alleles for a locus, n, is approximately given by Equation (2.2), where Ne is the effective population size and u is the allelic mutation rate. Thus, a reference can be defined as an integer with at least

Bp = dlog2(n)e + p bits, where p is a non-negative value of padding bits. Therefore, if each locus is expected to have n alleles, then each adjacency vector is expected to have at least LB0 bits, where L is the number of loci. How does this compare with the expected space requirements for the binary adjacency matrix described in this work?

Another way to interpret the Bp bits for each locus is as a set of simple edges to multiple mutation vertices. That is, each bit can be used to indicate an association with a specific mutation as has been done throughout this work. If the number of alleles for a locus, n, can be expressed as a unique combination of B0 mutations, the two graph representations can occupy equivalent memory

58 footprints. There may be some variability as a result of the amount of padding used and whether un-mutated loci are necessarily represented in the graph.

In general, a dense graph locus based graph is expected to occur can when there are few alleles per locus. As shown in [10], this occurs in an infinite allele mutation model when the inverse of

−1 the allelic mutation rate is significantly greater than the effective population size, Ne << u . This provides a general limit for the effectiveness of the approach described in this work when considering a locus based simulation.

59 Chapter 7

Conclusion

This work focused on the design of scalable, high-performance forward-time population genetic simulators. It considers the general biological system that a simulator is intending to mimic and the computational challenges it presents. The first challenge is one of scale. A population is comprised of many individuals. Each individual possesses genetic material that makes them unique within the population. The genetic material for each individual contains billions of elements making it infeasible to use a direct in silico translation as the basis for a scalable simulator.

Instead, an individual’s genetic material is reducible to a set of mutation that occur in one form or another elsewhere in the population. The set of mutations that occur with a population is significantly smaller than any genome. From an abstract perspective, a population is a set of individuals that have a set of mutations. This formulation is a graph. In effect, the scale challenge reduces to finding an in silico representation for a dynamically changing graph. Here, the graph used to represent the genetic material of all individual’s in a population is referred to as being the genome graph.

It is expected that each individual will only have a select subset of the mutations found in the population. However, the average number of mutations per sequence of the population is expected to be great enough to form a sufficiently dense genome graph. Traditionally, this tends to suggest that an adjacency list representation for the relationships between individuals and mutations would provide a desirable base data structure for the genome graph.

However, only simple relationships exist between individuals and mutations. That is, either the individual has the mutation or they do not. As such, a single bit is sufficient to represent an edge

60 in the graph. If the expected number of mutations per individual is a significant enough portion of the total mutations in the population, then a binary adjacency vector can offer a reduced memory footprint for an individual. Subsequently, a binary adjacency matrix could be used to represent the population.

This work explored the use of the binary adjacency matrix representation in forward-time pop- ulation genetic simulation. The initial work was to establish a baseline for comparison. As such, a simulator based upon the binary adjacency matrix was produced. A memory comparison was conducted to evaluate its expected and observed memory usage relative to the adjacency list imple- mentations provided by two open source forward-time population genetic simulation frameworks.

It was demonstrated that the binary adjacency matrix was able to offer a 4.5x savings in memory.

Although, this was largely a result of several optimizations being applied. Removing the optimiza- tions resulted in the binary adjacency matrix offering a 3x reduction from that of a comparable adjacency list approach.

The use of a binary adjacency matrix representation offered some advantages and disadvantages in the various genetic models encountered in the life cycle of a population. As a result, new implementations of some common genetic models were developed to better leverage the base data structure. These were employed in the simulator used to perform the memory comparison. This simulator was used again to conduct a runtime performance comparison with simulators based upon an adjacency list representation. The binary adjacency matrix representation noticeably outperformed the others, offering at least a 6.1x speedup over its nearest neighbor in a large scale simulation.

This initial work was conducted from the constraint that a simulation should run efficiently on a single processor core. The next step was then to relax the single-core constraint and explore parallelism that is readily available in a forward-time population genetic simulation. Parallelism was explored from both task parallel and data parallel perspectives.

First, a task parallel solution, QTLSimMT, was developed. It was capable of performing a simulation using either a batch mode or a pipeline model of execution. These models distribute partitions of work across available threads encountering different amounts of overheads as a result of their respective use cases. In general, the pipeline mode offered better performance as the number of threads was increased.

61 Second, a data parallel solution, QTLSimCUDA, was also developed. This solution used a

CUDA-enabled GPU to perform the data parallel operations. The development of this simulator required some modification to the existing genetic models as a result of the SIMD architecture. In addition, some of the optimizations that are present in QTLSimMT were removed from QTLSim-

CUDA. Although the QTLSimCUDA suffered in the base evolutionary scenarios as a result of not employing these optimizations, it excelled in the larger scales that involved more computation.

Despite having demonstrated space and time savings through the use of a binary adjacency ma- trix representation of the genome graph, its use a general representation in forward-time population genetic simulation is limited. This is related to the density of the genome graph being simulated.

An allele-based simulation may generally produce a sparse graph as a result of having a constant edge space resulting from a fixed set of loci to an expanding set of alleles.

In summary, this work has taken a novel approach to represent the genome graph of a population.

It has shown the benefits of the representation in terms of both memory utilization and general runtime performance in a sequential environment. This transitioned into a study of the task and data parallel solutions for a forward-time population genetic simulation.

62 Bibliography

[1] Jared C. Roach, Gustavo Glusman, Arian F. A. Smit, Chad D. Huff, Robert Hubley, Paul T.

Shannon, Lee Rowen, Krishna P. Pant, Nathan Goodman, Michael Bamshad, Jay Shendure,

Radoje Drmanac, Lynn B. Jorde, Leroy Hood, and David J. Galas. Analysis of genetic inher-

itance in a family quartet by whole-genome sequencing. Science, 328(5978):636–639, 2010.

[2] Antonio Carvajal-Rodr´ıguez. Simulation of genes and genomes forward in time. Current

Genetics, 11(1):58–61, Mar 2010.

[3] Bo Peng, Huann-Sheng Chen, Leah E. Mechanic, Ben Racine, John Clarke, Lauren Clarke,

Elizabeth Gillanders, and Eric J. Feuer. Genetic simulation resources: a website for the

registration and discovery of genetic data simulators. Bioinformatics, 29(8):1101–1102, 2013.

[4] Sean Hoban, Giorgio Bertorelle, and Oscar E. Gaggiotti. Computer simulations: tools for

population and evolutionary genetics. Nature Review Genetics, 2 2012.

[5] Xiguo Yuan, David J. Miller, Junying Zhang, David Herrington, and Yue Wang. An overview

of population genetic data simulation. Journal of Computational Biology, 19(1):42–54, January

2012. doi:10.1089/cmb.2010.0188.

[6] Bo Peng and Marek Kimmel. simuPOP: a forward-time population genetics simulation envi-

ronment. Bioinformatics, 21(18):3686–3687, 2005.

[7] K. R. Thornton. A C++ template library for efficient forward-time population genetic simu-

lation of large populations. ArXiv e-prints, January 2014.

[8] Frdric Guillaume and Jacques Rougemont. Nemo: an evolutionary and population genetics

programming framework. Bioinformatics, 22(20):2556–2557, 2006.

63 [9] M. Kimura. The number of heterozygous nucleotide sites maintained in a finite population

due to steady flux of mutations. Genetics, 61(4):893–903, 1969.

[10] and James F Crow. The number of alleles that can be maintained in a finite

population. Genetics, 49:725–738, April 1964.

[11] M. Kimura. Evolutionary rate at the molecular level. Nature, 217:624–626, February 1968.

[12] Motoo Kimura. The neutral theory of molecular . Cambridge University Press, New

York;Cambridge [Cambridgeshire];, 1983.

[13] Andreas Wagner. Neutralism and selectionism: a network-based reconciliation. Nature Reviews

Genetics, 9(12):965–974, 2008.

[14] Fumio Tajima. Statistical method for testing the neutral mutation hypothesis by poly-

morphism. Genetics, 123(3):585–595, Nov 1989.

[15] Beth L. Dumont and Bret A. Payseur. Evolution of the genomic rate of recombination in

mammals. Evolution, 62(2):276–294, 2008.

[16] Astrid Dempfle, Andr Scherag, Rebecca Hein, Lars Beckmann, Jenny Chang-Claude, and

Helmut Schfer. Gene-environment interactions for complex traits: definitions, methodological

requirements and challenges. European Journal of Human Genetics, 16(10):1164–1172, 2008.

[17] Tesfaye M. Baye, Tilahun Abebe, and Russell A. Wilke. Genotypeenvironment interactions

and their translational implications. Personalized Medicine, 8(1):59–70, 2011.

[18] Russell Merris. Graph theory. John Wiley, New York, 2001.

[19] K. Vaidyanathan, L. Chai, W. Huang, and D.K. Panda. Efficient asynchronous memory copy

operations on multi-core systems and i/oat. In Cluster Computing, 2007 IEEE International

Conference on, pages 159–168, Sept 2007.

[20] G. A. Watterson. On the number of segregating sites in genetical models without recombina-

tion. Theoretical Population Biology, 7(2):256–276, 1975.

[21] Patrick P. Putnam, Philip A. Wilsey, and Ge Zhang. Clotho: addressing the scalability of

forward time population genetic simulation. BMC Bioinformatics, June 2015.

64 [22] Richard R. Hudson. Generating samples under a wrightfisher neutral model of genetic varia-

tion. Bioinformatics, 18(2):337–338, 2002.

[23] Scott Grauer-Gray, William Killian, Robert Searles, and John Cavazos. Accelerating financial

applications on the gpu. In Proceedings of the 6th Workshop on General Purpose Processor

Using Graphics Processing Units, GPGPU-6, pages 127–136, New York, NY, USA, 2013. ACM.

[24] NVIDIA. CUDA C Programming Guide, September 2015.

[25] Intel R . Intel R Architecture Instruction Set Extensions and Future Features Programming

Reference, October 2017.

[26] William Magro, Paul Petersen, and Sanjiv Shah. Hyper-threading technology: Impact on

compute-intensive workloads. Intel Technology Journal, 6(1), 2002.

[27] NVIDIA. CURAND Library, September 2015.

[28] Fumio Tajima. Infinite-allele model and infinite-site model in population genetics. Journal of

Genetics, 75(1):27–31, 1996.

65