APPLICATION OF THE MEDIATOR DESIGN PATTERN

TO MONTE CARLO SIMULATION IN GENETIC EPIDEMIOLOGY

by

KEVIN C. CARTIER

Submitted in partial fulfillment of the requirements

For the degree of Master of Science

Thesis Adviser: Courtney Gray-McGuire, Ph.D.

Department of Epidemiology and Biostatistics

CASE WESTERN RESERVE UNIVERSITY

August, 2008

Contents

1 Introduction 1

1.1 Genetic Epidemiology ...... 1

1.1.1 Scope of Inquiry in Genetic Epidemiology ...... 2

1.1.2 Data Structures and Formats ...... 4

1.1.3 Models ...... 8

1.1.4 Observational Data ...... 11

1.2 Software Design ...... 12

1.2.1 Object-Oriented Design ...... 14

1.2.2 Software ...... 16

1.2.2.1 Creational Patterns ...... 17

1.2.2.2 Structural Patterns ...... 18

1.2.2.3 Behavioral Patterns ...... 18

1.2.3 The Mediator Pattern ...... 20

1.3 Simulation ...... 23

1.3.1 Computers and Randomness ...... 24

1.3.2 Monte Carlo Simulation ...... 25

1.3.3 Experiment by Simulation ...... 26

i 1.3.4 Simulation in Statistical Genetics ...... 27

1.4 Two Prevailing Methods ...... 29

1.4.1 Agent-Based Design ...... 30

1.4.2 Structure-Based Design ...... 31

1.5 Summary ...... 31

2 Background and Literature Review 34

2.1 Genotype Simulation ...... 34

2.1.1 Gene Dropping ...... 34

2.1.2 Haplotype Simulation ...... 39

2.2 Phenotype Simulation ...... 40

2.2.1 Simulating a Continuous Trait from Genotype ...... 40

2.2.2 Simulating a Binary Trait from Genotype ...... 43

2.3 Simulating Population Clustering Structures ...... 43

2.4 Simulating Recombination ...... 44

2.5 Estimating Statistical Power ...... 45

2.5.1 Statistical Testing and Power ...... 45

2.5.2 Estimating Power to Detect Linkage ...... 45

2.6 Program Survey ...... 50

ii 2.6.1 QU-GENE ...... 50

2.6.2 METASIM ...... 51

2.6.3 GAW 10: Simulated Family Data ...... 52

2.6.4 GAW 11: Simulated Multifactoral Data ...... 52

2.6.5 GAW 12: Simulated Multifactoral Data ...... 53

2.6.6 GAW 13: Simulated Longitudinal Data ...... 54

2.6.7 SAIL ...... 54

2.6.8 SNAPPERS ...... 55

2.6.9 SIMIBD ...... 56

2.6.10 GENOOM ...... 57

2.6.11 Genetic Simulation Library (GSL) ...... 57

2.6.12 POPSIM ...... 59

2.7 Summary ...... 60

3 Thesis 62

3.1 Thesis Statement ...... 62

3.2 Specific Aims ...... 62

3.2.1 Aim 1: Design and Implementation of the Core Simulation Program ...... 63

3.2.2 Aim 2: Design and Implementation of the User Interface . . 64

iii 3.2.3 Aim 3: Design and Implementation of the Output Interface . 64

3.2.4 Aim 4: Validation against Existing Data Sets ...... 65

4 Software Design 66

4.1 Program Architecture ...... 66

4.1.1 The Essential Message Types ...... 68

4.1.2 The Message Cycle ...... 70

4.1.3 The Primary Control Loop ...... 71

4.1.4 Default Object Behavior ...... 73

4.2 The User Interface ...... 74

4.3 Simulation Parameters ...... 74

4.3.1 Genome Parameters ...... 76

4.3.2 General Operating Parameters ...... 77

4.3.3 Population Parameters ...... 79

4.3.4 Individual Parameters ...... 80

4.3.5 Trait/Phenotype Parameters ...... 81

4.3.6 Genomic/Genotypic Parameters ...... 82

4.4 Integration with S.A.G.E. Libraries ...... 83

iv 5 Methods 84

5.1 Simulation Methods ...... 84

5.1.1 General Parameters ...... 84

5.1.2 Genome Simulation ...... 84

5.1.3 Population Simulation ...... 85

5.1.3.1 The Founding Generation ...... 85

5.1.3.2 Pairing & Procreation ...... 86

5.1.4 The Admixture Proportion Vector ...... 87

5.1.5 Individuals ...... 89

5.1.5.1 Individual Genotypes ...... 89

5.1.5.2 Individual Phenotypes ...... 91

5.1.6 Ascertainmen