
APPLICATION OF THE MEDIATOR DESIGN PATTERN TO MONTE CARLO SIMULATION IN GENETIC EPIDEMIOLOGY by KEVIN C. CARTIER Submitted in partial fulfillment of the requirements For the degree of Master of Science Thesis Adviser: Courtney Gray-McGuire, Ph.D. Department of Epidemiology and Biostatistics CASE WESTERN RESERVE UNIVERSITY August, 2008 Contents 1 Introduction 1 1.1 Genetic Epidemiology . 1 1.1.1 Scope of Inquiry in Genetic Epidemiology . 2 1.1.2 Data Structures and Formats . 4 1.1.3 Models . 8 1.1.4 Observational Data . 11 1.2 Software Design . 12 1.2.1 Object-Oriented Design . 14 1.2.2 Software Design Patterns . 16 1.2.2.1 Creational Patterns . 17 1.2.2.2 Structural Patterns . 18 1.2.2.3 Behavioral Patterns . 18 1.2.3 The Mediator Pattern . 20 1.3 Simulation . 23 1.3.1 Computers and Randomness . 24 1.3.2 Monte Carlo Simulation . 25 1.3.3 Experiment by Simulation . 26 i 1.3.4 Simulation in Statistical Genetics . 27 1.4 Two Prevailing Methods . 29 1.4.1 Agent-Based Design . 30 1.4.2 Structure-Based Design . 31 1.5 Summary . 31 2 Background and Literature Review 34 2.1 Genotype Simulation . 34 2.1.1 Gene Dropping . 34 2.1.2 Haplotype Simulation . 39 2.2 Phenotype Simulation . 40 2.2.1 Simulating a Continuous Trait from Genotype . 40 2.2.2 Simulating a Binary Trait from Genotype . 43 2.3 Simulating Population Clustering Structures . 43 2.4 Simulating Recombination . 44 2.5 Estimating Statistical Power . 45 2.5.1 Statistical Testing and Power . 45 2.5.2 Estimating Power to Detect Linkage . 45 2.6 Program Survey . 50 ii 2.6.1 QU-GENE . 50 2.6.2 METASIM . 51 2.6.3 GAW 10: Simulated Family Data . 52 2.6.4 GAW 11: Simulated Multifactoral Data . 52 2.6.5 GAW 12: Simulated Multifactoral Data . 53 2.6.6 GAW 13: Simulated Longitudinal Data . 54 2.6.7 SAIL . 54 2.6.8 SNAPPERS . 55 2.6.9 SIMIBD . 56 2.6.10 GENOOM . 57 2.6.11 Genetic Simulation Library (GSL) . 57 2.6.12 POPSIM . 59 2.7 Summary . 60 3 Thesis 62 3.1 Thesis Statement . 62 3.2 Specific Aims . 62 3.2.1 Aim 1: Design and Implementation of the Core Simulation Program . 63 3.2.2 Aim 2: Design and Implementation of the User Interface . 64 iii 3.2.3 Aim 3: Design and Implementation of the Output Interface . 64 3.2.4 Aim 4: Validation against Existing Data Sets . 65 4 Software Design 66 4.1 Program Architecture . 66 4.1.1 The Essential Message Types . 68 4.1.2 The Message Cycle . 70 4.1.3 The Primary Control Loop . 71 4.1.4 Default Object Behavior . 73 4.2 The User Interface . 74 4.3 Simulation Parameters . 74 4.3.1 Genome Parameters . 76 4.3.2 General Operating Parameters . 77 4.3.3 Population Parameters . 79 4.3.4 Individual Parameters . 80 4.3.5 Trait/Phenotype Parameters . 81 4.3.6 Genomic/Genotypic Parameters . 82 4.4 Integration with S.A.G.E. Libraries . 83 iv 5 Methods 84 5.1 Simulation Methods . 84 5.1.1 General Parameters . 84 5.1.2 Genome Simulation . 84 5.1.3 Population Simulation . 85 5.1.3.1 The Founding Generation . 85 5.1.3.2 Pairing & Procreation . 86 5.1.4 The Admixture Proportion Vector . 87 5.1.5 Individuals . 89 5.1.5.1 Individual Genotypes . 89 5.1.5.2 Individual Phenotypes . 91 5.1.6 Ascertainment . 92 5.2 Software Validation . 92 5.2.1 The GAW12 Phenotype Model . 93 5.2.2 Models for Linkage Analysis . 95 6 Results & Discussion 98 6.1 gSim Replicates . 98 6.2 Comparative Data Set Distributions . 99 6.3 Validation Results . 104 v 7 Conclusion 111 vi List of Tables 1 gSim Summary Statistics . 100 2 GAW12 Summary Statistics . 100 vii List of Figures 1 The Scope of Genetic Epidemiology . 3 2 Example Data for Genetic Epidemiology . 6 3 Example Pedigree Diagram . 7 4 Example of Relational Database Format for Pedigree Data . 8 5 The Process of Scientific Inquiry . 10 6 The Context of Software Development . 13 7 Object-Oriented Software Design . 15 8 Interface Complexity . 20 9 Gene-Dropping Example . 35 10 Pedigree with Affected Sibling Pair (ASP) . 36 11 Example Pedigree . 47 12 Partitions of Software Design . 63 13 gSim Class Relationships . 67 14 gSim Architecture . 69 15 Primary Control Loop . 72 16 Object Behavior . 75 17 gSim Genome Description File Format . 77 18 S.A.G.E. Genome Description File Format . 78 19 Admixed Populations . 88 20 The Admixture Proportion Vector . 89 21 The GAW12 Phenotype Model (Almasy et al. , 2001) . 94 22 The Phenotype used for gSim Validation . 97 23 Example gSim Pedigree Data . 99 24 Frequency Distribution of Q3 Mean Values across gSim Replicates . 101 viii 25 Frequency Distribution of Q3 Mean Values across GAW12 Replicates102 26 A Typical gSim-Generated Replicate . 103 27 Multi-marker H-E Linkage Analysis of Q3 on Chromosome 2 (gSim replicate) . 105 28 Multi-marker H-E Linkage Analysis of Q3 on Chromosome 17 (GAW12 replicate) . 105 29 Summary of gSim Data Analysis (Replicates 1 - 10) . 107 30 Model_1 Parameter Estimates for gSim Replicates . 107 31 Model_1 log(P-value) for gSim Replicates . 108 32 gSim vs. GAW12 Comparison of Replicates via Mosely et al. 110 ix Application of the Mediator Design Pattern to Monte Carlo Simulation in Genetic Epidemiology Abstract by KEVIN C CARTIER Genetic epidemiology relies on simulated data to support development of theory and methods. Simulated data are designed to reflect, as accurately as possible, the true phenotypic and genotypic distributions of individuals sampled from different types of relationship clusters. Common simulation methods can be classified into two groups: agent-based methods, in which individuals are simulated, one at a time, according to the rules of Mendelian inheritance and other assumptions, and structure-based methods, in which aggregates of individuals exhibiting properties of interest are simulated, with genotypic information inferred conditionally on both phenotype and structure. A previously untried software design is proposed to sup- port both agent-based and structure-based simulation with equal facility. The medi- ator design pattern is applied to simulation design, and is shown to (1) reduce the complexities arising from the potentially huge number of communication channels required between autonomous agents and (2) to provide an efficient mechanism by which higher-level system objects may override default behaviors of lower-level objects. x DEDICATION This work is dedicated to my wife Mary, whose gracious and loving care for our children exemplifies the word dedication. And as far as I can tell, she is also the reason the sun rises and sets every day. I would also like to commemorate my father Charles, who died earlier this year. Foremost among the many life-lessons he taught me were the values of language, organization, persistence and clarity of thought, all of which are (I hope) evident within within these pages. Nothing in the world can take the place of persistence. Talent will not; nothing is more common than unsuccessful men with talent. Genius will not; unrewarded genius is almost a proverb. Education will not; the world is full of educated derelicts. Persistence and determination alone are omnipotent. Calvin Coolidge xi ACKNOWLEDGEMENTS I would like to thank the members of my thesis advisory committee for their time and effort in reviewing drafts of the manuscript and for the many constructive sug- gestions they provided. The list of students who have at one time or another been advised or taught by Dr. Robert Elston is very long by now, and I feel privileged to be included in that list. I am especially grateful to Dr. Courtney Gray-McGuire, whose consistently prag- matic advice and guidance were always on target, and on more than one occasion prevented me from needlessly widening the scope of the project. There is a long- standing in-joke about computer programmers who “shoot themselves in the foot” due to overly elaborate design choices, and if I have managed to avoid doing so on this project then much of the credit must go to Dr. Gray-McGuire for keeping me properly focused throughout the course of development. xii 1 Introduction As a field of scientific research, genetic epidemiology relies heavily on simulated data, for both generation and testing of hypotheses. Historically, simulated data have been produced by computer programs in a wide variety of software design and implementation schemes. It is therefore reasonable to ask whether or not existing simulation designs and implementations are sufficient for the current and future needs of researchers in the field. Further, in areas of existing design that are found to be lacking, we would like to identify a previously untried design with the potential to correct those inadequacies. The question of optimal simulation design for genetic epidemiology is herein addressed by first examining the role of simulation within the overall process of scientific inquiry. Before evaluating any actual or idealized simulation designs, we shall first review the context in which simulated data are generated and analyzed..
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages131 Page
-
File Size-