A Thesis

entitled

SNPs and Indels Analysis in using Computer Simulation and

Sequencing Data

by

Sharmistha Chakrabortty

Submitted to the Graduate Faculty as partial fulfillment of the requirements for the

Master of Science Degree in Biomedical Sciences:

Bioinformatics, Proteomics and Genomics

______Dr. Alexei Fedorov, Committee Chair

______Dr. Robert Blumenthal, Committee Member

______Dr. Sadik Khuder, Committee Member

______Dr. Amanda Bryant-Friedrich, Dean College of Graduate Studies

The University of Toledo

August 2017

Copyright 2017, Sharmistha Chakrabortty

This document is copyrighted material. Under copyright law, no parts of this document may be reproduced without the expressed permission of the author.

An Abstract of

SNPs and Indels Analysis in Human Genome using Computer Simulation and

Sequencing Data

by

Sharmistha Chakrabortty

Submitted to the Graduate Faculty as partial fulfillment of the requirements for the Master of Science Degree in Biomedical Sciences: Bioinformatics, Proteomics and Genomics

The University of Toledo

August 2017

Genetic variations are the heritable changes in DNA caused by and can

be present in both coding and non- of the DNA. They provide great

resources for the evolution of an organism in response to environmental and

biological changes. Analysis of these variants (such as Single

Polymorphism (SNPs), Indels, and other structural variants like Copy Number

Variations (CNV)) thus, have a wide range of potential applications. These

include identification of causative variants and the for genetic diseases,

personalized genomics, population and evolutionary genetics, and forensic

biology. This study represents two such applications of human variant analysis

(particularly the analysis of SNPs and Indels). In the first chapter, SNPs were

analyzed to understand the correlation between recombination rate and genetic

diversity in the human genome, using a computational modeling program. A

iii

simulated human population was used to study the effect of various population level factors such as natural selective forces, the type of , etc., on this correlation. In the second chapter, Next Generation Sequencing (in this case

Whole Exome Sequencing) data and associated computational variant analysis tools and software were used to analyze both SNPs and Indels in the human genomes to find a lead candidate genetic variant responsible for Inherited Retinal

Dystrophy in a family.

iv

I dedicated this work firstly to Lord Almighty for bestowing his kind blessings onto me at every stage of my life. And to my parents, Shri Arun Kumar

Chakrabortty and Smt. Sikha Chakrabortty, and my brother Dr. Sudipto Kumar

Chakrabortty, who has supported me all my life and encouraged me to chase my dreams, no matter how far-fetched and difficult they may seem. Finally, I would also like to dedicate this work to all my teachers and co-workers in the past who have inspired me and ignited my mind with curiosity and thirst for knowledge

Acknowledgements

First and foremost, I would like to acknowledge immense contribution of my parents

Shri. Arun Kumar Chakrabortty and Smt. Sikha Chakrabortty for successful completion of my research, for it is they who kept me going despite numerous difficulties and always inspired me to reach my goals. They sacrificed their present to secure my future. Without unwavering support and continuous encouragement from my brother Dr. Sudipto Kumar Chakrabortty, this work would have never seen the light of the day. I would like to take this opportunity thank my advisor Dr. Alexei

Fedorov for his immense patience and constant motivation while I took my baby steps towards the huge ocean of scientific knowledge. I will be forever indebted for his irreplaceable ideas of critical analysis, and vital strategies to deal with insurmountable bioinformatics and algorithmic challenges. I am also deeply obligated towards my teachers and committee members Dr. Robert Blumenthal, Dr. Sadik

Khuder, and Dr. Robert Trumbly for their invaluable professional and personal lessons for a successful bioinformatics career; without their constant support, this degree would not have come to fruition. I also owe gratitude toward my coworkers

Rajib Dutta, Patrick Brennan and Basil Khuder for their inspiring ideas, constant assistance while we worked as a team, and for fostering strong bonds of friendship and camaraderie. I would also like to thank Jo Anne Gray and all my colleagues for helping me at different stages of the graduate program.

v

Table of Contents

Abstract ...... iii

Acknowledgements ...... iv

Table of Contents ...... v

List of Tables ...... viii

List of Figures...... ix

List of Abbreviations ...... x

List of Symbols ...... xi

1 Chapter 1. Correlation of recombination rate with genetic diversity in human genome

1.1 Synopsis ...... 1

1.2 Introduction

1.2.1 Recombination rate an important determinant of genetic diversity ...... 2

1.2.2 Recombination increases genetic diversity by reducing the effect of two

main selective forces: Genetic Hitchhiking and Background Selection ...... 4

1.2.3 Recombination rate is positively correlated with genetic diversity in natural

populations ...... 5

1.3 Materials and Methods

1.3.1 GEMA computational modelling program ...... 7

1.3.2 Modes of GEMA program ...... 9

1.3.3 Fitness calculation ...... 10

vi

1.3.4 Parameters used for GEMA modelling ...... 10

1.3.4.1 Recombination rate ...... 11

1.3.4.2 Modes of functionality ...... 11

1.3.4.3 Number of offspring ...... 12

1.3.4.4 Population size ...... 12

1.3.4.5 Gene size and gene length ...... 12

1.3.4.6 Mutation rate ...... 13

1.3.4.7 Distribution of Selection Coefficient in the population ...... 14

1.4 Results

1.4.1 GEMA program under saturated mode

1.4.1.1 GEMA in dominant mode of gene functionality ...... 15

1.4.1.2 GEMA in codominant mode of gene functionality ...... 18

1.4.1.3 GEMA in recessive mode of gene functionality ...... 18

1.4.2 GEMA program under unsaturated mode

1.4.2.1 GEMA in dominant mode of gene functionality ...... 21

1.4.2.2 GEMA in codominant mode of gene functionality ...... 24

1.4.2.3 GEMA in recessive mode of gene functionality ...... 24

1.4.3 GEMA under no selection pressure ...... 31

1.5 Summary of conclusion ...... 34

2 Chapter 2. Identification of rare genetic variant for Retinal Dystrophy in a family

2.1 Synopsis ...... 35

2.2 Introduction ...... 36

vii

2.3 Materials and Methods ...... 42

2.3.1 Filtering against 1000 Genome Phase 1 and Phase 3 ...... 45

2.3.2 Filtering based upon Genotype ...... 45

2.3.3 Variant Analysis

2.3.3.1 Variant analysis using IGV ...... 46

2.3.3.2 Variant analysis using database and literature survey ...... 49

2.3.4 Confirmation of unknown novel variant ...... 52

2.4 Results ...... 53

2.5 Summary of conclusion ...... 61

References ...... 67

A Appendix A ...... 78

B Appendix B ...... 87

viii

List of Tables

A.1 Results for GEMA experiments at saturated mode ...... 72

A.2 Results for GEMA experiments at unsaturated mode ...... 77

B.1 Potential Indels selected by variant analysis of whole-exome sequencing data ...... 81

ix

List of Figures

1-1 Outline of GEMA computer modeling program ...... 9

1-2 Distribution of Selection Coefficient in virtual populations ...... 16

1-3 GEMA in dominant mode of gene functionality under saturated mode ...... 20

1-4 GEMA in codominant mode of gene functionality under saturated mode ...... 21

1-5 GEMA in recessive mode of gene functionality under saturated mode ...... 23

1-6 GEMA in dominant mode of gene functionality under unsaturated mode ...... 26

1-7 GEMA in codominant mode of gene functionality under unsaturated mode ...... 29

1-8 GEMA in recessive mode of gene functionality under unsaturated mode ...... 30

1-9 GEMA programs under no selection pressure ……………………………….32

2-1 Workflow of WGS/WES computational data analysis for identification of

candidate variants of genetic diseases …………………………………………42

2-2 Pedigree of a family with autosomal recessive Retinal Dystrophy …....48

2-3 Identification of potential variants using Integrative Genomic Viewer (IGV)….52

2-4 Identification of potential variants of retina-related diseases using retina-specific

database RETINA SEARCH – TIGEM………………………………………..54

2-5 Flowchart of processing and filtration steps for selection of potential candidate

variants of retinal dystrophy from raw whole exome sequencing fastq files……58

2-6 Selected candidate Indel at RP1L1 gene………………………………………..62

2-7 Results of Human Gene Mutation Database for RP1L1 gene mutations ………64

x

List of Abbreviations

BS ...... Background Selection BWA ...... Burrows-Wheeler Aligner

CNV ...... Copy Number Variation

DCDC4B ...... Doublecortin Domain Containing 4B

GATK ...... Genome Analysis Toolkit GEMA ...... Genome Evolution Matrix Algorithm GH ...... Genetic Hitchhiking

HGMD ...... Human Gene Mutation Database

IGV ...... Integrated Genomic Viewer IRD ...... Inherited Retinal Diseases

MAF ...... Minor Allele Frequency Mb ...... Megabase pair

NGS ...... Next Generation Sequencing

RP1 ...... Retinitis Pigmentosa 1 RP1L1 ...... Retinitis Pigmentosa 1 Like 1

SNP ...... Single Nucleotide Polymorphism SV ......

VCF ...... Variant Calling Format

WGS ...... Whole Genome Sequencing WES ...... Whole Exome Sequencing

xi

List of Symbols

α ...... Number of offspring

µ ...... Number of novel mutations per gamete

R ...... Recombination rate

S ...... Selection Coefficient

H ...... Mode of gene expression

N ...... Population size

Ne ...... Effective population size

xii

C Chapter 1

Correlation of recombination rate with genetic diversity in the human genome

1.1 Synopsis

Genome Evolution Matrix Algorithm (GEMA) computer modeling program was used to

simulate natural human population to understand the correlation between recombination

rate and genetic diversity. In this study, we have also tried to investigate the role of two

main evolutionary selective forces, namely Genetic Hitchhiking and Background Sweeps,

on this correlation. In addition, we have also evaluated the effect of other population level

intricacies on genetic diversity and fitness of the population: 1. The role of different

mutation types (neutral, deleterious, and beneficial) via Selection Coefficient S; 2. effect

of natural selection pressure by choosing N most fit offspring from the pseudo-couples; 3.

effect of three different modes of gene functioning namely dominant, codominant, and

recessive. Our results from GEMA modeling suggests that recombination rate is

positively correlated with genetic diversity in a population, and thus, are in accordance

with the observations in natural human population studies and other model organisms

such as Drosophila and C. elegans, which also showed a strong positive correlation

between recombination rate and genetic diversity.

1

1.2 Introduction

1.2.1 Recombination rate an important determinant of genetic diversity

Genetic diversity of a population refers to the total genetic differences or polymorphism among individuals of a population. It represents a repository of traits possible for that organism, and thus, plays a monumental role in shaping the phenotype and evolution of that organism in response to environmental changes. Interestingly, different organisms have a different level of genetic diversity. For example, approximately 3% of the genome of Drosophila simulans shows variability (Begun 2007, Lack 2015) whereas only 0.1% of the human genome shows variability (McVean 2005, 2015). Genetic diversity or polymorphism were also found to vary even among different and within same chromosomes at different loci, among different species such as in plants (Tenaillon

2001, Nordborg 2005), fungi (Doniger 2008)and animals (Sachidanandam 2001, Wong

2004, Begun 2007). Many determinants of genetic diversity have been recognized such as demography or population history (such a bottlenecks and migration), effective population size Ne, mating system (such as outcrossing and selfing), life history (such as lifespan, fecundity and propagule size), rate of mutation, rate of recombination, gene density, linkage and natural selection process (Galtier 2016). Among all of these determinants, the rate of recombination plays a central role in regulating genome-wide

2

genetic diversity because all the other major determinants mentioned above are either upstream or downstream regulators of recombination rate in an organism (Galtier 2016).

For example, genetic diversity is highly dependent upon the rate of mutation. But mutation mainly occurs due to errors in DNA replication process and due to spontaneous

DNA damage caused by mutagens. It has been observed that rate of mutation is not consistent across the genome (Hodgkinson 2011) and among species (Lynch 2010).

Interestingly, recombination is involved in DNA repair in both cases due to replication error as well as in the case of DNA damage due to mutagenesis events (Matthew 2005).

Because of the interconnection between mutation and recombination, there is a highly controversial theory about the mutagenic effect of recombination. Therefore, there is a chance that rate of recombination might be related to the rate of mutation, and thus affects total genetic diversity (Matthew 2005).

Moreover, effective population size (Ne ) positively influences genetic diversity (the higher the effective population size, the higher is the chances of genetic diversity

(Kimura 1983). Ne is also dependent upon the life history of an organism (Leffler 2012,

Romiguier 2014), demographic events (Bjørnstad 2001, Alcala 2014, Sun 2015), and the mating system in different species (Jarne 1995, Charlesworth 2001, Glémin 2006, Dolgin

2008, Dey 2013, Slotte 2013, Glémin 2014, Burgarella 2015, Thomas 2015, Hartfield

2016). The mating system also greatly affects the recombination rate of the genome in different species (asexual organisms have very low rate of recombination as compared to outcrossing organisms), and thus regulates genetic diversity (Stebbins 1957, Maynard

Smith 1978, Judson 1996, Mark Welch 2000, Simon 2003, Igic 2013, McDonald 2016).

3

Additionally, recombination rate directly affects genetic diversity by breaking the linkage disequilibrium between these mutations, and thus reduces the effect of natural selection forces (Corbett-Detig 2015). Therefore, the higher the recombination rate of the genome, the lower will be the rate of fixation and removal of mutations from the population by genetic drift (as fewer mutations will be in linkage with each other), hence, this will result in increase in total genetic diversity of the organism (Tajima 1990, Cutter 2013).

Moreover, this genetic drift is again negatively correlated with effective population size i.e. larger is the population less will be the rate of fixation or removal of mutation or genetic drift (Kimura 1983, Charlesworth 2009). Hence, we can see that all these above mentioned major determinants of genetic diversity are highly interlinked with each other and with recombination rate, resulting in fine regulation of total genetic diversity of a population (Cutter AD 2013).

1.2.2 Recombination increases genetic diversity by reducing the effect of two main selective forces: Genetic

Hitchhiking and Background Selection

Natural selection forces which play a pivotal role in directing the evolution of an organism by fixing and removing the selected alleles from a population are of two main types: Background Selection and Genetic Hitchhiking (Tajima 1990, Dolgin 2008).

Background Selection selectively removes the deleterious mutations and all the mutations linked with them from a population (Charlesworth 1993, Charlesworth 1994). While 4

Genetic Hitchhiking is responsible for selectively fixing the beneficial mutation along with the linked mutation in a population (Maynard Smith 1974). Thus, both of these selective forces reduce the diversity of the population as they both remove a bunch of neutral allele at the nearby site along with the selected mutated alleles. However, the effect of these selective forces is inversely proportional to recombination rate of the genome, as recombination limits the physical distance upon which the forces can act by breaking the linkage between these mutated alleles, and hence, helps in maintaining the genetic diversity of the genome (Maynard Smith 1974, Kaplan 1989, Begun 1992,

Hudson 1995). Beneficial mutations along with linked mutations are fixed at very high rate, thus they are responsible for the reduction in genetic diversity in very short time, and thus, are often referred as Hard Sweeps (Maynard Smith 1974, Kaplan 1989). While, slightly deleterious mutations are often more pervasive in the genome, and in linkage with more number of neutral mutations, are usually removed in much slower rate, resulting is less pronounced effect in a reduction in genetic diversity, thus, are often referred as Soft Sweep (Charlesworth 1994, Comeron 2014). However, respective roles and effect of each these two pervasive selective forces in reduction of genetic diversity across the genome of various species is still an open question, and more work needs to be done to completely understand this correlation and to what extent these forces are accountable for the observed differences in diversity among different species (Enard

2014, Corbett-Detig 2015, Coop 2016).

5

1.2.3 Recombination rate is positively correlated with genetic diversity in natural populations.

Positive correlation of recombination rate with genetic diversity was first reported in D. melanogaster (Begun 1992, Campos 2014). Later this correlation was confirmed in other species such as Humans (Nachman 2001, Lercher 2002, Lohmueller 2011), Chimpanzees

(Jensen 2016), Birds (Lynn Y Huynh 2010, Carina F Mugal 2013), Yeast (Moses 2011,

Rattray 2015), C. elegans (Payseur 2003, Cutter 2010), plants (Stephan 1998, Tenaillon

2001, Flowers 2012), and viruses (Renzette N 2016). However, the strength of this correlation was found to vary among different species. For example, recombination rate was found to be highly correlated with genetic diversity among invertebrates followed by vertebrates, herbaceous plants, and woody plants respectively. (Russell B. Corbett-Detig

2015). Even among each of these groups, the strength of this correlation was found to be completely different. For example, among mammals, humans, and chimpanzees were found to have a strong correlation between recombination and genetic diversity but in mice, this correlation was found to be very weak (Aya Takahashi 2004). The reasons for this difference in the strength of correlation among species and the contribution of other factors which might influences this correlation is hugely controversial and unknown.

In order to fully understand this correlation of recombination rate with genetic diversity and how these population-level determinants such as selection pressure, population size, gene length, natural selection process, effect of mutation, etc. affects this correlation; in this study, we have tried to utilize a computer modelling program called Genome

6

Evolution Matrix Algorithm (GEMA) to stimulate natural population and all these population level intricacies (Qiu et.al., 2014) We have also attempted to delineate and investigate the effect of the two main natural selection processes: Genetic Hitchhiking and Background Selection on this correlation and their respective effect on reduction in genetic diversity.

1.3 Materials and Methods

1.3.1 GEMA computational modeling program

Genome Evolution Matrix Algorithm (GEMA) computer modeling program was used to simulate the natural human population, to understand the population level intricacies, and to evaluate the role of different evolutionary forces in creating as well as in maintaining this diversity. This program written in both Perl and Java was developed by a former

Ph.D. student of our lab. The methodology of this program is described in detail here: http://bpg.utoledo.edu/~afedorov/lab/GEMA.html, (Qiu et.al, 2014).

In brief, GEMA modeling program begins with a genetically identical population with an equal number of males and females of size N (set as 250 in this study). Random novel genomic mutation µ (µ = 20 per gamete) was created in each individual depending upon the distribution of Selection Coefficient (given as input by different simulation experiments representing the distribution of mutations by their effect in the population).

Meiotic recombination was carried out to create gametes depending upon the recombination rate R per gamete (given as input). Random gametes were then combined 7

to produce α number of offspring per pseudo couple (given as input), N most fit offspring were then selected to create the next generation. This process was repeated for thousands of generations until the total genetic diversity (measured by total number of SNPs) in the population reached a plateau. The total number of SNPs, generation, and overall fitness of the population was then recorded and used to assess a correlation between recombination rate and genetic diversity along with all other parameters.

In this study, the Perl version of GEMA was used particularly for studying the correlation between recombination rate and genetic diversity in the human genome and at the population level. We have also tried to investigate the role of two main evolutionary selective forces namely Genetic Hitchhiking and Background Sweeps, both of which reduces the genetic diversity of a population through linkage. In addition, we also evaluated the effect of mutation (such as deleterious, neutral and beneficial mutation) as well as a natural selection pressure on the diversity and fitness of the population.

Different sets of parameter values in addition to different experiments representing different simulated population scenarios have been used.

8

Genetically identical starting population of size N (set as 250)

Creation of genomic mutation µ in each individual (µ = 20/gamete) depending upon Selection Coefficient S

Meiotic recombination for gamete creation

Creation of random mating pairs to produce α number of offspring

Fitness calculation of the offspring based upon mode of gene functionality

N fittest offspring selected to replace parental generation

Total number of SNP, generation and fitness of the population is

recorded

Fig 1-1 Outline of GEMA (Genome Evolution Matrix Algorithm) computer modeling program.

GEMA begins with a genetically identically population of size N (set as 250). It creates novel genomic mutations with the rate of µ (µ = 20 per gamete) in each individual depending upon the distribution of Selection Coefficient assigned for each experiment

(given as input by different simulation experiments). Meiotic recombination is carried out

9

to create gametes depending upon the recombination rate R per gamete (given as input).

Random gametes are then combined to produce α number of offspring per pseudo couple

(given as input), N fittest offspring are then selected to become the next generation. This process is repeated for thousands of generations until the total genetic diversity

(measured by total number of SNPs) in the population reaches a plateau. Total number of

SNPs, generations, and overall fitness of the population is then recorded (Qiu et.al, 2014).

10

1.3.2 Two modes of GEMA modeling program

GEMA modeling program used by our study runs in two different modes: saturated and unsaturated mode, based upon the way of assignment of fitness to the mutated alleles. In the saturated mode, a reverse mutation occurring on an already mutated allele can revert back the effect of the earlier mutation. For example, an allele A initially had the selection coefficient of S = -10, and the first mutation from A to G changed its selection coefficient

S into 0, then a reverse mutation from G to back A again will change its S value into -10 from 0. Therefore, in this mode fitness of the population reaches a plateau after certain generations, and remains restricted within a certain range. Whereas, in the unsaturated mode of GEMA, random number generator was used to select both the site of mutation and the selection coefficient of the mutation for each mutational event. Therefore, here the GEMA program does not remembers the previous S value for a mutated allele, and hence, a reverse mutation occurring on the already mutated allele will not revert back to previous selection coefficient value. For example, here when nucleotide A with S value

-10 gets mutated into G with S value 0, a reverse mutation from G to back A again will change the value of S from 0 to -20 instead of -10. Thus, in this case, the fitness will always get accredited without reaching a plateau and is thus, will not remain restricted within a certain range (http://bpg.utoledo.edu/~afedorov/lab/GEMA.html).

11

1.3.3 Fitness calculation

In GEMA modeling experiments, fitness was calculated for every gene based on every

SNPs (as in the case of the natural human genome). Summation of fitness of all the genes of that individual was then taken as the fitness of that individual. The fitness of the genes was calculated in three different ways depending upon three different modes of gene functioning, represented by Dominance Coefficient (H). When genes were assumed to be in codominant mode (H = 0.5), the average fitness between the paternal and maternal allele was used as the fitness for that gene. Because in codominant mode, paternal and maternal allele interacts to exert an effect. Whereas, when genes were assumed to be in dominant mode (H = 0), the lowest fitness value out of the paternal and maternal allele was considered as the fitness of that gene. Because here, even one dominant deleterious mutation was sufficient to exert its effect. However, in the case of the recessive mode of gene functioning (H = 1), highest fitness value out of the paternal and the maternal allele was considered as the fitness of that gene because here always the dominant or the fittest allele shows its effect.

1.3.4 Parameters used for GEMA modeling program

1.3.4.1 Recombination Rate (R)

Four different recombination rates (1, 6, 48, 96 per gamete) have been checked in all different experimental conditions. R equals 48 was used to simulate average recombination rate of human 12

(http://bpg.utoledo.edu/~afedorov/lab/GEMA.html). R equals 1 or 6 was used to simulate recombination cold spots of a human chromosome, and R equals 96 was used to simulate recombination hotspots.

1.3.4.2 Modes of Dominance Coefficient (H)

The relation between recombination rate and genetic diversity was checked under three modes of gene functioning, defined by Dominant Coefficient (H). (1) In dominant mode dominance coefficient (H) will be 0. Here all genes were assumed to be dominant where just one allele is sufficient to express its effect. (2) In codominant mode dominance coefficient (H) will be 0.5. Here, all genes were assumed to be codominant where both the alleles need to interact to show their effect. (3) In recessive mode dominance coefficient (H) will be 1. Here, all the genes were assumed to be in recessive where both the alleles need to be same to show their effect.

1.3.4.3 Number of offspring

The number of offspring per pseudo mating pair was denoted by α. For my simulation experiments, I have used five (2, 3, 5, 10, and 20) different values for α (given as input) to evaluate the effect of selection pressure on the correlation between recombination rate and genetic diversity. Out of a total number of offspring, N most fit offspring were selected to represent the next generation so that constant population size of N remains maintained. Therefore, higher the α value more will be the selection pressure in each

13

generation. Whereas, in the case of α as 2, there won’t be any selection pressure as all the offspring will be used to replace their previous generation.

1.3.4.4 Population size (N)

Population size N is the total number of virtual individuals in simulated GEMA population which can be given as input by the user. For our modeling experiments, due to limited computer resources, population size was set to 250 individuals. The population was kept constant throughout the experiment for thousands of generations by replacing the parents with the exact number of offspring.

1.3.4.5 Gene size and Gene length

In GEMA modeling experiments, all the nucleotide sequence was regarded as genes, and no distinctions have been made between coding and non-coding regions. For all the simulation experiments, the total number of genes was kept constant, and set to 599 for each individual. Gene length of each gene was also kept constant, and set to 10,000 for all the individuals. Thus, every virtual individual in the simulated GEMA population will contain 599 X 10,000 = 5,990,000 total nucleotides in their genome.

1.3.4.6 Rate of mutation

An average number of novel mutation per human genome is known to be 2µ per genome

(µ= 20 per gamete) (Kondrashov and Shabalina 2002; Conrad et al. 2011; Li and 14

Durbin 2011). Therefore, the number of random novel mutations generated for each individual in each generation has been kept constant and set as 20 mutations per gamete

(denoted as µ) to simulate average frequency of mutations in normal humans. Here, only the first generation will contain 20 mutations per gamete, whereas remaining subsequent generation will contain all the mutations from previous generations in addition to the novel mutations introduced into their genome.

1.3.4.7 Distribution of Selection Coefficient (S) in the population

Five different simulation experiments (Exp D0, Exp E0, Exp DE, Exp Dn, and Exp En) have been used, which represents the distribution of mutations or Selection Coefficient in the simulated population. Each of these percentages denotes the likelihood of novel mutations generated for each gamete to be neutral, deleterious or to be beneficial. These experiments are given as input to the GEMA modeling program to study the influence of neutral, beneficial, and deleterious mutation on total genetic diversity in the human population at different recombination rate and conditions.

15

Fig 1-2. Distribution of Selection Coefficient (S) in virtual populations.

A) In Exp D0, 90 % of mutation are neutral (S = 0), 9 % are slightly deleterious (S = -1), and 1 % is highly deleterious (S = -100). B) In Exp E0, 90 % of mutation are neutral (S =

0), 9 % are slightly beneficial (S = +1), and 1 % is highly beneficial (S = +100). C) In

Exp DE, 80 % of mutation are neutral, 9 % each for both slightly deleterious and beneficial, and 1 % each for both highly deleterious and beneficial mutation. D) In Exp

Dn, 90 % is neutral and 10 % is a slightly deleterious mutation. E) In Exp En, 90 % is neutral and 10 % is a slightly beneficial mutation. 16

1.4 Results

1.4.1 GEMA under saturated mode

All our modeling experiments showed a strong positive correlation between recombination rate and genetic diversity (measured by total number of SNPs) in the population. Moreover, in all our modeling experiments we have found that genetic diversity was highest when α number of offspring per mating pair equal to 3, followed by

5, 10 and 20 per gamete respectively. Thus, genetic diversity decreases with increase in the value of α or selection pressure. Additionally, we have also found that the total number of SNPs was always higher for Exp D0 (90 % are neutral, 9 % is S = -1 or slightly deleterious and 1 % is S = -100 or highly deleterious), and in most of the cases it coincided with experiment Dn (containing 90 % neutral and 10 % slightly deleterious mutation). Whereas, all the experiments containing the beneficial mutation (such as Exp.

E0, and DE and En) coincided with each other and showed the less total number of SNPs.

1.4.1.1 GEMA in dominant mode of gene functioning

Figure 1-3 represents the effect of offspring or selection pressure on the correlation between recombination rate and genetic diversity when genes were assumed to function in dominant mode. Interestingly, in figure 1-3 A where α is equal to 3, Exp D0 has the highest number of SNPs for recombination rates 1, 6 and 48 per gamete. All other

17

experiments coincided with each other. But at recombination rate 96 per gamete, the total number of SNPs for all the experiments reached almost the same point. Thus, we can say that experiments such as E0, DE and Dn and En though initially had a low number of

SNPs but they had a much higher increase in the number of SNPs with an increase in recombination rate. This suggests that beneficial mutations are fixed at a much higher rate as compared to the rate of removal of deleterious mutations. Thus, we can say that effect of Genetic Hitchhiking which is responsible for fixation of beneficial mutation is more prevalent as compared to Background Sweeps. Moreover, with increase in α the total number of SNPs for experiment Dn started increasing as compared to experiment E0

(where 90 % is S = 0, 9 % is S = +1 and 1 % is S = +100), Exp DE (where 80 % is S = 0,

9 % is S = -1 and S = +1, 1 % is S = -100 and S = +100), and Exp En (where 90 % is S =

0 and 10 %is S = +1). In fig. 1-3 (C) and 1-3 (D) Dn completely coincided with experiment D0. The separation between these two sets (Exp. D0-Dn and DE-E0-En) of experiments increased with increase in α. Therefore, we can say that differences between the experiments lacking or containing beneficial mutations increased with selection pressure. This suggests that selection pressure more strongly influences the rate of fixation of beneficial mutations by Genetic Hitchhiking.

1.4.1.2 GEMA in codominant mode of gene functioning

Both figures 1-4 A and B are similar to figure 1-3 A and B respectively, where Exp D0 had the highest number of SNPs followed by Exp Dn (both lacking beneficial mutations, and while the remaining experiments (containing beneficial mutation) mostly coincided 18

each other. However, unlike figure 1-3 (C) and (D), in the codominant mode the difference between the two sets of experiments i.e Exp D0-Dn and Exp. DE-E0-En is much less in fig. 1-4 (C) and (D). This might be because when genes are acting codominant mode, both the alleles are interacting to cause a phenotypic effect, and thus most likely to have less pronounced effect as compared to genes when they are acting in dominant mode. Therefore, the difference between the rate of fixation and removal of mutations by selection forces is comparatively less compared to the dominant mode of gene functioning.

19

Fig 1-3: GEMA on dominant mode of gene functioning under saturated mode.

X axis represents recombination rate from 1 to 100 per gamete and Y represents the total number of SNPs from 0 to 200 thousand. Data points represent a total number of SNPs values when R = 1, 6, 48 and 96 per gamete. D0, E0, DE, Dn, and En (denoted by different color trendlines and shapes of data points) represents different simulation experiments. (A) Fig. A represents the relation between total SNPs and R when a total number of offspring α is 3. (B) Fig. B represents the relation between total SNPs and R when a total number of offspring α is 5. (C) Fig. C represents the relation between total

SNPs and R when a total number of offspring α is 10. (D) Fig. D represents the relation between total SNPs and R when a total number of offspring α is 20.

20

Fig 1-4: GEMA on the codominant mode of gene functioning under saturated mode.

X axis represents recombination rate from 1 to 100 per gamete and Y represents the total number of SNPs from 0 to 200 thousand. Data points represent a total number of SNPs values when R = 1, 6, 48 and 96 per gamete. D0, E0, DE, Dn, and En (denoted by different color trendlines and shapes of data points) represents different simulation experiments. (A) Fig. A represents the relation between total SNPs and R when a total number of offspring α is 3. (B) Fig. B represents the relation between total SNPs and R when a total number of offspring α is 5. (C) Fig. C represents the relation between total

SNPs and R when a total number of offspring α is 10. (D) Fig. D represents the relation between total SNPs and R when a total number of offspring α is 20.

21

1.4.1.3 GEMA in recessive mode of gene functioning

Similar to dominant and codominant mode, in the recessive mode of gene functioning, the total number of SNPs was highest when α equaled to 3 followed by 5, 10 and 20 respectively. Thus, the number of SNPs decreases with increase in α value. However, strikingly in this mode of gene functioning, the data values for all the five experiments

(Exp D0, Dn, E0, DE, and En) coincided with each other for all the four-recombination rates (i.e. 1, 6, 48 and 96 per gamete). Moreover, there is no difference between the groups of experiments containing the beneficial mutations (Exp DE, E0 and En) and the group which lacks beneficial mutations (Exp D0 and Dn), as seen in the dominant and codominant mode of gene functioning. This might be because, in the recessive mode of gene functioning, a mutated allele needs to be present in two copies for phenotypic effect.

Therefore, here in absence of two same mutated alleles, different types of mutations behaved similarly, and thus their rate of fixation or removal was almost similar.

22

Fig 1-5: GEMA on recessive mode of gene functioning under saturated mode

X axis represents recombination rate from 1 to 100 per gamete and Y represents the total number of SNPs from 0 to 200 thousand. Data points represent a total number of SNPs values when R = 1, 6, 48 and 96 per gamete. D0, E0, DE, Dn, and En (denoted by different color trendlines and shapes of data points) represents different simulation experiments. (A) Fig. A represents the relation between total SNPs and R when a total number of offspring α is 3. (B) Fig. B represents the relation between total SNPs and R when a total number of offspring α is 5. (C) Fig. C represents the relation between total

SNPs and R when a total number of offspring α is 10. (D) Fig. D represents the relation between total SNPs and R when a total number of offspring α is 20.

23

1.4.2 GEMA under unsaturated mode

In the unsaturated mode, the fitness of the population does not saturate after few generations and its fitness is not dependent upon the fitness of the previous mutation on the same nucleotide sequence. Results of GEMA under unsaturated mode are in complete agreement with results from the saturated mode and gave more clarity in the results with better separation and distinction between the different experiments. Under this mode, we have results for three experiments: Exp D0 (with 9 % slightly deleterious mutations and

1% highly deleterious mutations along with 90 % neutral mutations), Exp E0 (with 9 % slightly beneficial mutations, 1% highly beneficial and 90% neutral mutations) and Exp

DE (with 9% both slightly deleterious and slightly beneficial mutations, 1% both highly deleterious and beneficial mutations along with 80% neutral mutations).

1.4.2.1 GEMA in dominant mode of gene functioning

Figure 1-6 represents the correlation between genetic diversity (measured by total number of SNPs) and recombination rate at different selection pressure (controlled by the number of offspring or α) and at different selection coefficient S (set out by different experiments) under the dominant mode of gene functioning. Similar to GEMA experiment (fig. 1-3) under the saturated mode for dominant mode of gene functioning, in fig. 1-6 too Exp. D0 showed the highest number of SNPs at all the recombination points.

Whereas, Exp E0 and DE (with beneficial mutations) almost completely coincided with each other for all the four data points. Similar to saturated mode, here also the total

24

number of SNPs is highest for α = 3 (1-6 A) followed by α = 5 (1-6 B) and 10 and 20 with an almost similar number of SNPs (1-6 C and D). Moreover, like fig. 1-3, in fig. 1-6 the separation between experiments containing only deleterious mutations and experiments containing beneficial mutations increased with increase in α value. However, because the fitness of the population doesn’t reach a plateau and was not restricted to a certain range in the case of unsaturated mode, the separation between experiments containing only deleterious mutations (Exp D0) and experiments containing beneficial mutations (Exp E0 and DE) was much higher for all the four α values.

25

Fig 1-6: GEMA on dominant mode of gene functioning under unsaturated mode

X axis represents recombination rate from 1 to 100 per gamete and Y represents the total number of SNPs from 0 to 200 thousand. Data points represent a total number of SNPs values when R = 1, 6, 48 and 96 per gamete. D0, E0, DE (denoted by different color trendlines and shapes of data points) represents different simulation experiments. (A) Fig.

A represents the relation between total SNPs and R when a total number of offspring α is

3. (B) Fig. B represents the relation between total SNPs and R when a total number of offspring α is 5. (C) Fig. C represents the relation between total SNPs and R when a total number of offspring α is 10. (D) Fig. D represents the relation between total SNPs and R when a total number of offspring α is 20.

26

1.4.2.2 GEMA in codominant mode of gene functioning

Similar to dominant mode, GEMA results for codominant mode under the unsaturated mode, Exp D0 containing only deleterious mutation was separated from experiment E0 and DE containing beneficial mutation and showed the highest number of SNPs for all the data points as well as α values. Whereas, experiments E0 and DE almost completely coincided with each other similar to all other GEMA experiments. Similar to GEMA experiments in the dominant mode of gene functioning, here too the separation between these two sets Exp D0 and Exp E0-DE is much better than the saturated mode. However, we can still see that the separation between these two experiments sets is less compared to the dominant mode of gene functioning like dominant mode in a saturated mode of

GEMA experiment.

1.4.2.3 GEMA in recessive mode of gene functioning

In the recessive unsaturated mode, similar to experiments in recessive saturated mode,

GEMA experiments showed much less separation (though a bit greater than the saturated condition) between the two sets of experiments D0 and E0-DE. At recombination rate 96 per gamete, these two sets almost coincided with each other. However, the total number of SNPs still decreased along with the increase in α value like all other experimental conditions. Moreover, it is interesting that at the recessive condition for all the α values, the experiments containing beneficial mutations (Exp E0 and DE), the total number of

SNPs was much lower at recombination rate 6 compared to recombination rate 1. But it

27

increased again at recombination rate 48. We have not seen such a phenomenon in the case of saturated mode. This might because in the unsaturated mode we got much clear distinction among the experiments.

28

Fig 1-7: GEMA on the codominant mode of gene functioning under unsaturated mode.

X axis represents recombination rate from 1 to 100 per gamete and Y represents the total number of SNPs from 0 to 200 thousand. Data points represent a total number of SNPs values when R = 1, 6, 48 and 96 per gamete. D0, E0, DE (denoted by different color trendlines and shapes of data points) represents different simulation experiments. (A) Fig.

A represents the relation between total SNPs and R when a total number of offspring α is

3. (B) Fig. B represents the relation between total SNPs and R when a total number of offspring α is 5. (C) Fig. C represents the relation between total SNPs and R when a total number of offspring α is 10. (D) Fig. D represents the relation between total SNPs and R when a total number of offspring α is 20.

29

Fig 1-8: GEMA on the recessive mode of gene functioning under unsaturated mode.

X axis represents recombination rate from 1 to 100 per gamete and Y represents the total number of SNPs from 0 to 200 thousand. Data points represent a total number of SNPs values when R = 1, 6, 48 and 96 per gamete. D0, E0, DE (denoted by different color trendlines and shapes of data points) represents different simulation experiments. (A) Fig.

A represents the relation between total SNPs and R when a total number of offspring α is

3. (B) Fig. B represents the relation between total SNPs and R when a total number of offspring α is 5. (C) Fig. C represents the relation between total SNPs and R when a total number of offspring α is 10. (D) Fig. D represents the relation between total SNPs and R when a total number of offspring α is 20.

30

1.4.3 GEMA under no selection pressure

When α is equal to 2, both the offspring irrespective of their fitness is used to replace their parental generation. Therefore, natural selection pressure which selects the fittest offspring out of all the available offspring is not valid in this case (in the case of GEMA experiments, selection pressure acts by selecting N most fit offspring out of the total number of offspring). Therefore, experiments with α = 2 were used as control experiments to evaluate the effects of different degrees of selection pressure on the correlation between genetic diversity and recombination. In this case, we have found that there is no difference in total number of SNPs between two sets Exp. D0 or Dn

(experiments lacking beneficial mutation) and E0 or DE (experiments containing beneficial mutation). Moreover, there was no change in the pattern of correlation between recombination rate and genetic diversity, with an increase in recombination rate as well as with the mode of gene functioning. Furthermore, the total number of SNPs was very high (~ 250 thousand) as compared to all other GEMA experiments, and remained constant. This suggests that recombination increases genetic diversity only by limiting the effect of natural selective forces and linkage. In absence of selection, there is no correlation between recombination rate and genetic diversity. Since we observed a highest total number of SNPs (~ 250 thousand) in absence of selection pressure, therefore, it confirms that selection forces are negatively correlated with genetic diversity. Moreover, since have not observed any differences in the number of SNPs with respect to the mode of gene functioning, this proves that selection pressure was

31

responsible for the observed differences in the rate of fixation or removal of different types (deleterious, beneficial and neutral) of mutations.

32

Figure 1-9: GEMA under no selection pressure (i.e. α = 2).

X axis represents recombination rate from 1 to 100 per gamete and Y represents the total number of SNPs from 0 to 300 thousand. A) showed the correlation between total number of SNPs and recombination rate (when R = 6 and 48 per gamete) for experiment

D0 (with 9 % slightly deleterious mutations and 1% highly deleterious mutations along with 90 % neutral mutations) and E0 (with 9 % slightly beneficial mutations, 1% highly beneficial and 90% ) under dominant mode of gene functioning (H = 0).

B) showed the correlation between a total number of SNPs and recombination rate (when

R = 6 and 48 per gamete) for experiment D0 and E0 under the codominant mode of gene functioning (H = 0.5). C) showed the correlation between a total number of SNPs and recombination rate (when R = 6 and 48) for experiment D0 and E0 under the recessive mode of gene functioning (H = 1). D) showed the correlation between total number of

SNPs and recombination rate (when R = 1 and 96 per gamete) for experiment DE (with

9% both slightly deleterious and slightly beneficial mutation, 1% both highly deleterious and beneficial mutation along with 80% neutral mutation) and Dn (with 10 % slightly deleterious mutation and 90 % neutral mutation) under dominant mode of gene functioning (H = 0). E) showed the correlation between a total number of SNPs and recombination rate (when R = 1 and 96 per gamete) for experiment DE and Dn under the codominant mode of gene functioning (H = 0.5). F) showed the correlation between a total number of SNPs and recombination rate 1 and 96 for experiment DE and Dn under the codominant mode of gene functioning (H = 1).

33

1.5 Summary of conclusions

In all our computer simulation experiments, the total number of SNPs increases with increase in recombination rate. Thus, our modeling experiments suggest that recombination is positively correlated with genetic diversity. Our results are, therefore, in accordance with numerous observations seen the natural population of species like

Human, Drosophila, C. elegans, where the strong positive correlation was found between recombination rate and genetic diversity of the population. However, we have observed that this correlation between a total number of SNPs and recombination rate reaches a plateau around recombination rate 96 per gamete. The change or increase in total number of SNPs is highest from recombination rate 1 to 6, followed by 6 to 48 per gamete.

However, change is lowest or remained same from recombination rate 48 to 96. Thus, increase in recombination rate can account for the increase in diversity of the population only up to certain limit. This may be because recombination mainly increases the diversity of the population by breaking the linkage between the SNPss, therefore, it can only influence the mutations which are in linkage disequilibrium with others. In contrast, there can be much more SNPs which is not in linkage disequilibrium. In that case, increases in recombination rate will not increase the diversity of the population.

Moreover, we have collected our results (total number of SNPs) for four recombination rates (1, 6, 48 and 96 per gamete). Results for more recombination rate such as between

48 and 96, and beyond 96 like 112 or so, will give us a more accurate estimate about exactly at which point this correlation between the recombination rate and genetic diversity reaches a plateau in different experimental scenarios. By doing this we will be 34

able to calculate the correlation coefficient between recombination rate and genetic diversity in our modeling experiments.

Secondly, we have observed that when genes were assumed to be in dominant mode, the difference between the experiments containing beneficial mutations (Exp DE,

Exp E0, Exp En) and the experiments which lack them (Exp D0 and Exp Dn) was highest, followed by codominant mode (except α = 2 with no selection pressure). In the recessive mode, there was no difference between these two sets of experiments. This was likely because recessive mutations need two copies of itself to show their effect, hence, here it does not matter whether the alleles are beneficial or deleterious. As a result of that, no difference was found between experiments containing the beneficial or deleterious mutations. Moreover, in these experiments, when selection pressure was high (i.e. when α value was 10 and 20), the difference between the experiments containing beneficial mutation and those which lack them was even higher. Since we have seen this pattern in both saturated and unsaturated mode of GEMA modeling, it affirms that this pattern was not due to an error in assigning mutation or fitness calculation. Thus, from our modeling experimental results, we can presume that Genetic Hitchhiking which is mainly responsible for fixation of beneficial mutations and SNPs linked with them, was acting at much faster rate in the population (Hard Sweep). As a result of this beneficial mutations got fixed in the population much faster due to the natural selection process. In comparison, Background Selection mainly removes the deleterious mutations and the

SNPs linked with them from the population. And thereby decreases the diversity of the

35

population, was acting at a slower rate compared to Genetic Hitchhiking process, as the percentage increase in the number of SNPs was lower with an increase in recombination rate. Overall there was a higher number of SNPs for the experiments containing the deleterious mutations, even at low recombination rate. This observation indicates that deleterious mutation which neither got fixed nor removed was more prevalent in the population, and thereby more responsible for creating diversity. Whereas, as stated above, beneficial mutation got fixed faster due to the natural selection process. Observation from our modeling experiments is in accordance with the studies (Lohmueller KE 2011,

McVicker G 2009) which reported that deleterious mutation was more prevalent in the natural human population and in other species.

However, in case of Exp DE (80 % neutral, 1 % both highly deleterious and highly beneficial, and 9% both slightly deleterious and slightly beneficial mutation) which is a combination of both Exp D0 (90% neutral, 9% and slightly deleterious and 1% highly deleterious mutation) and Exp E0 (90% neutral, 9 % slightly beneficial and 1% highly beneficial mutation), behaved very similar to Exp E0 and Exp En (90% neutral 10% slightly beneficial mutation). This was probably because highly deleterious mutation similar to beneficial mutation got removed fast due to natural selection (hard sweep), and thus reduced diversity. Even though Exp DE contained 9% slightly deleterious mutation but it did not increase the total number of SNPs like Exp D0 and Exp Dn much. This may be because slightly deleterious mutation was mostly in linkage disequilibrium with highly deleterious and beneficial mutation, hence, removal or fixation of them removed the

36

former as well, and therefore, did not result in an increase in diversity or number of

SNPs. For confirmation of our above-mentioned presumption, further simulation experiments for conditions such as with high percentage of highly deleterious and beneficial mutation in presence and absence of slightly deleterious and beneficial mutations are needed.

Control experiments with no selection pressure was used by selecting α equals to 2 (since all the offspring were used to replace their parents). No difference in total number of

SNPs between two sets (ones which contain beneficial mutation and ones which lack them) of experimental condition was found. Moreover, there was no change in the pattern of correlation between recombination and genetic diversity, with an increase in recombination rate as well as with the mode of gene functioning. Moreover, we have found that for all the cases, total SNPs gets reduced with an increase in offspring (i.e. with an increase in selection pressure) and thus, the total number of SNPs was highest in the case of α equals to 2 followed by 3, 5, 10 and 20. Therefore, we can presume that higher the selection pressure less diverse is the population irrespective of recombination rate. Moreover, according to our modeling experiments, we have found that selection pressure can reduce genetic diversity only up to certain level, beyond that various other population and species-specific factors cause a change in the diversity level.

There are many controversial theories about how recombination rate increases diversity.

One of the most accepted theory is that recombination increases the genetic diversity of

37

the population by limiting the effect of natural evolutionary forces (such as Genetic

Hitchhiking and Background Selection) on the reduction of genetic diversity through linkage. However, other theories suggest that phenomenon such as errors in DNA double strand repair pathway, biased gene conversion theory, and genomic features like GC content, etc. are the main reason for this correlation. (Matthew T. Webster and Laurence

D. Hurst 2011). Therefore, only by distinguishing these factors we will be able to confirm the reason behind this correlation between recombination rate and genetic diversity.

Moreover, this correlation exhibits great variability in its strength. For example, Yeast have weak positive correlation whereas, Mice have no correlation (Asher D. Cutter

2011). Therefore, in future, we must delve into the genomic and species level intricacies responsible for this difference in this observed correlation.

.

38

Chapter 2

Identification of rare genetic variant for Retinal Dystrophy

2.1 Synopsis

Whole Exome Sequencing, followed by bioinformatic data analysis of the

sequencing data and extensive variant analysis was carried out for a family

affected by rare autosomal recessive Retinal Dystrophy, to identify the potential

candidate variants responsible for this genetic disorder. 100 long pair

end raw FASTQ files for each of the three members of the selected family:

unaffected mother and two affected daughters was processed into VCF files

containing SNPs and Indels information. Extensive filtering steps were used to

remove the common variants (with Minor Allele Frequency > 1 %), followed by

comprehensive variant annotation based upon Retina or Retina associated disease

databases, literature survey, and using online software and tools to narrow down

the candidate list. One potential novel candidate Indel (69 nucleotides long

) on gene RP1L1 was identified, which could be the potential cause of

Inherited Retinal Dystrophy in this particular family.

39

2.2 Introduction

Inherited Retinal Dystrophy (IRD) is a highly complex genetically heterogeneous group of disorders of retina, known to be associated with more than 280 genes

(RetNet, http://www.sph.uth.tmc.edu/RetNet/ (SP Daiger 1998)) and more than

4000 mutations (Ran X 2014), with incidence rate of 1 in 2000 – 3000, affecting nearly 2 million people worldwide (Dyonne T Hartong 2006). It results in dysfunction or death of photoreceptors in the retina and can be classified into three main broad categories. Retinitis Pigmentosa (RP) and Choroideremia affecting periphery of retina (characterized by night blindness and tunnel vision),

‘Macular’ or ‘Central’ dystrophies affecting macula (characterized by abnormal color vision and loss of central vision), rod-cone dystrophies affecting center and peripheral retina (characterized by both central and peripheral vision loss)

(Suzanne Broadgate 2017). In most of the cases ILD is non-syndromic (affecting the only retina) however, it can syndromic (affecting other organs and tissues such as ILD can lead to hearing loss) (Werdich XQ 2014).

Because of its multiple inheritance patterns (autosomal recessive, autosomal dominant, or X-linked), highly heterogeneous multigenic nature, the involvement of genetic modifiers, and low prevalence rate, it often results in inter/intrafamilial phenotypic variability (Ebermann I 2010, Chen Q 2014, Zhang Y 2014). As a result of these high allelic and locus heterogeneity, molecular and clinical

40

diagnosis of IRDs are extremely difficult, however, the arrival of NGS technologies have dramatically revolutionized this field of study ((Pei-Wen),

Chiang et al. 2015). Currently, whole exome or whole genome as well as targeted sequencing of known IRD related genes along with molecular diagnostic approaches such as Sanger sequencing and PCR-based approach provides the most advanced and accurate strategies for diagnosis and investigation of IRD related diseases (Mamanova L 2010, Audo I 2012, Neveling K 2012, Lee K

2015).

Until the US $ 1000 per genome milestone becomes a reality (EC. 2014),

Whole Exome Sequencing (WES) is currently considered as the best approach for identification of mutations for rare genetic diseases such as IRD as it provides the advantages of NGS at a lower cost. Thus, it can be efficiently utilized for sequencing of multiple members of a family at high coverage for accurate determination of this rare multigenic disease in a very cost-effective manner.(AB.

2011, Majewski J 2011, Rabbani B 2014). It is often advantageous over PCR and

Panel based sequencing approaches as WES allow us to detect novel genes associated with IRD and unknown mutations including deletion and as well as splice variants, which can be specific to an individual, family or a population (De Wilde B 2014, Consugar MB 2015). However, the most difficult part of using the NGS-based technology is not the cost but rather proper data management and computational analysis of data in a structured manner for meaningful biological results. Even though a huge number of tools and software

41

has been recently developed to ease this process, but, identification of correct

tools and resources for a particular study itself is a challenging task (Datta S 2010,

Li H 2010, Schadt EE 2010, Bao S 2011, Nielsen R 2011, Koboldt DC 2012).

NGS data in FASTQ format

Quality assessment

such as trimming, filtering, etc.

Mapping read alignment with the reference genome

Variant identification

such as SNP, CNV, Indel, SV

Variantand annotation conversion and into further VCF filtering files steps

Variant visualization to select potential variant

Lab validation using sanger sequencing and PCR

for confirmation of candidate

Figure 2-1: Workflow of WGS/WES computational data analysis for identification of

candidate variants of genetic diseases (Stephan Pabinger 2013).

42

WGS or WES data analysis can be divided into five major steps (Figure 2-1,

(Stephan Pabinger 2013)) after the FASTQ files containing the sequence information were obtained from their respective NGS platforms such as ‘Roche

454’, ‘Illumina’ and ‘ABI SOliD’ (ER. 2008, ML. 2010).

Step 1: Quality assessment of raw reads: - in this step base quality score and sequence properties were used to filter out and trim to avoid base calling errors, poor quality reads and adaptor contamination. Tools such as ‘NGSQC toolkit’

(Dai M 2010), ‘PRINSEQ’ (Schmieder R 2011), ‘FastQC’

(http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc), ‘ContEst’ (Cibulskis K

2011), ‘Galaxy’ (Blankenberg D 2010), ‘htSeqTools’ (Planet E 2012),

‘SolexaQA’ (Cox MP 2010), ‘FASTX-Toolkit’

(http://hannonlab.cshl.edu/fastx_toolkit), ‘PIQA’ (Martı´nez-Alca´ntara A 2009),

‘TileQC’ (Dolan PC 2008) and ‘TagCleaner’ (Schmieder R 2010) are commonly used for this purpose.

Step 2: After initial filtering steps reads are then aligned against chosen reference genome. Different versions of the reference genome are available in UCSC

(University of Santa Cruz) or GRC (Genome Reference Consortium) websites

(http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc, Nielsen R 2011,

Raney BJ 2011). Some of the most popular tools used for this purpose are:

‘Bowtie/Bowtie2’ (Langmead B 2009, Langmead B 2012), ‘BWA’ (Li H 2009, Li

H 2010), ‘MAQ’ (Li H 2008), ‘mrFAST’ (Alkan C 2009), ‘Novoalign’

43

(http://novocraft.com/) , ‘SOAP’ (Li R 2009), ‘SSAHA2’ (Ning Z 2001),

‘Stampy’ (Lunter G 2011), ‘YOABS’ (VL. 2012).

Step 3: Variant calling or identification: - there are four main types of variants:

Somatic variation, Germline variations, Copy Number Variations (CNV) and

Structural variations (SV). The type of tool used for the identification of these variants depends upon the experimental needs and design (Stephan Pabinger

2013). Some of the most common variant caller/identifier for SNPs, Indels, and

Somatic variations are ‘CRISP’ (V. 2010), ‘GATK’ (DePristo MA 2011),

‘SAMtools’ (Li H 2009), ‘SNVer’ (Wei Z 2011), ‘VarScan2’ (Koboldt DC

2012)and ‘SomaticSniper’ (for only somatic mutations) (Larson DE 2012). Most common tools for CNV are ‘CNVnator’ (Abyzov A 2011), ‘CONTRA’ (Li J

2012), ‘ExomeCNV’ (Sathirapongsasuti JF 2011)and ‘RDXplorer’ (Yoon S

2009). ‘BreakDancer’ (Chen K 2009), ‘Breakpointer’ (Sun R 2012), ‘CLEVER’

(Marschall T 2012), ‘GASVPro’ (Sindi SS 2012) and ‘SVMerge’ (Wong K 2010) are most commonly used tools for structural variant identification.

Step 4: Variant annotation and further filtering steps:

Any typical WGS/WES experiments often produce thousands of variants; thus, it is therefore very important to annotate these variants and select only functionally relevant variants (depends upon particular study) for further downstream analysis steps. The number of these variants are so huge that it not possible to manually annotate these variants, thus, annotation tools and platforms are used for this purpose. Some of the most common variant annotation tools are ‘ANNOVAR’

44

(Wang K 2010), ‘AnnTools’ (Makarov V 2012), ‘NGS–SNP’ (Grant JR 2011),

‘SeattleSeq’ (http://snp.gs.washington.edu/SeattleSeqAnnotation) , ‘Sequence variant analyzer’ (SVA) (Ge D 2011), ‘snpEff’ (Cingolani P 2012), ‘VARIANT’

(Medina I 2012), ‘Variant effect predictor’ (VEP) (McLaren W 2010).

Step 5: Variant visualization and detection of potential candidate variants:

Once the variants are annotated and functionally relevant variants are selected, it is very important to visually study and interpret the biological implications of these selected variants for further validation and confirmation. These visualization and interpretation tools display the aligned reads carrying the variants in comparison with the reference genome, also provides related information such as mapping quality, coverage, position inside the genomic region (Nielsen CB

2010). These tools can be divided into three different types depending upon the purpose they serve: i) finishing tools which are used for interpretations of selected variants; ii) genome browsers which can be used map the selected variants against reference genome and different annotations tools and resources; iii) comparative viewers which help us to compare sequence and variants against multiple organisms and individuals. Most common web-based visualization resources are

‘Ensembl Genome Browser’ (Spudich GM 2010), ‘UCSC Genome Browser’

(Dreszer TR 2012) and ‘Vega’ (Vertebrate Genome Annotation) Genome

Browser (J. 2005). Common standalone visualization tools are ‘Artemis’ (Carver

T 2012), ‘Integrative Genomics Viewer’ (IGV) (Thorvaldsdo´ ttir H 2013),

45

‘Sequence Annotation and Visualization and Analysis Tool’ (Savant) (Fiume M

2010), ‘Circos’ for visualization of CNV and SV.

In this study, we have used Whole Exome Sequencing (WES) technique to identify the potential genetic variants for a rare form of retinal dystrophy inherited in the autosomal recessive manner in a particular family. For our computational analysis of WES data, we have used some of the above-mentioned tools and resources, along with some retina-specific databases for further annotation and filtering steps.

2.3 Materials and Methods

This project was done in collaboration with my lab member Basil Khuder (MSBS student of Bioinformatics, Proteomics, and Genomics program), where he was mainly responsible for initial processing and filtering steps of raw FASTQ data, and their conversion into VCF files. Whereas, I worked with VCF files and was responsible for further filtering steps of SNPs and Indels, selection of candidate variants based upon different criteria as well as analysis of candidate variant using already existing databases, software/web-based tools and literature survey.

Whole exome sequencing of three individuals (two affected daughters and unaffected mother) of a family was done on Illumina Hiseq platform using

Agilent SureSelect Human All Exon V5 Kit (Rui Chen 2015). And raw fastq files

46

containing millions of reads (with an average read depth of 100 nucleotides) of

100 bp (base pair) long pair-end sequences of for each individual was provided by our collaborators. This FASTQ file was then processed by my lab member into

VCF files using tools such as ‘Burrows-Wheeler Aligner’ (BWA) (Li H., Durbin

R. 2009), ‘Picard’ (Alec Wysoker 2013), ‘SAM tools’ (Li H 2009) and ‘Genome

Analysis Toolkit’ (GATK) (McKenna A. 2010) The initial raw combined

(containing both SNPs and Indels) variant file contained total 924,444 variants.

Variant Calling procedure was then run to filter the variants which are present only on exons (using exon coordinates data) leaving 72,927 SNPs and 6,060

Indels. These files are then further filtered using hard filters (based upon the quality of reads, etc.) resulting in the production of initial VCF files containing total 66,999 SNPs and 6,064 Indels, and was then used for further filtering and downstream analysis steps.

47

MMOTHER fAFATHER

1g GENERATION I

GENERATION II

D DAUGHTERS SONS

Figure 2-2: Pedigree of a family with autosomal recessive Retinal Dystrophy.

First generation (I) represents unaffected parents and second generation (II)

represents five children including three daughters and two sons. Out of five

children, only two daughters (filled with blue color) were affected and none of the

sons were affected by retinal dystrophy. Whole exome sequencing of the

unaffected mother and two daughters were carried out on Illumina Hiseq platform

using Agilent SureSelect Human All Exon V5 Kit. 100 bp long pair-end

sequencing FASTQ files for each of the three family members were provided for

bioinformatic analysis and identification of causal genetic variants for autosomal

recessive retinal dystrophy.

48

2.3.2 Filtering against 1000 genome phase 1 and 3

Retinal Dystrophy is propagated into this family in an autosomal recessive manner. Therefore, it is expected to be caused by very rare variants with minor allele frequency (MAF) less than 1 %. Whereas 1000 Genome project which is a public database for common SNPs and Indels (with MAF > 1 %) from total 26 populations worldwide. 1000 Genome phase I data contains variants for total

1092 individuals from 14 different populations containing total 38 million SNPs

1.4 million short insertions and deletions and 14,000 large deletions. And, Phase

III contains total 84.7 million SNPs and 3.6 million Indels and 60,000 structural variants from 2504 individuals representing 26 populations. Thus, in order to narrow down our initial variant list into only the uncommon variants with MAF <

1 %, Perl script exon_position_comparison2.pl was used to filter my both initial

SNPs and Indels VCF files against 1000 Genome Phase I and Phase III data to remove all common variants. Few other Perl scripts such as dot_substitution.pl and SNP_Statistics.pl was also used during this process.

2.4.3 Selection of variants based on genotype

In this family as daughters are affected by Retinal Dystrophy, therefore, we hypothesized that genotype of the daughters must be same for the causative variants. Moreover, since Retinal Dystrophy has propagated in the autosomal recessive manner in this family and the parents are unaffected, therefore, we

49

hypothesized that both mother and father must be a carrier, and thus, heterozygous for the variants responsible for this genetic disorder. It is, therefore, it might be the case of Compound Heterozygosity where both the parents are a carrier of two different variants responsible for the disorder either in the same gene or in a different gene or a non-coding region, resulting in expression of the disease when these variants come together in their progeny.

Perl scripts Daughters_SNPs_Comparison.pl and

Daughters_Indels_Comparison.pl was used to select only those variants in which genotypes of the daughters are same and mother is either heterozygote for that variant or does not contain the variant allele at all. Based upon above mentioned genotypic criteria, we then again filtered the VCF files containing only the unknown variants to narrow down our candidate list of variants.

2.3.4 Variant Analysis

2.3.4.1 Variant analysis using Integrated Genomic

Viewer

The final VCF files after the rigorous filtering steps contained only 715 Indels and

1530 SNPs. Even though, during initial filtering and processing steps while conversion of raw FASTQ files into VCF files, there were still many variants which were present outside the exons or in the non-coding or untranslated region of the genes. All such variants were removed manually by viewing and studying 50

the variants using ‘Integrated Genomic Viewer’ (IGV) (James T. Robinson 2011).

Whereas, the variants which were present in the noncoding region but very close the splice acceptor/donor site and might affect splicing or in high linkage disequilibrium with other variants in the coding regions was retained. This processing steps further reduced our Indels list from 715 to 118 Indels.

Each of these 118 Indels was studied manually using IGV to evaluate the biological significance of these Indels such as whether they are causing a frameshift, or is a large deletion and insertion which might change the sequence, etc. This further narrowed down our list of candidates Indels into 46.

51

H Selected 69 nucleotide long

deletion

Figure 2-3: Identification of potential variants using Integrative Genomic

Viewer (IGV). The figure shows how IGV could be used to visualize and study variants related to genetic disease for each individual. For example, the image represents the selected candidate Indel in the RP1L1 gene.

52

2.3.4.2 Variant analysis using database search and

literature survey

Retina-specific database ‘RETINA SEARCH – TIGEM’

(http://retina.tigem.it/retina_search.php) is a database of genes expressed in the only retina was used to verify whether the genes containing this 46 Indels are expressed in the retina or not. This database also provided us the information about whether a particular gene is known to be involved or predicted to be associated with retina- related diseases such as Retinitis Pigmentosa, Photoreceptor Diseases, along with its rank and p-value for such estimation. Gene cards were also used verify the mRNA and protein expression in different tissues as well as other known information about these genes. After this initial database search, only 10 genes out of 46 genes carrying

Indels was found to be expressed in Retina. Extensive literature survey, publicly available database such as ‘RetNet -Retinal Informational Network’

(https://sph.uth.edu/retnet/) and ‘ClinVar’ (https://www.ncbi.nlm.nih.gov/clinvar/) was searched to verify whether these 10 genes are known or predicted to be associated with retinal related diseases or any related functions. Only one out of these ten genes were found to be a highly potential candidate for Retinal Dystrophy in this family. The presence of this particular Indel was then reverified in the raw FASTQ files of both the daughters and the mother for absolute confirmation.

53

Figure 2-4: Identification of potential variants of retina-specific diseases using retina-specific database RETINA SEARCH – TIGEM. The image shows the genes (expressed in the retina) which are involved in Photoreceptor or retina related disorders along with the p-value for such associations and rank.

54

2.3.4.3 Confirmation of unknown novel variant

To confirm whether the identified novel variants for Retinal Dystrophy are unknown or previously documented, extensive literature survey and publicly available databases for unknown variants such as ‘SNP Nexus’, ‘Human Gene

Mutation Database’ (HGMD), ‘SNPdb’ were used. SNPdb was first searched using the gene name containing our candidate variants. Human reference genome assembly (GRCh 37 Hg19, February 2009) was used as a reference genome for all our analysis. Since the coordinates of the variants can differ among the different assembly of the reference genome and depend upon the preprocessing steps and software used, we surveyed all the variants found in these two particular genes and genomic regions. However, we could not have found these two candidate variants in SNPdb. Since our novel variants were absent in both 1000 Genome data and as well as SNPdb, therefore, there was no rs ID for these variants. A database containing published genetic diseases associated mutations was then searched in order to determine whether these variants have already been reported.

For this purpose, mainly two databases and platform called HGMD and

SNPnexus was used.

Human Genetic Mutation Database (HGMD) (Peter D. Stenson 2014) is a database for known (published) variants reported to be responsible for human inherited diseases, is available in both free public version and professional version. The professional version of this database was used to get the information 55

about all the disease-associated mutations reported for our gene of interest. As output, HGMD categories all the mutations based upon their types (such as missense, splicing variants, etc.) and provides links to the studies reported them.

For further confirmation and analysis of our novel variants, another SNP analysis platform known as SNPnexus (Abu Z Dayem Ullah 2012) was used. It helps in identifying functionally relevant SNPs by bringing a vast variety of software tools and databases for SNP analysis under the same platform. Novel variants with only genomic coordinates can be used for genomic mapping with contig and cytogenetic positions and for finding additional annotation using this platform. In our study, this platform was mainly used to determine the physical position of our candidate variants and whether it overlaps with the position of any already reported SNPs.

2.4 Results

As described in materials and methods, raw 100 bp long pair-end sequencing fastq files for each of the three individuals (unaffected mother and two affected daughter) of the studied family, were processed and filtered using various software to produce VCF files for SNPs and Indels for further downstream analysis. Till this step, the work was carried out by my lab member Basil Khuder.

These SNPs and Indels VCF files were then filtered to remove common variants

(MAF > 1 %) present in 1000 Genome phase I and phase III data. Variants with selected genotype were then selected and were then used as final filtered VCF 56

files for further variant analysis. From this selected 715 Indels, only 108 Indels were found to be present in the coding region of the genes. And, only 46 out of

108 Indels were found to cause a change in reading frame and thus has the potential to disrupt protein structure and function. Furthermore, out of these 46, only 10 Indels carrying genes were found to be expressed in the retina, these 10 indels were then selected for further analysis. Extensive literature and database survey were carried out to select our highly potential candidate Indels RP1L1 as a genetic variant for retinal dystrophy in this particular family.

57

Raw combined FASTQ files

92924,444

Variant Calling based upon exon coordinates

SNPs Indels

7266,927 6,060 Filtration and processing into VCF files

66,999 6,064 Filtration based upon 1000 Genome Phase I 6630 2510 Filtration based upon 1000 Genome Phase III 1244 3732

715 1530 Filtration based on Genotype

Indels inside coding region 118

Significant Indels by their effect 46

Indel carrying genes expressed in Retina 10

Highly potential candidate variant RP1L1

58

Figure 2-5: Flowchart of processing and filtration steps for selection of potential candidate variants of retinal dystrophy from raw whole exome sequencing

FASTQ files.

59

2.4 RP1L1

Retinitis Pigmentosa 1 Like 1 protein (RP1L1), also known as Doublecortin

Domain Containing 4B (DCDC4B) encodes a retina-specific protein of

Doublecortin family. This 2480 amino acids long protein has two N-terminal doublecortin domains responsible for binding and regulation of microtubule polymerization, and two C-terminal repetitive regions with a high percentage of glycine, glutamine, and glutamic acid residues. RP1L1 and its paralog Retinitis

Pigmentosa (RP1) synergistically plays an essential role in photosensitivity, differentiation of photoreceptors cells particularly in the organization and of the outer segment of rod and cone photoreceptors. As a result, mutations in both

RP1L1 and RP1 has been reported to be associated with Occult Macular

Dystrophy (OMD) and Retinitis Pigmentosa. The C-terminal repetitive region of

RP1L gene contains 1-6 copies of highly polymorphic 16 amino acids long repeat, because of these, the exact length of this protein varies from individual to individual.

RP1L1 is 105,839 bp long gene, located on the minus strand of chromosome 8 comprises of 4 exons, out of which 1st exon is not noncoding. Our candidate Indel is a deletion of 69 nucleotides in the exon 4 at the chromosomal position

10,465,965 (last and longest exon) of RP1L1 gene. This 69-nucleotide long deletion was found to be present in the C-terminal repetitive regions containing 16

60

amino acids long repeats. Both the mother and the two daughters are found to be heterozygous for this deletion. However, no studies yet have reported the biological role of this repetitive region and the implication of copy number variation of this repeats in diseases. Presumably, this deletion might result in loss of copies of 16 aa long repeats present in the C-terminal repetitive region and might affect its interactions and roles in photosensitivity.

61

Figure 2-6: Selected candidate Indel at RP1L1 gene.

Fig. A shows the exon-intron structure of RP1L1 gene and the relative position of our selected 69 nucleotide deletion in the exon 4 of RP1L1 gene. Fig B represents the 69 nucleotides long deleted sequence and the encoded amino acids by this deleted region.

62

Human Gene Mutation Database (HGMD) professional version which has the latest collection of all mutations reported by literature for genes, showed that total

33 mutations have been reported so far for RP1L1 gene. Out of these, only 2 was small deletion and rest were missense mutations. The study linked with the 2 deletions was then checked, the deleted sequence was found to be different from our candidate indel. Therefore, the Indel selected by us could be considered as novel.

Moreover, a further survey of R1PL1 gene was carried out to find the presence of any additional variant, which may lead to a case of compound heterozygosity.

However, no potential candidate variant was found in the coding region of the

RP1L1. All the SNPs reported for RP1L1 were found to be common SNP with

MAF > 1 %, thus, cannot be the cause of this rare genetic disorder in this particular family.

63

Figure 2-7: Results of Human Gene Mutation Database for RP1L1 gene.

HGMD database was used to get the list of all mutations (according to their type) been reported by literature so far for RP1L1 gene. Total 33 mutations have been

64

reported by HGMD professional version: 18 missense mutations and 2 small

deletions. None of the 2 deletions were found to match with our selected Indel.

2.5 Summary of conclusions

From our whole genome exome sequencing of three family members and

extensive variant analysis of both Indels and SNPs, we have found one most

potential candidate Indel for retinal dystrophy for the selected family.

Even though the selected 69 nucleotides long Indel (deletion) in the RP1L1 gene

does not cause any change in the reading frame but it falls in the highly

polymorphic, 16 amino-acids long repetitive regions of the protein, which known

to exhibit copy number variation (between 1-6 copies) among individuals.

However, the biological effect of this copy number variation among individuals is

not known yet. However, since, this Glycine and Glutamine rich repeat falls under

the C-terminal region of RP1L1 which is known to be involved in protein-protein

interactions, therefore, we can presume that change in this copy number because

of this deletion might affect the interaction with other and might decrease

its binding affinity with other cofactors. Moreover, deletion in RP1L1 gene has

already been earlier reported in the literature (Alice E. Davidson 2012, Kaoru

Fujinami 2016) to cause Retinitis Pigmentosa in the Japanese population.

Therefore, we can presume that the novel deletion identified by us might be a

cause for Retinal Dystrophy in this family.

65

Hence, further wet lab experimental techniques such as Sanger sequencing, PCR, western blot, and mutagenesis experiments are needed to be performed to confirm the role of this above mention deletion on retinal dystrophy in this family.

Moreover, in order to be the case of a compound heterozygote for retinal dystrophy, there must be another genetic variant associated with this disorder in this particular family. There is a high chance that another variant is also present in this gene. However, as we couldn’t find any other potential candidate variant by a detailed survey of RP1L1 gene, thus, it is likely that the other associated variant is either present in other gene or non-coding region of the same or different gene.

Similar bioinformatics analysis for our final filtered list of SNPs must be done to find the potential candidate SNP variants causing retinal dystrophy in the studied family.

66

References

(2015). "The Consortium. A global reference for human ." Nature 526(7571): 68-74.

(Pei-Wen), J., et al. (2015). "Progress and prospects of next-generation sequencing testing for inherited retinal dystrophy." Expert Rev. Mol. Diagn. 15(10).

AB., S. (2011). "Exome sequencing: a transformative technology." Lancet Neurol. 10(10).

Abyzov A, U. A., Snyder M (2011). "CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing." Genome Res. 21.

Alcala, N. V., S. (2014). "Turnover and accumulation of genetic diversity across large time-scale cycles of isolation and connection of populations." Proc. R. Soc. B 281(20141369).

Alkan C, K. J., Marques-Bonet T (2009). "Personalized copy number and segmental duplication maps using next-generation sequencing." Nat Genet 41.

Audo I, B. K., Leveillard T (2012). "Development and application of a next-generation- sequencing (NGS) approach to detect known and novel gene defects underlying retinal diseases." Orphanet J Rare Dis. 7.

Aya Takahashi, Y.-H. L., and Naruya Saitou (2004). "Genetic Variation Versus Recombination Rate in a Structured Population of Mice." Mol. Biol. Evol. 21: 404–409.

Bao S, J. R., Kwan W (2011). "Evaluation of next generation sequencing software in mapping and assembly." JHum Genet 56.

Begun, D. J. (2007). "Population genomics: whole genome analysis of polymorphism and divergence in Drosophila simulans." PLoS Biology 5(e310).

Begun, D. J. A., C. F. (1992). "Levels of naturally occurring DNA polymorphism correlate with recombination rates in D. melanogaster." Nature 356: 519–520.

Bjørnstad, O. N. G., B. T. (2001). "Noisy clockwork: time series analysis of population fluctuations in animals." Science 293: 638–643.

Blankenberg D, G. A., Von Kuster G (2010). "Manipulation of FASTQ data with Galaxy." Bioinformatics 26. 67

Burgarella, C. (2015). "Molecular evolution of freshwater snails with contrasting mating systems." Mol. Biol. Evol. 32: 2403–2416.

Campos, J. L., Halligan, D. L., Haddrill, P. R. & Charlesworth, B. (2014). "The relation between recombination rate and patterns of molecular evolution and variation in Drosophila melanogaster." Mol. Biol. Evol. 31: 1010–1028.

Carina F Mugal, B. N. a. H. E. (2013). "Genome-wide analysis in chicken reveals that local levels of genetic diversity are mainly governed by the rate of recombination." BMC Genomics 14.

Carver T, H. S., Berriman M (2012). "Artemis: an integrated platform for visualization and analysis of high-throughput sequence-based experimental data." Bioinformatics 28.

Charlesworth, B. (1993). "The effect of deleterious mutations on neutral molecular variation." Genetics 134: 1289–1303.

Charlesworth, B. (1994). "The effect of background selection against deleterious mutations on weakly selected linked variants." Genet. Res. 63: 213–227.

Charlesworth, B. (2009). "Effective population size and patterns of molecular evolution and variation." Nat. Rev. Genet. 10 195–205.

Charlesworth, D. W., S. (2001). "Breeding systems and genome evolution." Curr. Opin. Genet. Dev. 11: 685–690. Charlesworth, (2013). "Background Selection 20 Years on." Journal of Heredity.104 (2). Chen K, W. J., McLellan MD (2009). "BreakDancer: an algorithm for high-resolution mapping of genomic structural variation." Nat. Methods 6.

Chen Q, Z. J., Shen Z, Zhang W, Yang J. (2014). "Whirlin and PDZ domain containing 7 (PDZD7) proteins are both required to form the quaternary protein complex associated with Usher syndrome type 2." J Biol Chem 289(52).

Cibulskis K, M. A., Fennell T (2011). "ContEst: estimating cross-contamination of human samples in next generation sequencing data." Bioinformatics 27.

Cingolani P, P. V., Coon M (2012). "Using Drosophila melanogaster as a model for genotoxic chemical mutational studies with a new program, SnpSift." Front. Genet. 3.

Comeron, J. M. (2014). "Background selection as a baseline for nucleotide variation across the Drosophila genome." PLoS Genet. 10(e1004434).

68

Consugar MB, N.-G. D., Place EM (2015). "Panel-based genetic diagnostic testing for inherited eye diseases is highly accurate and reproducible, and more sensitive for variant detection, than exome sequencing." Genet Med. 17(4): 253-261.

Coop, G. (2016). "Does linked selection explain the narrow range of genetic diversity across species?" bioRxiv.

Corbett-Detig, R. B., Hartl, D. L. & Sackton, T. B. (2015). "Natural selection constraints neutral diversity across a wide range of species." PLoS Biol. 13(e1002112).

Cox MP, P. D., Biggs PJ. (2010). "SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data." BMC Bioinformatics 11.

Cutter AD, P. B. (2013). "Genomic signatures of selection at linked sites: unifying the disparity among species." Nat. Rev. Genet. 14.

Cutter, A. D. C., J. Y. (2010). "Natural selection shapes nucleotide polymorphism across the genome of the nematode Caenorhabditis briggsae." Genetic Research 20: 1103–1111.

Cutter, A. D. P., B. A. (2013). "Genomic signatures of selection at linked sites: unifying the disparity among species." Nat. Rev. Genet 14: 262–274.

Dai M, T. R., Maher C (2010). "NGSQC: cross-platform quality analysis pipeline for deep sequencing data." BMC Genomics 11.

Datta S, D. S., Kim S (2010). "Statistical analyses of next generation sequence data: a partial overview." Bioinformatics 3.

De Wilde B, L. S., Dong W (2014). "Target enrichment using parallel nanoliter quantitative PCR amplification." BMC Genomics 15(184).

DePristo MA, B. E., Poplin R (2011). "A framework for variation discovery and genotyping using next-generation DNA sequencing data." Nat Genet 43.

Dey, A., Chan, C. K. W., Thomas, C. G. & Cutter, A. D. (2013). "Molecular hyper diversity defines populations of the nematode Caenorhabditis brenneri." Proc. Natl Acad. Sci. USA 110: 11056–11060.

Dohm JC, L. C., Borodina T (2008). "Substantial biases in ultra-short read data sets from high-throughput DNA sequencing." Nucleic Acids Research 36.

Dolan PC, D. D. (2008). "TileQC: a system for tile-based quality control of Solexa data." BMC Bioinformatics 9.

69

Dolgin, E. S., Charlesworth, B. & Cutter, A. D. (2008). "Population frequencies of transposable elements in selfing and outcrossing Caenorhabditis nematodes." Genome Research 90: 317–329.

Doniger, S. W. (2008). "A catalog of neutral and deleterious polymorphism in yeast." PLoS Genet. 4(e1000183).

Dreszer TR, K. D., Zweig AS (2012). "The UCSC Genome Browser database: extensions and updates 2011." Nucleic Acids Research 40.

Dyonne T Hartong, Prof Eliot L Berson, Prof Thaddeus P Dryja. (2006). "Retinitis pigmentosa." Lancet 368.

Ebermann I, P. J., Liebau MC (2010). "PDZD7 is a modifier of retinal disease and a contributor to digenic Usher syndrome." J Clin Invest 120(6): 1812-1823.

EC., H. (2014). "Technology: The $1,000 genome." Nature 507(7492): 294-295.

Enard, D., Messer, P. W. & Petrov, D. A. (2014). "Genome-wide signals of positive selection in human evolution." Genome Res. 24: 885–895.

ER., M. (2008). "Next-generation DNA sequencing methods." Annu Rev GenomicsHum Genet 9.

Fiume M, W. V., Brook A (2010). "Savant: genome browser for high-throughput sequencing data." Bioinformatics 26.

Flowers, J. M. (2012). "Natural selection in gene-dense regions shapes the genomic pattern of polymorphism in wild and domesticated rice. ." Mol. Biol. Evol. 29: 675–687.

Galtier, H. E. a. N. (2016). "Determinants of genetic diversity." Nature Genetics Review 17: 422-433.

Ge D, R. E., Shianna KV (2011). "SVA: software for annotating and visualizing sequenced human genomes." Bioinformatics 27.

Glémin, S., Bazin, E. & Charlesworth, D. (2006). "Impact of mating systems on patterns of sequence polymorphism in flowering plants." Proc. R. Soc. B 273.

Glémin, S. M., A. (2014). "Mating systems and selection efficacy: a test using chloroplastic sequence data in angiosperms." J. Evol. Biol. 27: 1386–1399.

Grant JR, A. A., Liao X (2011). "In-depth annotation of SNPs arising from resequencing projects using NGS-SNP." Bioinformatics 27. 70

Hartfield, M. (2016). "Evolutionary genetic consequences of facultative sex and outcrossing." J. Evol. Biol. 29: 5–22.

Hodgkinson, A. E.-W., A. (2011). "Variation in the mutation rate across mammalian genomes." Nature Review Genetics 12: 756–766. http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc The Genome Reference Consortium.

Hudson, R. R. a. K., N.L. (1995). "Deleterious background selection with recombination." Genetics 141: 1605–1617.

Igic, B. B., J. W. (2013). "Is self-fertilization and evolutionary dead end?" New Phytol. 198: 386–397.

J., L. (2005). "VEGA, the genome browser with a difference." Brief Bioinformatics 6.

Jarne, P. (1995). "Mating system, bottlenecks and genetic polymorphism in hermaphroditic animals." Genetic Research 65: 193–207.

Jensen, S. P. P. J. D. (2016). "THE IMPACT OF LINKED SELECTION IN CHIMPANZEES: A COMPARATIVE STUDY." Genome Biology and Evolution 10.

Judson, O. P. N., B. B. (1996). "Ancient asexual scandals." Trends Ecol. Evol. 11: 41–46.

Kaplan, N. L. (1989). "The ‘hitchhiking effect’ revisited." Genetics 123: 887–899.

Kimura, M. (1983). "The Neutral Theory of Molecular Evolution." Cambridge Univ. Press.

Koboldt DC, L. D., Chen K (2012). "Massively parallel sequencing approaches for characterization of structural variation." MethodsMol Biol 838.

Koboldt DC, Z. Q., Larson DE (2012). "VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing." Genome Research 22.

Lack, J. B. (2015). "The Drosophila genome nexus: a population genomic resource of 623 Drosophila melanogaster genomes, including 197 from a single ancestral range population." Genetics 199: 1229–1241.

Langmead B, S. S. (2012). "Fast gapped-read alignment with Bowtie 2." Nat. Methods 9.

71

Langmead B, T. C., Pop M (2009). "Ultrafast and memory-efficient alignment of short DNA sequences to the human genome." Genome Biology and Evolution 10.

Larson DE, H. C., Chen K (2012). "SomaticSniper: identification of somatic point mutations in whole genome sequencing data." Bioinformatics 28.

Lee K, G. S. (2015). "Navigating the current landscape of clinical genetic testing for inherited retinal dystrophies." Genet Med. 17(4): 245-252.

Leffler, E. M. (2012). "Revisiting an old riddle: what determines genetic diversity levels within species?" PLoS Biol. 10(e1001388).

Lercher, M. J. H., L. D. (2002). "Human SNP variability and mutation rate are higher in regions of high recombination." Trends Genet. 18: 337–340.

Li H, D. R. (2009). "Fast and accurate short read alignment with Burrows-Wheeler transform." Bioinformatics 25.

Li H, D. R. (2010). "Fast and accurate long-read alignment with Burrows-Wheeler transform." Bioinformatics 26.

Li H, H. B., Wysoker A (2009). "The /Map format and SAMtools." Bioinformatics 25.

Li H, H. N. (2010). "A survey of sequence alignment algorithms for next-generation sequencing." Brief Bioinformatics 11.

Li H, R. J., Durbin R. (2008). "Mapping short DNA sequencing reads and calling variants using mapping quality scores." Genome Res. 18.

Li J, L. R., Amarasinghe KC (2012). "CONTRA: copy number analysis for targeted resequencing." Bioinformatics 28.

Li R, Y. C., Li Y (2009). "SOAP2: an improved ultrafast tool for short read alignment." Bioinformatics 25.

Lohmueller, K. E. (2011). "Natural selection affects multiple aspects of genetic variation at putatively neutral sites across the human genome." PLoS Genet. 7(e1002326).

Lunter G, G. M. (2011). "Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads." Genome Res. 21.

Lynch, M. (2010). "Evolution of the mutation rate." Trends Genetics 26: 345–352.

72

Lynn Y Huynh, D. L. M., James W Thomas (2010). "Contrasting population genetic patterns within the white-throated sparrow genome (Zonotrichia albicollis)." BMC Genetics 11.

Majewski J, S. J., Lalonde E, Montpetit A, Jabado N. (2011). "What can exome sequencing do for you?" J Med Genet. 48(9).

Makarov V, O. G. T., Cai G. (2012). "AnnTools: a comprehensive and versatile annotation toolkit for genomic variants." Bioinformatics 28.

Mamanova L, C. A., Scott CE (2010). "Target-enrichment strategies for next-generation sequencing." Nat Methods. 7(2): 111-118.

Mark Welch, D. B. M., M. (2000). "Evidence for the evolution of Bdelloid rotifers without sexual reproduction or genetic exchange." Science 288: 1211–1215.

Marschall T, C. I., Canzar S (2012). "CLEVER: clique enumerating variant finder. ." Bioinformatics 28.

Martı´nez-Alca´ntara A, B. E., Feng C (2009). "PIQA: pipeline for Illumina G1 genome analyzer data quality assessment." Bioinformatics 29.

Matthew, H. (2005). "How homologous recombination generates a mutable genome." Human Genomics 2.

Maynard Smith, J. (1978). "The Evolution of Sex." Cambridge Univ. Press.

Maynard Smith, J. a. H., J. (1974). "The hitch-hiking effect of a favorable gene." Genet. Res. 23: 23–35.

McDonald, M. J., Rice, D. P. & Desai, M. M. (2016). "Sex speeds adaptation by altering the dynamics of molecular evolution." Nature 531: 233–236.

McLaren W, P. B., Rios D (2010). "Deriving the consequences of genomic variants with the Ensembl API and SNP effect predictor." Bioinformatics 26.

McVean, G., Spencer, C. C. A. & Chaix, R. (2005). "Perspectives on from the HapMap project." PLoS Genetics 1(e54).

Medina I, D. M. A., Bleda M (2012). "VARIANT: command line, web service and web interface for fast and accurate functional characterization of variants found by next- generation sequencing." Nucleic Acids Research 40.

ML., M. (2010). "Sequencing technologies—the next generation." Nat Rev Genet 11. 73

Moses, A. D. C. a. A. M. (2011). "Polymorphism, Divergence, and the Role of Recombination in Saccharomyces cerevisiae Genome Evolution." Mol. Biol. Evol. 28: 1745–1754.

Nachman, M. W. (2001). "Single nucleotide polymorphisms and recombination rate in humans." Trends Genet. 17: 481–485.

Neveling K, C. R., Gilissen C (2012). "Next-generation genetic testing for retinitis pigmentosa." Hum Mutat. 33(6): 963-972.

Nielsen R, P. J., Albrechtsen A (2011). "Genotype and SNP calling from next-generation sequencing data." Rev Genet 12.

Nielsen R, P. J., Albrechtsen A. (2011). "Genotype and SNP calling from next-generation sequencing data." Nat Rev Genet 12.

Ning Z, C. A., Mullikin JC. (2001). "SSAHA: a fast search method for large DNA databases." Genome Res. 11.

Nordborg, M. (2005). "The pattern of polymorphism in Arabidopsis thaliana." PLoS Biol. 3(e196).

Payseur, A. D. C. a. B. A. (2003). "Selection at Linked Sites in the Partial Selfer Caenorhabditis elegans." Mol. Biol. Evol. 20: 665–673.

Planet E, A. C.-O., Reina O (2012). "htSeqTools: high-throughput sequencing quality control, processing, and visualization in R." Bioinformatics 28.

Rabbani B, T. M., Mahdieh N. (2014). "The promise of whole-exome sequencing in medical genetics." J Hum Genet. 59(1).

Ran X, C. W., Huang XF, (2014). ‘RetinoGenetics’: a comprehensive mutation database for genes related to inherited retinal degeneration. Database, Oxford.

Raney BJ, C. M., Rosenbloom KR (2011). "ENCODE whole-genome data in the UCSC genome browser." Nucleic Acids Research 39.

Rattray, A., Santoyo, G., Shafer, B. & Strathern, J. N. (2015). "Elevated mutation rate during meiosis in Saccharomyces cerevisiae." PLoS Genet. 11(e1004910).

Renzette N, K. T., Jensen JD. (2016). "On the relative roles of background selection and genetic hitchhiking in shaping human cytomegalovirus genetic diversity." Mol. Ecol. 25. 74

Romiguier, J. (2014). "Comparative population genomics in animals uncovers the determinants of genetic diversity." Nature 515: 261–263.

Russell B. Corbett-Detig, D. L. H., Timothy B. Sackton (2015). "Natural Selection Constrains Neutral Diversity across A Wide Range of Species." PLoS Biol. 13(e1002112.).

Sachidanandam, R. (2001). "A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms." Nature 409: 928–933.

Sathirapongsasuti JF, L. H., Horst BAJ (2011). "Exome sequencing-based copy-number variation and loss of heterozygosity detection: ExomeCNV." Bioinformatics 27.

Schadt EE, L. M., Sorenson J (2010). "Computational solutions to large-scale data management and analysis." Nat Rev Genet 11.

Schmieder R, E. R. (2011). "Quality control and preprocessing of metagenomic datasets." Bioinformatics 27.

Schmieder R, L. Y., Rohwer F (2010). "TagCleaner: Identification and removal of tag sequences from genomic and metagenomic datasets." BMC Bioinformatics 11.

Simon, J. C., Delmotte, F., Rispe, C. & Crease, T. (2003). "Phylogenetic evidence for hybrid origins of asexual lineages in an aphid species." Evolution 57: 1291–1303.

Sindi SS, O. S., Peng L (2012). "An integrative probabilistic model for identification of structural variation in sequencing data." Genome Biol. 13.

Slotte, T. (2013). "The Capsella rubella genome and the genomic consequences of rapid mating system evolution." Nature Genetics 45: 831–835.

SP Daiger, B. R., J Greenberg, A Christoffels, W Hide. (1998). "Data services and software for identifying genes and mutations causing retinal degeneration." Invest. Ophthalmol.Vis. Sci. 39.

Spudich GM, F. n.-S. r. X. (2010). "Touring Ensembl: a practical guide to genome browsing." BMC Genomics 11.

Stebbins, G. L. (1957). "Self fertilization and population variability in the higher plants." Am. Naturalist 91: 41–46.

Stephan Pabinger, A. D., Maria Fischer, Rene Snajder, Michael Sperk, Mirjana Efremova, Birgit Krabichler, Michael R. Speicher, Johannes Zschocke and Zlatko 75

Trajanoski. (2013). "A survey of tools for variant analysis of next-generation genome sequencing data." Briefings in Bioinformatics. 15.

Stephan, W. L., C. H. (1998). "DNA polymorphism in Lycopersicon and crossing-over per physical length." Genetics 150: 1585–1593.

Sun, J., Cornelius, S. P., Janssen, J., Gray, K. A. & Motter, A. E. (2015). "Regularity underlies erratic population abundances in marine ecosystems." J. R. Soc. Interface 12(20150235).

Sun R, L. M., Zemojtel T (2012). "Breakpointer: using local mapping artifacts to support sequence breakpoint discovery from single-end reads." Bioinformatics 28.

Suzanne Broadgate, J. Y., Susan M. Downes, Stephanie Halford (2017). "Unravelling the genetics of inherited retinal dystrophies: Past, present and future." Elsevier.

Tajima, F. (1990). "Relationship between DNA polymorphism and fixation time." Genetics 125: 447–454.

Tenaillon, M. I. (2001). "Patterns of DNA sequence polymorphism along of maize." Proc. Natl Acad. Sci. USA 98: 9161–9166.

Tenaillon, M. I. (2001). "Patterns of DNA sequence polymorphism along chromosome 1 of maize (Zea mays ssp. mays L.)." Proc. Natl Acad. Sci. USA 98: 9161–9166.

Thomas, C. G. (2015). "Full-genome evolutionary histories of selfing, splitting, and selection in Caenorhabditis." Genome Research 25: 667–678.

Thorvaldsdo´ ttir H, R. J., Mesirov JP. (2013). "Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration." Briefings in Bioinformatics 14.

V., B. (2010). "A statistical method for the detection of variants from next-generation resequencing of DNA pools." Bioinformatics 26.

VL., G. (2012). "YOABS: yet other aligner of biological sequences— an efficient linearly scaling nucleotide aligner." Bioinformatics 28.

Wang K, L. M., Hakonarson H. (2010). "ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data." Nucleic Acids Research 38(e164).

Wei Z, W. W., Hu P (2011). "SNVer: a statistical tool for variant calling in analysis of pooled or individual next-generation sequencing data." Nucleic Acids Research 39(e132).

76

Werdich XQ, P. E., Pierce EA. (2014). "Systemic diseases associated with retinal dystrophies." Semin Ophthalmol 29.

Wong, G. K. S. (2004). "A genetic variation map for chicken with 2.8 million single- nucleotide polymorphisms." Nature 432: 717-722.

Wong K, K. T., Stalker J (2010). "Enhanced structural variant and breakpoint detection using SVMerge by integration of multiple detection methods and local assembly." Genome Biology and Evolution 11.

Yoon S, X. Z., Makarov V (2009). "Sensitive and accurate detection of copy number variants using read depth of coverage." Genome Research 19.

Zhang Y, S. S., Bhattarai S, (2014). "BBS mutations modify phenotypic expression of CEP290-related ciliopathies." Hum Mol Genet. 23(1): 40-51.

77

Appendix A

Results for GEMA experiments at saturated condition

Exp D0 Generations Total Exp E0 Generations Total SNPs SNPs A=5,R =48,h=0 1200 129K A=5,R =48,h=0 1080 95K A=5,R=48,h=0.5 1130 131K A=5,R=48,h=0.5 1000 100K A=5,R = 48, h=1 1120 111K A=5,R = 48,h=1 930 100K A=5,R=6, h=0 1290 107K A=5,R=6,h=0 1840 61K A=5,R=6, h=0.5 1350 107K A=5,R=6,h=0.5 1610 70K A=5,R=6, h=1 1460 92K A=5,R=6,h=1 1070 83K

Exp D0 Generations Total Exp E0 Generations Total SNPs SNPs Α=10,R=48,h=0 880 102K A=10,R =48, h=0 1090 54K A=10,R=48,h=0.5 770 104K A=10,R=48,h=0.5 730 77K A=10,R = 48,h=1 800 83K A=10,R =48,h=1 800 74K A=10,R=6, h=0 870 86K A=10,R=6, h=0 1730 36K A=10,R=6, h=0.5 830 88K A=10,R=6,h=0.5 1400 52K A=10,R=6, h=1 950 71K A=10,R=6, h=1 870 62K

Exp D0 Generations Total Exp E0 Generations Total SNPs SNPs A=3,R=48,h=0 940 162K A=3,R=48,h=0 960 145K A=3,R=48,=0.5 1010 162K A=3,R=48,=0.5 940 142K A=3,R=48,h=1 1040 151K A=3,R=48,h=1 910 144K A=3,R=6,h=0 1020 156K A=3,R=6,h=0 1360 106K A=3,R=6,h=0.5 1020 153K A=3,R=6,h=0.5 1300 105K A=3,R=6,h=1 1140 140K A=3,R=6,h=1 1040 126K

78

Exp D0 Generations Total Exp E0 Generations Total SNPs SNPs A=20,R=48,h=0 660 96K A=20,R=48,h=0 1890 22K A=20,R=48,=0.5 600 92K A=20,R=48,=0.5 730 67K A=20,R=48,h=1 740 74K A=20,R=48,h=1 670 58K A=20,R=6,h=0 730 83K A=20,R=6,h=0 2280 20K A=20,R=6,h=0.5 710 82K A=20,R=6,h=0.5 1130 45K A=20,R=6,h=1 810 59K A=20,R=6,h=1 760 49K

Exp DE Generations Total Exp Dn Generations Total SNPs SNPs A=5,R=48,h=0 1000 95K A=5,R=48,h=0 970 111K A=5,R=48,h=0.5 960 100K A=5,R=48,h=0.5 790 113K A=5,R=48,h=1 1030 100K A=5,R=48,h=1 900 107K A=5,R=6,h=0 1870 66K A=5,R=6,h=0 840 93K A=5,R=6,h=0.5 1490 71K A=5,R=6,h=0.5 920 92K A=5,R=6,h=1 1050 84K A=5,R=6,h=1 970 91K

Exp D0 Generations Total Exp E0 Generations Total SNPs SNPs A=2,R=48,H=0 2420 253K A=2, R=48,H=0 2430 254K A=2,R=48,H=0.5 2330 252K A=2,R=48,H=0.5 2500 253K A=2,R=48,H=1 2330 253K A=2,R=48,H=1 2520 255K A=2,R=6,H=0 2440 255K A=2,R=6,H=0.5 2790 257K A=2,R=6,H=0.5 2570 257K A=2,R=6,H=1 2560 253K A=2,R=6,H=1 2490 249K A=2,R=48,H=0 2710 254K

Exp D0 Generations Total Exp E0 Generations Total SNPs SNPs A=5, R=96,H=0 1050 131K A=5, R=96,H=0 960 104K A=5,R=96,H=0.5 1080 130K A=5,R=96,H=0.5 980 110K A=5,R=96,H=1 1100 115K A=5,R=96,H=1 940 108K A=5,R=1,H=0 2270 71K A=5,R=1,H=0 3880 41K A=5,R=1,H=0.5 1740 72K A=5,R=1,H=0.5 2720 48K A=5,R=1,H=1 1280 96K A=5,R=1,H=1 1180 95K

79

Exp DE Generations Total Exp Dn Generations Total SNPs SNPs A=5, R=96,H=0 1000 102K A=5, R=96,H=0 990 116K A=5,R=96,H=0.5 1030 109K A=5,R=96,H=0.5 960 118K A=5,R=96,H=1 990 106K A=5,R=96,H=1 1010 113K A=5,R=1,H=0 4020 45K A=5,R=1,H=0 2450 51K A=5,R=1,H=0.5 3850 47K A=5,R=1,H=0.5 2100 55K A=5,R=1,H=1 1090 98K A=5,R=1,H=1 1260 91K

Exp D0 Generations Total Exp E0 Generations Total SNPs SNPs A=3,R=96,H=0 970 164K A=3,R=96,H=0 1110 152K A=3,R=96,H=0.5 1020 161K A=3,R=96,H=0.5 1050 152K A=3,R=96,H=1 1110 153K A=3,R=96,H=1 980 151K A=3,R=1,H=0 1190 129K A=3,R=1,H=0 3180 74K A=3,R=1,H=0.5 1200 130K A=3,R=1,H=0.5 2080 77K A=3,R=1,H=1 1160 147K A=3,R=1,H=1 980 132K

Exp D0 Generations Total Exp E0 Generations Total SNPs SNPs A=10,R=1,H=0 2400 47K A=10,R=1,H=0 4800 24K A=10,R=1,H=0.5 2210 50K A=10,R=1,H=0.5 3350 35K A=10,R=1,H=1 1480 68K A=10,R=1,H=1 1150 70K A=10,R=96,H=0 900 104K A=10,R=96,H=0 1400 60K A=10,R=96,H=0.5 940 108K A=10,R=96,H=0.5 1120 85K A=10,R=96,H=1 990 86K A=10,R=96,H=1 900 79K

Exp DE Generations Total Exp Dn Generations Total SNPs SNPs A=3,R=96,H=0 1670 155K A=3,R=96,H=0 1710 156K A=3,R=96,H=0.5 1740 155K A=3,R=96,H=0.5 1720 157K A=3,R=96,H=1 1860 157K A=3,R=96,H=1 1740 157K A=3,R=1,H=0 6920 73K A=3,R=1,H=0 6480 74K A=3,R=1,H=0.5 5440 76K A=3,R=1,H=0.5 5370 81K A=3,R=1,H=1 2040 140K A=3,R=1,H=1 2180 138K

Exp DE Generations Total Exp Dn Generations Total SNPs SNPs A=10,R=1,H=0 4820 27K A=10,R=1,H=0 1910 45K A=10,R=1,H=0.5 3160 34K A=10,R=1,H=0.5 1840 47K 80

A=10,R=1,H=1 970 73K A=10,R=1,H=1 1250 64K A=10,R=96,H=0 1030 62K A=10,R=96,H=0 840 101K A=10,R=96,H=0.5 880 84K A=10,R=96,H=0.5 A=10,R=96,H=1 890 80K A=10,R=96,H=1 910 88K

Exp En Generations Total Exp En Generations Total SNPs SNPs A=3,R=96,H=0 1470 162K A=10,R=1,H=0 9920 24K A=3,R=96,H=0.5 1470 162K A=10,R=1,H=0.5 5480 34K A=3,R=96,H=1 1450 161K A=10,R=1,H=1 1520 74K A=3,R=1,H=0 7770 72K A=10,R=96,H=0 2180 54K A=3,R=1,H=0.5 4940 81K A=10,R=96,H=0.5 1310 92K A=3,R=1,H=1 1870 146K A=10,R=96,H=1 1290 84K

Exp DE Generations Total Exp Dn Generations Total SNPs SNPs A=2,R=1,H=0 2560 249K A=2,R=1,H=0 2820 257K A=2,R=1,H=0.5 2530 256K A=2,R=1,H=0.5 2670 253K A=2,R=1,H=1 2520 253K A=2,R=1,H=1 2480 247K A=2,R=96,H=0 2370 252K A=2,R=96,H=0 2430 254K A=2,R=96,H=0.5 2390 253K A=2,R=96,H=0.5 2500 253K A=2,R=96,H=1 2370 252K A=2,R=96,H=1 2790 257K

Exp DE Generations Total Exp Dn Generations Total SNPs SNPs A=3;R=6;H=0 1650 105K A=3;R=6;H=0 1400 118K A=3;R=6;H=0.5 1730 105K A=3;R=6;H=0.5 1390 119K A=3;H=6;H=1 1350 126K A=3;H=6;H=1 1310 128K A=3;R=48;H=0 1180 148K A=3;R=48;H=0 1160 147K A=3;R=48;H=0.5 1190 145K A=3;R=48;H=0.5 1210 148K A=3;R=48;H=1 1100 147K A=3;R=48;H=1 1160 148K

Exp DE Generations Total Exp Dn Generations Total SNPs SNPs A=10;R=6;H=0 2850 41K A=10;R=6;H=0 920 98K A=10;R=6;H=0.5 2320 50K A=10;R=6;H=0.5 830 101K A=10;R=6;H=1 1380 65K A=10;R=6;H=1 1280 71K A=10;R=48;H=0 1350 62K A=10;R=48;H=0 920 98K A=10;R=48;H=0.5 1150 77K A=10;R=48;H=0.5 830 101K A=10;R=48;H=1 1090 72K A=10;R=48;H=1 860 88K 81

Exp En Generations Total Exp En Generations Total SNPs SNPs A=10;R=6;H=0 2460 32K A=3;R=6;H=0 1660 107K A=10;R=6;H=0.5 1470 50K A=3;R=6;H=0.5 1460 108K A=10;R=6;H=1 1500 63K A=3;R=6;H=1 1200 128K A=10;R=48;H=0 1170 51K A=3;R=48;H=0 970 148K A=10;R=48;H=0.5 1210 81K A=3;R=48;H=0.5 890 150K A=10;R=48;H=1 1340 74K A=3;R=48;H=1 900 148K

Exp DE Generations Total Exp Dn Generations Total SNPs SNPs A=20;R=1;H=0 4200 20K A=20;R=1;H=0 1680 49K A=20;R=1;H=0.5 2580 28K A=20;R=1;H=0.5 1530 49K A=20;R=1;H=1 860 70K A=20;R=1;H=1 1470 75K A=20;R=96;H=0 1960 27K A=20;R=96;H=0 1010 97K A=20;R=96;H=0.5 790 74K A=20;R=96;H=0.5 930 99K A=20;R=96;H=1 770 63K A=20;R=96;H=1 1160 76K

Exp DE Generations Total Exp Dn Generations Total SNPs SNPs A=20;R=6;H=0 3820 18K A=20;R=6;H=0 1160 82K A=20;R=6;H=0.5 2320 40K A=20;R=6;H=0.5 1170 84K A=20;R=6;H=1 1300 49K A=20;R=6;H=1 1750 61K A=20;R=48;H=0 2610 31K A=20;R=48;H=0 1130 96K A=20;R=48;H=0.5 1010 66K A=20;R=48;H=0.5 1130 100K A=20;R=48;H=1 910 55K A=20;R=48;H=1 1080 73K

Exp En Generations Total EXP En Generations Total SNPs SNPs A=20;R=1;H=0 5320 16K A=20;R=6;H=0 4820 18K A=20;R=1;H=0.5 3170 26K A=20;R=6;H=0.5 1990 43K A=20;R=1;H=1 860 70K A=20;R=6;H=1 1050 51K A=20;R=96;H=0 3830 22K A=20;R=48;H=0 4230 19K A=20;R=96;H=0.5 700 79K A=20;R=48;H=0.5 910 72K

82

Exp D0 Generations Total Exp E0 Generations Total SNPs SNPs A=20;R=1;H=0 2490 48K A=20;R=1;H=0 3000 16K A=20;R=1;H=0.5 2580 50K A=20;R=1;H=0.5 3000 27K A=20;R=1;H=1 1500 81K A=20;R=1;H=1 1210 76K A=20;R=96;H=0 1140 104K A=20;R=96;H=0 2000 20K A=20;R=96;H=0.5 1040 101K A=20;R=96;H=0.5 1210 74K A=20;R=96;H=1 1150 74K A=20;R=96;H=1 1210 63K

Exp En Generations Total Exp En Generations Total SNPs SNPs A=5;R=1;H=0 6880 41K A=5;R=96;H=0 1220 109K A=5;R=1;H=0.5 4160 50K A=5;R=96;H=0.5 1150 116K A=5;R=1;H=1 1480 99k A=5;R=96;H=1 1130 114K

83

Results for GEMA experiments at unsaturated condition

Exp D0 Generations Total Exp E0 Generations Total SNPs SNPs A=5;R=6;H=0 900 130K A=5;R=6;H=0 1500 61K A=5;R=6;H=0.5 900 128K A=5;R=6;H=0.5 900 71K A=5;R=6;H=1 900 101K A=5;R=6;H=1 900 86K A=5;R=48;H=0 700 136K A=5;R=48;H=0 800 92K A=5;R=48;H=0.5 800 137K A=5;R=48;H=0.5 800 100K A=5;R=48;H=1 900 113K A=5;R=48;H=1 700 97K

Exp D0 Generations Total Exp E0 Generations Total SNPs SNPs A=5;R=1;H=0 1900 104K A=5;R=1;H=0 10000 42K A=5;R=1;H=0.5 3000 110K A=5;R=1;H=0.5 2800 50K A=5;R=1;H=1 2900 104K A=5;R=1;H=1 2600 100K A=5;R=96;H=0 1300 148K A=5;R=96;H=0 1400 104K A=5;R=96;H=0.5 1300 150K A=5;R=96;H=0.5 1400 111K A=5;R=96;H=1 1400 118K A=5;R=96;H=1 1300 109K

Exp DE Generations Total Exp DE Generations Total SNPs SNPs A=5;R=1;H=0 3000 46K A=5;R=6;H=0 3900 67K A=5;R=1;H=0.5 3000 47K A=5;R=6;H=0.5 3500 71K A=5;R=1;H=1 1900 97K A=5;R=6;H=1 2400 86K A=5;R=96;H=0 1800 107K A=5;R=48;H=0 1800 100K A=5;R=96;H=0.5 1700 113K A=5;R=48;H=0.5 2000 103K A=5;R=96;H=1 1700 111K A=5;R=48;H=1 1700 102K

Exp D0 Generations Total Exp E0 Generations Total SNPs SNPs A=3;R=1;H=0 2400 134K A=3;R=1;H=0 3000 75K A=3;R=1;H=0.5 2100 138K A=3;R=1;H=0.5 3000 81K A=3;R=1;H=1 2000 152K A=3;R=1;H=1 3000 142K A=3;R=96;H=0 1800 172K A=3;R=96;H=0 1700 157K A=3;R=96;H=0.5 1800 171K A=3;R=96;H=0.5 1700 158K A=3;R=96;H=1 2100 169K A=3;R=96;H=1 1700 157K

84

Exp D0 Generations Total Exp E0 Generations Total SNPs SNPs A=3;R=6;H=0 2200 163K A=3;R=6;H=0 3500 106K A=3;R=6;H=0.5 2000 161K A=3;R=6;H=0.5 3100 108K A=3;R=6;H=1 2000 161K A=3;R=6;H=1 2500 127K A=3;R=48;H=0 1900 172K A=3;R=48;H=0 2100 150K A=3;R=48;H=0.5 1900 171K A=3;R=48;H=0.5 2200 150K A=3;R=48;H=1 1900 168K A=3;R=48;H=1 2000 149K

Exp DE Generations Total Exp DE Generations Total SNPs SNPs A= 3;R=1;H=0 4800 79K A=3;R=6;H=0 1500 111K A=3;R=1;H=0.5 5700 79K A=3;R=6;H=0.5 3500 108K A=3;R=1;H=1 1700 142K A=3;R=6;H=1 2600 128K A=3;R=96;H=0 1600 157K A=3;R=48;H=0 2000 151K A=3;R=96;H=0.5 1600 157K A=3;R=48;H=0.5 2100 148K A=3;R=96;H=1 1700 157K A=3;R=48;H=1 2000 150K

Exp D0 Generations Total Exp E0 Generations Total SNPs SNPs A=10;R=1;H=0 2000 65K A=10;R=1;H=0 7300 26K A=10;R=1;H=0.5 2200 66K A=10;R=1;H=0.5 4800 41K A=10;R=1;H=1 2300 64K A=10;R=1;H=1 1600 77K A=10;R=96;H=0 1100 116K A=10;R=96;H=0 1900 65K A=10;R=96;H=0.5 1100 115K A=10;R=96;H=0.5 1400 91K A=10;R=96;H=1 1200 95K A=10;R=96;H=1 1700 157K

Exp D0 Generations Total Exp E0 Generations Total SNPs SNPs A=10;R=6;H=0 1500 102K A=10;R=6;H=0 5800 40K A=10;R=6;H=0.5 1500 103K A=10;R=6;H=0.5 3500 54K A=10;R=6;H=1 2000 80K A=10;R=6;H=1 2400 64K A=10;R=48;H=0 1300 118K A=10;R=48;H=0 2600 61K A=10;R=48;H=0.5 1500 119K A=10;R=48;H=0.5 2000 80K A=10;R=48;H=1 1800 94K A=10;R=49;H=1 2100 74K

85

Exp DE Generations Total Exp DE Generations Total SNPs SNPs A=10;R=1;H=0 5600 29K A=10;R=6;H=0 5100 42K A=10;R=1;H=0.5 4400 36K A=10;R=6;H=0.5 3100 53K A=10;R=1;H=1 1500 74K A=10;R=6;H=1 800 64K A=10;R=96;H=0 1500 66K A=10;R=48;H=0 2300 61K A=10;R=96;H=0.5 1200 89K A=10;R=48;H=0.5 1500 78K A=10;R=96;H=1 1200 80K A=10;R=48;H=1 1700 75K

Exp D0 Generations Total Exp E0 Generations Total SNPs SNPs A=20;R=1;H=0 1400 73K A=20;R=1;H=0 6600 18K A=20;R=1;H=0.5 1400 73K A=20;R=1;H=0.5 4200 29K A=20;R=1;H=1 1700 74K A=20;R=1;H=1 1200 78K A=20;R=96;H=0 1000 109K A=20;R=96;H=0 4800 28K A=20;R=96;H=0.5 1000 108K A=20;R=96;H=0.5 1200 76K A=20;R=96;H=1 1200 81K A=20;R=96;H=1 1300 67K

Exp D0 Generations Total Exp E0 Generations Total SNPs SNPs A=20;R=6;H=0 700 95K A=20;R=6;H=0 2700 21K A=20;R=6;H=0.5 700 96K A=20;R=6;H=0.5 1500 45K A=20;R=6;H=1 800 67K A=20;R=6;H=1 900 52K A=20;R=48;H=0 700 103K A=20;R=48;H=0 2300 27K A=20;R=48;H=0.5 700 102K A=20;R=48;H=0.5 700 68K A=20;R=49;H=1 800 80K A=20;R=49;H=1 800 59K

Exp DE Generations Total Exp DE Generations Total SNPs SNPs A=20;R=1;H=0 4300 20K A=20;R=6;H=0 2700 26K A=20;R=1;H=0.5 2300 29K A=20;R=6;H=0.5 1700 43K A=20;R=1;H=1 800 68K A=20;R=6;H=1 1100 53K A=20;R=96;H=0 1500 35K A=20;R=48;H=0 2400 33K A=20;R=96;H=0.5 800 71K A=20;R=48;H=0.5 900 68K A=20;R=96;H=1 800 64K A=20;R=48;H=1 900 58K

86

Appendix B

Potential Indels selected by variant analysis of whole exome sequencing data

Chromosome Number Position Reference Allele Alternative Allele chr1 17085999 CCCCG C 112.73 inside inside yes MST1L gene chr1 17087541 GGTGCT G 1699.73 inside inside yes MST1L gene chr1 115537600 GA G 367.73 inside inside yes gene is SYCP1 chr1 175129945 TC T 272.73 inside inside yes no sign of deletion, very poor coverage, gene KIAA0040, not expressed in retina chr1 175129947 TTCTTCTTG T 267.73 inside yes gene is KIAA0040, very nearby region as above chr3 75786760 TC T 628.73 inside inside yes SNP C-> T nearby chr3 100170600 A ATCCTAGAAGGCATTCTCATGAGGACCAGGAATTCCGATGCCGATCGTCTGACCGTCT3151.73 inside inside UNSURE SNP A-> G nearby chr3 195512373 G GGAT 70.73 inside inside UNSURE/ yes gene MUC4, no insertion sign, read depth is very low chr3 195701310 C CAA 906.73 inside inside yes SNP G->A nearby, SNP C -> T nearby, SNP T ->C nearby chr7 5939926 GATTTT G 1165.73 inside inside yes gene is CCZ1: 76 reads, 32 deletions chr7 27135316 ATGGTGGTGG A 2313.73 inside inside yes HOXA1 gene, 41 reads: 29 deletions; if no deletions there is a SNP 2 nt previous chr7 100612955 A ACTG 1133.73 inside inside yesish, will add an aminoMUC12; acid 17 insertions of 81 reads, associated with 4 SNPs nearby chr7 151945071 G GT 3258.73 inside inside yes KMT2C gene, 85 out of 404 reads chr8 10465965 TCCTTCTGCCTCTGGGGCCTCTACATCTTCTGACTCTGGCTGGGCCTCCCCTTCAGCCTCCTGGGCATCCT 690.73 Inside inside yes long deletion associted with a SNP, RP1L1 gene chr9 24545215 CAT C 1697.73 inside inside yes the gene is IZUMO3 chr9 33794797 TGA T 629.73 inside inside yes associated with SNPs. chr9 33797928 G GCC 3977.73 inside inside yes associated with insertion, deletion, and SNPs chr9 33797930 GAC G 4100.73 inside inside yes associated with insertion, deletion, and SNPs chr10 129213488 A AAGTCTGTTTTCATGGTAAGGATGGCCCCACCCCAGAAGGAATTGTCCGTTCCCTTCCTCAATAC1953.73 inside inside yes Gene = DOCK1 in exon 44 chr10 135438959 AC A 454.73 inside inside yes Gene = FRG2B in exon 4 chr11 1017035 A AAT 8119.73 inside inside yes Gene = MUC6 in exon 31 chr11 1017040 GGT G 7950.73 inside inside yes Gene = MUC6 in exon 31 chr11 1018215 G GCA 942.73 inside inside yes Gene = MUC6 in exon 31 chr11 1018222 AAT A 372.73 inside inside yes Gene = MUC6 in exon 31 chr11 6238950 A AC 89.73 inside inside yes Gene = FAM160A in exon 9 chr11 48367050 T TAG 4679.73 inside inside yes Gene = OR4C45 in exon 2 chr11 56143255 T TGA 3235.73 inside inside yes Gene = OR8U8 in exon 1 chr11 56143259 AGT A 3105.73 inside inside yes Gene = OR8U8 in exon 1 chr11 56143425 TATCA T 2072.73 inside inside yes Gene = OR8U8 in exon 1 chr11 56143782 T TGA 1840.73 inside inside yes Gene = OR8U8 in exon 1 chr11 56143784 CAT C 1711.73 inside inside yes Gene = OR8U8 in exon 1 chr11 64083295 CGGG C 88.73 inside inside yes Gene = ESRRA in exon 7 chr12 51740415 A AAG 5033.73 inside exoninside cds yes 3 SNPs associated with the deletion in this read. 3 deletions associated with the deletion in the read. Insertion associated with deletion present in read. Present in the exon. Encodes CELA1 chr12 51740416 C CG 4992.73 inside exoninside cds yes 3 SNPs associated with the deletion in this read. 3 deletions associated with the deletion in the read. Insertion associated with deletion present in read. Present in the exon. Encodes CELA1 chr13 25671272 AG A 1420.73 inside exoninside cds yes 4 SNPs associated with deletion. Encodes PABPC3 chr13 25671310 TTATGA T 1622.73 inside exoninside cds yes 4 SNPs associated with deletion on same read. 2 nearby deletions, 1 either side of chromosome location. Encodes PABPC3. chr15 38776733 ATTATGGTGGAAGAGGGGGA 45.73 inside inside yes low coverage, gene is FAM98B chr15 78211411 GCCTCCTGCTCTCGGAGCATCTG 674.73 inside inside yes? gene LOC645752, low coverage, program didn't provide amino acid sequence possibly due to low coverage sequencing chr15 90294304 C CG 230.73 inside inside yes SNP two bases after, gene MESP1 chr15 90294306 C CACGGGGCTCGG 326.73 inside inside yes SNP, gene MESP1 chr16 3119297 C CG 1138.73 inside inside yes gene IL32 chr16 67236130 CG C 3137.73 inside inside yes Gene ELMO3 chr16 81242148 GTT G 1370.73 inside inside yes Gene PKD1L2 chr18 44549025 T TGC 538.73 inside inside yes KATNAL2 gene chr2 133014651 A AC 4058.73 inside inside yes associated A>G SNP, but its not in the exon chr4 1388350 G GTGCCCATGTGGAGTGCCCGCCTGCTCACACA1221.73 inside inside yes CRIPAK exon1, insertion is in linkage disequilibrium with two SNPs, there is another SNP at the same position but not in linkage disequilibrium 87