Quick viewing(Text Mode)

Basic Principles and Laboratory Analysis of Genetic Variation Jesus Gonzalez-Bosquet and Stephen J

Basic Principles and Laboratory Analysis of Genetic Variation Jesus Gonzalez-Bosquet and Stephen J

unit 2. biomarkers: practical aspects

chapter 6. Basic principles and laboratory analysis of Jesus Gonzalez-Bosquet and Stephen J. Chanock Unit 2 Chapter 6 Chapter

Summary

With the draft of the of their functional significance. agnostically using dense data sets and advances in technology, the Understanding the true effect of with billions of data points. These approach toward mapping complex on the risk of developments have transformed diseases and traits has changed. complex diseases is paramount. the field, moving it away from the Human has evolved into The importance of designing pursuit of hypothesis-driven, limited the study of the genome as a high-quality studies to assess candidate studies to large-scale complex structure harbouring clues environmental contributions, as well scans across the genome. Together for multifaceted disease risk with as the interactions between these developments have spurred a the majority still unknown. The and exposures, cannot be stressed dramatic increase in the discovery discovery of new candidate regions enough. This chapter will address of genetic variants associated with by genome-wide association studies the basic issues of genetic variation, or linked to human diseases and (GWAS) has changed strategies for including genetics, as traits, many through genome-wide the study of genetic predisposition. well as analytical platforms and association studies (GWAS) (1). More genome-wide, “agnostic” tools needed to investigate the Already over 7400 novel regions of approaches, with increasing contribution of genetics to human the genome have been associated numbers of participants from high- diseases and traits. with more than 75 human diseases quality epidemiological studies are or traits in large-scale GWAS (2). for the first time replicating results Introduction Each region now represents a new in different settings. However, new- candidate “region” that harbours found regions (which become the New advances in microchip putative genes, which will require new candidate “genes”) require technologies and informatics allow extensive mapping of the variants extensive follow-up and investigation to look across the genome to explore the genomic architecture

Unit 2 • Chapter 6. Basic principles and laboratory analysis of genetic variation 99 of the region and its contribution mechanisms and outcomes. The scope of genetic variation to human diseases and traits. Eventually these insights will be The return to exploring candidate applied to treatment or preventive The spectrum of human genetic regions differs from the old approach measures that are best suited for the variation is enormous with respect of nominating favoured genes, individual (known as personalized to both the types of genetic variation because it is driven by findings medicine). Individualization of and the sheer magnitude of the that reach conclusive thresholds treatments based on the greatest number of variants in any given based on more rigorous statistical likelihood for efficacy, while genome. Even though two considerations. minimizing (or avoiding) deleterious are estimated to differ by less While there is ample opportunity toxicities, represents a long-term than 0.5%, there are still several to survey thousands of genetic goal, but one that is in the distant million differences; the majority are variants, often well chosen future. While the opportunity to vestigial, but a small proportion and based on an emerging begin to develop evidence-based probably contribute to disease risk. understanding of the structure of individualized therapeutics, also The most common type of variation genetic variation and its patterns of known as , is is a single base change, inheritance, the ability to analyse promising, its realization will require followed by small insertions or the interaction between genetic a nuanced understanding of the deletions in sequence. Progressively variants and the environment has contribution of genetic variation to larger structural alterations and lagged. This is mainly because complex diseases. copy number variants are fewer the measurement tools for the This chapter will address the in absolute number, but perhaps latter have not undergone the basic issues of genetic variation, affect more bases (Figure 6.1). So transformative shift observed including as far, available technologies have in assessing genetic variation. well as analytical platforms and accelerated the discovery and The integration of environmental tools needed to investigate the characterization of diversity in the exposure with genetic factors contribution of genetics to human . In the first wave of should provide insights into disease diseases and traits. annotation, common variants have

Figure 6.1. Genetic variant frequencies and estimated effect size for genetic contribution

100 been described, many of which the stable substitution of a single efforts, such as the 1000 Genomes are universal to all . base, which by definition is observed Project, indicate these estimates The ability to ascertain estimates in at least 1% of a population. Though are low (http://www.1000genomes. for lower frequency variants is this definition has been useful org/). There are estimated to be a dependent upon the number of for cataloging genetic variation, greater number of SNPs with lower subjects surveyed, as well as the the advent of next-generation MAFs and, unlike common SNPs, population genetic history of the sequencing technology has revealed the majority may be population- subjects used for discovery. New the sheer breadth of variations in specific (Figure 6.2). The majority of sequencing technologies, referred different populations with estimated common SNPs, with a MAF greater to as next-generation sequencing, frequencies well below 1%. Still, for than 15–20%, are widespread in allow for the ability to catalogue the purpose of current applications human populations (8,9). Only a variants with lower frequencies and of genetic variation, the SNP is the small subset of high-frequency will certainly shift the paradigms most commonly annotated variant. SNPs (less than 10%) appear to be

further. Generally, the interrogation The minor frequency (MAF) found in a single population, again Unit 2 of genetic variation continues to is designated for the lower allele suggesting the universal ancestry of reveal greater complexity in different frequency observed at a locus in common SNPs (9). 6 Chapter human populations, which manifests one particular population, but often Previously in candidate as differences in frequencies of there can be major differences approach studies, SNPs in coding variants. in estimated MAFs between regions were often selected on the populations with distinct histories. basis of an in silico predicted effect, Single-nucleotide The literature suggests that there are but with little supporting biological polymorphisms (SNPs) more than perhaps 15 million SNPs evidence. The attempt to classify with a MAF greater than 1% (3–5), coding variants, known as a coding The most common sequence and 10 million SNPs with a MAF SNP (cSNP), has focused on the variation in the genome, the single- greater than 10% (3,6,7); however predicted effect on the actual coding nucleotide (SNP), is recent large-scale sequencing sequence. The majority of cSNPs

Figure 6.2. Estimated number of SNPs in the human genome in relation with their minor (MAF). Source: (5). Reprinted by permission from Macmillan Publishers Ltd: Genetics, copyright (2003).

Unit 2 • Chapter 6. Basic principles and laboratory analysis of genetic variation 101 do not alter the predicted amino scale surveys of cell lines, as well as are catalogued in a public database, acid and are known as synonymous laboratory data. the Online in SNPs. However, a subset of Nearly half of the more than Man (OMIM) (http://www.ncbi.nlm. variants are predicted to shift the 10 million human SNPs in the nih.gov/omim/). and are known as non- international public database for synonymous coding SNPs. Though SNPs, or dbSNP (http://www.ncbi. The correlation of common this subset was initially of great nih.gov/SNP/), have been validated genetic variants interest, very few non-synonymous with genotyping assays by the SNP coding SNPs have actually been Consortium and the International Most SNPs are not inherited conclusively associated with human HapMap Project (8,20). Until independently but in blocks, resulting diseases or traits, and even fewer recently, only a small percentage in sets of SNPs being transmitted have corroborative biological had been verified by sequencing, together between generations data to provide plausibility for the but with the advent of the 1000 (4,28,29). These blocks are defined association (10,11). Nonetheless, Genomes Project, nearly all common by (LD), which the analysis of synonymous and (MAF >10%) and uncommon (MAF estimates the correlation between non-synonymous SNPs has been between 1 and 10%) variants should SNPs on shared quite informative for evolutionary be confirmed by next generation passed down from ancestral studies (12,13). sequence technology (21,22). chromosomes. LD is defined as the There has been considerable In the current build, roughly one non-random association of effort to calculate the effect of sixth of the variants in dbSNP are at different loci (30). Initially, each a non-synonymous cSNP in probably monoallelic, due to errors SNP is a single that has conformational protein changes. A in either genotyping or, more likely, taken hold and become fixed in a proliferation of prediction software sequencing (23,24). In general, the population, either as a consequence has been created (e.g. Protein reported SNPs have been biased of direct selection or because it is Data Bank (http://www.rcsb.org/ towards high-frequency variants in close enough on the pdb) and Swiss-Model (http:// populations of European ancestry. to be included within a block of a swissmodel.expasy.org//SWISS- Currently, the catalogue of shared segment. Individual SNPs MODEL.html)). Though new models uncommon variation, namely SNPs that are strongly associated with and algorithms claim improved with MAFs under 1%, is incomplete. each other are said to be in LD, reliability for predicting deleterious However, the although this correlation could be changes in protein structure is expected to generate a thorough eroded over time by recombination (14–16), without corroborative catalogue of variants with greater (exchange of genetic material) laboratory data the findings are than 1% MAF. The contribution during (31). are merely in silico observations. of uncommon variants (MAF defined as sets of SNPs, and other Overall, between 50 000 and 250 between 1% and 10%) represents genetic polymorphisms (larger in 000 SNPs could be functional, an untapped portion of the genomic size), on chromosomal segments non-synonymous coding variants, architecture. It will require either that are in strong LD. or regulators of larger studies to provide sufficient There are several ways or splicing (10,11). It is likely that a power to detect association, or new to determine haplotypes from subset of non-synonymous cSNPs design strategies to discover and ; this is commonly referred contribute to regulatory differences characterize uncommon and rare to as resolving phase. in expression or genetic pathways variants (25,26). Rare or uncommon The offspring haplotype phase (17–19), but most SNPs appear variants have been shown to be can be determined if the parental not to be functional and have been informative in the extremes of genotypes are known or directly with maintained on the backbone of mapping human traits, such as with biochemical methods (30). Based an inherited block of DNA through cholesterol levels (27). Rare variants on the assumption that haplotypes generations. Subsets of SNPs that or can explain a proportion are randomly joined into genotypes, alter regulation or expression of of the strong familial component of phasing can be estimated using a gene, called regulatory SNPs complex diseases, as well as the one of several statistical methods (rSNPs), are difficult to predict with classical Mendelian inheritance of that can account for the ambiguity high efficiency and most likely will single or oligogenic diseases. These of unobserved haplotypes (30). be categorized on the basis of large- highly penetrant disease mutations Different methods have been

102 developed to estimate haplotypes mapping (38–40), but the recent used to mark common haplotypes from unphased multilocus success with GWAS suggests in the region (Tagger, embedded data in unrelated individuals; the otherwise. in Haploview software (http://www. underlying principles are based on The concept of LD also permits broad.mit.edu/mpg/haploview) and models that incorporate either a investigators to look at a set of SNPs TagZilla (http://tagzilla.nci.nih.gov)) maximum likelihood (32), parsimony and determine proxies for other (43). Consequently, the indirect (33), combinational theory (34) or untested SNPs (or tagSNPs) (41,42). approach of using a limited set of a priori distribution derived from This indirect approach is predicated tagSNPs as a proxy of a LD block (35). This last on finding markers only, relegating has emerged as the preferred method is the basis for the phase the search for causal or functional approach, used by both GWAS and reconstruction software PHASE variants to later work (Figure 6.3). studies (44). (35,36), which has performed Several approaches optimize the favourably in simulation studies (37), number of surrogate SNPs needed Structural polymorphisms

and its modified version designed to account for untested variants, Unit 2 for larger data sets, fastPHASE such as the “greedy algorithm.” The Structural variations in the genome

(22). Reconstructed haplotypes latter estimates highly correlated may be either cytologically 6 Chapter from unrelated individuals and LD SNPs, primarily on the basis of visible or, more commonly, structure have been used to study the MAF, to create heuristic bins submicroscopic variants that can genomic association to complex of tagged SNPs. Thus, tagSNPs range in size from a few base pairs traits (17,29). In fact, some research represent proxies for additional, to thousands (45,46). These can suggests that haplotypes would be highly correlated SNPs with include deletions, insertions and better suited for candidate studies comparable allele frequency and duplications collectively known as because of a perceived statistical distribution in the population of copy number variations (CNVs), as advantage over the single-locus LD interest. In a sense, tagSNPs are well as less-frequent inversions and

Figure 6.3. SNP selection strategy. A. SNP selection through haplotype blocks, based on the concept of linkage disequilibrium (LD). D’ is a measure of LD between SNPs, represented in the figure through a heat map from white (low D’) to red (high D”). A haplotype (represented by a dark triangle in the figure) is a set of SNPs in strong LD, or high D’. “TagSNPs” are proxies for other SNPs in the same haplotype. This is the so-called indirect approach (147). B. Selection of SNPs based on r2, another measure of LD. This method creates groups with similar LD (r2) into ‘bins.’ In the figure each spot represents a SNP, and those with similar r2 are included in the grey blocks or ‘bins.’ ‘TagSNPs’ are proxies for all these loci included in each ‘bin’ with comparable LD (148)

Unit 2 • Chapter 6. Basic principles and laboratory analysis of genetic variation 103 translocations (Figure 6.4) (47,48). Figure 6.4. Spectrum of genomic variation. Challenges and standards in integrating Several of the inversions can be surveys of : The range of genetic variation that must be taken into account when designing and analyzing genotype studies (46). The figure represents quite large, such as the 3.5 Mb on the whole spectrum of , from the molecular level with DNA chromosome 17 seen in as much sequence variation, exemplified by SNPs, to structural variation, a broad category that as 20% of the European population includes variations from 2 bp to whole chromosomal variations. The focus of recent (49). On the other hand, / genetic studies has been the subgroup in the midrange (with strong highlighting). These forms of variation have been studied with molecular methods to cytogenetic deletions as small as two base approaches. Reprinted by permission from Macmillan Publishers Ltd: Nature Genetics, pairs can be observed. Although copyright (2007). structural variants in some genomic Single nucleotide Molecular regions have no obvious phenotypic •Base change – substitution – genetic consequence (50–52), CNVs have Insertion-deletions ( “”) detection

been shown to influence gene variation •SNPs –tagSNPs Sequence dosage in select circumstances. Consequently, many have pursued 2 bp to 1,000 bp • , CNVs because of the potential Indels contribution of high estimated •Inversions effects for complex diseases, either •Di-, tri-, tetranucleotiderepeats alone or in combination with other •VNTRs factors (53). Some observations, 1 kb to submicroscopic either by the failure to assemble Copy number variants ( CNVs) the draft genome sequence or by Segmental duplications actual experimentation, estimate •Inversions, translocations the segmental duplicated genomic CNV regions ( CNVRs ) sequence could involve between •Microdeletions, microduplications 5–10% of the genome (51,54,55). Microscopic to subchromosomal Other clues come from the Segmental aneusomy recognition that a notable number •Chromosomal deletions – losses of SNPs failed the quality control •Chromosomal insertions –gains metrics in the International HapMap Structural variation •Chromosomal inversions •Intrachromosomaltranslocations Project; these were later determined •Chromosomal abnormality to reside in regions now known to Heteromorphisms be enriched for CNVs (7,45,55–57). •Fragile sites Current surveys suggest that CNVs are less common than previously Whole chromosomal to whole genome •Interchromosomaltranslocations reported (58), and many are • Ring chromosomes, isochromosomes infrequent (59). It is also notable that •Marker chromosomes over three fourths of common CNVs Cytogenetic are in LD with common SNPs (59). Aneusomy detection Coordinated efforts are underway Term deŠned or discussed in Box 1 to establish a comprehensive catalogue of CNVs, such as the can detect quantitative imbalances (59–64). Advances in techniques Database of Genomic Variants has accelerated CNV discovery, but have improved determination of (http://projects.tcag.ca/variation/) there are still substantive technical common CNVs, such as tiling arrays (46,60) and the Human Genome challenges due to the breadth of (which cover the genome through Structural Variation Project (http:// polymorphic differences for which partial overlapping (tile-like) sets of humanparalogy.gs.washington.edu/ analyses are particularly unstable. fixed oligonucleotides), paired-end structuralvariation/). Recently, there New emerging algorithms should sequencing (sequence analysis of have been several international streamline moderate- to high- both ends of a larger fragment to efforts to establish standards for throughput, cost-effective methods improve alignment), and new dense identification, validation and reporting to scan the genome for CNVs, as well SNP genotyping platforms based of CNVs (46). The availability of as inversions or translocations based on probe intensity (e.g. Illumina and several microarray platforms that on stable sequence assemblies Affymetrix).

104 Short tandem repeats (STRs) for Hardy–Weinberg stays in HWE (68). Also, if these represent a class of polymorphisms proportion conditions are met, the genotypic that occur when a pattern of two and allelic frequencies of the or more are repeated The fitness for Hardy–Weinberg offspring generation will be related in certain areas of the genome. proportion, an important tool for by the following simple equations. Previously known as microsatellites, understanding population structure, For a trait in the population with two they were frequently employed to examines the distribution of the alleles (A1 and A2), if the A1 allele conduct linkage studies in potentially allelic and genotypic frequencies. frequency in the population is p, and informative pedigrees. The patterns Though theoretical, it states that the A2 allele frequency is q = (1-p), can range in length from 2–10 if certain assumptions are met, then expected genotype proportions base pairs (usually tetra- or penta- genotype and allele frequencies can (f) under HWP are: 2 2 nucleotide repeats) and are typically be estimated from one generation f(A1A1) = p , f(A1A2) = 2pq, f(A2A2) = q located in non-coding regions. to the next. The derivation of the Random mating, or the absence

Since longer repeat sequences Hardy–Weinberg principle for a of a genotypic correlation between Unit 2 can be susceptible to artefactual single locus assumes: a randomly mating partners, will generate a errors in genotyping accuracy, mating population; an infinitely large distribution of observed genotypes 6 Chapter particularly related to problems of population, or a large that should not deviate significantly PCR amplification, the industry enough that random fluctuations in from the expected proportions standard for both genetic analysis allele and genotype frequencies are (Hardy–Weinberg Proportions and forensic application is 4–5 base small; no mutation; no migration; (HWP)). This is predicated on pair (bp) repeat units. Shorter repeat and no fitness differences among Mendel’s law of segregation, and, sequences (e.g. 2 or 3 bp) tend genotypes. When all of these assuming the absence of selection, to suffer from artefacts, such as assumptions are met, Hardy– all parents contribute equal numbers stutter and preferential amplification Weinberg Equilibrium (HWE) is of gametes to the pool. The HWE (65–67). By genotyping a sufficient established and four important principles can be applied to family- number of STR loci, it is possible to conclusions can be drawn: 1) allele based and case–control data to generate a unique genetic profile of frequencies do not change from one detect genotyping error, population an individual. generation to the next; 2) genotype stratification and association. frequencies can be inferred from A violation of any of the above Population genetics allele frequencies; 3) only one assumptions can produce deviation generation is required to go from from HWE, which may include The field of population genetics non-equilibrium to equilibrium; and mating behaviour, population size has advanced rapidly and emerged 4) once the system is in HWE, it and migration patterns. For example, as central to the investigation of genetics and complex diseases. Table 6.1. Issues for generation of final, publication-grade build of high-density Overall, the discipline of population genotype data genetics seeks to characterize the genetic composition of biological • Eliminate samples with low completion rates (< 90%) populations, as well as the changes • Remove SNP assays with low call rates (< 90%) in genetic composition that occur from environmental and migratory • Determination of fitness for Hardy–Weinberg proportion factors, including . • Compare expected duplicates To draw conclusions about the likely • Investigate unexpected duplicates patterns of genetic variation in actual • Assess concordance between duplicates populations, population geneticists • Search for cryptic relatedness between subjects develop abstract mathematical • Assessment of population substructure (after filtering 1st degree relatives) models of gene frequency dynamics and test these conclusions against • Determine admixture with STRUCTURE analysis empirical data. Some of the more • Estimate population stratification (principal component analysis) robust concepts in population genetic • Assess genotype calling algorithm analysis that are applied in disease • Validate significant genotype calls with second technology mapping are discussed below.

Unit 2 • Chapter 6. Basic principles and laboratory analysis of genetic variation 105 systematic will increase might eliminate the problem. The in the distribution of common SNP levels of homozygosity across the power to detect deviations due to markers (76). The ability to detect genome, as will small population genotyping error under most modes stratification with any marker or set sizes (68). Having more than one of inheritance has been found to be of markers may also vary depending random mating population in a very small (73). Even the deviation on the allele frequency in each sample may also cause deviations created by neighbouring SNPs, subgroup (68). from HWE, as well as mating which diminish the performance In general, an assessment of with certain (known of genotyping assays, does not the underlying structure can be as ), which will produce a large enough deviation estimated using standard algorithms increase homozygosity as well. from HWE to be detected (74). to identify distinct populations (77). causes In GWAS, it is likely that hundreds The most commonly used approach allele frequencies to drift from one if not thousands of markers will is implemented in the STRUCTURE generation to the next. In many deviate from HWE. Understanding program. This uses multilocus cases, the deviations are also a why and how HWE testing would genotype data to examine population screen for performance of the help in the process of disease- structure by attempting to separate genotype technology, because gene discovery is becoming more subjects into groups (defined as k a disproportionate number of important as the number of SNPs populations) and determining the heterozygotes or homozygotes included in these studies increases distribution of shared alleles. could represent systematic errors in into the hundreds of thousands (75). As the ability to understand genotyping. The control observed genotype population stratification (or One of the most common frequencies are tested against control differences between cases and reasons for not using data in expected genotype frequencies to controls due to systematic ancestral association studies is presumed determine if there may be genotyping differences) has improved, several genotyping error. Many types of error (68). methods have been developed to errors in genotyping can cause study and account for these types deviations from HWE; therefore Spectrum of differences of systematic study population tests for both assay specificity and in population substructure structures. One approach deviation from HWE have been commonly used for the correction of proposed to minimize the genotype The age of GWAS has generated population stratification is to adjust error rate and thereby improve sufficiently large data sets that simultaneously for a fixed number data quality (69,70). Deviation from can determine the degree of of top-ranked principal components HWE resulting from allelic drop-out, differences in underlying population resulting from a principal component where some alleles are insufficiently substructure, also known as analysis (PCA) (76). It is critical to amplified, can cause an excess of population stratification. An look for underlying subgroups in homozygotes and increase false- examination of thousands of markers stratified samples by testing sets negative or false-positive results not in LD permits investigators to of genetic markers not linked to (71). However, caution should be assess the extent of admixture and the , and then adjust exercised in association studies exclude individuals who are outliers for inflation due to stratification before removing data because for the association analysis. (76,78,79). An alternative approach of HWE deviations. If there is a Classically, population is to use a structured association systematic HWE deviation in both stratification is present when there method in association mapping, cases and controls, it may be is a measurable difference in the permitting case–control analysis in easier to determine a genotyping distribution of alleles between the context of known differences in error if both deviations occurred subgroups that have different population structure. in the same direction (72). Non- population histories. There are In select circumstances, in systematic error is more problematic examples of this in older case– which the epidemiologic data and should trigger a review of control studies where the cases suggest major differences between standard operating procedures for and controls have been drawn populations, it is possible to biospecimen handling, as well as from different populations. It is conduct mapping by admixture an assessment of all information also possible to have stratification of linkage disequilibrium (MALD). workflow. If the error is recognized, between cases and controls based This capitalizes on the concept of re-genotyping of the faulty samples on differences in exposures, as well admixture, which is the genetic mix

106 of two or more distinct populations. non-random (84,85). Gene function, sizes, such as , show It relies on the differences in allele gene structure and the roles of evidence for reduced variability frequencies between populations genes and gene products in genetic and effectiveness of selection in to guide the search to focus on networks can influence whether comparison with other changes in the genome rather than particular mutations will contribute (87,88). In the era of multispecies a specific gene(s). So far it has to advantageous phenotypic comparisons of genome sequences been successful in mapping a key changes. Some mutations generate and GWAS, it is critical to assess prostate region on 8q24 and specific phenotypic changes, the evolutionary role of a type of end-stage renal disease whereas pleiotropic mutations alter and its interactions with mutation, that is more common in individuals several seemingly unrelated traits. migration, recombination and of African American background Mutations with pleiotropic effects selection. Therefore, population (80–82). will rarely change all phenotypic size plays a central part in modern traits in a favourable way, and, in studies of molecular and variation (88). Selection some instances, may even reduce Unit 2 fitness (86). The same mutation in One of the most influential

Population geneticists often a different genetic background may variables for human genetic variation 6 Chapter define evolution as a change in a produce a different phenotypic effect is geographic location, with genetic population’s genetic composition because of interactions between differentiation between populations over time. The four factors that can alleles, under the phenomenon increasing with geographic distance bring about such a change are natural called . Also, populations and decreasing with selection, mutation, random genetic exposed to repeated environmental distance from Africa. Populations of drift, and migration into or out of the changes may present with different African ancestry have the greatest population. More controversial is a genetic changes that produce a diversity, resulting in shorter possibility of changes in the mating range of phenotypes suited to the segments of LD (89–93). Modern pattern, which some consider not environmental conditions, namely population genetics estimates that to be part of classical evolutionary . the ancestral human population change. Natural selection occurs Initially, when the environment originated in Africa and radiated when some genotypic variants in favours a phenotype that is largely outward to other continental a population enjoy a survival or different from the average one in a locations, both within Africa and reproduction advantage over others. population, mutations that cause elsewhere. Although the concept that natural this phenotypic change towards the Alleles under positive selection selection favours the survival of new optimum are favoured (called can increase in prevalence in a individuals with a fitness advantage strength of selection). Population population and leave distinctive now almost seems intuitive, it was size and history also influence signatures, or patterns of genetic largely opposed when introduced genetic evolution. A small population variation, in DNA sequences. These by Darwin (83). Under Mendelian size can accentuate the effects can be identified by comparison inheritance and with random of random sampling of alleles, with the background distribution mating, genotype frequencies after so-called genetic drift. In small of genetic variation, primarily one generation do not change; the populations, genetic drift will allow evolved under neutrality (94). In determinant of whether the allele deleterious alleles to occasionally some cases, these signatures, or will spread in the population is the increase in frequency (84). differences in allele frequencies fitness of heterozygotes versus that Random genetic drift refers between populations, reflect major of wild-type homozygotes. to the chance fluctuations in regional selective pressures like Mutation is the primary gene frequency that arise in finite infectious diseases (e.g. malaria), source of genetic variation driving populations; it can be thought of environmental stresses (e.g. differences within a population as a type of “sampling error.” In temperature), or dietary factors (e.g. and thus preventing homogeneity. many evolutionary models, the milk consumption) (13,95,96). Although mutations that occur in population is assumed to be infinite When immigrants with a the genome are initially thought or very large to avoid chance different genetic makeup enter a to be random, the distribution of fluctuations. This assumption is new population, the population’s biologically significant mutations often not realistic, and species with genetic composition will be altered. that cause diversity appears to be historically low effective population The evolutionary importance of

Unit 2 • Chapter 6. Basic principles and laboratory analysis of genetic variation 107 migration stems from the fact that expression that can use a common amplified, either locally or elsewhere many species are composed of signature (the polyA tail) to capture in the genome, and the fidelity of several distinct subpopulations, a high percentage of mRNA at the polymorphisms between these largely isolated from each other but once. An assay must be robust and different segments is undermined, connected by occasional migration. reproducible in exceeding a sufficient as was observed in the International Migration between subpopulations threshold for detection. Even though HapMap Project (45,56). gives rise to , limiting the amplification protocols are highly Initially, restriction fragment extent to which subpopulations can reliable, error can be introduced for length polymorphism (RFLP) diverge from each other genetically. SNP detection, particularly if there assays were used to identify are neighbouring SNPs that alter patterns of DNA broken into pieces Laboratory analysis of human allele-specific binding of probes by restriction enzymes. The size of genetic variation or if local genomic sequence is the fragments was used to develop enriched for guanine- (GC) a footprint of the region of interest Genotype analysis content (Figure 6.5) (97,98). The (99). RFLP analysis is laborious presence of duplicates of part of and error-prone, and thus has Genotyping is used to interrogate the sequence (CNV), either in the been largely abandoned for probe specific, unique loci in the genome segment amplified or neighbouring intensity and microchip technologies following DNA amplification by the SNPs, can undermine the that can be easily scaled and polymerase chain reaction (PCR). fidelity of the assay, sometimes reliably performed. Examples of One of the challenges of genotype providing bias in allele calling (55). these are differential hybridization, analysis is that each allele in Based on the amplification of local primer extension, ligation reactions the genome must be assayed sequence surrounding the SNP of and allele-specific probe cleavage, individually, unlike surveys of gene interest, redundant sequences are all of which interrogate one SNP

Figure 6.5. Fidelity of the genotyping assay: error could be introduced in SNP detection. For example, the presence of a neighboring SNP under both TaqMan® (TM) probes (left panel) may alter allele-specific binding and bias the allele call (right panel).

108 at a time. Occasionally, RFLPs are neighbouring SNPs or insertion/ The newer system of Illumina, required to interrogate a region with deletions. The throughput is known as the Infinium® Assay, high degrees of paralogy. moderate for single-plex TaqMan, features single-tube preparation of but new miniaturization technologies DNA followed by whole-genome Low-density genotyping have improved the efficiency of amplification before genotyping moderate-scale genotyping studies thousands of unique SNPs. The most commonly used using either the Fluidigm® or Hybridization to bead-bound 50mer technique for single SNP assays BioTrove platforms (101,102). oligomers is followed by single- is the TaqMan® SNP genotyping Multiplexing has increased the base extension, which incorporates assay (). It is technical capacity to interrogate a labelled nucleotide for assay a PCR-based assay designed to large, predetermined, fixed sets detection. This technology can interrogate a single SNP that uses of SNPs. The cost of high-density be used to design custom sets of two locus-specific PCR primers and SNP platforms and the necessity SNPs (between 7600 and 60 000

two allele-specific labelled probes for large-scale follow-up studies bead types) with high efficiency Unit 2 (100). The 5′ exonuclease property have incentivized the development (106). It is the backbone of the of Taq polymerase is capitalized of methodologies for selective fixed content chips, which have 6 Chapter for detection of base-matching at a replication efforts. The technologies increased in size and coverage of specific site. Attached to the 5′ end that have been developed for these the common SNPs in the genome. of each probe is an allele-specific replication studies are based on This began with the HumanHap300 reporter dye: each allele has a direct oligonucleotide hybridization and its complementary HumanHap corresponding dye, which provides with probe fluorescence detection, 240, through to the HumanHap500, a benchmark for the ratio of the the single-base sequencing method, Human Hap610 and HumanHap dyes as a reflection of the allele or chip-based mass spectrometry 660w. The Infinium HD (high- distributions. On the 3′ end of each (i.e. based on matrix-assisted laser density) series followed with the probe is a single universal quencher desorption/ionization time-of-flight Human1M-Duo BeadChips, which dye, which prevents the excitation (MALDI-TOF)) (103). Matrix-assisted has over 106 SNPs to be genotyped, and emission of the reporter dyes. laser desorption/ionization (MALDI) primarily chosen as tagSNPs from During PCR amplification, the two enables analysis of biomolecules by HapMap II (8). The increasing PCR primers anneal to the template ionization usually triggered by a laser content of the chips also provides DNA. The detection probes anneal beam. A matrix is used to protect the an opportunity to detect a larger specifically to the complementary biomolecule from destruction; it can subset of the common CNVs. sequence between the forward and be multiplexed to perform roughly 30 However, algorithms for detection of reverse primer sites. During the SNP assays at one time. CNVs continue to evolve and should elongation step of each cycle, the improve in the coming years. Taq polymerase comes in contact High-density SNP detection The Affymetrix microchip system from the 5′ end with the reporter dye. is based on an assay known as the Capitalising on the exonuclease The first generation of custom whole-genome sampling analysis property of Taq polymerase, the bead-array technology by Illumina® (WGSA) for highly multiplexed SNP reporter dye is released from the enables custom detection of more genotyping (107). This method probe and the fluorescence is than 1500 SNPs with excellent amplifies the human genome with a released (i.e. no longer quenched by performance, and analysis of high- single primer amplification reaction the quencher dye). In addition, the quality DNA generated by whole- using restriction enzyme-digested, probe itself is digested by the Taq genome amplification assays adaptor-ligated human genomic polymerase. After multiple cycles (104,105). This system combines DNA. After fragmentation, sequential of PCR (that reach saturation for high-multiplexing in a multisample labelling and hybridization of the copying both alleles), fluorescence array format, well suited for custom targets is required before analysing is detected for the two reporter dyes genotype analysis of samples. the fragments on a microchip. The using an ABI 7900HT Sequence Though best used with native initial GeneChip® Human Mapping Detection System. DNA, it can analyse whole-genome 500K Array spaced SNP markers Careful attention must be paid amplified DNA, but at a price of by physical proximity, but the new to the unique flanking sequences distortions of heterozygosity for Genome-Wide Human SNP Array to avoid overlap with adjacent, roughly 5% of the SNPs. 6.0 provides a denser set of SNPs

Unit 2 • Chapter 6. Basic principles and laboratory analysis of genetic variation 109 (over 900 000), as well as probes conclusively established (106,108). comparison. The closer the value is that monitor common CNVs across Normally, custom panels are more to 1, the stronger the correlation, and the genome. The distribution of expensive and usually created for if the value is estimated to be 1.0, then restriction enzyme sites in select a single study (109). In this regard, both loci segregate together. New regions of the genome does not scalability to meet the requirements approaches are being developed permit assays across the full of validation studies represents one to account for the complexity of LD genome, limiting the coverage of the biggest challenges in the patterns in distinct populations, such somewhat. The primary debate design of these studies (110). as multimarker strategies that have over the choice of platforms is the Important components of the been proposed for analysing more coverage of known SNPs in HapMap optimization process include both a complicated loci (111,112). Stage 2: the SNPs selected for Laboratory Information Management the Illumina platform have been System (LIMS) and robotic Sequence analysis primarily chosen according to the automation that accurately track and aggressive tag strategy, whereas handle samples for efficient workflow Until recently, DNA sequence the first-generation Affymetrix chips management. Because of the high analysis by capillary electrophoresis provided spaced coverage based on cost of these platforms, the hardware has been the platform of choice the physical map of the genome, but used for sample processing, and the for medium- and small-scale with higher density. The coverage of software integrating both, there is projects, displacing the Sanger the latter has improved. little flexibility in choosing individual sequencing protocols that used gels SNPs to be included within the or polymers as separation media Methodological issues in already-designed, commercially for the fluorescently labelled DNA GWAS genotyping available whole-genome scans. fragments (113). The advent of the Two high-density genotyping 96-capillary 3730/3730 xl DNA High-throughput genotyping platforms, Affymetrix and Illumina®, Analyser (Applied Biosystems) was facilities require sophisticated achieve calling capabilities of the central catalyst in the generation robotics for efficient laboratory flow between 500 000 and 2.5 million of the first draft sequence of the and sample handling, as well as SNPs, as well as probe content to human genome (114). dedicated computational hardware interrogate CNVs. Both platforms Dideoxy sequencing is based and software able to effectively need between 400–800 ng of total on the principle of terminating DNA process both the quantity and high-quality DNA (usually at 50 ng/ synthesis by incorporation of the complexity of the data. Despite μl) for the assay, but because of the dideoxy nucleotide terminator on the fact that new technologies and dead-space of the robotics (which the complementary strand of DNA platforms has decreased the nominal can be 35% of the required amount fragments. The generated library price per genotype assayed, the for the assay) over 1 ug is required. of various length fragments can pricing must also take into account Issues common to both platforms be assembled to read the specific the need to study duplicates and are the difficulties in assaying SNPs DNA sequence. Sequencing- samples that must be redone due to that reside close together (within by-synthesis is based upon the technical inadequacies determined 60 or fewer nucleotides), which, as principle of pyrophosphate release in the quality control assessment previously mentioned, is inherent in by nucleotide incorporation along (see below). this type of genotyping detection. the complementary strand of DNA Since replication is a central Denser sets of SNPs on commercial to the varied-length template. As requirement to protect against the platforms have increased coverage, with dideoxy sequencing, the library flurry of false-positives observed in but not always for all populations. of generated fragment lengths can GWAS, follow-up studies are needed Coverage based on the HapMap be assembled into a specific DNA to verify the results and thus justify II set of SNPs with minor allele sequence. An amplification step by the considerable effort required to frequencies greater than 5%, is PCR is required, and thus has an investigate novel regions. To this one of the main factors driving the intrinsic error below 0.3% (small but end, custom panels are needed choice of platform (8,43). Figure predictable) (115). to explore regions at the same 6.6 illustrates the minimum LD Efficient removal of time that loci are analysed over for any SNP assay assessed by unincorporated dye terminators is sufficiently large data sets, so that the coefficient of correlation, r2 necessary before running samples genome-wide significance can be (a measure of LD), for 2-SNP on a capillary electrophoresis in

110 Figure 6.6. Genotyping platforms coverage of HapMap II SNPs. SNP coverage is plotted against LD measured by r2, or coefficient of correlation, for SNP-SNP comparison. Panels: A. HapMap CEU population: CEPH (Utah residents with ancestry from northern and western USAB); B. HapMap YRI population: Yoruba in Ibadan, Nigeria; C. HapMap JPT population: Japanese in Tokyo, Japan, and CHB population: in Beijing, China. Unit 2 Chapter 6 Chapter

which an electrical field is applied. setting them apart from conventional to analyse the increasing amount This allows negatively-charged capillary-based sequencing. These of data generated by these systems DNA fragments to move through techniques provide high speed and (113). Because of their novelty, the polymer towards the positive high-throughput from amplified the accuracy and associated electrode. Standard software single DNA fragments, avoiding the quality of sequencing reads must collects raw data files and translates need for cloning of DNA fragments. be further validated, but the high the collected colour data images Therefore, with minimal input of number of reads provides increased into consecutive nucleotide base DNA, the sequencer produces coverage of each base position calls. libraries of shorter length reads of (25). The major challenge of the between 35–400 bp, depending on next-generation sequencing is Next-generation DNA the platform, compared to those of the informatics of the dense data sequencing capillary sequencers (650–800 bp). sets, which requires archiving and A limiting factor is the elevated cost storing dense data sets that must be Next-generation sequencers have for generating the sequence with assembled to determine accurate been developed to process millions high-throughput. There is a need to reads. In this regard, error rates for of sequence reads in parallel rather develop software applications and next-generation sequencing runs than in batches of 96 at a time, more efficient computer algorithms and assembly constitute a new set

Unit 2 • Chapter 6. Basic principles and laboratory analysis of genetic variation 111 of problems, particularly since the identifies it and by a chemical step Applications of high-throughput quantum increase in data makes that removes the fluorescent group. DNA sequencing their inspection more daunting. At the end of the sequencing run (~4 The Roche/454 GS-FLX days), the sequence of each cluster A major focus of this new technology technology works on the principle is computed and subjected to quality is to rapidly and comprehensively of pyrosequencing, which uses control. A typical run yields ~40–50 catalogue human genetic variation, pyrophosphate molecules released million such sequences. particularly common and uncommon on nucleotide incorporation by DNA The Applied Biosystems genetic polymorphisms (e.g. SNPs polymerase to fuel a downstream SOLiD sequencer uses a unique and insertion/deletions). Since set of reactions that ultimately sequencing process catalysed by GWAS have relied on the genotyping produces light from the cleavage DNA ligase. A SOLiD (Sequencing of common alleles to discover of oxyluciferin by luciferase (116). by Oligo Ligation and Detection) run novel associations with diseases’ The DNA strands of the library are requires days, and produces 3–4 Gb risks (120), follow-up of regions of amplified en masse by emulsion of sequence data with an average association identified by GWAS is PCR (117) on the surfaces of read length of approximately 50 bp important to characterize common hundreds of thousands of agarose (119). The specific process couples and uncommon variants which beads. Each agarose bead surface oligo adaptor-linked DNA fragments might be better markers (or even contains up to 1 million copies of the with 1-μm magnetic beads that are candidates) for further functional original annealed DNA fragment to decorated with complementary studies. Since GWAS point to new produce a detectable signal from the oligos, and amplifies each bead-DNA candidate regions, the detailed fine sequencing reaction. Imaging of the complex by emulsion PCR. A SOLiD mapping of a region necessitates light flashes from luciferase activity sequencing by ligation first anneals the generation of a comprehensive records which templates are adding a universal sequencing primer, set of common and uncommon that particular nucleotide; the light then goes through subsequent variants. Already there are select emitted is directly proportional to the ligation of the appropriate labelled examples of next-generation amount of a particular nucleotide 8mer, followed by detection at each sequencing analysis applied to incorporated. The current 454 cycle by fluorescent readout. The regions to determine new variants instrument, the GS-FLX, produces unique attribute of this system is for follow-up association testing, an average read length of 400 bp per that an extra quality check of read such as for regions 8q24 associated sample (per bead), with a combined accuracy is enabled that facilitates with prostate and colon cancer and throughput of ~100–150 Mb of the discrimination of base calling 10q11.2 (containing the MSMB gene) sequence data per run. By contrast, errors from true polymorphisms or associated with a single ABI 3730 programmed to insertion/ () events, the (121,122). Eventually the 1000 sequence 24 × 96 well plates per so-called “2 base encoding” (25). Genomes Project should provide day produces ~440 kb of sequence The third generation of a suitable map to begin to choose data in 7 hours, with an average sequencing technologies is in variants in a region of interest. read length of 650 bp per sample development and should be available Characterizing all common variants (25). in the coming years. For example, previous to a fine-mapping process The Illumina Genome single molecule sequencing is based has two major benefits: all common Analyser is based on the concept on novel chemistry that enables genetic variants are represented of sequencing by synthesis direct measurement of billions of using a tagSNP approach; and (Solexa® Sequencing technology) strands of DNA. The detection the correlations among all genetic to produce sequence reads of system measures incorporated variants are known, which provides 35–150 bp from tens of millions of bases on individual strands and advantages in functional variant surface-amplified DNA fragments thus avoids the requirement of detection (121,122). simultaneously (118). A mixture amplification, which is subject to Since large-scale sequencing of single-stranded, adaptor oligo- biases and errors. across the genome is still several ligated DNA fragments is incubated years away, attention has focused and amplified with four differentially- on targeted sequencing of regions of labelled fluorescent nucleotides. high interest, such as those defined Each base incorporation cycle is by GWAS or linkage studies, and, followed by an imaging step that more recently, the opportunity to

112 sequence across the exome (e.g. is particularly difficult. Two new cells (119); to whole transcriptome more than 180 000 known exons developments should address this shotgun sequencing, or RNA- in the genome). Several different issue: sequencing with greater Seq, study into alternative splicing technologies have been developed coverage, which diminishes the in human cells (131); and the to capture target sequence, false-positives and -negatives of identification of mammalian DNA either through liquid phase (e.g. sequence determination, and an sequences bound by transcription biotinylated solution capture probes increase in read length, which will factors in vivo, by combining with long range or micro-droplet permit phasing of genomes. chromatin immunoprecipitation solution technique), or tiled arrays The ambitious effort from an (ChIP) with parallel sequencing that contain probes that enrich international research consortium, (ChIP-Seq) (132). for capture of DNA for sequence namely the 1000 Genomes Project, The Human Microbiome Project analysis. Each of the next-generation “…will involve sequencing the (HMP) (http://nihroadmap.nih. sequencing technologies have been genomes of at least a thousand gov/hmp/) integrates

successfully used with one or more people from around the world and metagenomics in an effort to Unit 2 target capture technologies. For to create the most detailed and characterize the genome sequences instance, using the NimbleGen medically useful picture to date of of organisms inhabiting a common 6 Chapter solution-based capture technique, human genetic variation” (http:// environment (133). By understanding the KLK3 locus, recently identified www.1000genomes.org/). The genomics, metagenomics and as a signal for prostate cancer and goal is to create a detailed map of their relations, the HMP seeks prostate serum antigen levels, was human genetic variation relevant at to determine whether individuals resequenced to comprehensively or above the level of a frequency of share a core human microbiome catalogue all common variants for 0.5–1% across the genome (113). and whether changes in the human follow-up genotype and functional By optimizing technology, costs microbiome can be correlated with analyses (123–125). Recently, will continue to fall enabling greater changes in human health (134). sequencing across the exome after scope of study at a lower price. enrichment with tiled arrays has Reduction to affordable levels, Quality control in the successfully been used to identify targeted for the US$1000 range for laboratory high-penetrant mutations in the an entire human genome sequence, coding regions in individuals with offers the promise of personal The advent of new technologies Mendelian disorders (126). Exome genomics. There are still formidable and workflow paradigms required sequencing represents the first step barriers, however, with respect to for high-throughput genotyping and towards examining the portion of the informatics, storage and the ethical sequencing has changed the nature genome that is easily interpretable, and social dilemmas posed by such of laboratory work in genetics. The namely changes in coding structure. analyses. bulk of the work has been shifted to It requires careful annotation and Next-generation sequencing high-throughput analyses, where so analytical structures, however, to technologies have already been much data is processed in such a sift through the thousands of rare applied to complementary fields of short time that the older shibboleths variants in unique individuals. investigation in genetics. The intent of quality control have been shed for The sequencing of the first has been to characterize a complex more efficient approaches, which human genomes has underscored sample with a mixture of nucleic seek to identify potential errors in a the challenge of unraveling the acids through their sequence without high-volume workflow. physical map, particularly in some prior knowledge of it, in contrast to The efficient and meticulous regions of great redundancy and/or the probe hybridization used by the sample handling process must complexity; moreover, it illustrates original SAGE technique (129,130). begin at the moment of receipt of the daunting problem of assembly Thus, it is possible to characterize germline DNA for genotyping or (127,128). More genomes need the sequence of mRNAs, methylated sequencing. Close coordination to be sequenced to establish a DNA, DNA or RNA regions bound between the laboratory performing reliable reference standard for by certain proteins, and other DNA the extraction and the biorepository the analysis of human genomic or RNA regions involved in gene storing the DNA samples is variations. The current reference is expression and regulation (113). optimal and protects against an amalgam of several genomes, Recent examples are its application handling and biorepository errors, thus the ability to unravel variation to transcriptome profiling in stem an underappreciated problem.

Unit 2 • Chapter 6. Basic principles and laboratory analysis of genetic variation 113 Standard operating procedures unique genetic profiles of samples optimized. These include a type of (SOPs) for the process should be can be useful for quality assessment multiple displacement amplification created and reviewed regularly for and control in the workflow. Many (MDA) with the high-performance improvements and quality control laboratories have incorporated into bacteriophage φ 29 DNA purposes. the upfront analysis a set of SNPs or polymerase, which uses degenerate DNA quantification is not an a forensic panel of 15 small tandem hexamers or generation of libraries exact . Due to technical and repeats and amelogenin, also of 200–2000 fragments workflow issues, it is actually quite known as the AmpFLSTR Identifiler created by random chemical difficult to reproducibly quantify assay (Applied Biosystems). cleavage of genomic DNA. Ligation DNA (135). Several different The fingerprinting can be helpful of adaptor sequences to both ends techniques can be used to measure to sleuth problems and identify and PCR amplification is required. DNA, but each one has limitations contaminated samples before costly Quantities can vary greatly based and, in some workflows, different analysis. Furthermore, the results on input DNA, but under optimal applications of preparing for low- or can serve as a proxy for the viability conditions an enrichment of 10 000- high-throughput genetic analyses. of the DNA and its success on the fold can be expected. Quantification methods should high-performance genotyping or The rolling circle amplification be chosen for specific genotype/ sequencing technologies. Certainly, (RCA) technique is an enzymatic sequence platforms. The most high failure rates indicate poor process mediated by DNA commonly used techniques are performance. The profiles can be polymerases. Long single-stranded spectrophotometric measurement of used to match known duplicates DNA molecules are synthesized DNA optical density by PicoGreen and identify unexpected duplicates, on a short circular template by (Turner BioSystems) analysis, which in turn stimulates close using a single DNA primer. RCAs NanoDrop spectrophotometer inspection of both biorepository generate a large-scale DNA (NanoDrop Technologies), or by issues and workflow in the laboratory template with the advantage of real-time PCR analysis using a (e.g. errors with plates or reagents). not requiring a thermal cycling standardized TaqManTM assay For the conduct of many instrument (137,138). Differential (136). Real-time PCR can provide molecular studies, success has been observed with a preliminary test for sample sample availability has been a whole blood, dried blood, buccal quality as it relates to robust limiting factor. Naturally, there has cell swabs, cultured cells and buffy analysis in a high-throughput been intense interest in the whole- coat cells. Intriguingly, WGA of laboratory, but performance still genome amplification (WGA) water control specimens generates must be gauged with specific technology to provide sufficient a small, monoallelic signal, which technologies. Spectrophotometry amounts of DNA for analysis. can be called as a single allele, and the PicoGreen assay measure Thus, varying results reflect not thus underscoring the value of total DNA present, regardless of only differences in the protocols rigorous controls (139). Still, more source or quality, whereas a real- and reagents, but the samples laboratories have chosen MDA for time PCR assay measures the themselves. The quality of DNA whole-genome amplification (140). total amplifiable human DNA. DNA used to amplify across the genome The utility of duplicates drawn quantitation by real-time PCR is affects the success and fidelity of from the same sample remains a particularly helpful for assessing the process. WGA can generate central theme of laboratory quality the contribution of non-human DNA large quantities of DNA for genotype control, but with the advent of high- to samples collected from buccal assays, but approximately 5% of the throughput laboratories the purpose swabs, cytobrush samples or other genome is not faithfully reproduced, has shifted slightly. Still duplicate non-blood sources. Minor but real particularly regions with high GC testing is useful to detect problems differences between techniques content or near telomeres. Thus, the with sample quality, prior storage, reflect dissimilarities in the ratio of results of analyses of these regions and informatic issues in sample single- and double-stranded DNA, should be cautiously interpreted. management. In some cases, it critical for analysis using diverse While the temptation to use WGA can also reveal rare individuals technologies. DNA in GWAS is great, the results enrolled in more than one study. Because of the high volume so far have not been encouraging. Reproducibility of assays is key, and of activities in high-throughput Currently, there are two approaches with the whole-genome-scan chips genotyping/sequencing facilities, that have been commercially surpasses 99.8% concordance.

114 Errors in genotyping, mainly due the flow of information. Central such as those encountered in to loss of one of the heterozygous to the success of a laboratory is GWAS. PLINK, now in version 1.06, alleles, occur in well below 1% the functioning of a Laboratory is a free, open-source whole-genome of samples; therefore, when the Information Management System association suite that focuses on the rate creeps above 1%, close (LIMS), which is required to analysis of large-scale genotype/ inspection of the process should be track samples, assays, reagents, phenotype data, but lacks support undertaken. If SOPs are followed equipment, robotics and processes for study design and planning, closely, completion rates should be through the entire workflow. The genotype generation, or CNV calling. greater than 95% for most studies, LIMS captures the movement of Its integration with Haploview (http:// but may be slightly lower depending information from receipt of samples www.broadinstitute.org/haploview/ on the quality of genomic DNA. through the analytical steps and into haploview) allows visualization, Completion rates below 90% should the quality control regime required annotation and representation of raise substantive concern about to provide a final, stable data set, some of the results. GLU (version

technical or analytical problems. In linking the results of experimental 1.0) is also a suite created to manage, Unit 2 GWAS studies, it is recommended data to in silico information via analyse and report high-throughput that a second technology, such relational databases. Annotation of SNP genotype data (http://code. 6 Chapter as TaqMan or sequencing, be the genome is needed to provide google.com/p/glu-genetics). GLU performed to verify the accuracy and clear points of reference for the was created to address the need establish concordance (120). Errors genomic coordinates for the for new and scalable computational with fitness for Hardy–Weinberg genotype and sequence assays. approaches, as well as storage, proportion (Hardy–Weinberg Careful quality control and quality management, quality control and equilibrium (HWE) testing) can catch assurance checks within the LIMS genetic analysis. It is a framework major genotype errors, but should software, particularly with real-time and software package designed probably not be used as a stringent monitoring, are needed to maintain with a set of powerful tools that can threshold for excluding SNPs from assay reproducibility and reliable scale to effectively handle trillions analysis. data flow. of genotypes. The integration of The increasing number of loci GLU with a robust and fast SNP Bioinformatics explored by new platforms, as tagging tool, like TagZilla, increases well as the quantum increases in its functionality and allows LD Large-scale genotyping and the increments in study size, has estimation and computation of sequence analysis has shifted forced major changes in laboratory MAF, HWE, and proportions (http:// the burden of informatics towards data storage and management. tagzilla.nci.nih.gov/). high-performance tools that Laboratory systems should be manage the computational and able to routinely process, monitor Conclusions bioinformatic workflow needed and assess quality control of large and future directions to manipulate high-density data amounts of data (106–109 data sets. The required tasks, archiving, points) generated by these studies. Knowledge acquired by the draft analysis and access are destined to The increasing need for processing of the human genome and its grow exponentially as studies are power mandates the use of scalable annotation, and advances in designed with increasing numbers computational systems capable of technology, have changed the of participants and larger and more parallel computing, with software approach towards mapping complex complex variants to be interrogated. applications specially designed for diseases and traits. Once oriented Accordingly, the efficiency of the this multiprocessor environment and to the study of candidate genes laboratory flow is based on a high- readily upgradable. and/or mutations, throughput pipeline for both genetic Suites of publicly available has evolved into the study of the analysis and informatic handling tools (e.g. PLINK (http://pngu. genome as a complex structure of the data sets. Major steps in mgh.harvard.edu/~purcell/plink/ harbouring clues for multifaceted the process include the choice of summary.shtml) (141) and Genotype disease risk; some known, but the markers and platforms together Library and Utilities (GLU) (http:// majority unknown. The discovery with a sophisticated quality control cgf.nci.nih.gov/glu/docs)) have of new candidate regions by GWAS process. Highly trained personnel been developed for archiving and has forced rethinking previous are needed to effectively coordinate management of dense data sets, strategies for the study of genetic

Unit 2 • Chapter 6. Basic principles and laboratory analysis of genetic variation 115 predisposition. More agnostic study thousands of the most common Consortial efforts to describe approaches, genome-wide, with genetic variants across the genome human variation have focused on increasing numbers of participants (SNPs), without any prior hypothesis, the description and characterization from high-quality epidemiological conception or what is being defined of three continental populations studies are, for the first time, as an agnostic manner. This initial pursued by the International replicating results in different phase requires adequately powered HapMap Project (http://www. settings. But new-found candidate follow-up studies for replication hapmap.org). But using GWAS, regions lead to extensive follow-up that is central to the search for other consortia and interest groups and confirmation of their functional moderate- to high-frequency low- have focused on a more disease- significance. Understanding the penetrance variants associated with specific approach that has resulted true effect of genetic variability human diseases and traits (120,142). in the discovery of over 200 on the risk of complex diseases is Teams of scientists with specific novel loci associated with human paramount, but also important is responsibilities in each step of the diseases/traits (2,8,20,57). Though the design of high-quality studies to process are necessary to ensure the majority of the association assess environmental contributions, quality control and stable analytical studies to date have used high- as well as the interactions between results as part of the effort to map throughput genotyping technology, genes and exposures. complex human diseases and traits. new programs in comprehensive If accurate measures of Previously, family linkage resequencing analysis would environmental factors must be studies have been used to identify unveil an even greater catalogue addressed, increased efforts are rare genetic variants with high- of uncommon variants (http:// needed in the study of the biological penetrance susceptibility genes www.1000genomes.org/). relevance of the regions already (143,144), but failed to be informative In concert with the assessment discovered. To date, there are a on more common genetic variants of germline genetic variation, few examples where biological with low to moderate effect (145). other programs are underway to functional basis has been associated With the advent of next-generation characterize functional annotation with a candidate region discovered sequencing technologies and through gene expression via GWAS. Also, the gap between the discovery of many rare and analysis. The ENCODE Project new-found genomic regions and uncommon variants, family studies (ENCyclopedia Of DNA Elements) their biological interpretation could will be required to assist in defining seeks to define functional elements become greater with the introduction the most notable variants for follow- (http://www.genome.gov/10005107) of new resequencing technology, up studies. In this regard, family (25), and The Cancer Genome Atlas which is capable of interrogating studies should prove invaluable in (TCGA) examines both somatic more numbers of less frequent mapping many complex diseases, and germline alterations in select loci. New challenges arise with as well as the highly penetrant (146). Together, these new new technologies. High-throughput Mendelian disorders. developments promise to accelerate resequencing must standardise its Based on the preliminary data the discovery and characterization technical protocols, quality control, published as a result of GWAS, it is of novel genomic mechanisms in calling algorithm and interpretation. not currently possible to draw final human diseases and traits. Only deep resequencing of high conclusions concerning the valid risk numbers of individuals will create assessment of complex diseases. quality databases capable of testing Education of both the public and rare variants in the population. Until scientific media is necessary to these steps are readily available affect a rational approach towards for new technologies, broad implementing any risk reduction implementation will not be possible. policies. These new challenges for The new approach to the public health officials will require genomic study of complex diseases careful attention to the ethical, moral has resulted in a more ambitious and social implications of dense “team” science, in which resources genomic data sets to assure the and study populations are pooled to public, and the participants in the identify novel genetic markers (Cf. current studies, that confidentiality Figure 6.1). In this regard, GWAS is protected (26).

116 References

1. Hunter DJ, Khoury MJ, Drazen JM (2008). 14. Miklós I, Novák A, Dombai B, Hein J 26. Birney E, Stamatoyannopoulos JA, Letting the genome out of the bottle–will we (2008). How reliably can we predict the Dutta A et al.; ENCODE Project Consortium; get our wish? N Engl J Med, 358:105–107. reliability of protein structure predictions? NISC Comparative Sequencing Program; doi:10.1056/NEJMp0708162 PMID:18184955 BMC Bioinformatics, 9:137.doi:10.1186/1471- Baylor College of Medicine Human Genome 2105-9-137 PMID:18315874 Sequencing Center; Washington University 2. Manolio TA, Brooks LD, Collins FS (2008). Genome Sequencing Center; Broad Institute; A HapMap harvest of insights into the genetics 15. Edwards YJ, Cottage A (2003). Children’s Hospital Oakland Research of common disease. J Clin Invest, 118:1590 – Bioinformatics methods to predict protein Institute (2007). Identification and analysis 1605.doi:10.1172/JCI34772 PMID:18451988 structure and function. A practical approach. of functional elements in 1% of the human Mol Biotechnol, 23:139–166.doi:10.1385/MB: genome by the ENCODE pilot project. Nature, 3. Kruglyak L, Nickerson DA (2001). Variation 23:2:139 PMID:12632698 is the spice of . Nat Genet, 27:234–236. 447:799–816.doi:10.1038/nature05874 PMID: 17571346 Unit 2 doi:10.1038/85776 PMID:11242096 16. Heringa J (2000). Computational methods for protein secondary structure prediction 27. Romeo S, Pennacchio LA, Fu Y et al. 4. Reich DE, Cargill M, Bolk S et al. (2001). using multiple sequence alignments. 6 Chapter Linkage disequilibrium in the human genome. (2007). Population-based resequencing of Curr Protein Pept Sci, 1:273–301.doi:10. ANGPTL4 uncovers variations that reduce Nature, 411:199–204.doi:10.1038/35075590 2174/1389203003381324 PMID:12369910 PMID:11346797 triglycerides and increase HDL. Nat Genet, 39:513–516.doi:10.1038/ng1984 PMID:17322 17. Erichsen HC, Chanock SJ (2004). SNPs in 5. Reich DE, Gabriel SB, Altshuler D (2003). 881 cancer research and treatment. Br J Cancer, Quality and completeness of SNP databases. 90:747–751.doi:10.1038/sj.bjc.6601574 PMID: Nat Genet, 33:457–458.doi:10.1038/ng1133 28. Bonnen PE, Wang PJ, Kimmel M et al. 14970847 PMID:12652301 (2002). Haplotype and linkage disequilibrium architecture for human cancer-associated 18. Cargill M, Altshuler D, Ireland J et al. 6. Lander ES, Linton LM, Birren B et al.; genes. Genome Res, 12:1846–1853.doi:10.11 (1999). Characterization of single-nucleotide International Human Genome Sequencing 01/gr.483802 PMID:12466288 polymorphisms in coding regions of human Consortium (2001). Initial sequencing and genes. Nat Genet, 22:231–238.doi:10.1038/ analysis of the human genome. Nature, 29. Sabeti PC, Reich DE, Higgins JM et al. 10290 PMID:10391209 409:860–921.doi:10.1038/35057062 PMID:11 (2002). Detecting recent positive selection in the human genome from haplotype structure. 237011 19. Stephens JC, Schneider JA, Tanguay DA Nature, 419:832–837.doi:10.1038/nature011 et al. (2001). Haplotype variation and linkage 7. Venter JC, Adams MD, Myers EW et al. 40 PMID:12397357 disequilibrium in 313 human genes. Science, (2001). The sequence of the human genome. 293:489–493.doi:10.1126/science.1059431 Science, 291:1304–1351.doi:10.1126/science. 30. Slatkin M (2008). Linkage disequilibrium– PMID:11452081 1058040 PMID:11181995 understanding the evolutionary past and mapping the medical future. Nat Rev Genet, 20. International HapMap Consortium (2003). 8. Frazer KA, Ballinger DG, Cox DR et al.; 9:477–485.doi:10.1038/nrg2361 PMID:18427 The International HapMap Project. Nature, International HapMap Consortium (2007). A 557 426:789–796. doi:10.1038/nature02168 second generation human haplotype map of PMID:14685227 over 3.1 million SNPs. Nature, 449:851–861. 31. Orr N, Chanock SJ (2008). Common doi:10.1038/nature06258 PMID:17943122 genetic variation and human disease. Adv 21. Packer BR, Yeager M, Burdett L et al. Genet, 62:1–32.doi:10.1016/S0065-2660(08) (2006). SNP500Cancer: a public resource 9. Hinds DA, Stuve LL, Nilsen GB et al. (2005). 00601-9 PMID:19010252 for sequence validation, assay development, Whole-genome patterns of common DNA and frequency analysis for genetic variation variation in three human populations. Science, 32. Hill WG (1974). Estimation of linkage in candidate genes. Nucleic Acids Res, 34 307:1072–1079.doi:10.1126/science.1105436 disequilibrium in randomly mating populations. Database issue;D617–D621.doi:10.1093/nar/ PMID:15718463 , 33:229 –239.doi:10.1038/hdy.1974. gkj151 PMID:16381944 89 PMID:4531429 10. Chanock SJ (2001). Candidate genes and single nucleotide polymorphisms (SNPs) 22. Stephens M, Sloan JS, Robertson PD 33. Clark AG (1990). Inference of haplotypes in the study of human disease. Dis Markers, et al. (2006). Automating sequence-based from PCR-amplified samples of diploid 17:89–98. PMID:11673655 detection and genotyping of SNPs from diploid populations. Mol Biol Evol, 7:111–122. PMID: samples. Nat Genet, 38:375–381.doi:10.1038/ 2108305 11. Risch NJ (2000). Searching for genetic ng1746 PMID:16493422 determinants in the new millennium. 34. Eskin E, Halperin E, Karp RM (2003). Nature, 405:847–856.doi:10.1038/35015718 23. Marth G, Schuler G, Yeh R et al. Efficient reconstruction of haplotype structure PMID:10866211 (2003). Sequence variations in the public via perfect phylogeny. J Bioinform Comput human genome data reflect a bottlenecked Biol, 1:1–20.doi:10.1142/S0219720003000174 12. Hughes AL, Packer B, Welch R et al. population history. Proc Natl Acad Sci USA, PMID:15290779 (2003). Widespread purifying selection at 100:376–381.doi:10.1073/pnas.222673099 polymorphic sites in human protein-coding PMID:12502794 35. Stephens M, Smith NJ, Donnelly P loci. Proc Natl Acad Sci USA, 100:15754– (2001). A new statistical method for haplotype 15757.doi:10.1073/pnas.2536718100 PMID:1 24. Marth GT, Korf I, Yandell MD et al. (1999). reconstruction from population data. Am J 4660790 A general approach to single-nucleotide Hum Genet, 68:978–989.doi:10.1086/319501 polymorphism discovery. Nat Genet, 23:452– PMID:11254454 13. Hughes AL, Packer B, Welch R et al. 456.doi:10.1038/70570 PMID:10581034 (2005). Effects of natural selection on 36. Stephens M, Donnelly P (2003). A interpopulation divergence at polymorphic 25. Mardis ER (2008). The impact of next- comparison of bayesian methods for sites in human protein-coding Loci. Genetics, generation sequencing technology on haplotype reconstruction from population 170:1181–1187.doi:10.1534/genetics.104. genetics. Trends Genet, 24:133–141.doi:10. genotype data. Am J Hum Genet, 73:1162– 037077 PMID:15911 586 1016/j.tig.2007.12.007 PMID:18262675 1169.doi:10.1086/379378 PMID:14574645

Unit 2 • Chapter 6. Basic principles and laboratory analysis of genetic variation 117 37. Marchini J, Cutler D, Patterson N et al.; 51. Sebat J, Lakshmi B, Troge J et al. (2004). 64. Barnes C, Plagnol V, Fitzgerald T et International HapMap Consortium (2006). A Large-scale copy number polymorphism in al. (2008). A robust statistical method for comparison of phasing algorithms for trios the human genome. Science, 305:525–528. case-control association testing with copy and unrelated individuals. Am J Hum Genet, doi:10.1126/science.1098918 PMID:15273396 number variation. Nat Genet, 40:1245–1252. 78:437–450.doi:10.1086/500808 PMID:1646 doi:10.1038/ng.206 PMID:18776912 5620 52. Iafrate AJ, Feuk L, Rivera MN et al. (2004). Detection of large-scale variation in the human 65. Ballantyne KN, van Oorschot RAH, Mitchell 38. Akey J, Jin L, Xiong M (2001). Haplotypes genome. Nat Genet, 36:949–951.doi:10.1038/ RJ (2007). Comparison of two whole genome vs single marker linkage disequilibrium tests: ng1416 PMID:15286789 amplification methods for STR genotyping of what do we gain? Eur J Hum Genet, 9:291–300. LCN and degraded DNA samples. Forensic doi:10.1038/sj.ejhg.5200619 PMID:11313774 53. Inoue K, Lupski JR (2002). Molecular Sci Int, 166:35–41.doi:10.1016/j.forsciint.2006. 03.022 PMID:16687226 39. Schaid DJ (2004). Evaluating associations mechanisms for genomic disorders. Annu of haplotypes with traits. Genet Epidemiol, Rev Genomics Hum Genet, 3:199–242.doi: 66. Goellner GM, Tester D, Thibodeau S et al. 27:348–364.doi:10.1002/gepi.20037 PMID:15 10.1146/annurev.genom.3.032802.120023 (1997). Different mechanisms underlie DNA 543638 PMID:12142364 instability in Huntington disease and colorectal cancer. Am J Hum Genet, 60:879–890. PMID: 40. Tan Q, Christiansen L, Christensen K et 54. Bailey JA, Yavor AM, Massa HF et al. 9106534 al. (2005). Haplotype association analysis (2001). Segmental duplications: organization of human disease traits using genotype and impact within the current human genome 67. Roewer L, Krawczak M, Willuweit S et al. data of unrelated individuals. Genet Res, project assembly. Genome Res, 11:1005–1017. (2001). Online reference database of European 86:223–231.doi:10.1017/S0016672305007792 doi:10.1101/gr.GR-1871R PMID:11381028 Y-chromosomal short (STR) PMID:16454861 haplotypes. Forensic Sci Int, 118:106 –113. 55. Bailey JA, Gu Z, Clark RA et al. (2002). doi:10.1016/S0379-0738(00)00478-3 PMID:11 41. Cardon LR, Abecasis GR (2003). Using Recent segmental duplications in the human 311820 haplotype blocks to map human complex trait genome. Science, 297:1003–1007.doi:10.1126/ loci. Trends Genet, 19:135–140.doi:10.1016/ science.1072047 PMID:12169732 68. Ryckman K, Williams SM. Calculation S0168-9525(03)00022-2 PMID:12615007 and use of the Hardy-Weinberg model in 56. Freeman JL, Perry GH, Feuk L et al. association studies. Curr Protoc Hum Genet 42. Johnson GC, Esposito L, Barratt BJ et al. 2008;Chapter 1:Unit 1.18. (2001). Haplotype tagging for the identification (2006). : new insights in of common disease genes. Nat Genet, 29:233– genome diversity. Genome Res, 16:949–961. 69. Hosking L, Lumsden S, Lewis K et al. 237.doi:10.1038/ng1001-233 PMID:11586306 doi:10.1101/gr.3677206 PMID:16809666 (2004). Detection of genotyping errors by Hardy-Weinberg equilibrium testing. Eur J 43. Barrett JC, Fry B, Maller J, Daly MJ (2005). 57. International HapMap Consortium (2005). A Hum Genet, 12:395–399.doi:10.1038/sj.ejhg. Haploview: analysis and visualization of LD haplotype map of the human genome. Nature, 5201164 PMID:14872201 and haplotype maps. Bioinformatics, 21:263– 437:1299–1320.doi:10.1038/nature04226 265.doi:10.1093/bioinformatics/bth457 PMID:16255080 70. Gomes I, Collins A, Lonjou C et al. (1999). PMID:15297300 Hardy-Weinberg quality control. Ann Hum 58. Buckley PG, Mantripragada KK, Piotrowski Genet, 63:535–538.doi:10.1046/j.1469- 44. Stram DO, Haiman CA, Hirschhorn JN et A et al. (2005). Copy-number polymorphisms: 1809.1999.6360535.x PMID:11246455 al. (2003). Choosing haplotype-tagging SNPS mining the tip of an iceberg. Trends Genet, based on unphased genotype data using a 21:315–317.doi:10.1016/j.tig.2005.04.007 71. Akey JM, Zhang K, Xiong M et al. (2001). preliminary sample of unrelated subjects with PMID:15922827 The effect that genotyping errors have on the an example from the Multiethnic Cohort Study. robustness of common linkage-disequilibrium Hum Hered, 55:27–36.doi:10.1159/000071807 59. McCarroll SA, Kuruvilla FG, Korn JM measures. Am J Hum Genet, 68:1447–1456. PMID:12890923 et al. (2008). Integrated detection and doi:10.1086/320607 PMID:11359212 45. McCarroll SA, Altshuler DM (2007). Copy- population-genetic analysis of SNPs and copy 72. Wittke-Thompson JK, Pluzhnikov A, Cox NJ number variation and association studies of number variation. Nat Genet, 40:1166 –1174. (2005). Rational inferences about departures human disease. Nat Genet, 39 Suppl;S37– doi:10.1038/ng.238 PMID:18776908 from Hardy-Weinberg equilibrium. Am J Hum S42.doi:10.1038/ng2080 PMID:17597780 Genet, 76:967–986.doi:10.1086/430507 PMID: 60. Khaja R, Zhang J, MacDonald JR et 15834813 46. Scherer SW, Lee C, Birney E et al. (2007). al. (2006). Genome assembly comparison Challenges and standards in integrating identifies structural variants in the human 73. Leal SM (2005). Detection of genotyping surveys of structural variation. Nat Genet, 39 genome. Nat Genet, 38:1413–1418.doi:10. errors and pseudo-SNPs via deviations Suppl;S7–S15.doi:10.1038/ng2093 PMID:1759 1038/ng1921 PMID:17115057 from Hardy-Weinberg equilibrium. Genet 7783 Epidemiol, 29:204–214.doi:10.1002/ 61. Istrail S, Sutton GG, Florea L et al. (2004). gepi.20086 PMID:16080207 47. Kidd JM, Cooper GM, Donahue WF et al. Whole-genome shotgun assembly and (2008). Mapping and sequencing of structural comparison of human genome assemblies. 74. Cox DG, Kraft P (2006). Quantification of the variation from eight human genomes. Nature, Proc Natl Acad Sci USA, 101:1916–1921. power of Hardy-Weinberg equilibrium testing to 453:56–64.doi:10.1038/nature06862 PMID:18 doi:10.1073/pnas.0307971100 PMID:14769938 detect genotyping error. Hum Hered, 61:10–14. 451855 doi:10.1159/000091787 PMID:16514241

48. Feuk L, Carson AR, Scherer SW (2006). 62. Cooper GM, Zerr T, Kidd JM et al. (2008). 75. Hirschhorn JN (2005). Genetic approaches Structural variation in the human genome. Systematic assessment of copy number variant to studying common diseases and complex Nat Rev Genet, 7:85–97.doi:10.1038/nrg1767 detection via genome-wide SNP genotyping. traits. Pediatr Res, 57:74R–77R.doi:10.1203/01. PMID:16418744 Nat Genet, 40:1199–1203.doi:10.1038/ng.236 PDR.0000159574.98964.87 PMID:15817501 PMID:18776910 49. Stefansson H, Helgason A, Thorleifsson 76. Yu K, Wang Z, Li Q et al. (2008). Population G et al. (2005). A common inversion under 63. Korn JM, Kuruvilla FG, McCarroll SA et substructure and control selection in genome- selection in Europeans. Nat Genet, 37:129– al. (2008). Integrated genotype calling and wide association studies. PLoS One, 137.doi:10.1038/ng1508 PMID:15654335 association analysis of SNPs, common copy 3:e2551.doi:10.1371/journal.pone.0002551 number polymorphisms and rare CNVs. Nat PMID:18596976 50. Sharp AJ, Locke DP, McGrath SD et al. Genet, 40:1253–1260.doi:10.1038/ng.237 (2005). Segmental duplications and copy- PMID:18776909 number variation in the human genome. Am J Hum Genet, 77:78–88.doi:10.1086/431652 PMID:15918152

118 77. Falush D, Stephens M, Pritchard JK 90. Ramachandran S, Deshpande O, 103. Sun X, Ding H, Hung K, Guo B (2000). (2003). Inference of population structure Roseman CC et al. (2005). Support from A new MALDI-TOF based mini-sequencing using multilocus genotype data: linked loci the relationship of genetic and geographic assay for genotyping of SNPS. Nucleic and correlated allele frequencies. Genetics, distance in human populations for a serial Acids Res, 28:E68.doi:10.1093/nar/28.12.e68 164:1567–1587. PMID:12930761 founder effect originating in Africa. Proc Natl PMID:10871391 Acad Sci USA, 102:15942–15947.doi:10.1073/ 78. Devlin B, Roeder K (1999). Genomic control pnas.0507611102 PMID:16243969 104. Cunningham JM, Sellers TA, Schildkraut for association studies. Biometrics, 55:997– JM et al. (2008). Performance of amplified 1004.doi:10.1111/j.0006-341X.1999.00997.x 91. Li JZ, Absher DM, Tang H et al. (2008). DNA in an Illumina GoldenGate BeadArray PMID:11315092 Worldwide human relationships inferred from assay. Cancer Epidemiol Biomarkers Prev, genome-wide patterns of variation. Science, 17:1781–1789.doi:10.1158/1055-9965.EPI- 79. Pritchard JK, Rosenberg NA (1999). Use of 319:1100–1104.doi:10.1126/science.1153717 07-2849 PMID:18628432 unlinked genetic markers to detect population PMID:18292342 stratification in association studies. Am J 105. Berthier-Schaad Y, Kao WH, Coresh J Hum Genet, 65:220–228.doi:10.1086/302449 92. Rosenberg NA, Pritchard JK, Weber et al. (2007). Reliability of high-throughput PMID:10364535 JL et al. (2002). Genetic structure of human genotyping of whole genome amplified DNA populations. Science, 298:2381–2385.doi: in SNP genotyping studies. Electrophoresis, 80. Freedman ML, Haiman CA, Patterson N et 10.1126/science.1078311 PMID:12493913 28:2812–2817.doi:10.1002/elps.200600674 al. (2006). Admixture mapping identifies 8q24 PMID:17702060 as a prostate cancer risk locus in African- 93. Romero IG, Manica A, Goudet J et al. American men. Proc Natl Acad Sci USA, (2009). How accurate is the current picture of 106. Thomas G, Jacobs KB, Yeager M et al.

103:14068–14073.doi:10.1073/pnas.0605 human genetic variation? Heredity, 102:120– (2008). Multiple loci identified in a genome- Unit 2 832103 PMID:16945910 126.doi:10.1038/hdy.2008.89 PMID:18766200 wide association study of prostate cancer. Nat Genet, 40:310–315.doi:10.1038/ng.91 Chapter 6 Chapter 81. Kao WH, Klag MJ, Meoni LA et al.; Family 94. Sabeti PC, Schaffner SF, Fry B et al. PMID:18264096 Investigation of Nephropathy and (2006). Positive natural selection in the human Research Group (2008). MYH9 is associated lineage. Science, 312:1614–1620.doi:10.1126/ 107. Matsuzaki H, Loi H, Dong S et al. (2004). with nondiabetic end-stage renal disease in science.1124309 PMID:16778047 Parallel genotyping of over 10,000 SNPs African Americans. Nat Genet, 40:1185 –1192. using a one-primer assay on a high-density doi:10.1038/ng.232 PMID:18794854 95. Sabeti PC, Varilly P, Fry B et al.; oligonucleotide array. Genome Res, 14:414– International HapMap Consortium (2007). 425.doi:10.1101/gr.2014904 PMID:14993208 82. Kopp JB, Smith MW, Nelson GW et al. Genome-wide detection and characterization (2008). MYH9 is a major-effect risk gene of positive selection in human populations. 108. Al Olama AA, Kote-Jarai Z, Giles for focal segmental glomerulosclerosis. Nat Nature, 449:913–918.doi:10.1038/nature062 GG et al.; UK Genetic Prostate Cancer Genet, 40:1175 –1184.doi:10.1038/ng.226 50 PMID:17943131 Study Collaborators/British Association of PMID:18794856 Urological Surgeons’ Section of Oncology; 96. Tishkoff SA, Reed FA, Ranciaro A et al. UK Prostate testing for cancer and Treatment 83. Hurst LD (2009). Fundamental concepts (2007). Convergent of human study (ProtecT Study) Collaborators (2009). in genetics: genetics and the understanding lactase persistence in Africa and Europe. Nat Multiple loci on 8q24 associated with prostate of selection. Nat Rev Genet, 10:83–93. Genet, 39:31–40.doi:10.1038/ng1946 PMID: cancer susceptibility. Nat Genet, 41:1058– doi:10.1038/nrg2506 PMID:19119264 17159977 1060.doi:10.1038/ng.452 PMID:19767752

84. Stern DL, Orgogozo V (2009). Is genetic 97. Pompanon F, Bonin A, Bellemain E, 109. Barrett JC, Cardon LR (2006). Evaluating evolution predictable? Science, 323:746–751. Taberlet P (2005). Genotyping errors: causes, coverage of genome-wide association doi:10.1126/science.1158997 PMID:19197055 consequences and solutions. Nat Rev Genet, studies. Nat Genet, 38:659–662.doi:10.1038/ 6:847–859.doi:10.1038/nrg1707 PMID:1630 85. Stern DL, Orgogozo V (2008). The loci ng1801 PMID:16715099 4600 of evolution: how predictable is genetic 110. McCarthy MI, Abecasis GR, Cardon LR et evolution? Evolution, 62:2155–2177. 98. Packer BR, Yeager M, Staats B et al. al. (2008). Genome-wide association studies doi:10.1111/j.1558-5646.2008.00450.x PMID: (2004). SNP500Cancer: a public resource for for : consensus, uncertainty 18616572 sequence validation and assay development and challenges. Nat Rev Genet, 9:356–369. for genetic variation in candidate genes. 86. Cooper TF, Ostrowski EA, Travisano doi:10.1038/nrg2344 PMID:18398418 Nucleic Acids Res, 32 Database issue;D528– M (2007). A negative relationship between D532.doi:10.1093/nar/gkh005 PMID:14681474 111. Kim Y, Feng S, Zeng ZB (2008). mutation and fitness effect in yeast. Measuring and partitioning the high-order Evolution, 61:1495–1499.doi:10.1111/j.1558- 99. Saiki RK, Scharf S, Faloona F et al. linkage disequilibrium by multiple order 5646.2007.00109.x PMID:17542856 (1985). Enzymatic amplification of beta- Markov chains. Genet Epidemiol, 32:301– globin genomic sequences and restriction site 87. Eyre-Walker A, Keightley PD, Smith NGC, 312.doi:10.1002/gepi.20305 PMID:18330903 analysis for diagnosis of sickle cell anemia. Gaffney D (2002). Quantifying the slightly Science, 230:1350–1354.doi:10.1126/science. 112. Schaid DJ (2004). deleterious mutation model of molecular 2999980 PMID:2999980 and haplotypes. Genet Epidemiol, 27:317– evolution. Mol Biol Evol, 19:2142–2149. 320.doi:10.1002/gepi.20046 PMID:15543637 PMID:12446806 100. Livak KJ, Marmaro J, Todd JA (1995). Towards fully automated genome-wide 113. Ansorge WJ (2009). Next-generation 88. Charlesworth B (2009). Fundamental polymorphism screening. Nat Genet, 9:341– DNA sequencing techniques. N Biotechnol, concepts in genetics: effective population 342.doi:10.1038/ng0495-341 PMID:7795635 25:195–203.doi:10.1016/j.nbt.2008.12.009 size and patterns of PMID:19429539 and variation. Nat Rev Genet, 10:195–205. 101. Brenan CJ (2002). DNA-based molecular doi:10.1038/nrg2526 PMID:19204717 lithography for nanoscale fabrication. IEEE 114. Collins FS, Morgan M, Patrinos A (2003). Eng Med Biol Mag, 21:164.doi:10.1109/ The Human : lessons from 89. Relethford JH (2004). Global patterns of MEMB.2002.1175178 PMID:12613226 large-scale . Science, 300:286–290. isolation by distance based on genetic and doi:10.1126/science.1084564 PMID:12690187 morphological data. Hum Biol, 76:499–513. 102. Frederickson RM (2002). Fluidigm. doi:10.1353/hub.2004.0060 PMID:15754968 Biochips get indoor plumbing. Chem Biol, 115. Sanger F, Nicklen S, Coulson AR (1977). 9:1161–1162. PMID:12445764 DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci USA, 74:5463–5467.doi:10.1073/pnas.74.12.5463 PMID:271968

Unit 2 • Chapter 6. Basic principles and laboratory analysis of genetic variation 119 116. Margulies M, Egholm M, Altman 127. Levy S, Sutton G, Ng PC et al. (2007). 140. Lasken RS (2009). Genomic DNA WE et al. (2005). Genome sequencing in The diploid genome sequence of an individual amplification by the multiple displacement microfabricated high-density picolitre reactors. human. PLoS Biol, 5:e254.doi:10.1371/ amplification (MDA) method. Biochem Soc Nature, 437:376–380. PMID:16056220 journal.pbio.0050254 PMID:17803354 Trans, 37:450–453.doi:10.1042/BST0370450 PMID:19290880 117. Dressman D, Yan H, Traverso G et al. 128. Wheeler DA, Srinivasan M, Egholm (2003). Transforming single DNA molecules M et al. (2008). The complete genome of 141. Purcell S, Neale B, Todd-Brown K et al. into fluorescent magnetic particles for an individual by massively parallel DNA (2007). PLINK: a tool set for whole-genome detection and enumeration of genetic sequencing. Nature, 452:872–876.doi:10. association and population-based linkage variations. Proc Natl Acad Sci USA, 100:8817– 1038/nature06884 PMID:18421352 analyses. Am J Hum Genet, 81:559–575. 8822.doi:10.1073/pnas.1133470100 PMID:12 doi:10.1086/519795 PMID:17701901 857956 129. Mortazavi A, Williams BA, McCue K et al. (2008). Mapping and quantifying mammalian 142. Hunter DJ, Thomas G, Hoover RN, 118. Van Tassell CP, Smith TPL, Matukumalli transcriptomes by RNA-Seq. Nat Methods, Chanock SJ (2007). Scanning the horizon: LK et al. (2008). SNP discovery and allele 5:621–628.doi:10.1038/nmeth.1226 PMID:18 what is the future of genome-wide association frequency estimation by deep sequencing 516045 studies in accelerating discoveries in cancer of reduced representation libraries. Nat etiology and prevention? Cancer Causes Methods, 5:247–252.doi:10.1038/nmeth.1185 130. Velculescu VE, Zhang L, Vogelstein Control, 18:479–484.doi:10.1007/s10552- PMID:18297082 B, Kinzler KW (1995). Serial analysis of 007-0118-y PMID:17440825 gene expression. Science, 270:484–487. 119. Cloonan N, Forrest ARR, Kolle G et al. doi:10.1126/science.270.5235.484 PMID:75 143. Hall JM, Lee MK, Newman B et al. (2008). transcriptome profiling 70003 (1990). Linkage of early-onset familial breast via massive-scale mRNA sequencing. Nat cancer to chromosome 17q21. Science, Methods, 5:613–619.doi:10.1038/nmeth.1223 131. Sultan M, Schulz MH, Richard H et al. 250:1684–1689.doi:10.1126/science.2270482 PMID:18516046 (2008). A global view of gene activity and PMID:2270482 alternative splicing by deep sequencing of the 120. Chanock SJ, Manolio T, Boehnke M et al.; human transcriptome. Science, 321:956–960. 144. Wooster R, Bignell G, Lancaster J et NCI-NHGRI Working Group on Replication doi:10.1126/science.1160342 PMID:18599741 al. (1995). Identification of the breast cancer in Association Studies (2007). Replicating susceptibility gene BRCA2. Nature, 378:789– genotype-phenotype associations. Nature, 132. Robertson G, Hirst M, Bainbridge 792.doi:10.1038/378789a0 PMID:8524414 447:655–660.doi:10.1038/447655a PMID:17 M et al. (2007). Genome-wide profiles of 554299 STAT1 DNA association using chromatin 145. Stratton MR, Rahman N (2008). The immunoprecipitation and massively parallel emerging landscape of breast cancer 121. Yeager M, Deng Z, Boland J et al. (2009). sequencing. Nat Methods, 4:651–657.doi:10. susceptibility. Nat Genet, 40:17–22.doi:10. Comprehensive resequence analysis of a 97 1038/nmeth1068 PMID:17558387 1038/ng.2007.53 PMID:18163131 kb region of chromosome 10q11.2 containing the MSMB gene associated with prostate 133. Hugenholtz P, Tyson GW (2008). 146. McLendon R, Friedman A, Bigner D cancer. Hum Genet, 126:743–750.doi:10.1007/ Microbiology: metagenomics. Nature, et al.; Cancer Genome Atlas Research s00439-009-0723-9 PMID:19644707122. 455:481–483.doi:10.1038/455481a Network (2008). Comprehensive genomic PMID:18818648 characterization defines human glioblastoma 122. Yeager M, Xiao N, Hayes RB et al. (2008). genes and core pathways. Nature, 455:1061– Comprehensive resequence analysis of a 134. Turnbaugh PJ, Ley RE, Hamady M et 1068.doi:10.1038/nature07385 PMID:187728 136 kb region of human chromosome 8q24 al. (2007). The human microbiome project. 90 associated with prostate and colon cancers. Nature, 449:804–810.doi:10.1038/nature062 Hum Genet, 124:161–170.doi:10.1007/s00439 44 PMID:17943116 147. Gabriel SB, Schaffner SF, Nguyen H et -008-0535-3 PMID:18704501 al. (2002). The structure of haplotype blocks 135. Bergen AW, Qi Y, Haque KA et al. (2005). in the human genome. Science, 296:2225– 123. Parikh H, Deng Z, Yeager M et al. (2010). Effects of DNA mass on multiple displacement 2229.doi:10.1126/science.1069424 PMID:120 A comprehensive resequence analysis of the whole genome amplification and genotyping 29063 KLK15-KLK3-KLK2 locus on chromosome performance. BMC Biotechnol, 5:24.doi:10. 19q13.33. Hum Genet, 127:91–99.doi:10.1007/ 1186/1472-6750-5-24 PMID:16168060 148. Carlson CS, Eberle MA, Rieder MJ et s00439-009-0751-5 PMID:19823874 al. (2004). Selecting a maximally informative 136. Haque KA, Pfeiffer RM, Beerman set of single-nucleotide polymorphisms 124. Ahn J, Berndt SI, Wacholder S et al. MB et al. (2003). Performance of high- for association analyses using linkage (2008). Variation in KLK genes, prostate- throughput DNA quantification methods. BMC disequilibrium. Am J Hum Genet, 74:106–120. specific antigen and risk of prostate cancer. Biotechnol, 3:20.doi:10.1186/1472-6750-3-20 doi:10.1086/381000 PMID:14681826 Nat Genet, 40:1032–1034, author reply PMID:14583097 1035–1036.doi:10.1038/ng0908-1032 PMID: 137. Fire A, Xu SQ (1995). Rolling replication 19165914 of short DNA circles. Proc Natl Acad Sci USA, 125. Eeles RA, Kote-Jarai Z, Giles GG 92:4641–4645.doi:10.1073/pnas.92.10.4641 et al.; UK Genetic Prostate Cancer Study PMID:7753856 Collaborators; British Association of 138. Dean FB, Nelson JR, Giesler TL, Lasken Urological Surgeons’ Section of Oncology; UK RS (2001). Rapid amplification of plasmid and ProtecT Study Collaborators (2008). Multiple phage DNA using Phi 29 DNA polymerase and newly identified loci associated with prostate multiply-primed rolling circle amplification. cancer susceptibility. Nat Genet, 40:316–321. Genome Res, 11:1095–1099.doi:10.1101/ doi:10.1038/ng.90 PMID:18264097 gr.180501 PMID:11381035 126. Ng SB, Turner EH, Robertson PD et 139. Bergen AW, Haque KA, Qi Y et al. al. (2009). Targeted capture and massively (2005). Comparison of yield and genotyping parallel sequencing of 12 human exomes. performance of multiple displacement Nature, 461:272–276.doi:10.1038/nature082 amplification and OmniPlex whole genome 50 PMID:19684571 amplified DNA generated from multiple DNA sources. Hum Mutat, 26:262–270.doi:10. 1002/humu.20213 PMID:16086324

120