<<

NEWS AND VIEWS

Sample size dictates inference dogma collection, rather than to data generation or ent experimental approaches provides insight The technical revolution that made the cur- analysis. The open question is whether the com- into mechanism and helps specify the genetic rent data boom possible has matured; munity of funding agencies, peer reviewers and model. In fact, genomic convergence is a spe- laboratories now routinely generate millions of regulatory bodies will be prepared to encourage cific embodiment of the concept of ‘consilience per day. However, more data can be a and support the assembly of subjects and mate- of inductions’, formulated by William Whewell problem when we do not have the discipline to rials needed to accomplish these goals. in the 1840s11, which says that valid inductions apply the scientific method. Replication is also will be supported by data from many different a problem when working with effects so small Candidate success experimental approaches. that replication success could not be reasonably Rather than being discovered initially through a Although they benefit greatly from the expected, even if the marker is associated with -wide scan of hundreds of thousands of impressive acceleration provided by genome- the disease. genetic markers, which is now the , IL7R is wide scans, virtually all genetic association Larger sample size is the mantra of the hour. a candidate gene that refused to quit. Certainly, disease studies ultimately become candidate Experimentalists and funding agencies no longer the first paper on IL7R polymorphisms in mul- gene projects. Additional susceptibility in scoff at theoretical analyses showing that thou- tiple sclerosis7 would not have satisfied the strin- —siblings of HLA-DRB2, IL7R sands of cases and controls are needed to dem- gent replication criteria advocated by a recent and, probably, IL2RA4—might now stand above onstrate modest genetic associations (Fig. 1). National Institute–National Human the background noise so that they may be seen. Rather, risk variants discovered in the recent Genome Research Institute working group8. We know they must be there. rash of relatively , successful genome scans Nevertheless, the current studies2–4 establish- COMPETING INTERESTS STATEMENT settle the importance of large samples by direct ing the IL7R association with multiple sclerosis The author declares no competing financial interests. demonstration. The apparent exception of com- are an example of persistence paying off. The plement (CFH) in age-related macular studies also show that thought and insight are 1. Kenealy, S.J., Pericak-Vance, M.A. & Haines, J.L. J. Neuroimmunol. 143, 7–12 (2003). degeneration escapes this problem only by the valuable and important, even in an era when 2. Gregory, S.G. et al. Nat. Genet. 39, 1083–1091 (2007). http://www.nature.com/naturegenetics very unusual magnitude of the effect (with an high-throughput data generation and brute- 6 3. Lundmark, F. et al. Nat. Genet. 39, 1108–1103 odds ratio of ∼7) . Small effect sizes (with odds force approaches dominate. (2007). ratios <1.2) require much larger sample sizes The hubbub over genetic association studies, 4. The International Multiple Sclerosis Genetics than are readily available for an initial study, the problem of inconsistent replication and the Consortium. N. Engl. J. Med., published online 29 July 2007 (doi:10.1056/NEJMoa073493). considering also the many additional samples obvious false-positive associations that accom- 5. Silman, A.J. et al. Br. J. Rheumatol. 32, 903–907 needed for multiple, independent replication pany high-throughput genotyping technolo- (1993). 6. Klein, R.J. et al. 308, 385–389 (2005). studies. gies obscure the conceptual simplicity of the 7. Teutsch, S.M., Booth, D.R., Bennetts, B.H., Heard, Although discerning a is much less scientific method. What we cannot replicate R.N. & Stewart, G.J. Eur. J. Hum. Genet. 11, 509–515 difficult today than previously, collecting the cannot be made dogma. As scientists, to accept (2003). 8. NCI-NHGRI Working Group on Replication in Association samples needed is becoming increasingly dif- a relationship, we must find it again through Studies et al. Nature 447, 655–660 (2007). ficult and expensive. Controversy begets regu- independent, careful evaluation. As the modest 9. Hauser, M.A. et al. Hum. Mol. Genet. 12, 671–676 (2003). lation, and regulatory compliance has become effect size for IL7R in multiple sclerosis attests, 10. Noureddine, M.A. et al. Mov. Disord. 20, 1299–1309 predatory. Permission to perform human subject those doing genetic association studies in com- (2005). Nature Publishing Group Group 200 7 Nature Publishing 11. Whewell, W. in Theory of Scientific Method (ed.

© research is soaked in carefully wrought legalities plex diseases have been working too long in a Butts, R.) (Hackett Publishing, Indianapolis, Indiana, that subjects generally ignore. This social choice, statistically underpowered universe. 1989). along with the small effect sizes anticipated, The selection of candidate genes using multi- 12. Skol, A.D. et al. Nat. Genet. 38, 209–213 (2006). 13. Gordon, D. et al. Hum. Hered. 54, 22–33 (2002). dictates that genetics projects will dedicate an ple sources of data is a strategy termed “genomic 14. Gordon, D. et al. Pac. Symp. Biocomput. 490–501 increasing proportion of resources to sample convergence”9,10. The application of differ- (2003).

A haplotype map for the laboratory

Richard Mott

Two reports present detailed analyses of the haplotype structure of widely used laboratory mice based on resequencing data from 15 inbred strains. The studies provide the deepest view thus far of the patterns of genetic variation segregating in the inbred lines, and have implications for the design of complex trait mapping studies in mice.

The mouse is the primary model of where a human gene is known to be associated ent among inbred mouse strains that, together human disease. About 90% of human genes with a mendelian disease, the knockout of the with the intercrosses, heterogeneous stocks, have an ortholog in the mouse, and in cases mouse ortholog will often produce a similar recombinant inbred lines and other genetic ref- phenotype. Just as importantly, we gain insights erence populations derived from them, are the Richard Mott is at the Wellcome Trust Centre into human multifactorial diseases by examin- workhorses for the dissection of complex traits for Human Genetics, University of Oxford, ing the phenotypic consequences of naturally in the mouse1. Recently, the National Institute Oxfordshire OX3 7BN, UK. occurring genetic variation in the mouse. This of Environmental Health (NIEHS) e-mail: [email protected] variation is captured in the polymorphisms pres- contracted Perlegen Sciences to resequence the

1054 VOLUME 39 | NUMBER 9 | SEPTEMBER 2007 | NATURE GENETICS NEWS AND VIEWS

a b c

Figure 1 The genetic variation between these three classical inbred strains of mice is explained by the variation observed between wild-derived inbred strains. (a) C57BL/6J (the reference-sequence ). (b) 129S1/SvImJ. (c) DBA/2J. All photographs were provided by Joyce Peterson at The .

of 15 inbred mouse strains, compris- haplotypes of the founders. Unlike in humans, up to 45 million SNPs segregating between all ing 11 classical strains and 4 wild-derived strains where resequencing candidate genes among the strains. corresponding to the subspecies musculus many individuals uncovers rare variants that Second, variation between the 11 classical domesticus, M. m. musculus and M. m. castaneus may be disease related, the frequency of rare strains captures only 41% of the total number and the natural M. m. molossinus. Frazer variants in mice descended from these strains of observed SNPs, which is an underestimate of et al.2, reporting in Nature, now present the first should be negligible, limited to the few sponta- the total variation in the wild-derived strains. description and analysis of this data set, which neous that can accumulate in a small Third, the studies estimate that roughly 76% of they generated and released publicly last fall. In number of generations. In other words, in these the genome of each classical strain originated a related study on page 1100 of this issue, Yang mice, the ‘common disease/common variant’ from M. m. domesticus, roughly 5% from M. M. et al.3 present an independent analysis of the hypothesis is true by construction. musculus and roughly 3% from M. m. castaneus, 6 http://www.nature.com/naturegenetics same data set. Together, these analyses give fresh Starting with the report by Wade et al. , the with the remainder being of uncertain ancestry, insight into the haplotype structure of widely haplotype structure of the mouse inbred strains where the wild-derived strains show evidence used inbred mouse strains. has gradually become clearer. It was known that of introgression. Yang et al.3 suggest that these the classical strains showed less genotypic and regions are in fact 67% domesticus, raising the Unnatural history and haplotype structure phenotypic variation compared to the wild total fraction of the domesticus contribution to Mouse inbred strains are grouped into the ‘clas- strains, suggesting that the former descended just over 90%. Lastly, patterns of ancestry vary, sical’ strains (Fig. 1), which include C57BL/6J, from a limited pool of founder genomes. even within the same . On aver- whose genome has been sequenced as a refer- Detailed but localized resequencing of classical age, at each locus, there are between two and ence4, that were mostly derived before 1950 and strains at several loci7–9 suggested the ancestry of four distinct haplotypes present in the classi- whose detailed ancestry was not recorded, and the classical strains could not be explained by a cal strains. When pairs of classical strains are the ‘wild’ strains, which descended more recently single phylogenetic tree but rather by a mosaic of compared, as would be the case in a quantita- from different subspecies of mice trapped in the trees across the genome. However, the definitive tive trait locus mapping experiment using an F2 wild. Most mouse genetics is performed using relationships between the classical strains, and intercross, between 36% and 55% of the genome Nature Publishing Group Group 200 7 Nature Publishing

© the classical strains. The precise number of with the wild-derived strains, were unknown. is IBD, where no quantitative trait locus can seg- independent strains available depends on how regate. Taking all classical strains together, 11% closely related strains are grouped together, but Characterizing patterns of variation of the genome is IBD. there are about 40 that are sufficiently distinct to The studies by Frazer et al.2 and Yang et al.3 It is also worth noting that the two studies used be worth phenotyping systematically (see http:// present detailed analyses of the newly generated different methods to construct genome-wide www.jax.org/phenome). NIEHS data set, reporting qualitatively similar phylogenetic mosaics. Frazer et al.2 divided the To make full use of the inbred strains, we findings but with some quantitative differences. genome into 40,000 segments (ranging from 1 kb need a genome-wide haplotype map of the First, the 8.23 million SNPs identified (92% of to 3 Mb, with an average width of 58 kb) such mouse and, ideally, the complete sequence which are novel) are only a subset of the SNPs that the phylogenetic tree relating the changes of each strain. Their essentially clonal nature expected to be present, as the array-based tech- in the classical strains was constant within each makes such an endeavor very cost-effective, nology used limits coverage to the 58% of the segment, and segment breakpoints were defined as the information is reusable in an unlimited C57BL6/J reference genome that is nonrepeti- whenever any pairwise strain comparison indi- number of experiments, either directly when tive, and the genotype-calling algorithm is highly cated a switch from identity to difference. By inbred strains are phenotyped, or by statistical conservative. Based on a comparison with SNPs contrast, Yang et al.3 first tiled the genome into imputation when crosses are used5. segregating between C57BL/6J and the classical segments 100 kb wide, then computed the best Because of the unnatural man-made history strain A/J from Celera data, Frazer et al.2 estimate phylogenetic tree in each segment and finally of the inbred strains, their haplotype structure that 43% of SNPs segregating between the clas- merged adjacent segments with the same tree. does not resemble that of humans or of wild sical strains have been discovered, in the sense These conclusions suggest that the several mice. Inbred strains are constructed by repeated of being observed in at least one of the classical thousand quantitative trait loci thus far mapped brother-sister mating over many generations. strains. Yang et al.3 predict a similar false-negative in crosses between classical strains of mice are Inevitably, selection will eliminate recessive rate among the classical strains but estimate that the tip of the iceberg, and many more will be lethal alleles, creating loci that are identical by the rate among the wild-derived strains is higher, discovered segregating in the wild-derived descent (IBD) in all strains, and fixing a limited principally because the chance that an allele is strains. This makes the construction and wide- range of haplotypes elsewhere in the genome. observed at least once depends on its frequency, spread use of genetic reference populations Moreover, the genome of any animal descended so alleles that are private to a subspecies are less that include the wild-derived strains, such as from inbred strains will be a mosaic of the likely to be observed; they suggest there may be the collaborative cross10, a key to progress. It

NATURE GENETICS | VOLUME 39 | NUMBER 9 | SEPTEMBER 2007 1055 NEWS AND VIEWS

also suggests that newly emerging resequencing 1. Flint, J., Valdar, W., Shifman, S. & Mott, R. Nat. Rev. (2000). methods should be used to create the definitive Genet. 6, 271–286 (2005). 6. Wade, C.M. et al. Nature 420, 574–578 (2002). 2. Frazer, K.A. et al. Nature, advance online publication 7. Yalcin, B. et al. Proc. Natl. Acad. Sci. USA 101, 9734–9739 haplotype map of the mouse, including not 29 July 2007 (doi:10.1038/nature06067). (2004). only SNPs but also copy number variants and 3. Yang, H., Bell, T.A., Churchill, G.A. & Pardo-Manuel de 8. Frazer, K.A. et al. Genome Res. 14, 1493–1500 genome rearrangements. Villena, F. Nat. Genet. 39, 1100–1107 (2007). (2004). 4. Waterston, R.H. et al. Nature 420, 520–562 (2002). 9. Zhang, J. et al. Genome Res. 15, 241–249 (2005). COMPETING INTERESTS STATEMENT 5. Mott, R., Talbot, C.J., Turri, M.G., Collins, A.C. & Flint, J. 10. Churchill, G.A. et al. Nat. Genet. 36, 1133–1137 The author declares no competing financial interests. Proc. Natl. Acad. Sci. USA 97, 12649–12654 (2004). An Arabidopsis haplotype map takes root

Edward Buckler & Michael Gore

The report of a haplotype map for the selfing plant has uncovered numerous major-effect polymorphisms and rapid linkage disequilibrium decay. This work lays the foundation for genome-wide association studies at near-gene-level resolution in a possessing substantial functional diversity and extensive community resources.

Systematic mutagenesis, the use of large linkage ered an average of one polymorphism every 166 The future of association mapping populations and positional are com- bp. For comparison, this polymorphism rate is These studies pave the way for developing a

http://www.nature.com/naturegenetics mon approaches for dissecting complex traits 11 times that found in a human haplotype map powerful association mapping panel for A. in plants. In human genetics, recent genome- study done with a comparable platform3. This is thaliana. In fact, these groups are now develop- wide association studies have mapped com- a tremendous amount of functional variation, ing a tiling array to score an association panel mon complex diseases, but the plant genetics which has certainly had a role in the adaptation of 1,000 ecotypes. One key question for this community has been awaiting a haplotype map of A. thaliana to environments throughout the future genotyping effort is how many SNPs will comparable to the human HapMap to make a globe. be needed to provide markers in high LD with similar approach feasible in plants. Clark et al.1 In a companion study, Kim et al. character- all regions of the genome. Kim et al. convinc- recently reported a haplotype map for the pre- ize LD in these A. thaliana collections2. Theory ingly show that 200,000 tag SNPs should be dominantly self-pollinating (‘selfing’) species suggests that selfing species should have higher sufficient to cover almost all of the genome2. Arabidopsis thaliana, and in a companion paper levels of LD across the genome, as there are few A second matter is ascertainment bias, as in this issue, Kim et al. (page 1151)2 character- opportunities for effective recombination4. In these resequencing arrays were designed from ize linkage disequilibrium (LD) in these strains. outcrossing species with large population sizes a single sequenced reference genome. With This extensive catalog of polymorphisms1 and like that of and melanogaster, resequencing arrays, multiple adjacent poly- Nature Publishing Group Group 200 7 Nature Publishing 2

© detailed examination of LD set the stage for rapid LD decay provides sub-gene-level map- morphisms cause a decrease in hybridization high-resolution genome-wide association scans ping resolution5. However, the LD structure signal intensity, thus preventing many of the in this model selfing species. of selfing species with large population sizes highly polymorphic regions from being char- has been an open question. Initial A. thaliana acterized at single– resolution. In the A. thaliana diversity and LD surveys, at the FRI locus, suggested extensive current studies, this resulted in only 27% of the Laying the foundation for an A. thaliana hap- LD (up to 250 kb)6, and a low-density genome total polymorphisms being scored in a given lotype map, Clark et al.1 conducted a thorough scan suggested that LD decayed within 20–40 line1. Additionally, although non-allelic struc- array resequencing of 20 diverse A. thaliana kb7. Kim et al. now show that LD decays sub- tural polymorphisms are probably common8, genomes at single-base resolution. This pro- stantially within 10 kb and sometimes within these have not been comprehensively cata- vided a powerful catalog of genetic diversity, 3–4 kb for a very diverse subset of ecotypes2. So logued or discovered by this approach, which with more than 1 million SNPs and hyper- despite being predominantly selfing, A. thali- will probably require next-generation sequenc- variable regions (50-bp to >10-kb deletions ana must historically have had enough sex and ing approaches to identify these potentially and SNP clusters). More than 100,000 amino recombination to break down linkage blocks. important variants. These tiling arrays, when changes were identified, along with nearly 2,500 Kim et al. also found hotspots of LD decay that combined with the recently developed analysis polymorphisms that should radically alter tran- showed moderate to weak correlations with the methods that account for complex population script or protein structure. Overall, they discov- level of polymorphism, recombination and GC structures9,10, will allow powerful full-genome content, although the mechanisms controlling association scans. Edward Buckler is in the US Department of regions of high and low recombination were Understanding the phenotypic importance Agriculture–Agricultural Research Service, not clear2. Larger sample sizes will be necessary of rare SNPs is also critical for association Ithaca, New York 14853, USA, and the to understand these processes, as even repeat studies and genetic architecture analyses. In Department of Plant Breeding and Genetics, content had little correlation, despite it being this sample of 20 A. thaliana ecotypes, nearly Cornell University, Ithaca, New York 14853, very important in other plant species. The low 50% of the polymorphisms were unique to an USA. Michael Gore is in the Department of degree of LD found in these studies is exciting, ecotype2. Do these rare alleles have phenotypic Plant Breeding and Genetics, Cornell University, as it suggests that diverse A. thaliana lines may effects? If these rare SNPs have important phe- Ithaca, New York 14853, USA. provide near-gene-level resolution for associa- notypic effects, then alternative mapping strat- e-mail: [email protected] tion mapping. egies may be needed. Currently, the phenotypic

1056 VOLUME 39 | NUMBER 9 | SEPTEMBER 2007 | NATURE GENETICS