<<

Vol 461 | 10 September 2009 | doi:10.1038/nature08250 LETTERS

Targeted capture and massively parallel of 12 human

Sarah B. Ng1, Emily H. Turner1, Peggy D. Robertson1, Steven D. Flygare1, Abigail W. Bigham2, Choli Lee1, Tristan Shaffer1, Michelle Wong1, Arindam Bhattacharjee4, Evan E. Eichler1,3, Michael Bamshad2, Deborah A. Nickerson1 & Jay Shendure1

Genome-wide association studies suggest that common genetic in MYH3 (ref. 5). Unpaired, 76 base-pair (bp) reads12 variants explain only a modest fraction of heritable risk for com- from post-enrichment shotgun libraries were aligned to the reference mon diseases, raising the question of whether rare variants account genome13. On average, 6.4 gigabases (Gb) of mappable sequence was for a significant fraction of unexplained heritability1,2. Although generated per individual (20-fold less than whole sequencing DNA sequencing costs have fallen markedly3,theyremainfarfrom with the same platform12), and 49% of reads mapped to targets what is necessary for rare and novel variants to be routinely iden- (Supplementary Table 1). After removing duplicate reads that tified at a genome-wide scale in large cohorts. We have therefore represent potential polymerase chain reaction artefacts14, the average sought to develop second-generation methods for targeted sequen- fold-coverage of each was 513 (Supplementary Fig. 1). On cing of all -coding regions (‘exomes’), to reduce costs while average per exome, 99.7% of targeted bases were covered at least enriching for discovery of highly penetrant variants. Here we report once, and 96.3% (25.6 Mb) were covered sufficiently for variant call- on the targeted capture and massively parallel sequencing of the ing ($83 coverage and Phred-like15 consensus quality $30). This exomes of 12 humans. These include eight HapMap individuals corresponded to 78% of genes having .95% of their coding bases representing three populations4, and four unrelated individuals called (Supplementary Fig. 2 and Supplementary Data 2). The aver- with a rare dominantly inherited disorder, Freeman–Sheldon syn- age pairwise correlation coefficient between individuals for gene-by- drome (FSS)5. We demonstrate the sensitive and specific identifica- gene coverage was 0.87, consistent with systematic bias in coverage tion of rare and common variants in over 300 megabases of coding between individual exomes. sequence. Using FSS as a proof-of-concept, we show that candidate False positives and false negatives are critical issues in genomic genes for Mendelian disorders can be identified by exome sequen- resequencing. We assessed the quality of our exome data in four ways. cing of a small number of unrelated, affected individuals. This First, comparing sequence-based calls for the eight HapMap exomes strategy may be extendable to diseases with more complex genetics to array-based , we observed a high concordance with through larger sample sizes and appropriate weighting of non- both homozygous (99.94%; n 5 219,077) and heterozygous synonymous variants by predicted functional impact. (99.57%; n 5 43,070) genotypes (Table 1). Second, we compared Protein-coding regions constitute ,1% of the or our coding single-nucleotide polymorphism (cSNP) catalogue to ,30 megabases (Mb), split across ,180,000 . A brute-force ,1 Mb of coding sequence determined in each of the eight approach to with conventional technology6 is HapMap individuals by molecular inversion probe (MIP) capture expensive relative to what may be possible with second-generation and direct resequencing16. At coordinates called in both data sets, platforms3. However, the efficient isolation of this fragmentary geno- 99.9% of all cSNPs (n 5 4,620) and 100% of novel cSNPs mic subset is technically challenging7. The enrichment of an exome (n 5 334) identified here were concordant, consistent with a low false by hybridization of shotgun libraries constructed from 140 mgof discovery rate. Third, we compared the NA18507 cSNPs identified genomic DNA to seven microarrays was described previously8.To here to those called by recent of this improve the practicality of hybridization capture, we developed a individual12, and found substantial overlap (Supplementary Fig. 3). protocol to enrich for coding sequences at a genome-wide scale start- The relative numbers of cSNPs called by only one approach, and the ing with 10 mg of DNA and using two microarrays. Our initial target proportions of these represented in dbSNP, indicate that exome was 27.9 Mb of coding sequence defined by CCDS (the NCBI sequencing has equivalent sensitivity for cSNP detection compared Consensus Coding Sequence database)9. This curated set avoids the to whole genome sequencing. Fourth, we compared our data to inclusion of spurious hypothetical genes that contaminate broader cSNPs in high-quality Sanger sequence of single haplotype regions exome definitions10. The target is reduced to 26.6 Mb on exclusion of from fosmid clones of the same HapMap individuals17. Most fosmid- regions that are poorly mapped with our anticipated read length defined cSNPs (38 of 40) were at coordinates with sufficient coverage owing to paralogous sequences elsewhere in the genome in our data for variant calling. Of these, 38 of 38 were correctly (Supplementary Data 1). identified as variant. We captured and sequenced the exomes of eight individuals previ- A comparison of our data to past reports on exonic18 or exomic8 ously characterized by the HapMap4 and Human Genome Structural array-based capture revealed roughly equivalent capture specificity, Variation11 projects. We also analysed four unrelated individuals but greater completeness in terms of coverage and variant calling affected with Freeman–Sheldon syndrome (FSS; Online Mendelian (Supplementary Table 2). These improvements probably arise from Inheritance in Man (OMIM) #193700), also called distal arthro- a combination of greater sequencing depth and differences in array gryposis type 2A, a rare autosomal dominant disorder caused by designs and in experimental conditions for capture. Within the set of

1Department of Genome Sciences, 2Department of Pediatrics, University of Washington, 3Howard Hughes Medical Institute, Seattle, Washington 98195, USA. 4Agilent Technologies, Santa Clara, California 95051, USA. 272 ©2009 Macmillan Publishers Limited. All rights reserved NATURE | Vol 461 | 10 September 2009 LETTERS

Table 1 | Sequence coverage and array-based validation Concordance with Illumina Human1M-Duo calls

Individual Covered $13 Sequence called Homozygous reference Heterozygous Homozygous non-reference NA18507 (YRI) 26,477,161 (99.7%) 25,795,189 (97.1%) 23757/23762 (99.98%) 5553/5583 (99.46%) 3582/3592 (99.72%) NA18517 (YRI) 26,476,761 (99.7%) 25,748,289 (97.0%) 23701/23705 (99.98%) 5575/5601 (99.54%) 3568/3579 (99.69%) NA19129 (YRI) 26,491,035 (99.8%) 25,733,587 (96.9%) 23701/23708 (99.97%) 5482/5510 (99.49%) 3681/3690 (99.76%) NA19240 (YRI) 26,486,481 (99.7%) 25,576,517 (96.3%) 23546/23551 (99.98%) 5600/5634 (99.40%) 3542/3549 (99.80%) NA18555 (CHB) 26,475,665 (99.7%) 25,529,861 (96.1%) 23980/23984 (99.98%) 4877/4893 (99.67%) 3776/3786 (99.74%) NA18956 (JPT) 26,454,942 (99.6%) 25,683,248 (96.7%) 24217/24221 (99.98%) 4890/4910 (99.59%) 3751/3760 (99.76%) NA12156 (CEU) 26,476,155 (99.7%) 25,360,704 (95.5%) 23789/23794 (99.98%) 5493/5514 (99.62%) 3206/3213 (99.78%) NA12878 (CEU) 26,439,953 (99.6%) 25,399,572 (95.6%) 23885/23891 (99.97%) 5413/5425 (99.78%) 3274/3292 (99.45%) FSS10066 (Eur) 26,467,140 (99.7%) 25,546,738 (96.2%) NA NA NA FSS10208 (Eur) 26,461,768 (99.6%) 25,576,256 (96.3%) NA NA NA FSS22194 (Eur) 26,426,401 (99.5%) 25,454,551 (95.9%) NA NA NA FSS24895 (Eur) 26,478,775 (99.7%) 25,602,677 (96.4%) NA NA NA The number of coding bases covered at least 13 and with sufficient coverage to variant call ($83 and consensus quality $30) are listed for each exome, with the fraction of the aggregate target (26.6 Mb) that this represents in parentheses. For the eight HapMap individuals, concordance with array genotyping (Illumina Human1M-Duo) is listed for positions that are homozygous for the reference allele, heterozygous or homozygous for the non-reference allele (according to the array genotype). CEU, CEPH HapMap; CHB, Chinese HapMap; Eur, European–American ancestry (non- HapMap); JPT, Japanese HapMap; YRI, Yoruba HapMap. NA, Not applicable. called positions, the high concordance with heterozygous array- (n 5 47,079), 32% of annotated variants and 86% of novel variants based genotypes (.99%) provides an estimate of our sensitivity for were singleton observations across 24 chromosomes (Fig. 1a). rare variant detection, as rare variants are overwhelmingly expected We also estimated the total number of cSNPs in each individual to be heterozygous. However, sensitivity was limited in that ,4% of relative to the reference genome (Table 2b). As the precise and com- known heterozygous genotypes were at coordinates where there was prehensive definition of the human exome remains incomplete, we insufficient coverage to make a confident call. extrapolated our data to an estimated exome size of exactly 30 Mb. There were 56,240 cSNPs called in one or more individuals, of The results were remarkably consistent by population. As expected, a which 13,347 were novel. On average, 17,272 cSNPs were called per higher number of non-synonymous cSNPs were estimated for the individual, of which 92% were already annotated in a public database Yoruba individuals (average 10,254; n 5 4) than non-Africans (aver- (dbSNP v129) (Table 2a). The proportion of previously annotated age 8,489; n 5 8). More heterozygous cSNPs were estimated for the cSNPs was consistent by population, and higher for European (94%; four Yoruba (average 14,995) than the six European Americans n 5 6) and Asian (93%; n 5 2) than Yoruba (88%; n 5 4) ancestry. (average 11,586) and the two Asians (average 10,631). The ratio of These confirmation rates are ,10% higher than recent whole genome synonymous to non-synonymous cSNPs was 1.2 within any single analyses12,19–22. The most likely explanation is that coding sequences individual, and 1.1 when calculated for a non-redundant list of have historically been more heavily ascertained than noncoding variants identified across all individuals. The difference results from sequences, although other factors such as dbSNP version, prior ascer- the slightly shifted allele frequency distribution of non-synonymous tainment of HapMap individuals and different false discovery rates variants (Fig. 1b). Consistent with expectation23, the trend is more may contribute as well. For the subset of cSNPs at coordinates pronounced for non-synonymous variants predicted to be damaging with sufficient coverage for variant calling in all 12 individuals (by PolyPhen24) (Fig. 1c).

Table 2 | Coding variation across 12 human exomes a Summary statistics for observed cSNPs

Individual cSNP calls Number in dbSNP Percentage in dbSNP Number heterozygous Number homozygous NA18507 (YRI) 19,720 17,577 89.112,896 6,824 NA18517 (YRI) 19,737 17,326 87.813,039 6,698 NA19129 (YRI) 19,761 17,298 87.512,845 6,916 NA19240 (YRI) 19,517 17,168 88.012,866 6,651 NA18555 (CHB) 16,047 14,894 92.89,181 6,866 NA18956 (JPT) 16,011 14,848 92.79,132 6,879 NA12156 (CEU) 16,119 15,250 94.610,179 5,940 NA12878 (CEU) 15,970 15,051 94.29,928 6,042 FSS10066 (Eur) 16,229 15,144 93.310,240 5,989 FSS10208 (Eur) 16,073 15,018 93.49,966 6,107 FSS22194 (Eur) 16,094 15,128 94.010,005 6,089 FSS24895 (Eur) 15,986 15,027 94.09,920 6,066 b Genome-wide cSNP estimates assuming a 30 Mb exome

Individual Estimated total cSNPs Estimated total heterozygous Estimated total homozygous Estimated total synonymous Estimated total non-synonymous NA18507 (YRI) 22,727 14,876 7,851 12,466 10,261 NA18517 (YRI) 22,841 15,135 7,706 12,550 10,291 NA19129 (YRI) 22,907 14,906 8,001 12,693 10,214 NA19240 (YRI) 22,814 15,063 7,751 12,565 10,249 NA18555 (CHB) 18,722 10,677 8,045 10,275 8,447 NA18956 (JPT) 18,523 10,585 7,938 10,072 8,451 NA12156 (CEU) 18,825 11,818 7,007 10,220 8,605 NA12878 (CEU) 18,544 11,455 7,089 10,110 8,434 FSS10066 (Eur) 18,836 11,795 7,041 10,240 8,596 FSS10208 (Eur) 18,591 11,444 7,147 10,075 8,516 FSS22194 (Eur) 18,667 11,539 7,128 10,144 8,523 FSS24895 (Eur) 18,508 11,466 7,042 10,169 8,339 For part a, cSNPs called in each individual, relative to the reference genome, are broken down by the fraction in dbSNP and by genotype. Part b shows extrapolation of observed numbers of cSNPs in each individual to an exactly 30 Mb exome. CEU, CEPH HapMap; CHB, Chinese HapMap; Eur, European–American ancestry (non-HapMap); JPT, Japanese HapMap; YRI, Yoruba HapMap. 273 ©2009 Macmillan Publishers Limited. All rights reserved LETTERS NATURE | Vol 461 | 10 September 2009 a b set (two or more observations), the average number of relatively rare 12,000 12,000 frameshifting indels identified per individual was 8 for non-Africans 10,000 10,000 Annotated Synonymous (n 5 8) and 17 for Yoruba (n 5 4). 8,000 Novel 8,000 Non-synonymous The number of synonymous, missense, nonsense, splice site, fra- 6,000 6,000 meshifting indel and non-frameshifting indel variants observed in

4,000 4,000 each individual (as well as the size of the subsets that are novel and

2,000 2,000 singleton observations) is presented in Supplementary Table 4. Also Number of variants Number of variants

0 0 shown are the average numbers of variants of each class for non- 123456789101112 123456789101112 Number of observations of minor allele Number of observations of minor allele Africans and Yoruba. c d Phenotypes inherited in an apparently Mendelian pattern often lack sufficiently sized pedigrees to pinpoint the causal locus. We 70 100 evaluated whether exome sequencing could be applied to identify 60 Synonymous 90 80 n directly the causative gene underlying a monogenic human disease Benign non-synonymous 3 indels 50 70 n Possibly damaging non-synonymous Non-3 indels (FSS), that is, with neither linkage data nor candidate gene analysis. 60 40 Probably damaging non-synonymous 50 Even in this simple scenario for ‘whole exome/genome genetics’, the 30 40 key challenge that arises immediately is that the large number of 20 30 20 apparently private mutations present by chance in any single human 10 10

0 Average number of indels 0 genome makes it difficult to identify which variant is causal, even 1 23456789101112 123456789101112131415

Fraction of each variant class (%) Number of observations of minor allele Length of coding indels (bases) when only considering non-synonymous variants. This hurdle was overcome recently in the context of hereditary by Figure 1 | Minor allele frequency and coding indel length distributions. restricting focus only to nonsense mutations and also resequencing a, The distribution of minor allele frequencies is shown for previously annotated versus novel cSNPs. b, The distribution of minor allele tumour DNA from the same individual, but this approach greatly frequencies is shown for synonymous versus non-synonymous cSNPs. c, The limits sensitivity and is only relevant to a subset of mechanisms distribution of minor allele frequencies (by proportion, rather than count) is within one disease class28. shown for synonymous cSNPs (n 5 21,201) versus non-synonymous cSNPs To quantify this background of non-causal variants in our exome predicted to be benign (n 5 13,295), possibly damaging (n 5 3,368), or data, we first investigated how many genes had one or more non- 24 probably damaging (n 5 2,227) by PolyPhen . d, The distribution of lengths synonymous cSNPs, splice site disruptions or coding indels in one or of coding indel variants is shown (average numbers per exome). Error bars several FSS exomes (Fig. 2, row 1). Simply requiring that a gene indicate s.d. contain variants in multiple affected individuals was clearly insuf- ficient, as over 2,000 candidate genes remained even after intersecting Nonsense mutations and splice-site disruptions are often assumed four FSS exomes. We then applied filters to remove presumably to be deleterious, but have a broad range of potential fitness 25–27 common variants, as these are unlikely to be causative. Removing effects . Our non-redundant cSNP catalogue included 225 non- dbSNP-catalogued variants from consideration reduced the number sense mutations (112 novel) and 102 splice-site disruptions (49 of candidates considerably (Fig. 2, row 2). Remarkably, the eight novel). Excluding 86 nonsense alleles that are common in this data HapMap exomes provided a filter nearly equivalent to dbSNP set (two or more observations) or in a recent study25 (.5% allele (Fig. 2, row 3). Combining the two catalogues had a synergistic effect frequency), our genome-wide estimate (projected to 30 Mb) for the (Fig. 2, row 4), such that the candidate list could be narrowed to a average number of relatively rare mutations introducing premature single gene (MYH3, identified previously by a candidate gene nonsense codons in an individual genome was 10 for non-Africans approach as causative for FSS5). Specifically, MYH3 is the only (n 5 8) and 20 for Yoruba (n 5 4). However, these are probably gene where: (1) at least one (but not necessarily the same) non- overestimates, given that our catalogue of common nonsense muta- synonymous cSNP, splice-site disruption or coding indel is observed tions remains incomplete. Short insertions and deletions (indels) in coding sequence are in all four individuals with FSS; (2) the mutations are not in dbSNP, likely to be functionally important when they cause frameshifts, nor in the eight HapMap exomes. Taking the predicted deleterious- but are difficult to detect with short reads. We developed and applied ness of individual mutations into account served as an effective filter an approach for identifying indels from our unpaired 76 bp reads. In as well (Fig. 2, row 5), but was not required to identify MYH3. Ranges total, 664 coding indels were called in one or more individuals. On average, 166 coding indels were called per individual, of which 63% Any 3 of 4 FSS24895 FSS24895 were previously annotated in dbSNP (Supplementary Table 3). To FSS24895 FSS10208 FSS10208 FSS24895 FSS10208 FSS10066 FSS10066 assess our sensitivity, we compared our data for NA18507 to data FSS24895 FSS10208 FSS10066 FSS22194 FSS22194 12 published previously . The majority (73%) of their coding indels Non-synonymous cSNP, splice site 4,510 3,284 2,765 2,479 3,768 were also observed in our data (136 of 187). To assess specificity, variant or coding indel (NS/SS/I) we attempted PCR and Sanger sequencing of 28 novel coding indels NS/SS/I not in dbSNP 513 128 71 53 119 chosen at random. Of 21 successful assays, 20 coding indels were NS/SS/I not in eight 799 168 53 21 160 verified and 1 was a false positive. We anticipate that future use of HapMap exomes NS/SS/I neither in dbSNP 360 38 8 1 (MYH3)22 paired-end reads will improve detection of coding indels. nor eight HapMap exomes The shape of the distribution of coding indel lengths was consist- 10,20 affected has at least one… MYH3 ent with other studies as well as across the 12 exomes (Fig. 1d), Number of genes in which each …And predicted to be damaging 160 10 2 1 ( )3 demonstrating a preference for multiples of 3 (‘3n’). Of the 664 coding indels observed here, 65% were 3n in length. The allele fre- Figure 2 | Direct identification of the causal gene for a monogenic disorder quency distribution for novel indels relative to annotated indels was by exome sequencing. Boxes list the number of genes with one or more non- markedly shifted towards rarer variants (Supplementary Fig. 4). synonymous cSNP, splice-site SNP, or coding indel (NS/SS/I) meeting specified filters. Columns show the effect of requiring that one or more NS/ However, the length histograms for novel versus annotated coding SS/I variants be observed in each of one to four affected individuals. Rows indels were similar (Supplementary Fig. 5), reinforcing the notion show the effect of excluding from consideration variants found in dbSNP, that our set of novel coding indels is not excessively contaminated the eight HapMap exomes, or both. Column five models limited genetic with false positives (as these would not be expected to have the heterogeneity or data incompleteness by relaxing criteria such that variants observed 3n bias). Excluding indels that were common in this data need only be observed in any three of four exomes for a gene to qualify. 274 ©2009 Macmillan Publishers Limited. All rights reserved NATURE | Vol 461 | 10 September 2009 LETTERS of candidate list sizes when other permutations of individuals are Illumina. Annotations of cSNPs were based on NCBI and UCSC databases, used are shown in Supplementary Fig. 6. supplemented with PolyPhen Grid Gateway24 predictions for non-synonymous MYH3 was well covered in our data. To assess our sensitivity more SNPs. globally, we calculated the probability that a would have Identification of coding indels. Identification of coding indels involved: (1) gapped alignment of unmapped reads to the genome to generate a set of can- been identified in all four FSS-affected individuals for each gene, didate indels using cross_match; (2) ungapped alignment of all reads to the based on our overall coverage of that gene in each individual reference and alternative alleles for all candidate indels using Maq; and (3) (Supplementary Data 2). The average probability across all genes filtering by coverage and allelic ratio. was 86%. This is probably still an overestimate of sensitivity, as func- Data access. Sequencing reads for HapMap individuals are available from tional noncoding or structural mutations would be missed. It also the NCBI Short Read Archive, accession SRP000910. Variants identified in remains challenging to detect mutations in segmentally duplicated HapMap individuals have been submitted to NCBI dbSNP under the handle regions of the genome with short read sequencing. ‘SEATTLESEQ’. Variants identified in FSS individuals are available to approved Nevertheless, our analysis suggests that direct sequencing of investigators through NCBI dbGaP, accession number phs000204. Individual exomes of small numbers of unrelated individuals (but more than genotypes for variants identified in HapMap individuals, as well as the collapsed CCDS 2008 definition (before masking of coordinates listed in Supplementary one) with a shared monogenic disorder can serve as a genome-wide Data 1), are available at http://krishna.gs.washington.edu/12_exomes. scan for the causative gene. The availability of the eight HapMap exomes was clearly helpful, suggesting that the power of this Full Methods and any associated references are available in the online version of approach will improve as the 1000 Project29 generates a the paper at www.nature.com/nature. catalogue of common variation that is more complete and evenly Received 5 June; accepted 29 June 2009. ascertained than dbSNP. Also, FSS is inherited in an autosomal dom- Published online 16 August 2009. inant pattern so the presence of only one mutant allele is sufficient to cause disease. Applying this strategy to a recessive disease would 1. Cohen, J. C. et al. Multiple rare alleles contribute to low plasma levels of HDL cholesterol. Science 305, 869–872 (2004). probably be easier, because there are far fewer genes in each exome 2. Frazer, K. A., Murray, S. S., Schork, N. J. & Topol, E. J. Human genetic variation and that are homozygous or compound heterozygous for rare non-syn- its contribution to complex traits. Nature Rev. Genet. 10, 241–251 (2009). onymous variants. We also note that modelling of even a modest 3. Shendure, J. & Ji, H. Next-generation DNA sequencing. Nature Biotechnol. 26, degree of genetic heterogeneity or data incompleteness is observed 1135–1145 (2008). 4. The International HapMap Consortium. A haplotype map of the human genome. to have a significant impact on performance (Fig. 2, column offset to Nature 437, 1299–1320 (2005). the right). Moving along the spectrum from rare monogenic disor- 5. Toydemir, R. M. et al. Mutations in embryonic myosin heavy chain (MYH3) cause ders to complex common diseases, it is likely that the increasing Freeman-Sheldon syndrome and Sheldon-Hall syndrome. Nature Genet. 38, extent of genetic heterogeneity will need to be matched by increas- 561–565 (2006). ingly large sample sizes30, and/or more sophisticated weighting of 6. Sjoblom, T. et al. The consensus coding sequences of human breast and colorectal cancers. Science 314, 268–274 (2006). predicted mutational impact. 7. Olson, M. Enrichment of super-sized resequencing targets from the human A clear limitation of exome sequencing is that it does not identify genome. Nature Methods 4, 891–892 (2007). the structural and noncoding variants found by whole genome 8. Hodges, E. et al. Genome-wide in situ capture for selective resequencing. sequencing. At the same time, it allows a given amount of sequencing Nature Genet. 39, 1522–1527 (2007). 9. National Center for Biotechnology Information. Consensus CDS protein set to be extended across at least 20 times as many samples compared to ,http://www.ncbi.nlm.nih.gov/projects/CCDS. (2009). whole genome sequencing. In studies focused on identifying rare 10. Ng, P. C. et al. Genetic variation in an individual human exome. PLoS Genet. 4, variants or somatic mutations with medical relevance, sample size e1000160 (2008). and the interpretability of functional impact may be critical to 11. Kidd, J. M. et al. Mapping and sequencing of from eight human achieving meaningful success. It is in the context of such studies that genomes. Nature 453, 56–64 (2008). 12. Bentley, D. R. et al. Accurate whole human genome sequencing using reversible exome sequencing may be most valuable. terminator chemistry. Nature 456, 53–59 (2008). We demonstrate that targeted capture and massively parallel 13. Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling sequencing represents a cost-effective, reproducible and robust strat- variants using mapping quality scores. Genome Res. 18, 1851–1858 (2008). egy for the sensitive and specific identification of variants causing 14. Campbell, P. J. et al. Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nature protein-coding changes in individual human genomes. The 307 Mb Genet. 40, 722–729 (2008). determined here across 12 individuals is the largest data set reported 15. Ewing, B. & Green, P. Base-calling of automated sequencer traces using Phred. II. so far of human coding sequence ascertained by second-generation Error probabilities. Genome Res. 8, 186–194 (1998). sequencing methods. Finally, our successful demonstration that the 16. Turner, E. H., Lee, C., Ng, S. B. & Shendure, J. Massively parallel exon capture and causative gene for a Mendelian disorder can be identified directly by library-free resequencing across 16 individuals. Nature Methods 6, 315–316 (2009). exome sequencing of several unrelated individuals provides increas- 17. Kidd, J. M. et al. Haplotype sorting using human fosmid clone end-sequence pairs. ing context to the possibility that exome or genome sequencing may Genome Res. 18, 2016–2023 (2008). represent a new approach for identifying gene–disease relationships. 18. Albert, T. J. et al. Direct selection of human genomic loci by microarray hybridization. Nature Methods 4, 903–905 (2007). METHODS SUMMARY 19. Wheeler, D. A. et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876 (2008). DNA samples, targeted capture and massively parallel sequencing. DNA sam- 20. Wang, J. et al. The diploid genome sequence of an Asian individual. Nature 456, ples were obtained from Coriell Repositories (HapMap) or by M.B. (FSS). Each 60–65 (2008). shotgun library was hybridized to two Agilent 244K microarrays for target 21. Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol. 5, enrichment, followed by washing, elution and additional amplification. The first e254 (2007). array targeted CCDS (2007), while the second was designed against targets 22. Ley, T. J. et al. DNA sequencing of a cytogenetically normal acute myeloid poorly captured by the first array plus updates to CCDS in 2008. All sequencing leukaemia genome. Nature 456, 66–72 (2008). was performed on the Illumina GA2 platform. used are listed in 23. Boyko, A. R. et al. Assessing the evolutionary impact of amino acid mutations in Supplementary Table 5. the human genome. PLoS Genet. 4, e1000083 (2008). Read mapping and variant analysis. Reads were mapped to the reference 24. Sunyaev, S. et al. Prediction of deleterious human alleles. Hum. Mol. Genet. 10, 591–597 (2001). human genome (hg18, downloaded from http://genome.ucsc.edu), initially with 13 25. Yngvadottir, B. et al. A genome-wide survey of the prevalence and evolutionary ELAND (Illumina) for quality recalibration, and then again with Maq . forces acting on human nonsense SNPs. Am. J. Hum. Genet. 84, 224–234 (2009). Sequence calls were also performed by Maq, and filtered to coordinates with 26. Olson, M. V. When less is more: gene loss as an engine of evolutionary change. 15 $83 coverage and a Phred-like consensus quality $30. Sequence calls for Am. J. Hum. Genet. 64, 18–23 (1999). HapMap individuals were compared against Illumina Human1M-Duo 27. Cohen, J. et al. Low LDL cholesterol in individuals of African descent resulting from genotypes. NA18507 SNPs from whole genome data12 were obtained from frequent nonsense mutations in PCSK9. Nature Genet. 37, 161–165 (2005). 275 ©2009 Macmillan Publishers Limited. All rights reserved LETTERS NATURE | Vol 461 | 10 September 2009

28. Jones, S. et al. Exomic sequencing identifies PALB2 as a pancreatic cancer Institute, National Institutes of Health/National Institute of Child Health and susceptibility gene. Science 324, 217 (2009). Human Development, and the Washington Research Foundation. S.B.N. is 29. Siva, N. 1000 Genomes project. Nature Biotechnol. 26, 256 (2008). supported by the Agency for Science, Technology and Research, Singapore. E.H.T. 30. Kryukov, G. V., Shpunt, A., Stamatoyannopoulos, J. A. & Sunyaev, S. R. Power of and A.W.B. are supported by a training fellowship from the National Institutes of deep, all-exon resequencing for discovery of human trait genes. Proc. Natl Acad. Health/National Human Genome Research Institute. E.E.E. is an investigator of the Sci. USA 106, 3871–3876 (2009). Howard Hughes Medical Institute. Supplementary Information is linked to the online version of the paper at Author Contributions The project was conceived and experiments planned by www.nature.com/nature. S.B.N., E.H.T., A.B., E.E.E., M.B., D.A.N. and J.S. Experiments were performed by S.B.N., E.H.T., C.L. and M.W. Algorithm development and data analysis were Acknowledgements For discussions or assistance with genotyping data, we thank performed by S.B.N., P.D.R., S.D.F., A.W.B., T.S., M.B., D.A.N. and J.S. The manuscript P. Green, J. Akey, R. Patwardhan, G. Cooper, J. Kidd, D. Gordon, J. Smith, I. Stanaway was written by S.B.N. and J.S. All aspects of the study were supervised by J.S. and M. Rieder. For assistance with project management, computation, data management and submission, we thank E. Torskey, S. Thompson, T. Amburg, Author Information Reprints and permissions information is available at B. McNally, S. Hearsey, M. Shumway and L. Hillier. For Human1M-Duo genotype www.nature.com/reprints. The authors declare competing financial interests: data on HapMap samples, we thank Illumina. Our work was supported in part by details accompany the full-text HTML version of the paper at www.nature.com/ grants from the National Institutes of Health/National Heart Lung and Blood nature. Correspondence and requests for materials should be addressed to J.S. Institute, the National Institutes of Health/National Human Genome Research ([email protected]) or S.B.N. ([email protected]).

276 ©2009 Macmillan Publishers Limited. All rights reserved doi:10.1038/nature08250

METHODS found in CCDS20080902 that were not in CCDS20070227 and hence not tar- geted by the first array. A subset of arrays used lacked control grids. Genomic DNA samples. Targeted capture was performed on genomic DNA Targeted capture by hybridization to DNA microarrays. Hybridizations to from eight HapMap individuals (four Yoruba (NA18507, NA18517, NA19129 Agilent 244K arrays were performed following manufacturer’s instructions with and NA19240), two East Asians (NA18555 and NA18956) and two European- modifications. For each enrichment, a 520 ml hybridization solution containing Americans (NA12156 and NA12878)) and four European-American individuals 20 mg of the bulk-amplified genomic DNA library, 13 aCGH hybridization affected by Freeman–Sheldon syndrome (FSS10066, FSS10208, FSS22194 and buffer (Agilent), 13 blocking agent (Agilent), 50 mg human CotI DNA FSS24895). Genomic DNA for HapMap individuals was obtained from Coriell (Invitrogen) and 0.92 nmol each of the blocking oligonucleotides SLXA_ Cell Repositories. Genomic DNA for FSS individuals was obtained by M.B. FOR_AMP, SLXA_REV_AMP, SLXA_FOR_AMP_rev and SLXA_REV_ Oligonucleotides and adaptors. All oligonucleotides were synthesized by AMP_rev was incubated at 95 uC for 3 min and then at 37 uC for at least Integrated DNA Technologies and resuspended in nuclease-free water to a stock 30 min. The hybridization solution was then loaded and the hybridization cham- concentration of 100 mM. Sequences are shown in Supplementary Table 5. ber assembled following the manufacturer’s instructions. Incubation was done at Double-stranded library adaptors SLXA_1 and SLXA_2 were prepared to a final 65 uC for at least 66 h with rotation at 20 r.p.m. in a hybridization oven (Agilent). concentration of 50 mM by incubating equimolar amounts of SLXA_1_HI and After hybridization, the slide-gasket sandwich was removed from the chamber SLXA_1_LO together and SLXA_2_HI and SLXA_2_LO together at 95 C for u and placed in a 50 ml conical tube filled with aCGH Wash Buffer 1 (Agilent). The 3 min and then leaving the adaptors to cool to room temperature in the heat slide was separated from the gasket while in the buffer and then washed, first with block. fresh aCGH Wash Buffer 1 at room temperature for 10 min on an orbital shaker m Shotgun library construction. Shotgun libraries were generated from 10 gof (VWR) set on low speed, and then in pre-warmed aCGH Wash Buffer 2 (Agilent) genomic DNA (gDNA) using protocols modified from the standard Illumina at 37 uC for 5 min. Both washes were also done in 50 ml conical tubes. protocol12 . Each library provided sufficient material for hybridization to two A Secure-Seal (SA2260, Grace Bio Labs) was then affixed firmly over the active m 3 microarrays. For each sample, gDNA in 300 l1 Tris-EDTA was first sonicated area of the washed slide and heated briefly according to the manufacturer’s for 30 min using a Bioruptor (Diagenode) set at high, then end-repaired for instructions. One port was sealed with a seal tab and the seal chamber completely 45 min in a 100 ml reaction volume using 13 End-It Buffer, 10 ml dNTP mix filled with approximately 1 ml of hot EB (95 uC). The other port was sealed and and 10 ml ATP as supplied in the End-It DNA End-Repair Kit (Epicentre). The the slide incubated at 95 uC on a heat block. After 5 min, one port was unsealed fragments were then A-tailed for 20 min at 70 uC in a 100 ml reaction volume with and the solution recovered. DNA was purified from the solution using a standard 13 PCR buffer (Applied Biosystems), 1.5 mM MgCl2, 1 mM dATP and 5 U ethanol precipitation. AmpliTaq DNA polymerase (Applied Biosystems). Next, library adaptors Precipitated DNA was resuspended in EB and the entire volume used in a 50 ml SLXA_1 and SLXA_2 were ligated to the A-tailed sample in a 90 ml reaction PCR volume comprising of 13 iTaq SYBR Green Supermix with ROX (Bio-Rad) volume with 13 Quick Ligation Buffer (New England Biolabs) with 5 ml and 0.2 mM each of primers SLXA_FOR_AMP and SLXA_REV_AMP. Thermal Quick T4 DNA Ligase (New England Biolabs) and each adaptor in 103 molar cycling was done in a MiniOpticon Real-time PCR system (Bio-Rad) with the excess of sample. Samples were purified on QIAquick columns (Qiagen) after following programme: 95 uC for 5 min, then 30 cycles of 95 uC for 30 s, 55 uC for each of these four steps and DNA concentration determined on a Nanodrop- 2 min and 72 uC for 2 min. Each sample was monitored and extracted from the 1000 (Thermo Scientific) when necessary. PCR machine when fluorescence began to plateau. Samples were then purified Each sample was subsequently size-selected for fragments of size 150–250 bp on a QIAquick column (Qiagen) and sequenced. using gel electrophoresis on a 6% TBE-polyacrylamide gel (Invitrogen). A gel Sequencing. All sequencing of post-enrichment shotgun libraries was carried slice containing the fragments of interest was then excised and transferred to a out on an Illumina Genome Analyzer II as single-end 76 bp reads, following the siliconized 0.5 ml microcentrifuge tube (Ambion) with a 20 G needle-punched manufacturer’s protocols and using the standard sequencing primer. Image hole in the bottom. This tube was placed in a 1.5 ml siliconized microcentrifuge analysis and base calling was performed by the Genome Analyser Pipeline ver- tube (Ambion), and centrifuged in a tabletop microcentrifuge at 16,110g for sion 1.0 or 1.3 with default parameters, but with no pre-filtering of reads by 5 min to create a gel slurry that was then resuspended in 200 ml13 Tris-EDTA quality. Quality values were recalibrated by alignment to the reference human and incubated at 65 uC for 2 h, with periodic vortexing. This allowed for passive genome with the Eland module. elution of DNA, and the aqueous phase was then separated from gel fragments by Read mapping. The reference human genome used in these analyses was UCSC centrifugation through 0.2 mm NanoSep columns (Pall Life Sciences) and the assembly hg18 (NCBI build 36.1), including unordered sequence DNA recovered using a standard ethanol precipitation. (chrN_random.fa) but not including alternate haplotypes. For each lane, reads Recovered DNA was resuspended in elution buffer (EB; 10 mM Tris-Cl, with calibrated qualities were extracted from the Eland export output. Base pH 8.5, Qiagen) and the entire volume used in a 1 ml bulk PCR reaction volume qualities were rescaled and reads mapped to the human reference genome using with 13 iProof High-Fidelity Master Mix (Bio-Rad) and 0.5 mM each of primers Maq (version 0.7.1)13. Unmapped reads were dumped using the –u option and SLXA_FOR_AMP and SLXA_REV_AMP with the conditions: 98 uC for 30 s, 20 subsequently used for indel mapping. Mapped reads that overlapped target cycles at 98 uC for 30 s, 65 uC for 10 s and 72 uC for 30 s, and finally 72 uC for regions (‘target reads’) were used for all other analyses. 5 min. PCR products were purified across four QIAquick columns (Qiagen) and Target masking. All possible 76-bp reads that overlapped the aggregate target all the eluants pooled. were simulated, mapped using Maq and consensus called using Maq assemble Design of exome capture arrays. We targeted all well-annotated protein-coding with parameters –q 1 –r 0.2 –t 0.9. Target coordinates that had read depth ,76 regions as defined by the CCDS (version 20080902). Coordinates were extracted (that is, half of the expected depth), reflecting a poor ability to have reads from entries with ‘public’ status, and regions with overlapping coordinates were confidently mapped to them (Supplementary Data 1), were removed from con- merged. This resulted in a target with 164,007 discontiguous regions summing to sideration for downstream analyses, leaving a 26,553,795 bp target. 27,931,548 bp. By comparison, coding sequence defined by all of RefSeq (NCBI Variant calling. All reads with a map score .0 from each individual were 36.3) comprises 31.9 Mb (14% larger). Hybridization probes against the target merged and filtered for duplicates such that only the read with the highest were designed primarily such that they were evenly spaced across each region. aggregate base quality at any given start position and orientation was retained. Probes were also constrained (1) to be relatively unique, such that the average Sequence calls were obtained using Maq assemble with parameters –r 0.2 –t 0.9, occurrence of each 15-mer in the probe sequence is less than 1008, (2) to be and only coordinates with at least 83 coverage and an estimated Phred-like between 20 and 60 bases in length, with preference for longer probes, and (3) to consensus quality value of at least 30 were used for downstream variant analyses. have a calculated melting temperature (Tm) #69 uC, with preference for higher Comparison of sequence calls to array genotypes, dbSNP and whole genome Tm values. Tm was calculated by 64.9 1 41 3 (number of G 1 Cs 2 16.4)/length sequencing. For the eight HapMap individuals, sequence calls were compared to of probe. array-based genotyping data (Illumina Human1M-Duo) provided by Illumina. Two arrays (Agilent, 244K format) were designed and used per individual. The We excluded from consideration genotyping assays where all eight individuals first array was common to all individuals, and contained 241,071 probes were called by the arrays as homozygous non-reference as well as the MHC locus designed mainly against the subset of the target that was also found in a previous at chromosome 6:32500001–33300000, as both sets are likely to be error- version of the CCDS (CCDS20070227). For most exomes, the second array was enriched in the genotyping data. We downloaded dbSNP(v129) from ftp:// custom-designed specifically against target regions that had not been adequately ftp.ncbi.nih.gov/snp/organisms/human_9606/chr_rpts on 13 May 2008. represented after capture on the first array and subsequent sequencing. For two Approximately 14.2 million non-redundant coordinates were defined by this individuals (FSS10066, FSS10208), the matching was to a different individual’s file set. For comparison of NA18507 cSNPs to whole genome data, variant lists first-array data. However, this did not seem to have a significant effect on per- were obtained from Illumina12. formance, probably because features capturing poorly on the first array largely Identification of coding indels. Reads for which Maq was unsuccessful in iden- did so consistently. Additionally, all of the second arrays also targeted sequences tifying an ungapped alignment were converted to fasta format and mapped to the

©2009 Macmillan Publishers Limited. All rights reserved doi:10.1038/nature08250

human reference genome with cross_match (v1.080812, http://www.phrap.org), supplied by I. Adzhubey. The server reads files with SNP locations and alleles, using parameters –gap_ext 21–bandwidth 10 2minmatch 20 –maxmatch 24. and produces annotation files available for download. Annotation includes Output options –tags –discrep_lists –alignments –score_hist were also set. dbSNP rs IDs, overlapping-gene accession numbers, SNP function (for example, Alignments with an indel were then filtered for those that: (1) had a score at whether coding missense), conservation scores, HapMap minor-allele frequen- least 40 more than the next best alignment; (2) mapped at least 75 bases of the cies and various protein annotations (sequence, position, amino acid changes read; (3) had no substitutions in addition to the indel; and (4) overlapped a with physicochemical properties and PolyPhen classification). Indels were con- target region. Reads from filtered alignments that mapped to the negative strand sidered annotated by dbSNP if an entry was found with the same allele (or reverse were then reverse-complemented and, together with the rest of the filtered reads, complemented) within 1 bp of the variant position. This was to allow for ambi- re-mapped with cross_match using the same parameters. This was to reduce guities in calling the indel position. ambiguity in called indel positions due to different read orientations. After the Calculation of genome-wide estimates. Extrapolated estimates for the genome- second mapping, alignments were re-filtered using the same criteria (1) to (4). wide number of cSNPs of various classes (Table 2b) were calculated based on the For each sample, a putative indel event was called if at least two filtered reads number of cSNP calls in that individual, the estimated sensitivity for making a covered the same event. A fasta file containing the sequences of all called events variant call in that individual at any given position within the aggregate target 675 bp, as well as the reference sequence at the same positions, was then genera- (based on the fraction of array-based genotypes of that class that were success- ted for each individual. All the reads from each individual were then mapped to fully called; calculated separately for heterozygous and homozygous non-ref- its ‘indel reference’ with Maq using default parameters. Reads that mapped erence variants), and extrapolation to an estimated exome size of exactly multiple times (map score 0) or had redundant start sites were removed, after 30 Mb (that is, multiplying by 30/26.6 5 1.13). A similar approach was taken which the number of reads mapping to either the reference or the non-reference to estimate the genome-wide number of uncommon cSNPs introducing non- allele was counted for each individual and indel. An indel was called if there were sense codons, starting with the number observed in each individual and extra- at least eight non-reference allele reads making up at least 30% of all reads at that polating based on estimated sensitivity for heterozygote detection and an genomic position. Indels were called as heterozygous if non-reference alleles estimated exome size of exactly 30 Mb. were 30–70% of reads at that position, and homozygous non-reference if .70%. Freeman–Sheldon syndrome mutations. For FSS10066, FSS22194 and Variant annotation. For cSNP annotation, we constructed a local server that FSS24895, the identified mutation was a CRT at chromosome 17:10485359, integrates data from NCBI (including dbSNP and Consensus CDS files) and and the corresponding amino acid change was R672H. For FSS10208, the muta- from UCSC Genome . We also generated PolyPhen predictions24 tion was CRT at chromosome 17:10485360, and the corresponding amino acid for all cSNPs identified here, using the PolyPhen Grid Gateway and Perl scripts change was R672C.

©2009 Macmillan Publishers Limited. All rights reserved