<<

© 2014 America, Inc. All rights reserved. The The concept DS of genetic a for need genes. mutant identify to the system selection without compounds chemical new of mutagenicity potential the to one assess allow would sequencing enomics robust forensics include for need detection mutation similar a low-frequency with fields Other sequencing. NGS conventional of rate error background high the to owing assess resistance drug of development including environments, to adapt changing to ability their underlies populations microbial within found sity by conventional approaches. sequencing obtained Similarly, the be divergenetic cannot that resolution of level cells—a of <1% in are present that chemo mutations of detection confident the of requires emergence rapid in resistance and therapy relapse cancer in factor genetic of levels extreme heterogeneity exhibit cancers human that suggest progression and has initiation cancer of driver intrinsic heterogeneity an be to proposed been long genetic example, For mutations. subclonal population. a within cells of fraction a mutations, which are defined as mutations that are present in only subclonal of the analysis limits it mutations, greatly ing inherited ( nucleotides sequenced 100–1,000 per call base erroneous one of on rate: the order error have high a relatively NGS approaches all However, mutations. clonal inherited and to identify nucleotides of of billions field sequence to routine now emerging is It medicine. the personalized feasible made has and genetics ditional tra revolutionized has technology powerful This modern era. the genomic defines (NGS) sequencing DNA Next-generation I preparation and target enrichment, as well as an overview of the data analysis workflow. families arising from both of the two D tag sequence derived from the two original strands.complementary Mutations are scored only if the variant is present in the sample D relies on the ligation of sequencing adapters harboring random yet double-stranded complementary nucleotide sequences to the D wild-type nucleotides, thereby enabling the study of heterogeneous populations and very-low-frequency genetic alterations. Duplex Published online 9 October 2014; corrected online 22 October 2014 (details online); L.A.L. ( Portland State University, Portland, Oregon, USA. 1 Marc J Prindle Scott R Kennedy Duplex Sequencing Detecting ultralow-frequency mutations by 2586 mutations, random and subclonal of study the facilitate thereby To overcome the high error rate of next-generation sequencing and Department of Pathology,of Department University of Washington, Seattle, USA. Table NTRO protocol S There is growing need for technologies capable of resolving resolving of capable technologies for need growing is There can be applied to any double-stranded D

| [email protected] VOL.9 NO.11VOL.9 D 1 S ; ref. ; ref. UCT equencing equencing (D 15,1 NA 10–1 6 of interest. Individually labeled strands are then I evolution , 1 ON 3 3– ). Although this error rate is acceptable for study for acceptable is rate error this ). Although , but this genetic diversity is difficult to directly directly to difficult is diversity genetic this but , 6 1 . Subclonal mutations are probably a major major a probably are mutations Subclonal . 2 ). | , Kawai J Kuong . Recent tumor genome sequencing studies studies sequencing genome tumor Recent . 2014 7– 1 , Michael W Schmitt 9 . However, the study of cancer subclones subclones cancer of study the However, . S ) ) is a next-generation sequencing methodology capable of detecting a single mutation among >1 × 10 | 1 natureprotocols 7 n and 1 , Jiang-Cheng Shen 4 NA Department of ,of Department University of Washington, Seattle, USA. Correspondence should be addressed to strands. Here we provide a detailed protocol for efficient D 1 NA 2 8 , Edward J Fox , as high-accuracy high-accuracy as , sample, but it is ideal for small genomic regions of <1 Mb in size. 2 Department of Medicine,of Department University of Washington, Seattle, USA. 1 4 , paleog , 1 , Rosa-Ana Risques 1 PCR , Brendan F Kohrn - - - - -

-amplified, -amplified, creating sequence ‘families’ that share a common doi:10.1038 pared with each other, and the base identity at each position is is position each at identity base the and other, each with pared DNA from an the of are duplex two individual then com strands 1 Fig. (SSCSs; sequences’ consensus form to ‘single-strand strands two the of each for established is sequence consensus a and grouped ( tag sequences derived from each of the two single parental strands are PCR-amplified to create sequence families that share the same T ( molecule DNA double-stranded original the of strands two of the one to back read sequence every trace to feasible becomes it sequences, nucleotide double-stranded complementary yet dom ran containing adapters DNA with duplex By tagging sequence. DNA unique own its with molecule DNA fragmented each label DNA by double-stranded tags molecular using degenerate bases to among >10 a ability mutation detect unique single the has it and NGS, conventional with compared improvement DNA, racy in of double-stranded sequencing with a >10,000-fold accu unprecedented (DS). DS yields Sequencing Duplex termed we recently developed a highly sensitive methodology sequencing able able Fig. 1 Fig. i. 1 Fig. Illumina HiSeq Illumina MiSeq Life Ion Torrent Pacific Biosciences P ABI ABI SOLiD latform 1 c 9 a 1 1 ). The two complementary consensus sequences derived derived sequences consensus complementary two The ). b . DS takes advantage of the inherent complementarity of of complementarity inherent the of advantage takes .DS ). After adapter ligation, the individually labeled strands strands labeled individually the ligation, adapter ). After ). After sequencing, members of each tag family are are family tag each of members sequencing, After ). /nprot.2014.17 |

NGS NGS platforms and their associated errors 1 & Lawrence A Loeb 3 , Jesse J Salk 0 T Single nucleotide Single nucleotide A-T bias Short deletions, homopolymers G/C deletions he he protocol typically takes 1–3 d. P rimary error type 2 , Eun Hyun Ahn S adapter synthesis, library 3 Department of Biology,of Department 1 , 4

T he he method 1 . 1 , B

7 ackground sequenced sequenced (%) 0.1 0.1 0.2

16 1 20–2

7 PCR

6

to - - - © 2014 Nature America, Inc. All rights reserved. their utility in error correction. error in utility their by these enzymes, nor is the fidelity of repair perfect, thus limiting these in mutations samples false of number the reduce to shown been or glycosylases dam of DNA addition the of with age Removal polymerization. during pairings base erroneous cause that adducts DNA of prevalence the of error because of form this to sensitive particularly is DNA degraded or thousands of false positives on a typical sequencing run. Damaged 10 between a at frequency sertions misin make construction library in used typically DNA polymerases mutation. subclonal a of appearance the give would ants vari artifactual these molecules, unrelated is in occur to mutation unlikely same the because However, mutation. a as erroneously be scored will they and event, misincorporation original a all contain will copy the of molecules related of subpopulations these When sequenced, molecules. daughter synthesized quently subse to propagated be can occurring synthesis of event round first the during misinsertion a whereby barcoding, stranded and can in single- occur process CirSeq A variant. similar genetic a called incorrectly be would error propagated the cell, single a from derived are reads the all Because reads. the of all to events misincorporation initial any copies’, of propagating ‘copies thus random of in generation the and DNA synthesized results newly the from priming polymerase DNA strand-displacing a with single-cell of conjunction in sequences primer case random of use the the sequencing, In mutations. as scored erroneously be polymerase to likely are they and molecules, DNA daughter the to propagated be by events inherently will amplification of round first the during occurring double-stranded a Misincorporation of strand molecule. single a from derived DNA ing involve sequenc methods these of all three strengths, has unique (CirSeq) and circle (iii) sequencing three main strategies besides DS have emerged: (i) single-cell single-cell (i) emerged: sequencing have DS increased, besides has strategies methods main sequencing three accurate for need the As Alternatives to DS mutations. real as counted not are thus and DNA strands two the of one only in appear will damage DNA from arising or misincorporations polymerase DNA by PCR during introduced a yielding sequence’ ‘duplex (DCS; consensus position, that at perfectly match strands two the if only retained from ref. present at the same position in both strands of the DNA. Figure is adapted first-round PCR errors. True mutations are scored if and only if they are families with tags error, but not first-round PCR errors. Comparing SSCSs from the paired mutations (green dots). Formation of the SSCS removes the first type of mistakes (blue or purple dots); first-round PCR errors (brown dots); true for each tag family. Mutations are of three different types: sequencing grouped together into tag families of form related, PCR products. ( amplification of each strand of a DNA duplex results in two distinct, but results in a unique 12-nt tag sequence on both ends of the molecule. PCR invariant spacer sequence. ( Sequencing adapter, showing the random double-stranded tag and the Figure 1

3 | 31, Overview of Duplex Sequencing. ( 3, 3 © 4,27– 2 2013 Kennedy . However, not all mutagenic lesions are recognized recognized are lesions mutagenic all However,not . 2 α 9 β , molecular barcoding (ii) single-stranded and c ) Reads sharing unique β α b generates a DCS, which eliminates these ) Ligation of the adapters to the sample DNA et al. −4 30 α and 10 and , a β 3 ) Schematic of a Duplex 1 or . each approach Although α β and in vitro in α , and an SSCS is created −6 Fig. Fig. 1 β , which can lead to lead ,can which tag sequences are repair kits has has kits repair c ). Mutations 20,21,2

- - - - - 3

accuracy. In our original publication on DS on publication original our In accuracy. The primary advantage of DS over other approaches is its superior Advantages DS of NGS to produce a given depth of sequencing. At present, the use use the present, At sequencing. of depth given a produce to NGS conventional than capacity sequencing larger much requires DS ance on sequencing multiple PCR duplicates of both DNA strands, reli method’s the to cost.Owing a at comes however,ability this frequencies; low extremely at occur that mutations quantifying DS and is detecting a that sensitive is of uniquely method capable DS of Limitations results). unpublished DNA (M.J.P., degraded and damaged E.J.F.extremely L.A.L., and even in artifacts mutational DS that for is removing cating useful frequency, indi in mutation the change a only we twofold found and samples, and tissue unfixed formalin-fixed paired sequenced By DS, using we have misincorporations. ant to damage-induced resist extremely DS makes errors these correct to strands both from information sequencing of use the DNA strands, on paired position corresponding same the at occur to unlikely are events mutations false of a number high have reported tissues from formalin-fixed DNA sequence to NGS using studies DNA.Numerous damaged chemically from derived mutations artifactual mitigate help can artifacts. PCR of result the be may age with indicating that reactive oxygen species induced studies mutations increase previous that suggests and aging with of theories free-radical with associated inconsistent is commonly finding This mtDNA. most to damage oxidative mutations the in increase age-associated significant no observed we Surprisingly, lifespan. 80-year an over fivefold approximately increases frequency the mutations is 10–100-fold lower than that previously reported, and individuals old and young from tissue brain in mtDNA human of frequency mutation the measuring to DS applied recently we addition, In results). unpublished L.A.L., (E.J.F. DNA and nuclear human in 10 × 5 as low as frequencies mutation measured subsequently 10 × 2.5 as bacteriophage M13 the for frequency mutation the of measurement direct the b a 2 1 1 1 Similarly to our results with mtDNA, we have found that DS DS that found have we mtDNA, with results our to Similarly βα αβ f f amily amily 34– α α β 3 3 Randomized duplextag 3 . We showed that the frequency of mtDNA point point mtDNA of frequency the that showed We. N N N N N N N N N N N N 6 ′ N . Because complementary artifacts from damage damage from artifacts complementary Because . ′ N strands ofduplex PCR duplicatesboth ′ N ′ N ′ N ′ N ′ natureprotocols N ′ α β N β ′ N ′ N ′ −6 N N ′ A CT T GAC mutations per base pair. We have have We pair. base per mutations G 2 2 2 1 T

c | VOL.9 NO.11VOL.9 α α α α α β β β β β β α 1 True mutation 9 , we demonstrated demonstrated ,we protocol consensus sequences consensus sequences Single-strand | (SSCSs) 2014 (DCSs) Duplex α α α α α α β β β β β β |

2587 −8 - - -

© 2014 Nature America, Inc. All rights reserved. tion, end repair, 3 repair, end tion, selec size sonication, by shearing DNA of steps standard basic the follows protocol The protocol. preparation library Illumina for to DS is the standard similar construction library Sequencing the procedureof Overview practical. to fall,become increasingly that we this will anticipate hibitively expensive; however, as continues the cost sequencing of genome targets >1–2 Mbp in size to a high depth of coverage is pro of current NGS with DS technology to sequence large genomes or 2588 tion and, optionally, targeted DNA capture. We have made several protocol ? ! step isoptional, itisstrongly recommended. See 12. (Optional) Radiolabel the samplefrom eachstepand runitonan8Murea, 14%(wt/vol)acrylamide gel ( 11. Divide the adapters into 50- Remove and save1 10. Resuspend the pelletineachtubewith100  9. Remove the supernatant. Drythe tubefor 10mininvertedonapapertowel,and then for 5minupright. tubes for 30minat4°C >10,000 8. Remove the supernatant, add 1mlof 75%(vol/vol)ethanol toeachtubeand mixbyinversion. Immediately centrifuge the 7. Centrifuge the mixture for 30minat4°C>10,000 ? room-temperature 100%ethanol toeachtube;mixthe tubesbyinversion. 6. Divide the solution into aliquots in1.5-ml tubes, 275 5. Add 50 HpyCH4III (5U/ ddH 10× NEBCutSmart buffer DNA from step3 R for quality control purposes;labelas‘cutadapters’. into four 0.2-mlPCRtubes. Incubate the tubesfor 16hat37°Cinathermocycler withaheated lid. After16h,remove and save1 4. Cleavethe extended adapters bymixing the components inthe tablebelowbycarefully pipetting upand down and then dividing ? not work.The strands couldmelt ifthe DNAisoverdried afterprecipitation.  quality control purposes;labelas‘extended adapters’. 3. Ethanol-precipitate the DNAand resuspend itwith200 Klenow exo ddH 10 mMdNTPmix 10× NEBbuffer#2 DNA from step1 R into two0.2-mlPCRtubesand incubating for 1hat37°C. 2. Extend the annealed adapters bymixing the components inthe tablebelowbygently pipetting upand down, and then splitting not automatically coolafterthe heating cycle.  save 1 PCR tube;heat the tubeto95°Cfor 5mininathermocycler withaheated lid. Turn off the machine and leaveitfor 1h.Remove and 1. Anneal the twooligonucleotides (MWS51and MWS55in The adapter synthesis protocol generates asufficient amount of adapters for several hundred samples. Box 1

eagent eagent CAUT

CR CR CR TROUBLES TROUBLES TROUBLES

| 2 2 VOL.9 NO.11VOL.9 O O I I I

T T T

µ I I I I ON CAL CAL CAL l for quality control; labelitas‘annealed adapters’.

µ |

− γ l of 3Msodiumacetate(pH 5.2)and mix.The final volume should be550 (5U/

SynthesisofDSadapters - STEP STEP STEP H H H 32 OOT OOT OOT P-ATP isradioactive. Take appropriate stepstoavoid exposure toradioactive compounds.

µ

| ′ 2014 It isextremely important thatthe oligonucleotides beallowedtocoolslowly. Make sure thatthe thermocycler does Drythe DNApelletfor the exact times specified. Overdrying the DNAcouldleadtostrand melting. It isessential thatthe twostrands never melt. Ifthey do, the complementarity of the tagswillbelost,and DSwill l) dA-tailing, adapter ligation, PCR amplifica PCR ligation, adapter dA-tailing,

µ I I I

l) µ N N N

l for quality control; labelas‘final adapters’.

G G G

|

natureprotocols Volume ( 27.9

µ

l aliquots, and place them at−80°Cfor long-term storage. The final concentration willbe~50 11.6 11.6 27.9 199

g .

m 15 Volume ( 235 50 200 l)

µ m Final concentration 0.5U/ N/A 1mM 1× Variable B l TE l)

ox

low 2 for adetailed protocol. g . Afterresuspension, poolallthe tubestogether for afinal volume of 200 ● . µ µ l pertube(~2tubestotal),and add 675 0.5U/ Final concentration N/A 1× Variable T µ

l of ddH able l T IMI - - - 3 ) bycombining 100 2 N O. We recommend saving 1 factors, including the number of reads devoted to a sample, the the sample, a to devoted reads of number the including factors, on a of number depends amount the PCR input that determined However, DNA optimal. of was we have since attomoles ~40 that used during the PCR step. In the original publication, we specified An additional change in the protocol involves the amount of DNA different synthesis method that we a have require included in adapters the design, new this of part As DNA. sample the of dA-tailing efficient more for allow to adapters sequencing the re-designed have reli we Specifically, its reproducibility. and ability increased substantially have which its publication, since initial protocol the to optimizations and updates ­important G µ l 2d µ µ l. l of each100 µ l of the resuspended adapters for µ l (i.e., 2.5volumes) of µ M oligonucleotide ina0.2-ml Fig. 2 a ) . Although this Boxes

µ 1 µ M. and l. µ l 2 - . © 2014 Nature America, Inc. All rights reserved. SAMtools (BWA) Aligner Burrows-Wheeler the include these data; ing sequenc the process to packages software standard of number a ( data our lyze ana and process to use we that pipeline computational hensive publication. original our in presented genomes, were that plasmids pure small other and mtDNA as such highly small, beyond DS of utility the expands greatly which capture, DNA targeted use to option the included have we Finally, data. of amount final the maximize to amounts input DNA optimal the provide and parameters these between relationship the outline formally We number. copy target and gene and size, target and genome capture, DNA targeted of use coordinates are identified and grouped to and grouped are form coordinates identified ‘tag families’ with After alignment, reads sharing the same tag sequence and genomic to The reads are BWAusing the then genome reference aligned ( ‘tag_to_header.py’ called script custom python the by performed all are steps These discarded. are tag the within bases nine than greater or homopolymers nucleotides ambiguous with Sequences header. read the in stored is that tag 24-nt single a to combined is reads paired-end two the of each on present tag 12-nt the and read, each from removed tationally compu is sequence 5-bp invariant the First, site. ligation the to corresponding sequence by 5-bp followed an sequence, invariant https://gith latest version of the DS software package can be downloaded from (ii) initial alignment; SSCS and assembly; (iii) DCS assembly. The and Tag parsing (i) steps: major three into down broken is flow as well as several custom Python scripts. The computational work R ! 2. ForeachDNAsample, mixthe following components and incubate the mixture at37°Cfor 30min. 1. DiluteeachDNAsamplefrom steps1,3,4and 5from and should beavoided. Bioanalyzer 2100,do not haveenough resolution todetect the 9-bpsizechange thatoccursduring enzymatic restriction (step4) Therefore, the evaluation of the extension step(step3)willbelimited. Technical solutions, such asthe Agilent TapeStation 2200or synthesis steps. Agarose electrophoresis couldalsobeused;however, the fluorescent dyes used require double-stranded DNA. with polyacrylamide electrophoresis provides the most consistent and sensitive waytoevaluate the success of the adapter The following stepsprovide amethod bywhich toevaluate the synthesis of the adapters. We havefound thatthe useof radiolabeling adapters andPAGE  bromophenol blueband reaches three-fourths of the waytothe bottomof the gel. 6. Run5 tetramethylethylenediamine (TEMED).Pour itrapidly into the sequencing gel platesetup,asdescribed bythe manufacturer. 5. Mix100mlof the 8Murea, 14%(wt/vol)polyacrylamide gel mixture with130 4. Boilthe DNA:buffermixture for 1min. 3. Add 20 T4 PNK(10U/ ddH γ 10× NEBpolynucleotide kinase (PNK)buffer DNA from step1 blue marker. Donot runthe gel for toolong toavoid lossof the fragment. See In addition to protocol, a we In the updated include also compre addition Each read obtained from a DS run consists of a 12-nt tag tag 12-nt a of consists run DS a from obtained read Each Box 2 -

eagent CAUT 32

CR P-ATP (6,000Ci/mmol) 2 O I T

I I ON CAL 3

µ 8 ub.com/loeblab/Duplex-Sequencin , Picard and the Genome Analysis Toolkit (GATK) µ l of the final DNA:buffermixture onthe 8Murea, 14%(wt/vol)polyacrylamide gel at60W for 1.5–2horuntil the

| γ l of the denaturing gel-loading buffercontaining sequencing gel dye mixfor atotalvolume of 40

- Optionalqualitycontrol steps foradaptersynthesis usingradiolabeled STEP 32 Fig. Fig. µ P-ATP isradioactive. Take appropriate stepstoavoid exposure. l)

The small fragment thatisremoved byethanol precipitation migrates near the bottomof the bromophenol

2 ). The computational workflow for DS uses uses DS for workflow computational The ).

T IMI Supplementary Fig. 1 Fig. Supplementary

N G g . 1d

Volume ( 0.5 0.25 16.25 1 1

B

ox

m

39

1 l)

, in9 3 3 40

). 7 7

- - - - -

, . , µ

0.5U/ Variable we have made since the initial publication initial the since made have we col, as well as provide details about the changes and improvements tions that should be taken into account for each stage in the proto considera important highlight we will sections, In following the design Experimental genome. reference to the re-aligned then are data DCS essed are filtered out during this step ( replaced by ‘N’. Reads containing a high proportion of Ns (>30%) DCS. Non-matching bases are considered undefined, and they are at each position, with only matching bases being kept to produce a tag a tag with designated are sequences sub-tag 12-nt two the if Specifically, another. one to relative transposed being of virtue by grouped be can reads SSCS of pair a to responding each cor with tags the and sequences, 12-nt two of associated consists read sequence tag 24-nt The ‘DuplexMaker.py’. called script a by compared and together grouped are DNA strands tial the in change ( accuracy method’s appreciable any without yield data reduces this that we but have found agreement, sequence or >70% family per members three than more requiring Wewith have experimented The resulting data are referred to as SSCSs undefined. ( considered are and an ‘N’ by replaced are consensus a form cannot that Positions position. that at sequence same the have members the of 70% at least when is only kept a position of bers are then compared at each sequence position, and the identity requires three members to result in a tag family. The family mem called ‘ConsensusMaker.py’. script a python By the script default, l of ddH Next, the two related SSCS reads corresponding to two the ini corresponding reads Next, SSCS two the related Final concentration β α in read 2. The paired SSCS strand reads are then compared 2 0.4 N/A 1× O. Figure α β µ µ in read 1 is compared with the sequence having the having sequence the in with 1 read is compared l of ammonium persulfate and 75 M µ l 3 for anexample gel. Supplementary Fig. 3 Fig. Supplementary natureprotocols Supplementary Supplementary Fig. 4 µ

l. α | VOL.9 NO.11VOL.9 and and Supplementary Fig. 2 1 ). 9 . µ β l of , then a sequence ,a sequence then protocol

| ). The proc 2014

|

2589 ). ------

© 2014 Nature America, Inc. All rights reserved. methods, which involve simultaneous fragmentation of the the of ‘Tagmentation’ fragmentation simultaneous optimization. involve which require methods, probably would but DS, with nebuliza compatible be both should as digestion such enzymatic or tion methods Other below. Setup Equipment in given are parameters shearing Specific DNA. nuclear longer for and times DNA) viral or mtDNA as (such genomes small for times shorter with genome, the of size the on depending varied is time The shearing ultrasonicator. acoustic a using tion Covaris fragmentation. DNA precipitations. ethanol DNA during the overdrying or urea as such agents chaotropic of use °C, 80 the above heating tions that could result in the of melting two DNA such strands, as performing DNA isolation, it is imperative to avoid any manipula When capture. targeted require not do that mtDNA, or plasmid ture and 100–300 ng of DNA for smaller genomes, such as purified steps. Typically, 1–3 account for expected sample loss at the enzymatic and purification of the DNA throughout the initial sample preparation steps and to with start excess amounts of DNA to allow for easy quantification low.is extremely However, it feasible, when is most convenient to DS for needed DNA starting of amount theoretical the so range, used in the PCR amplification step is typically in the low-attomole requirements. Sample Sequencing. The blue numbers correspond to the steps in the PROCEDURE. Figure 2 2590 sample DNA and ligation of sequencing adapters by a transposase protocol

| VOL.9 NO.11VOL.9 Index reference genome w/BW

| Schematic of the basic computational workflow for Duplex A | 2014 55 µ g of g DNAof is required for DNA targeted cap We physically shear the DNA by sonica by DNA the shear physically We The amount of fragmented DNA actually DNA actually fragmented of The amount | natureprotocols Local realignment/end Extract duplextags (tag_to_header.py) Realign DCSreads Mutation detection clipping (GATK) Sequence DNA Sort andfilter and analysis (paired end) Align reads DCSs 63 and64 47 or53 65–67 68–71 73–75 57 56 (DuplexMaker.py) MakeDCS (ConsensusMaker.py) Make SSCSs Sort SSCSs Sort reads 58 and59 62 61 60 - - - -

then, we have determined enzymatic T-tailing to be consider be to T-tailing enzymatic determined have we then, a 3 DS of tion 3 and repair end to subjected is DNA sheared the protocol, construction library dA-tailing. and repair End speed. greater and recovery higher of benefits additional the has which beads, XP Ampure with selection size we perform problems, these mitigate To transillumination. UV during damage DNA of introduction the increases gels of use the addition, In DS. with use its cluding pre strands, DNA two the of melting in result can selection size gel-based However, gel. the from range purification fragment desired and the of excision with PAGE, use preparation library selection. Size strands. complementary between comparison the perform and DNA molecules unique to identify impossible is it barcodes, molecular with marked not are strands DNA the bar Because codes. molecular with incompatible invariant is an that of sequence use transposon make they because DS currently with are incompatible kit), prep sample DNA Nextera Illumina’s (e.g., correspond to the designations in the adapters at each step on the synthesis. Lane and gel band designations b = 56 nt; c = 83 nt; d = 75 nt; e = 48 nt; and f = 9 nt. ( and lane 4: step 10 (final adapters). Band sizes are as follows: a = 58 nt; adapters); lane 2: step 3 (extended adapters); lane 3: step 4 (cut adapters); steps in the synthesis protocol described in step of the adapter synthesis. Lane numbers correspond to the following synthesis. ( Figure 3 ( form double-stranded a to it converting thereby sequence, tag degenerate the copy to single- used is 12-nt polymerase a DNA A sequence. tag contains randomized which stranded of one oligonucleotides, two synthesis. Adapter preparations. library NGS conventional with consistent more protocol the makes and ( yield data final the increases neously and section synthesis’ ‘Adapter 3 a have to adapters the re-engineered reads. We final of have number reduced a substantially in results ( A-tailing than efficient less ably ′ dT overhang on DNA the fragmented sample. However, since

| Quality control of the sequencing adapters at each step of a 1 ) ) Representative 14% (wt/vol) polyacrylamide gel for each 9 , we used A-tailed adapters and enzymatically added added enzymatically and adapters , A-tailed we used Many protocols for next-generation sequencing sequencing next-generation for protocols Many ′ a -end dA-tailing. Notably, in our initial descrip initial Notably, our in dA-tailing. -end b a DS adapters are constructed by annealing annealing by constructed are adapters DS 1 c a 2 Fig. Fig. d f a e 3 Similar to the standard Illumina Illumina standard the to Similar a 3 d a e . ). The extended product contains contains product extended The ). 4 Boxes Boxes b Supplementary Fig. 5 Fig. Supplementary 4 3 2 1 e d a e d b c a B ′ ox dT overhang ( overhang dT 1 Supplementary Table 1 Table Supplementary and and 1 . . Lane 1: step 1 (annealed T T A 2 f ), which simulta which ), b ) ) Schematic of Fig. 1 Fig. ), which which ), a , see

) - - - - - © 2014 Nature America, Inc. All rights reserved. ( six and corresponds to ~40 raw reads being required to form one DCS read. The maximum efficiency of DCS formations occurs at a peak family size of final DCSs that are formed for every read originally dedicated to a sample. different lane fractions. ( Figure 5 improving data yield. Because the number of reads varies between appreciably without capacity sequencing wastes size) family tag large (i.e., sequence tag same the having reads many versely,too con be calculated; cannot sequence a size), consensus family tag small (i.e., sequence tag same the sharing reads few too are there If formed. DCSs of number final the influences strongly which size), family tag (i.e., sequence tag same the share that reads ing sequenc of the number that dictate variables adjustable primary tion a to of lane dedicated sample, sequencing a are the particular frac the with in PCR, the used along DNA of fragments number The molecule. DNA multiple double-stranded a of creating strand each of of copies purpose the for PCR-amplified is library amplification. PCR by size family tag Setting with ( beads XP Ampure purification by removed then are dimers adapter and ers 20-fold molar excess use of adapter relative to DNA.then Unligated adapt and 2100 Bioanalyzer or 2200 TapeStation Agilent an on library DNA preligated the in con molecules molar DNA of the centration determine we dimers, adapter of presence the minimize To sequencing. DNA with interferes dimers these of adapter presence The PCR. during amplify preferentially size, can adapters result in dimers, adapter which, owing to their small excess of use the and ligation, to inefficient lead adapters too few ligation. Adapter avoided. be should and requirement ‘phasing’ platform’s the of limitations technical of because sequencer the Illumina with is incompatible <12 of length However,tag a library. a in molecule every of ling exceedsignificantly the degeneracy needed to ensure unique labe random per nucleotides adapter (24 nt per final ligated molecule) Twelve sequenced. be to library fragment DNA the on overhang ( adapter final a in 3 results site this of Cleavage sequence. tag the of downstream site restriction HpyCH4III an not increase the final depth of coverage. sequencing data from the two sequencing runs can be combined and analyzed. Importantly, further sequencing of a sample with a approximatelylarge peak six familymembers size maximizeswill the final number of DCS reads. Samples that exhibit a small peak family size can be sequenced again muchand thePCR rawinput. ( Figure 4 number of DCSs. reached. A peak family size a b

) ) The total number of DCSs increases until a peak family size of ~16 is Fraction of total reads 0.02 0.04 0.06 0.08 0.10 0

| | 5 1 Representative tag family size distributions. ( Optimal peak family size. Replicates of the same sample at Tag familysize(no.ofreadsperfamily) Figs. 1a 1a Figs. 10 c ) Tag family size distribution that is too large because of too little PCR input. We have found that a family size that is centered around The DNA:adapter ratio is a crucial variable, as variable, a is crucial ratio DNA:adapter The 15 Supplementary Fig. 6 Fig. Supplementary a ) ) Plot compares peak family size to the number of 20 and > 16 16 does not result in an increase in the final 25 3 b ), which can be ligated to the 3 the to ligated be can ), which 30 ′ dT overhang on the end of the the of end the on overhang dT 35 40 ). b Fraction of total reads 0.05 0.10 0.15 0.20 0.25 0.30 a The sequencing sequencing The ) Optimal family size distribution. ( 0 3 1 Tag familysize(no.ofreadsperfamily) ′ dA dA 7 5 ------different tag families and occur in a distribution ( distribution a in occur and families tag different as different tags) as a function of tag family size (the ‘PE_reads. (the size family tag of function a as tags) different as the same size (e.g., tag families can have of the same number of reads families tag to belonging reads of proportion the plotting By library. the in present DNA molecules the of efficiencies fication ampli different of result the is it and amplification, PCR during generated under a given set of conditions. This distribution occurs ‘peak family size’ as our preferred metric to refer to tag family sizes number of raw of number reads to produce a DCS ( single smallest the is, requiring that DS: of efficiency the maximizes six of size family peak a that determined have we size, family peak the for values different with samples of On analysis the reads. of basis of the proportion highest the containing >1 tag size the family as size family peak define formally We example). an as a centered at broader distribution the peak family size (see and region, tag the in errors sequencing of result the probably is one, which of size family a at tag peak tag a solitary with of sizes family distribution a observe typically we data), these section provides PROCEDURE the of 60 Step in generated file tagstats’ 1 9 b a Number of DCS reads Number of DCS reads / number of

2E+04 4E+04 6E+04 8E+04 1E+05 1E+05 total reads 0.002 0.004 0.006 0.008 0.010 0.012 0.014 1 1 0 b 0 6 4 2 0 ) Tag family size distribution that is too small because of too 0 1 3 8 6 4 2 5 c Fraction of total reads natureprotocols 0.01 0.02 0.03 0.04 0.05 0.06 0 2 1 Peak familysize 5 9 Peak familysize Tag familysize(no.ofreadsperfamily) 1 8 8 9 10 9 0

| VOL.9 NO.11VOL.9 125 12 12 163 Fig. Fig. 5 protocol 14 14 Fig. Fig. 201 a | 2014 ). Although 16 4 16 ), we use use ), we 239 Fig. 4 |

18 2591 18 277 a -

© 2014 Nature America, Inc. All rights reserved. in a depth of 2,500–5,000. of depth a in results lane HiSeq a of 25% for mtDNA input of attomoles 200 constant, size family peak the keep to needed fraction lane and input PCR between relationship linear the Given 500–1,000. of depth a in results typically this mtDNA, human 16.5-kb the For lane). a HiSeq2500 of peak capacity sequencing ~5% (i.e., to a sample optimal reads an paired-end million produces eight devoting DNA when six of of size family amol 40 of input PCR a that found have we DNA, plasmid or mtDNA purified as such six. of size peak family appropriate the obtain to PCR the in used be should that in to referred be can value this fraction, lane per-sample the determining After optimal. are ratio SSCS:DCS and size family peak the that vided pro be obtained will that the depth of a estimate rough provides raw of number Notably,formula ~40. to this DCS the single a form to needed raise reads to tends this and DCS, a form to fail that present typically are reads extra several quality, reads high raw of are all not because However, six). around of average an and ten, four with between ratios SSCS:DCS obtain (we typically DCS single a form to needed typically SSCSs of number the and six) (i.e., size family peak optimal the of to product the responds cor roughly value This DCS. single a form to needed read-pairs that approximately corresponds to the pseudo-constant average number of unprocessed determined empirically an The is 40 analysis). of our value in (76 bases in length read postanalysis the G fraction), lane (i.e., Where of number sample: a to devote to the reads estimate to used be can formula following The increases. target genome the of size the as decrease will sample a of depth sequencing final the par fraction, In lane given a coverage. for ticular, of depth desired the and region targeted the of size the by influenced be will sample) particular a to devoted and family sizes and substantial decreases in DS efficiency ( in DS efficiency decreases substantial and sizes family peak suboptimal to lead fraction lane and input PCR to changes ( six of to size fraction family lane peak a dedicated maintain the and PCR for DNA of input amounts the both increase proportionally to necessary is it DS, in target, genome more DCSs are or needed. To the increase genome number reads of obtained larger a for depth same the achieve obtain. we that DCSs of number final the maximizing while error, measurement or ting pipet of because size family peak in tolerance some for allows a for aim six and between 12 size peak family members. ranging This range we PCR, for amounts input DNA on deciding when consume sequencing capacity without and yielding further DCS data.redundant Therefore, are reads additional members all 16 nearly family, and six between ranging ( sizes family peak for yield continues to increase, albeit at data a decreasing that overall efficiency,noted be should it optimal, is six of size family peak a 2592 Fig. 5 Fig. protocol is the genome or genome target size in base pairs and and pairs base in size target genome or genome the is For samples that do not make use of targeted capture, capture, targeted of use make not do that samples For to or target, given a for depth sequencing greater Toachieve 5

| a VOL.9 NO.11VOL.9 b ). The choice of lane fraction (i.e., the number of reads reads of number the (i.e., fraction lane of choice The ). N ). As the peak family size approaches 16 reads per tag tag per reads 16 approaches size family peak the As ). is the number of paired-end reads devoted to a sample sample a to devoted reads paired-end of number the is Table | 2014 D 2 to determine the amount of target DNA target of amount the determine to is the desired average depth of coverage, coverage, of depth average desired the is | natureprotocols

N = 40 R DG Table 2 ). Nonproportional Nonproportional ). Figs. 4b Figs. R is is , c - - - -

ence ence the input of amount DNA. The value of influ markedly can and genome nontarget per copies of sands genes, ribosomal or ignored. However, in the case of multicopy targets such as mtDNA the genome. For of single-copy genes, portion nontarget the of copy per target the of copies of T ( fragments Where formula: following the by number and the size copy of the genomic relative target, and its it on can be estimated dependent is PCR the in used DNA total of amount DNA. The total of amount the than less much generally of the amount to specifically refers value this sample, noncapture a to contrast reads paired-end lane). a However,of HiSeq2500 capacity (i.e., ~5% sequencing in million ~8 for used is DNA target of amol 4 T the Notably, 20. and 15 cycles between occurs typically which reactions one or two cycles before the SYBR Green signal plateaus, by as and for remove qPCR mtDNA, the the reaction we monitor DNA, such input of fmol <10 requiring samples For cycles. PCR of number the monitor carefully we ( Tocomplication, this avoid products higher-molecular-weight nonspecific, of in can result the appearance phase the exponential beyond cycles ( fraction lane increasing DNA input amounts should also be proportionally increased with and target target the the DNA inputs, nontarget the to to DNA. Similar nontarget specific are that primers using (qPCR) PCR quantitative with copy analysis number by performing estimated able able 80 80 × 10 72 × 10 64 × 10 56 × 10 48 × 10 40 × 10 32 × 10 24 × 10 16 × 10 8 × 10 R is the size of the target DNA in base pairs and and pairs base DNA in target the of size the is When performing targeted capture, a PCR input amount of of amount input PCR a capture, targeted performing When The The number PCR of cycles is variable. an Additional important eads persample 2 2 6 m 6 6 6 6 6 6 6 6 6 | is the desired amount of of amount desired the is

DNA DNA input amounts for PCR. Table target 2 ), DNA in the captured is sample, which being N G can be on the order of hundreds or thou or hundreds of order the on be can is the size of total genome in base pairs, pairs, base in genome total of size the is A To ttomoles of D (with capture) Table ta D l 40 36 32 28 24 20 16 12 N 2 NA 8 4 ). will have a value of 1 and can be = target mG NT NA

DNA in attomoles of of attomoles in DNA N A (with nocapture) ttomoles of D can be reasonably N is the number number the is 400 360 320 280 240 200 160 120 80 40 Fig. 6a Fig. NA , b

). - -

© 2014 Nature America, Inc. All rights reserved. loid genome, composes only 0.0007% of the total DNA. Thus, a Thus, DNA. total the of 0.0007% only composes genome, loid 20-kb a DNA, in genomic locus is in However,copy which present a per hap single DNA. targeted the of purity 100% nearly in result will enrichment 10,000-fold a Therefore, cell. human a in DNA of mass the total copy of number,its high ~0.2% composes to owing genome, mitochondrial the example, For DNA. target the of enrichment 10,000-fold a in results typically capture that is reason The regions. low-copy-number small, targeting when inefficient is However,capture mtDNA. as such targets number original DNA lost. is the of complementarity com the and PCR, single-stranded during its ponents into melted is DNA the duplex because DS targeted with compatible not are methods enrichment note, PCR-based Of are optimization. to some likely they require and technologies, these evaluated not have However, we well. as work probably would manufacturers other by enrichment offered methods Hybridization-based success. good with flies, and mice humans, including organisms, of number a from mtDNA and nuclear both against sets capture several targeted used custom-designed and designed have We system. enrichment target is interest often Wedesirable. have the used SureSelectXT Agilent capture. DNA Targeted ( cycle number optimal the to 2100 determine then on be quantified an Agilent TapeStation 2200 or Bioanalyzer fully removing a small aliquot from the reaction. The aliquots can care and cycles two every thermocycler the pausing by products needed to maximize yield while avoiding higher-molecular-weight tem or genomic target, we recommend testing the number of cycles respectively. Whenlater, up first setting DS for sys a or new experimental earlier cycle one stopped be should reactions the PCR, the in used DNA of amount the in decrease or increase twofold every for Importantly, eight. and six cycles between stopped be found that have samples we requiring PCR, ~600 during fmol of used total be input DNAshould should that cycles of number the determining for benchmark a As unreliable. be can signal Green SYBR the that found have we DNA, nuclear of capture targeted monitored. carefully be should and PCR the in used DNA of amount the on dependent be will point plateau specific ( Step 47 showing higher molecular species resulting fromFigure too 6 many PCR cycles. See Experimental design for details on determining the number of PCR cycles. c a ) Electropherogram of the postligation at Step 36; note that the double peaks are normal. The peaks can vary in size and intensity without affecting the final results.

Capture is readily performed Capture on is large readily performed targets, or on high-copy- If a sample requires >10 fmol of DNA, such as when performing Sample intensity (FU) 100 200 300 400 500 600 0

| Example Agilent TapeStation 2200 electropherograms. (

25 Lower

50

100

Targeted capture of the sequences of of sequences the of capture Targeted 200 300 333 400 500 700 1,000 1,500 Upper Size (bp) b

Fig. 6a Sample intensity (FU) 100 200 300 400 500 600 0 , b a ). ) Electropherogram of an optimal post-PCR sample at Step 47. ( 25 Lower

50 - - - - 100 two primary factors determine overall data yield: the peak family family the peak yield: data overall determine factors two primary from 1 to it range can 10%. in and hands our We that have found reads, raw of number the by reads DCS of number the dividing metrics. Quality sequence. target the to mapping reads of >85% in results typically steps capture geted tar the repeating that found However, have we on-target. being DNA final the of 7% only in result will enrichment 10,000-fold first and last five bases of each read each of bases five last and first the GATK, with by clipping followed re-alignment local forming per by mitigated be can end-repair artifacts the These alignment. during and process occur that errors to due likely are facts 5 the at mutations of frequency an increased however, we observe positions; read possible all out steps. tailing library and tions condi ligation to optimizing to more attention data, obtain with order in remade be to need will library sample the cases, such In PCR. during amplified and ligated successfully was DNA strands despite adequate family sizes, this suggests that only one of the two ( 5–6 of sizes peak above improve appreciably not does it although above, noted as size, family the increasing by overcome readily is issue This strands. DNA both from form not might members three least at with families then small, too are sizes family peak If reasons. several for occur can ten, with an average and of ~6. An four excessively high (i.e., ratio >14–15) between ratios SSCS:DCS obtain Wetypically DCS. a and form SSCS can find its that partner every and would indicate ratio is the other key factor. A ratio of two is the ideal theoretically protocol the again on the same biological sample will not performing work. The SSCS:DCS replicate; technical a be must sample resequenced the Importantly, run(s). previous the with data the in the information using peak family size can be obtained by resequencing the same sample optimal an then desired, is what than smaller is that size family section; amplification’ PCR by size family tag ‘Setting the (see previously discussed as to six, be adjusted optimally it to should and a sample, dedicated by the amount of DNA used for the PCR and the fraction of a lane determined is size family peak The ratio. SSCS:DCS the and size 200 Finally, mutations should theoretically be distributed through Finally,be distributed theoretically should mutations 300 400 346 500 700 1,000 1,500 Upper Size (bp) The overall efficiency of DS is calculated by by calculated is DS of efficiency overall The c natureprotocols Sample intensity (FU) 100 200 300 400 500 600 Fig. 5 Fig. 0 Supplementary Table 2 Supplementary a ). Should a sample exhibit a peak peak a exhibit sample a Should ). 25

′ Lower and 3 and Fig. 5 Fig. 3

9 50 . b ) Electropherogram from ′ ends of reads. These arti These reads. of ends

b | VOL.9 NO.11VOL.9

). If a poor ratio is seen seen is ratio poor a ).If 100

200 182 300 protocol

and combining and combining 400

500 504 | 2014 700 681 1,000 1,500 Upper

|

2593 Size (bp) - - - - -

© 2014 Nature America, Inc. All rights reserved. Xs represent the sampleindexing sequence. We recommend using the Illumina TruSeq index sequences availableonthe Illumina website. T • • • • • • • • • • • • • • • • • • • • • • • • • REAGENTS M 2594 able able N MWS21 MWS20 MWS13 MWS55 MWS51 protocol optional and is needed only if targeted capture isperformed optional only andisneeded if SureSelectXT reagent kit, HSQ, 16(Agilent, cat. no. item G9611A)This is evaluatednot been andtheir compatibility isunknown. other targeted capture probably products will work; however, have they uses enrichment target section The Agilent SureSelectXT. useof The  Agilent SureSelectXT (protocol enrichment target Set version 1.6; Agilent) (Agilent, cat. no. 5067-4626) DNA forhigh-sensitivity theBioanalyzer 2100may used be cat. no. 5067-5363and5067-5364). Alternatively, an Agilent Agilent D1KTapeStation DNA high-sensitivity kit(Agilent, SYBR Green (10,000×; Life Technologies, cat. no. S-7563) cat. no. buffer and10mMdNTPmix) 5×fidelity KK2502; supplied with KAPA HiFi HotStart DNA (1U/ml; KAPA polymerase Biosystems, buffer) 10×Ultrapure ligation supplied with (600,000U/ml;Ultrapure T4ligase Enzymatics, cat. no. L6030-HC-L; buffer) NEBNext dA-tailing module(NEB, cat. no. E6053L; includes reaction buffer) NEBNext end-repair module(NEB, cat. no. E6050L; includes reaction for sequences Sequencing adapter PCRoligonucleotide (IDT; primers see  DS adapter oligonucleotides (IDT; see buffer gelloading (Life Technologies,Denaturing cat. no. AM8546G) neurotoxin; care. itwith handle solution; Bio-Rad, cat. no. 161-0154) Acrylamide (Supplied acrylamide/bis-acrylamide asa19:130%(wt/vol) Urea (Sigma-Aldrich, cat. no. U6504) it. handling ! γ acetateSodium (Sigma-Aldrich, cat. no. S2889) ! Ethanol (100%, 200proof; Sigma-Aldrich, cat. no. E7023) dNTPs, 2.5mMeach (Promega, cat. no. U1511) HpyCH4III (NEB, cat. no. R0618; includes CutSmart buffer) T4 polynucleotide kinase(NEB, cat. no. M0201; includes reaction buffer) Nuclease-free water (Life Technologies, cat. no. AM9937) TE 10× TBEelectrophoresis buffer (Sigma-Aldrich, cat. no. T4415) cat. no. M0212L, includes NEBbuffer #2) Klenow (3 fragment may used be Alternatively, Agencourt RNAclean XP(BeckmanCoulter, cat. no. A63987) Agencourt AMPure (BeckmanCoulter, XPbeads cat. no. A63880) conditions.denaturing DNA interest of ATER -

CAUT CAUT ame 32

CR CR low P-ATP (6,000Ci/mmol; MPBiomedicals, cat. no. 0135001U01)

| I I (Affymetrix, cat. no. 75793) VOL.9 NO.11VOL.9 T T 3 3 I I I I I ON ON CAL CAL ALS |

Sequences of the adapter and PCR oligonucleotides. Ethanol isflammable. it Keep away from flames. γ The use of target enrichment target isoptional for thisprotocol. useof The Oligonucleotides must PAGE be purified. - 5 5 5 5 5 32 ′ ′ ′ ′ ′ -CAAGCAGAAGACGGCATACGAGATXXXXXXGTGACTGGAGTTCAGACGTGTGC-3 -GTGACTGGAGTTCAGACGTGTGC-3 -AATGATACGGCGACCACCGAG-3 -TCTTCTACAGTCANNNNNNNNNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC-3 -AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT-3 P-ATP isradioactive. Take appropriate care when 

CR | ′ → 2014 I T 5 I ′ CAL exo |

− It under non-strand- must purified be natureprotocols ; New Biolabs (NEB) England !

Table CAUT I 3 ON for sequences) ′ Acrylamide isaknown ′ S equence

Table

3

• • • • • • • • • • • • • • • EQUIPMENT • • • • • • • • Pipette tips theadapters re-designing of sequencing have platforms evaluated notbeen andwould require thisprotocol,incompatible with isnotrecommended. Non-Illumina clusters relative to and, theotherplatforms Illumina not although this protocol. MiSeq The reduced usesasubstantially number of NextSeq500, HiSeq1000 andHiSeq2000 are compatible all with HiSeq2500Illumina andassociated equipment (Illumina): Bioanalyzer 2100(Agilent, cat. no. G2938C) Agilent TapeStation 2200(Agilent, cat. no. G2964AA)or Agilent (Covaris,Sonication tubes cat. no. 520045) Covaris Sonicator S220(Covaris) Heating plate stir Heating block (37°C) Microcentrifuge (Eppendorf, cat. no. 5424000.410) qPCR-compatible caps(Bio-Rad, PCRstrip eight-well cat. no. TCS0803) cat. no. TLS0851) qPCR-compatible (0.2ml; tubes Bio-Rad, PCRstrip eight-well qPCR thermocycler (Bio-Rad) Microcentrifuge (1.7ml, tubes Eppendorf) Vacuum centrifuge cat. no. 12331D) DynaMag-96 plate separator side (Life magnetic Technologies, Eight- with v0.1.3. with Distance software pa in thisprotocol are to known v1.107 compatible be with Picard software package ( are to known v2.4-9 compatible be with ( Genome Analysis Toolkit (GATK) software package inthisprotocolused are v0.7.5 compatible with Pysam software package ( protocol are v1.62. compatible with softwareBioPython package ( 3.x Python with protocol have testedv2.7.x. been with are They notcurrently compatible softwarePython package ( compatiblenot be other softwareSAMtools; to theuseof manipulate sam/bamfilesmay this protocol with distributed for usewith designed hasbeen the protocol to isknown versions work with SAMtools software package ( compatiblenot be protocol BWA; for usewith designed hasbeen different may aligners toknown versions work with BWA software package ( http: //www.broadinstitute well PCR strip tubes (0.2ml; tubes BioExpress,well PCRstrip cat. no. T-3035-1) ′ ′ ′ ckage: inthis protocol used thescripts are compatible http:// http://pi http://code.googl (Step (Step 38A) targeted capture with (Steps 38B and 51) or without and 51 to generate amplified tag families, PCR primers used at PROCEDURE Steps 38 in Annealed adapters prepared as described Duplex sequencing adapter oligos. http://python.org .org/gatk B http://samtoo ≤ http://biopython.or ox bio-bwa.sourceforge. 0.6.2. this software The with distributed 1 card.sourceforge.net and used in PROCEDURE Step 30 / ): the scripts used inthisprotocol used ): thescripts ls.sourceforge.net e.com/p/pysam / ≤ ): the scripts used inthis used ): thescripts P 0.1.18. software The urpose g / ): the scripts used inthis used ): thescripts net

/ / ): the scripts used used ): thescripts ): theprotocol is /

): the scripts ): thescripts / ):

© 2014 Nature America, Inc. All rights reserved. sample and resuspend. 12|  11| 10| 30 satroom temperature. 9| 8| beads are outof solution. Visually confirmthatthe beadshave moved tothe side of the tubesand thatthe solution isclear. 7| 6| and down. The 2:1ratio isusedtoensure the maximal retention of allDNA. 5| 4|  3| be 4 °C, and the water bath should be degassed for 30 min before shearing. bubbles, gently tap or shake to bring the solution to the bottom of the well. In addition, the water bath temperature should  2| we typically startwith200–500ng of DNA.Fortargeted capture, wetypically startwith1–3 1| arbitrary number of DNAsamples.  S PROCE in separate microcentrifuge 100 to tubes afinal concentration of Sequencing adapter oligonucleotides up to 6months. thesolutionand divide into 50- SYBR Green, 12.5× ethanol. solution This stored canbe atroom temperature for upto 1month. Ethanol solution, 75%(vol/vol) (20–25 °C). autoclaving. solution This stored canbe indefinitely at room temperature ddH ddH itin90mlof acetateSodium solution, 3M the volume up to 1 liter with ddH slowly stirring with mild heat using a stir plate. After the urea is dissolved, bring 466 ml of acrylamide and 100 ml of 10× TBE. Allow the urea to dissolve by Urea 8 M, polyacrylamide gel mixture 14% (wt/vol) REAGENT SETUP onication of D

CR CR CR CR 2 Remove the PCRtubesfrom the magnetic plateseparator and add 40 Allowthe beadstodry for 5minatroom temperature until there are no visibletraces of ethanol. Aspirate outthe ethanol and discard it.RepeatSteps9and 10once more for atotalof twowashes. O to adjust thefinal volume to 100ml. Sterilize thesolution by Carefully dispense 200 Aspirate the supernatant from eachtubeand discard it.Keep the tubesonthe magnetic plateseparator. Place PCRtubesonto the DynaMag-96 side magnetic plateseparator (orequivalent) for atleast2minoruntil allthe Incubate the mixture for 5minatroom temperature. Add 130 Split the 130 Vortex the room-temperature AMPure XPbeadmixture toresuspend any magnetic particles that may havesettled. Transfer the DNAtoaCovaris sonication tubeand shear DNAusing the settings outlined inEquipment Setup. For eachlibrary, dilutethe DNAinto afinal volume of 130 I I I I T T T T D I I I I URE CAL CAL CAL CAL

The following procedure ispresented for asingle sample;however, the protocol canbe scaledupfor an STEP STEP STEP µ 2 l of AMPure XPbeadstoeachPCRtube(2:1bead:sampleratio) and mixwellbyvortexing orbypipetting up O. Adjust 1MHCl, thepHto 5.2with andthen add

NA Dilute 1 Overdrying or underdrying the beads reduces yield. The beads must be at room temperature before use. A cold bead mixture significantly reduces yield. Ensure that there are no air bubbles in the bottom of the tube after loading the sample. In the case of air µ

l of sonicated DNAinto 2×65- ●

T µ

IMI l of 10,000×SYBRGreen in799 l of Add 24.6 g of sodiumacetateAdd anddissolve 24.6gof µ

2 l aliquots; store thealiquots at−20°Cfor Add 25 ml of ddH Add 25mlof O. Store it at 4 °C for up to 1–2 months. N µ G l of room-temperature 75%(vol/vol)ethanol toeachPCRtubeand incubate the tubesfor

1h Dissolve theoligos inTEor ddH

2 O to 75 ml of 100% O to 75mlof Add 480 g of urea to µ l of ddH l of µ µ M each. l aliquots in0.2-mlPCRtubes.

2 O

2

O,

µ 30 min before shearing. temperature should be 4 °C, and the water bath should be degassed for intensity, 5; cycles/burst, 100; and time, 20 s × 3. or plasmid DNA, the following settings should be used: duty cycle, 10%; burst, 200; and time, 20 s × 6. For small genomes, such as mtDNA, viral DNA to ~300 bp (range 100–500 bp): duty cycle, 10%; intensity, 5; cycles/ Covaris S220 Sonicator EQUIPMENT SETUP are they notatroom properly notfunction temperature. if will to tothe beads warm room temperature before use. AMPure beads XPmagnetic use. Primers stored canbe atthistemperature for 6–12months. 20 in TEor ddH Sequencing adapter PCRprimers Oligonucleotides stored canbe atthistemperature for 6–12months. Immediately freeze theoligonucleotidesuse. further at−20°Cuntil l of TE µ M each. Immediately freeze theoligonucleotides further at−20°Cuntil low . Forsamplesthatdo not usetargeted capture, µ 2 l of ddH O inseparate microcentrifuge to tubes afinal concentration of

2 The following settings are used to shear nuclear O toonlyone of the twotubesof each natureprotocols

The beads are beads storedThe normally at4°C. Allow

Dissolve each respective PCRprimer µ g of DNA.

 | VOL.9 NO.11VOL.9

CR  I T

CR I CAL I protocol T I The water-bath CAL | 2014 The beads beads The

|

2595

© 2014 Nature America, Inc. All rights reserved. Incubate the mixture ina thermocycler for 30minat37°C. 22| 3  and then transfer the supernatant toanew 0.2-mlPCRtube. 21| beads are resuspended. Let the beadmixture sitfor atleast2min. 20| 19| supernatant. Keep the beads. onto the magnetic plate separator for atleast2minoruntil allthe beadsare outof solution, and discard the 140- 18| beads are outof solution, and then transfer the supernatant (should be85 17| selects against large (>500bp),poorlysheared DNAfragments. 16| Incubate the mixture inathermocycler for 30minat20°C. 15| and then remove and discard the supernatant. 24| in size. Incubate the tube for 5minatroom temperature. 23| E  is clear. Transfer the supernatant toanew 0.2-mlPCRtube. Discard the beads. 14| mixture sitfor atleast2min. 13| 2596 DNA DNA from Step 21 R NEBNext end-repair enzyme Mix 10× NEBNext end-repair buffer DNA from Step 14 R 10× 10× NEBNext dA-tailing buffer NEBNext Klenow protocol nd nd repair of sonicated D ′

eagent eagent PAUSE PAUSE A PlaceA-tailed DNAsampleonto the magnetic plateseparator for atleast2min oruntil allthe beadsare outof solution, Add 60 Combine the components inthe tablebelowina0.2-mlPCRtubeand mixthem carefully bypipetting upand down. Placethe PCRtubebackonto the magnetic plate separator for atleast2minoruntil allthe beadsare outof solution, Remove the PCRtubefrom the magnet and add 42 Repeatthe double 75%(vol/vol)ethanol washoutlined inSteps9–11. Add 55 Placethe tubecontaining the end-repaired DNAonto the magnetic plateseparator for atleast2minoruntil allthe Add 35 Combine the components inthe tablebelowina0.2-mlPCRtubeand mixthem carefully bypipetting upand down. Placethe tubecontaining the combined samplebackonthe magnetic plateseparator for 2minoruntil the supernatant Carefully transfer the resuspended beadmixture tothe second samplePCRtubeand resuspend the beads. Letthe bead -tailing -tailing of blunt-ended D

| VOL.9 NO.11VOL.9

PO PO µ µ µ I I NT NT l of AMPure XPbeadstothe sampleand incubate the tubeatroom temperature for 5min.Placethe PCRtube l of AMPure XPbeads(0.7:1bead:sampleratio) and incubate the tubeatroom temperature for 5min.This l of AMPure XPbeads(1.2:1 bead:sampleratio) and mix. The beadratio selects against DNAfragments <150bp exo− Samples can be stored at −20 °C for several weeks. Samples can be stored at −20 °C for up to several weeks. | 2014 | natureprotocols NA and size selection NA Volume ( library Volume ( 42 3 5 ● m 40 l)

5 5 T IMI m l) N ● Final concentration G

1 h T IMI Final concentration µ Variable l of ddH N G 1× 1 h Variable N/A 1× 2 O toeachsampleand gently pipetteupand down until the µ l) toanew 0.2-mlPCRtube. Discard the beads. µ l

© 2014 Nature America, Inc. All rights reserved.  the magnetic separator for 2 min or until the supernatant is clear. Transfer35| the 30 by drying at 37 °C is also important, as ethanol inhibits PCR. that form during ligation. Failure to do so will adversely affect the downstream PCR steps. Removal of any residual ethanol  37 °Cinaheat blockwiththe tubecapopened. resuspended inthe ethanol byvortexing before being placedonthe magnetic separator. Drythe beadsfor 5minat 34| beads are outof solution. Remove and discard the supernatant. 33| Incubate the tubefor 5min atroom temperature. 32| 31| DNA concentration of your sample, butthe totalvolume should be60 As noted inStep29,the volumes of adapters and waterthatare usedwillneed tobeadjustedonthe basisof the 30| ligation reaction. Ifnecessary, dilutethe adapters inddH described in 29| L ? Agilent TapeStation 2200orAgilent Bioanalyzer 2100,according tothe manufacturer’s instructions. 28|  then transfer the 37 27| for atleast2min. 26| 25| Total volume Ultrapure T4 ligase (600 U/ ddH Duplex Sequencing adapters (50 10× Ultrapure ligation buffer DNA from Step 27 R igation igation of D

TROUBLES

eagent PAUSE PAUSE CR Remove samples from the magnetic separator and add 30 Repeatthe double 75%(vol/vol)ethanol washes outlined inSteps9–11,withthe exception thatthe beads befully Placethe PCRtubecontaining the ligated sampleonto the magnetic plateseparator for atleast2minoruntil allthe Add 60 Incubate the tubefor 15minat25°Cinathermocycler (no heated lid). Mixthe following components, inorder, ina0.2-mlPCRtube, and mixthem carefully bypipetting upand down. Byusing the amount of sampleDNAdetermined inStep28,calculatethe amount of DSadapters (synthesized as Remove 2 Placethe samplebackonto the magnetic plateseparator for atleast2minoruntil allthe beadsare outof solution, and Remove the tubefrom the magnetic plateseparator and resuspend the beadsin37 Repeatthe double 75%(vol/vol)ethanol washoutlined inSteps9–11. 2 I O T I CAL

PO PO µ

H I I B STEP NT NT l of AMPure beads(1.2×bead:sampleratio) and mix.Thisratio selectsagainst DNAfragments <150bpinsize. S OOT oxes µ adapters to sample D Samples can be stored at −20 °C for several weeks. Samples can be stored at −20 °C for several weeks. l from the A-tailed library and quantify the amount of DNAusing the high-sensitivity kitsfor either the Resuspending the magnetic beads during the ethanol washes is essential for removing adapter dimers I N 1 G µ and l of eluent toanew 0.2-mlPCRtube. µ 2 l) ) thatwillbeneeded for a20:1molar excess relative tothe totalDNAconcentration inthe final µ M) NA Volume (

● Variable Variable

T IMI 60 35 6 6 N m G l) 1 h 2 O toaneasilypipetteablevolume. Final concentration µ l of TE Variable Variable 60 60 U/ N/A 1× low µ µ to each sample. After 2 min, place the samples back on l l. µ l of supernatant to a new 0.2-ml PCR tube. natureprotocols µ l of ddH 2 O. Letthe beadmixture sit

| VOL.9 NO.11VOL.9 protocol | 2014

|

2597

© 2014 Nature America, Inc. All rights reserved. ( targeted capture, follow option B. plasmids thatdo not require targeted capture, follow option A.To generate duplicate families of genomic targets thatrequire 38| portion of the genome. (Refertothe INTRODUCTION and T where of totalDNAper8million reads. The formula for estimating the totalDNAinputis we use~2fmol of totalDNAfor the same number of reads. Fornuclear DNAcapture (20kbtarget size),weuse~600fmol For noncapture experiments, weuse40amol of inputfor every8million paired-end reads. Forcapture of mtDNA, 37| ? 2100. The ligation reaction canresult inmultiple peaks;wequantify allDNAseenbetween200and 900bp( 36| PCR 2598 protocol isthe sizeof the target DNAinbasepairsand A

TROUBLES (ii) ) Generation of duplicatefamilies withouttargeted capture (i) Generate amplifiedtag families byqPCR. To generate tag families for small homogeneous genomes such asmtDNAor Determine the number of attomoles of inputDNArequired for PCRonthe basisof target sizeand lane fraction. Quantify the molar amount of adapter-ligated DNAlibrary from Step35onanAgilent TapeStation 2200orBioanalyzer amplification amplification  

determining the number of PCRcycles. higher-molecular-weight products appearing ( | VOL.9 NO.11VOL.9 sequences. desired inputamount of the sample, butthe totalvolume should be50 The volumes of DNAand waterthatare usedwillneed tobeadjustedonthe basis of the DNAconcentration and the Mix the components inthe tablebelowinaqPCR-compatiblePCRtube, and mix carefully bypipetting upand down. Incubate the mixture ina qPCRthermocycler asfollows: m

2–variable 1 C Total volume KAPA HiFi DNA polymerase (1 U/ DNA (From Step 37) 12.5× SYBR Green 20 20 10 mM dNTP mix 5× KAPA fidelity buffer ddH R CR isthe desired amount of target DNAinattomoles of fragments, ycle eagent µ µ I 2 T M M Primer MWS21 M Primer MWS13 O I H CAL OOT |

2014 STEP I N ● G

T The number of PCRcyclesisacrucial variable; adifference of asingle cycle canresult innonspecific, | 95 95 °C for 4 min IMI

98 98 °C for 20 s natureprotocols Denature N G 1 h µ l) 60 60 °C for 20 s A Volume ( nneal — N Variable Variable isthe number of copies of the target percopyof the nontarget Fig. 1.5 10 50 1 1 1 1 To 6 m b l) ta T ). See‘PCRAmplification’ inthe INTRODUCTION for details on able D l 72 72 °C for 15 s NA Final concentration 2 E xtend for further details.) = — mG NT 0.02 0.02 U/ Variable 0.3 0.3 mM 0.4 0.4 0.4 0.25× N/A 1× µ µ G M M isthe sizeof totalgenome inbasepairs, µ l µ l. See T able 3 for MWS13and MWS21primer Fig. 6

c ).

© 2014 Nature America, Inc. All rights reserved. Let the bead mixture sitfor atleast2min. 45| 44| and then remove and discard the supernatant. 43| PCR products (  by vortexing. 42| remove and discard the supernatant. 41| selects against DNAfragments <200bpinsize. Incubate the mixture for 5minatroom temperature. 40| 39| ( B (iii) Incubate the mixture inaqPCRthermocycler asfollows:

(ii) ) Generation of duplicatefamilies for targeted D CR (i) Remove the microcentrifuge tubefrom the magnetic plateseparator and resuspend the beadsin20 RepeatSteps42and 43.Letthe beadsdry for 5min. Placethe microcentrifuge tubeonto amagnetic separator for atleast2minoruntil allthe beadsare outof solution, Remove the beadsfrom the magnetic plateand resuspend them in200 Placethe PCRsampleonto amagnetic separator for atleast2minoruntil allthe beadsare outof solution, and then Add 1.0volumes of AMPure XPbeadstothe pooled PCRs(typically 50 Transfer and poolallthe PCRsfrom asingle sampleinto asingle 1.7-mlmicrocentrifuge tube. I up and down (see total inputDNA). PCR oneachsamplealiquot provides enough PCRproduct for targeted capture (eachPCRwillcontain one-third of the and details ondetermining the number of PCRcycles. nonspecific, higher-molecular-weight products appearing (   T For eachPCR,mixthe components inthe tablesbelowinaqPCR-compatiblePCRtubeand mixcarefully bypipetting Dilute the DNAfrom Step37toafinal volume of 30 2–Variable 1 C KAPA HiFi DNA polymerase (1 U/ DNA (from Step 38B(i)) 12.5× SYBR Green 20 20 10 mM dNTP mix 5× KAPA fidelity buffer ddH R I CAL

ycle eagent CR CR µ µ T 2 M M primer MWS20 M primer MWS13 able O I I

T T STEP I I S CAL CAL upplementary Fig.6 2 Failure tocompletelyresuspend the beadsduring thisstepcanleadtothe retention of nonspecific ). We havefound thatsplitting the inputDNAinto three equal aliquots and performing aseparate

STEP STEP 95 95 °C for 4 min 98 98 °C for 20 s The number of PCRcyclesisacrucial variable; adifference of asingle cyclecanresult in It isimportant thatthe nonindexed primer MWS20beusedand not MWS21. T Denature able 3 for MWS13and MWS20primer sequences): µ ). l) 60 60 °C for 20 s Volume ( A nneal — 24.5 1.5 10 10 1 1 1 1 m NA l) capture µ 72 72 °C for 15 s l withddH Final concentration E xtend Fig. — 0.02 0.02 U/ Variable 0.3 0.3 mM 0.4 0.4 0.4 0.25× N/A 1× 6 µ µ b 2 M M O (see‘PCRAmplification’ in the INTRODUCTION ). See‘PCRAmplification’ inthe INTRODUCTION for µ l µ µ l of beadsfor every50 l of room-temperature 75%(vol/vol)ethanol natureprotocols µ l of PCR).The beadratio

| VOL.9 NO.11VOL.9 µ l of ddH protocol 2 |

O. 2014

|

2599

© 2014 Nature America, Inc. All rights reserved. Bioanalyzer 2100 and submitfor sequencing. 53|  52| or two cycles before the signal plateaus. The SYBR Green signal typically begins to plateau between cycles 15 and 22. higher-molecular-weight products appearing. We monitor the reaction by the SYBR Green signal, and we remove the tube one  51| maintain sample diversity.  PCR tubeand mixcarefully bypipetting upand down (see 50| in 30 Agilent SureSelectXT protocol withallreagent volumes reduced bythree-quarters. The final capture DNAshould beeluted The lowervolume capture greatly reduces cost,withno adverseeffectsnoted. The 0.25×capture follows the standard we routinely perform the targeted capture at0.25×volume reduction, relative tothose recommended bythe manufacturer. 49| ? Resuspend itin1.7 48| ( ? If targeted capture is being performed, continue to 47| Step 48; otherwise, the sample can be submitted for sequencing at this point.  transfer the 20 46| 2600 ? 2–30 1 C KAPA HiFi DNA polymerase (1 U/ Captured DNA (from Step 49) 12.5× SYBR Green 20 20 10 mM dNTP mix 5× KAPA fidelity buffer ddH R protocol O

TROUBLES TROUBLES TROUBLES

ycle eagent ptional) ptional) PAUSE PAUSE CR CR Quantify the final bead-purified PCR products between 200 and 900 bp using an Agilent TapeStation 2200 or Bioanalyzer 2100. Quantify the final bead-purified PCR products between 200 and 900bpusing an Agilent TapeStation 2200or PurifyeachcompletedPCRby repeating Steps40–44. Incubate the mixture in a qPCRthermocycler asfollows: PCR-amplifythe 30 Perform targeted DNAcapture according tothe Agilent SureSelectXT instructions (version 1.6).Alternatively, Remove 200ng of purifiedPCRproduct from Step47and lyophilizeitcompletelyusing avacuumcentrifuge. Placethe sampleonto the magnetic separator for atleast2minoruntil allthe beadsare outof solution, and then µ µ

2 | M M primer MWS21 M primer MWS13 I I O VOL.9 NO.11VOL.9 µ T T l of ddH I I CAL CAL

PO PO 95 95 °C for 4 min 98 98 °C for 20 s T

H H H I I argeted D STEP STEP Denature NT NT OOT OOT OOT 2 µ Samples can be stored at −20 °C for several weeks. Samples can be stored at −20 °C for several weeks. O. | l of eluent toanew 0.2-mlPCRtube. 2014 The number of PCR cycles is a crucial variable; a difference of a single cycle can result in nonspecific, For DNA prepared using target capture, it is essential to use all the DNA from the targeted capture to I I I N N N µ G G G l of ddH | NA natureprotocols µ l of captured samplebymixing the components in the tablebelowinaqPCR-compatible capture 60 60 °C for 20 s 2 µ O. l) A nneal — ●

T Volume ( IMI N 30 10 1 1 1 1 1 5 G 24 h m 72 72 °C for 15 s l) E xtend — Final concentration T 0.02 0.02 U/ able Variable 0.2 0.2 mM 0.4 0.4 0.4 0.25× N/A 1× µ µ 3 M M for MWS13and MWS21primer sequences): µ l

© 2014 Nature America, Inc. All rights reserved. 61| A detailed flowchart schematic of the program can be found in 60| B 59| 58| and 57| separate files based on their index sequence before tag parsing.  in This command results intwofastq files(one for each read) thatare ready tobealigned. Aflowchartschematic canbe found the following command: 56| B aligners, and their with compatibility the downstream steps is not known.  ( 55| reference genome from the University of California, Santa Cruz(UCSC)Genome Browser canbeused. ftp:/ we usev37of the human reference genome from the 1,000Genomes Project. Thisreference genome canbeobtained at 54| B http://bio-bwa.so ioinformatics ioinformatics processing: making ioinformatic ioinformatic processing: parsing and filtering duplex tags ioinformatic processing: prepare a reference genome for use with

S CR CR Sortthe SSCSreads: Runthe Python program ‘ConsensusMaker.py’ tocollapsePCRduplicates into SSCS: Convertto.bamformat and sortbyposition: Make asingle paired-end .samfile: Align eachread tothe reference genome: Concatenate the 12-nt tagsequences from the paired reads and evaluate for tagquality using ‘tag_to_header.py’ using The reference genome must beindexed before being usedwithBWA. Followthe instructions onthe website First,download and decompress the genome of interest. Forthe human nuclear and mitochondrial genomes, upplementary Figure 1 samtools view-buSSCS.bam |samtoolssort-SSCS.sort read_type dpm--filtosn max 1000 tagfile PE_reads.tagcounts--outfileSSCS.bam --min3-- python ConsensusMaker.py--infilePE_reads.sort.bam -- samtools view-SbuPE_reads.sam|sort-PE_reads.sort read_2.aln read_1.fq.smiread_2.fq.smi>PE_reads.sam bwa sampe–sreference_genome.fastaread_1.aln bwa alnreference_genome.fastaread_2.fq.smi>read_2.aln bwa alnreference_genome.fastaread_1.fq.smi>read_1.aln read_2.fq.smi –-barcode_length12–-spacer_length5 read_2.fq --outfile1read_1.fq.smi--outfile2 python tag_to_header.py--infile1read_1.fq--infile2 /ftp-trace.ncbi.nih. I I T T I I CAL CAL

STEP STEP

--cutoff 0.7--Ncutoff0.3--readlength 84-- If samples are multiplexed within the same sequencing lane, then these samples should be sorted into The bioinformatics steps outlined in this protocol are specific to BWA. We have not used other sequencing urceforge.net/bwa.shtm gov/1000genomes/ftp/ . SSCS s s l ) using the recommended options relevant tothe genome of interest. ●

T technical/reference/ IMI N G variable ● S upplementary upplementary Figure 2

T IMI human_g1k_v37.fasta. N B G W variable A and SA Mtools Mtools natureprotocols . g ● z . Alternatively, the hg19

T IMI N G variable

| VOL.9 NO.11VOL.9 protocol

| 2014

|

2601 © 2014 Nature America, Inc. All rights reserved. An example command is as follows: 70| 69| Picard tools. Anexample command isprovided below: 68| 67| 66| B 65| 64| and 63| A A detailed flowchart schematic of the program can be found in 62| B 2602 protocol ioinformatics ioinformatics processing: prepare files for analysis ioinformatics ioinformatics processing: making D Perform localre-alignment of the reads using GATK. Firstidentify the genome targets for localre-alignment. Index the final sortedDCS.bamfile: Add readgroups fieldto the header of the final DCS.bamfilewith Picard toallow forcompatibilitywith the GATKusing Filteroutunmapped reads from the final DCS.bamfile: Index the final sorted DCS .bamfile: Convertto.bamformat and sortbyposition: Make apaired-end .samfile for the DCS data: Align eachDCS.fastq tothe reference genome: Construct DCSsfrom SSCSsusing ‘DuplexMaker.py’.

DCS_PE.filt.readgroups.intervals DCS_PE.filt.readgroups.bam -o RealignerTargetCreator -Rreference_genome.fasta-I java -Xmx2g-jarGenomeAnalysisTK.jar -T samtools indexDCS_PE.filt.readgroups.bam RGLB=UW RGPL=IlluminaRGPU=ATATATRGSM=default INPUT=DCS.filt.bam OUTPUT=DCS_PE.filt.readgroups.bam java -jar–Xmx2gAddOrReplaceReadGroups.jar DCS_PE.filt.bam samtools view-F4-bDCS_PE.aln.sort.bam> samtools indexDCS_PE.aln.sort.bam DCS_PE.aln.sort samtools view-SbuDCS_PE.aln.sam|sort– DCS_data_read_2.fq >DCS_PE.aln.sam DCS_data_read_2.aln DCS_data_read_1.fq bwa sampe–sreference_genome.fastaDCS_data_read_1.aln DCS_data_read_2.aln bwa alnreference_genome.fastaDCS_data_read_2.fq> DCS_data_read_1.aln bwa alnreference_genome.fastaDCS_data_read_1.fq> DCS_data --Ncutoff0.3--read_length84 python DuplexMaker.py--infileSSCS.sort.bam--outfile | VOL.9 NO.11VOL.9 | 2014 | natureprotocols CS s s ●

T IMI N G variable ●

T IMI S N upplementary upplementary Figure 3 G variable .

© 2014 Nature America, Inc. All rights reserved. reference genome: chromosome, position, wild-typebaseidentity, depth, totalnumber of mutations observed, number 75| total mutation frequency. file liststhe number and frequency of mutations withWilsonconfidence intervals foreachtype of mutation, aswell required toscore aposition, and -Cindicates thatonlymutations present ataclonality of <1%are counted. The output only counting eachtypeof mutation ateachposition once. The argument -dspecifiesthataminimum depth of 100is 74| 73| ? on the tagstats’ file created in Step 60. Plot the family size (column 1) on the 72| analysis protocol thatwecurrently use.  B enzymatic stepsusedduring sequencing library preparation.  read isprovided: 71| so thisstepishighly recommended. Inaddition, localre-alignment must bedone afterthe lastalignment step.  This command is followed by the actual local re-alignment. An example command is as follows: nonreference insertions, and number of nonreference deletions. of nonreference Ts, number of nonreference Cs, number of nonreference Gs, number of nonreference As, number of

ioinformatics ioinformatics analysis: quality metrics and basic mutation analysis TROUBLES This program outputsatab-delimited filecontaining the following information for each genomic position inthe The optional argument -uinvokes uniquemutation counting; ifitisleftout,allmutations are counted, instead of

CR CR CR Locatethe genomic position of eachmutation: Count the number of uniquemutations present inthe final DCSsequences and calculatetheir frequencies: Make apileupfilefrom the final DCS readsusing the following example command: Perform end-trimming of DCSreads. Anexample command thattrims fivebasesfrom boththe 3

Plot Plot the tag family size distribution (see final.bam.pileup.mutpos -d100–C.01 python mut-position.py -iDCS-final.bam.pileup-oDCS- 0.01 -u>DCS-final.bam.pileup.countmuts cat DCS-final.bam.pileup|pythonCountMuts.py -d100-c0-C final.bam.pileup reference_genome.fasta DCS-final.bam>DCS- samtools mpileup-B–A-d500000-f 5,80-84" --clipRepresentationSOFTCLIP_BASES final.bam -Rreference_genome.fasta--cyclesToTrim"1- DCS_PE_reads.filt.readgroups.realign.bam -oDCS- java -Xmx2g-jarGenomeAnalysisTK.jar-TClipReads-I DCS_PE.filt.readgroups.realign.bam -targetIntervals DCS_PE.filt.readgroups.intervals -o - java -Xmx2g-jarGenomeAnalysisTK.jar-TIndelRealigner I I I R reference_genome.fasta-IDCS_PE.filt.readgroups.bam T T T y I I I axis. CAL CAL CAL The processed DCS.bamfilescanbeusedinany downstream analysis thatis desired. We present the basic data

H STEP STEP OOT Failure toclipthe ends of reads increases the occurrence of false positivesthatoccurduring the Alignment increases the occurrence of false mutations thatoccuratthe 3 I N G Fig. Fig. 4 as an example). The information for this plot is found in the ‘PE_reads. x axis and the fraction of total reads (column 3) ●

T IMI N G variable natureprotocols ′ and 5 ′ ends of the reads, ′ and 5

| VOL.9 NO.11VOL.9 ′ ends of each protocol |

2014

|

2603 © 2014 Nature America, Inc. All rights reserved. T Troubleshooting advice canbefound in ? 2604 able able 47, 47, 53 28, 36 Step 12 B and 6 steps 3 B S 72 48 protocol

TROUBLES tep ox ox

1 1 | VOL.9 NO.11VOL.9 , , 4 4 |

Troubleshooting table. No DNA is detectable with AMPure XP beads are present after cleaning nonspecific PCR products Dimerized adapter or during quantification No DNA is detectable of synthesized (bands in lane 4 Adapters are not properly addition of ethanol Sample becomes cloudy upon P size size is too large obtained and the peak family Not enough depth is size is too small obtained and the peak family Not enough depth is perform targeted capture Not enough DNA is present to products are present large-molecular-weight during quantitation or roblem H Fig. Fig. 3 OOT | 2014 I a N do not match) G | natureprotocols

to to too many PCR cycles DNA concatemerization owing efficiently removed Adapter dimers are not library preparation Not enough DNA 1s used during during the washes is too low Ethanol concentration used still present in lane 4) precipitation ( present after final ethanol Restriction fragment is still ( Adapters are incompletely cut ( Adapters are not fully extended buffer salts Precipitation of enzymes and P during PCR amplification Not enough DNA is produced during the washes is too low Ethanol concentration used initial PCR Too little DNA is used during the fraction was used initial PCR or too small a lane Too much DNA is used during the T Fig. Fig. 2 Fig. 3 ossible reason able a a 4 , , lane 3 does not match) , lane 2 does not match) . Fig. Fig. 3 a , , band f

Remake the 75% (vol/vol) ethanol and start over outlined in step 5 of Perform ethanol precipitation again using the protocol start the synthesis over again overnight at 37 °C. If necessary, replace enzyme and Add an additional 2 perform the extension reaction again Replace reagents, re-anneal oligonucleotides and preparation. Continue with the protocol Precipitate does not affect downstream steps or library S DNA DNA is needed in the PCR obtained peak family size is 24, 24/6 = 4, 4× more amount in the PCR proportionally. For example, if the is the optimal peak family size. Then increase the DNA not the depth. Divide your peak family size by six, which Further resequencing will only increase the family size, of molecules was too low, the depth will be too low. DNA molecules used as input in the PCR. If the number DCS reads, which is directly proportional to the number of Depth is directly proportional to the number of fraction needed to obtain a peak family size of six See pool the raw sequencing reads prior to data analysis. Sequence the sample again (do not re-do the PCR) and the DNA (Steps 38B–47) obtain additional DNA, pool all the reactions, and purify Perform additional PCRs under the same conditions to Remake the 75% (vol/vol) ethanol and repeat PCR at the determined cycle outlined in the PROCEDURE and remove the samples (see to confirm that the expected product is present the unpurified PCR on a TapeStation or Bioanalyzer Run a titration of the PCR cycle number. Quantify bead clean-up (Steps 40–44; Bring the volume up to 50 of input DNA Repeat the library preparation with a larger amount olution S Fig. Fig. 6 upplementary upplementary a for an example). Rerun the PCR steps as µ T B able able 2 l l of HpyCH4III and incubate ox 1 µ to determine the lane l l with ddH S upplementary upplementary Fig. 6b 2 O O and repeat the

)

© 2014 Nature America, Inc. All rights reserved. 8. 7. 6. 5. 4. 3. 2. 1. com/reprints/index.htm at online available is information permissions and Reprints interests. CO protocol. S.R.K., M.W.S., J.J.S., B.F.K. and L.A.L. wrote the paper. S.R.K., M.W.S., E.J.F., M.J.P., E.H.A., J.-C.S., K.J.K. and R.-A.R. optimized the original protocol. S.R.K., M.W.S. and B.F.K. developed the data analysis software. AUT Approaches to Aging Training grant (NIA T32-AG000057). NCI R01-CA102029 to L.A.L. S.R.K. was further supported by the Genetic Institute (NCI): NIA P01-AG001751, NCI P01-CA77852, NCI R01-CA160674 and grants from the National Institute on Aging (NIA) and the National Cancer for this research was provided by the following US National Institutes of Health protocol. We also thank A. Herr for helpful comments and discussion. Support protocol as written and for helping to clarify several important steps of the A online of version the pape Note: Any and Information Source Data Supplementary files are available in the events, and itresults inafurther 90–99 and processing steps high frequency of G PCR errors thathavebeenpropagated toalltagfamily members. The mutational spectrumof the SSCStypically shows a positives, which account for >99%of allartifacts. However, this stepisunable toremove artifacts arising from first-round conventional NGSmethods. We havefound that SSCSformation iscapableof removing almost allsequencer-derived false results). We consistently observea>50,000-fold reduction inthe apparent mutation frequency inthese samples, relative to brain tissue We haveapplied DStoanumber of biological systems, including the M13mp2phagemid ANT B B Steps 54–75, bioinformatic processing: variable and dependent on computational resources Steps 48–53 (optional), targeted DNA capture: 24 h Steps 36–47, PCR amplification: 1 h Steps 29–35, ligation of DS adapters to sample DNA: 1 h Steps 22–28, 3 Steps 15–21, end repair of sonicated DNA and size selection: 1 h Steps 1–14, sonication of DNA: 1 h ● ckno ox ox

M T H PET (2014). glioma.recurrent of evolution therapy-driven (2013). cancer.colorectal in responsechemotherapy genes.cancer-associated new for myeloma. multiple in profiles Nature (2012). sequencing.multiregion by revealed change.malignant of basis Resour. Ecol. Mol. Johnson, B.E. Johnson, A. Kreso, M.S. Lawrence, N. Bolli, N. Navin, Gerlinger,M. a as replication DNA in Errors N. Battula, & C.F.Springgate, L.A., Loeb, sequencers. next-generationDNA to guide FieldT.C. Glenn, IMI I OR 2 1 C w , , optional quality control steps for adapter synthesis using radiolabeled adapters and PAGE: 1 d , synthesis of DS adapters: 2 d I I

N le CONTR PATE N G G FI

dgm G 472 et al. et NANC et al. et et al. et 3 D I ents , 90–94 (2011). 90–94 , 3 BUT and human nuclear DNAfrom bothfrozen and formalin-fixed samples(M.J.P., E.J.F. and L.A.L.;unpublished et al. et RESULTS Heterogeneity of genomic evolution and mutational and evolution genomic ofHeterogeneity et al. et ′ Variable clonal repopulation dynamics influence dynamicsrepopulation clonal Variable Tumour evolution inferred by single-cell sequencing. single-cell by inferred evolution Tumour I et al. et dA-tailing of end-repaired DNA: 1 h I

AL We thank D. Crispin and A. Lawrence for testing the ONS

11 Intratumor heterogeneity and branched evolution branched andheterogeneity Intratumor l r I . Mutational analysis reveals the origin and origin the reveals analysis Mutational . NTERESTS , 759–769 (2011). 759–769 , Mutational heterogeneity in cancer and the search the and cancer in heterogeneityMutational →

S.R.K., S.R.K., M.W.S., J.J.S. and E.J.F. developed the 1 9 T mutations, which isthe result of fixation of oxidative damage occurring during the DNApurification . Formation of the DCSreads consistently removes these artifacts, aswellthose from other damage Cancer Res. Cancer Nat. Commun. Nat.

The authors declare no competing financial Nature N. Engl. J. Med. J. Engl. N.

34

499 , 2311–2321 (1974). 2311–2321 ,

5 Science , 214–218 (2013). 214–218 , Science , 2997 (2014). 2997 , + % reduction inmutational artifacts.

339

366 343 http://www.nature. , 543–548 , , 883–892 , , 189–193 ,

24. 23. 22. 21. 20. 19. 18. 17. 16. 15. 14. 13. 12. 11. 10. 9.

molecular identifiers.molecular (2011). ID. primer a using gene protease HIV-1 the of sequencing deep and sampling Accurate (2007). products with batch-stamps and barcodes. and batch-stamps with products Methods Nat. reads. sequence short derived locally of assembly tag-directed Parallel, sequencing. parallel massively with mutations rare of quantification and Detection sequencing. Res. Acids Nucleic carcinogens. of fingerprint mutational the detecting for method sequencing. deep arbitrary perspectives. and requirements,strategies DNA. ancient in deamination cytosine by induced artifacts reveal amplifications multiple from sequences (2009). 342–350 directions.biology molecular future analyses: (2010). 1154–1159 evolution. bacterial during rates mutation (2008). 987–993 in evolution and variationgenome studies.genetic facilitates strains (2005). of family gene cancer.in heterogeneity Kivioja, T.Kivioja, R. Swanstrom, & J.A. Anderson, J., Roach, C.D., Jones,C.B., Jabara, PCR Encoding C.D. Laird, & R.S. Hansen, Stoger,R., McCloskey,M.L., Shendure,J. & C. Lee, R.P.,Turner,Patwardhan,E.H., J.B., Hiatt, B.Vogelstein, K.W.& Kinzler, N., Papadopoulos, J., Wu, I., Kinde, M.W.Schmitt, A. Besaratinia, C.A. Carlson, DNA: ancient of Next-generationsequencingHofreiter, M. & M. Knapp, DNA S. Paabo, & Haeseler,A. von D., Jaenicke,V.,Serre,Hofreiter, M., DNA forensicfrom evidenceExtracting A. Daal, van & B.Budowle, Loh, E., Salk, J.J. & Loeb, L.A. Optimization of DNA polymerase DNA of Optimization L.A. Loeb, & J.J. Salk, E., Loh, K.E. Holt, A. Srivatsan, Z. Liu, genetic of Implications L.A. Loeb, & M.J. M.W.,Prindle, Schmitt, et al. et et al. et et al. et Proc. Natl. Acad. Sci. USA Sci. Acad. Natl. Proc. USA Sci. Acad. Natl. Proc Patterns of diversifying selection in the phytotoxin-like scr74 phytotoxin-like the in selection diversifying of Patterns

et al. et et al. et 7 et al. et , 119–122 (2010). 119–122 , Phytophthora infestans Phytophthora et al. et High-throughput sequencing provides insights into insightsprovides sequencing High-throughput Counting absolute numbers of molecules using unique using molecules of numbers absolute Counting

40 Decoding cell lineage from acquired mutations using mutations acquired from lineage cell Decoding High-precision, whole-genome sequencing of laboratory of sequencingwhole-genome High-precision, Proc. Natl. Acad. Sci. USA Sci. Acad. Natl. Proc. Detection of ultra-rare mutations by next-generationby mutationsultra-rare of Detection A high-throughput next-generationsequencing-basedhigh-throughput A , e116 (2012). e116 , Nat. Methods Nat. natureprotocols Ann. NY Acad. Sci. Acad. NY Ann. 1 9 Nat. Methods Nat. , human mtDNAfrom frozen Nucleic Acids Res. Acids Nucleic

PLoS Genet. PLoS 9 Salmonella typhi Salmonella , 72–74 (2012). 72–74 ,

.

109 108 Mol. Biol. Evol. Biol. Mol.

9 , 14508–14513 (2012). 14508–14513 , , 78–80 (2012). 78–80 , Genes Proc. Natl. Acad. Sci. USA Sci. Acad. Natl. Proc. , 9530–9535 (2011). 9530–9535 , Biochem. Genet. Biochem.

1267

| VOL.9 NO.11VOL.9 108

Biotechniques 4

29

, e1000139 (2008). e1000139 , 1 , 110–116 (2012). 110–116 , , 20166–20171 , , 227–243 (2010). 227–243 , , 4793–4799 (2001). 4793–4799 , . Nat. Genet. Nat.

protocol 22 , 659–672 ,

45 | 46 2014 , 761–767 , , 339–340, ,

40

| ,

107 2605

, © 2014 Nature America, Inc. All rights reserved. 32. 31. 30. 29. 28. 27. 26. 25. 2606 protocol

orders of magnitude using circle sequencing. circle using magnitude of orders (2014). 686–690 sequencing. population through revealed virus RNA an of sequencing.genome nucleus sperm. and activity recombination of analysis Science cell. human single a of variations copy-number single-nucleotideand single-moleculebarcodes. minimizessequence-dependent amplificationandbias optimized with noise sequencing. next-generationto application with molecules template PCR counting 6 DNA. UV-damaged of amplification STR multiplex to mix repair PreCR the of applicationforensic for protocol optimized 110 Diegoli, T.M., Farr, M., Cromartie, C., Coble, M.D. & Bille, T.W.Bille,An & M.D. Coble, Cromartie,C., M., Farr,T.M.,Diegoli, D.I. Lou, landscapes fitness and Mutational R. Andino, & Brodsky,L. A., Acevedo, Y.Wang, single-cellGenome-wide S.R. Quake, & B.Behr, H.C., Fan, J., Wang, of detectionGenome-wide X.S. Xie, & A.R. Chapman, S., Lu, C., Zong, Shiroguchi,T.Z.,Jia,K.,Sims, P.A. Xie,Digitalsequencing X.S.RNA& for method C.P.A Lichtenstein, & Brenner,S. R.J., Osborne, J.A., Casbon,

, 498–503 (2012). 498–503 , | VOL.9 NO.11VOL.9 , 19872–19877 (2013). 19872–19877 ,

Cell 338 et al. et et al. et

150 , 1622–1626 (2012). 1622–1626 , Nucleic Acids Res. Acids Nucleic Clonal evolution in breast cancer revealed by single- by revealed cancer breast in evolution Clonal High-throughput DNA sequencing errors are reduced by reduced are errors sequencing DNA High-throughput , 402–412 (2012). 402–412 , | 2014 | Proc.USAAcad.Natl.Sci. natureprotocols Nature

39 , e81 (2011). e81 ,

512 de novo de , 155–160 (2014). 155–160 , Proc. Natl. Acad. Sci. USA Sci. Acad. Natl. Proc. mutation rates in human in rates mutation Forensic Sci. Int. Genet. Int. Sci. Forensic

109 , 1347–1352(2012)., Nature

505 ,

35. 34. 33. 40. 39. 38. 37. 36.

tissues, input amount and tumor heterogeneity. heterogeneity. tumor and amount tumor input (FFPE) tissues, embedded fixed-paraffin formaldehyde settings: cancer Res. Acids Nucleic specimens. cancer breast formalin-fixed of sequence genome whole (2013). e1003794 damage.oxidative with inconsistent are that mutations mitochondrialsomatic in increase age-related an reveals sequencing (2011). data. sequencing next-generationDNA using (2010). 1297–1303 data. sequencing next-generationDNA analyzing for Bioinformatics transform. Wheeler Diagn. Mol. J. specimens. tissue fresh-frozen and formalin-fixed from data sequence (2011). 68 Kerick, M. M. Kerick, S.E. Yost, Ultra-sensitive L.A. Loeb, M.W.& Schmitt, J.J., Salk, Kennedy,S.R., DePristo, M.A. DePristo, A. McKenna, H. Li, Burrows– with alignmentshort-read accurate and Fast R. Durbin, & H. Li, Spencer,D.H. et al. et et al. et et al. et The Sequence Alignment/Map format and SAMtools. and formatAlignment/Map Sequence The et al. et

et al. et 25 15 et al. et Identification of high-confidence somatic mutations in mutations somatic high-confidence Identification of Targeted high throughput sequencing in clinical clinical in sequencing throughput high Targeted

, 2078–2079 (2009). 2078–2079 , , 623–633 (2013). 623–633 , 40 The genome analysis toolkit: a MapReduce frameworkMapReduce a toolkit: analysis genome The Comparison of clinical targeted next-generationtargeted clinical of Comparison Bioinformatics A framework for variation discovery and genotyping and discovery variation for framework A , e107 (2012). e107 ,

25 , 1754–1760 (2009). 1754–1760 , Nat. Genet. Nat. BMC Med. Med. BMC Genome Res. Genome

PLoS Genet. PLoS 43 , 491–498 ,

20

, 9 ,

4 , CORRIGENDA Corrigendum: Detecting ultralow-frequency mutations by Duplex Sequencing Scott R Kennedy, Michael W Schmitt, Edward J Fox, Brendan F Kohrn, Jesse J Salk, Eun Hyun Ahn, Marc J Prindle, Kawai J Kuong, Jiang-Cheng Shen, Rosa-Ana Risques & Lawrence A Loeb Nat. Protoc. 9, 2586–2606 (2014); doi:10.1038/nprot.2014.170; corrected online 22 October 2014 In the version of this article initially published online, the sequence for the MWS21 oligonucleotide was incorrectly listed in Table 3 as 5′-CCAGCAGAAGACGGCATACGAGATXXXXXXGTGACTGGAGTTCAGACGTGTGC-3′. The second nucleotide in the sequence should be A, not C, and the correct sequence is 5′-CAAGCAGAAGACGGCATACGAGATXXXXXXGTGACTGGAGTTCAGACGTGTGC-3′. The error has been corrected in the PDF and HTML versions of this article.

nature protocols