-converted duplexes for the strand-specific detection and quantification of rare

Austin K. Mattoxa, Yuxuan Wanga, Simeon Springera, Joshua D. Cohena, Srinivasan Yegnasubramanianb, William G. Nelsonb, Kenneth W. Kinzlera, Bert Vogelsteina,c,1, and Nickolas Papadopoulosa,b,d,1

aLudwig Center, Sidney Kimmel Comprehensive Center, Johns Hopkins School of Medicine, Baltimore, MD 21128; bDepartment of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins School of Medicine, Baltimore, MD 21128; cThe Howard Hughes Medical Institute, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins School of Medicine, Baltimore, MD 21128; and dDepartment of Pathology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins School of Medicine, Baltimore, MD 21128

Contributed by Bert Vogelstein, March 13, 2017 (sent for review January 30, 2017; reviewed by Carlos Caldas and Jean-Pierre Issa) The identification of mutations that are present at low frequencies in clinical samples is often far less than optimal, compounding the clinical samples is an essential component of precision medicine. The problem. Sensitivity can be increased by pretreating the DNA to development of molecular barcoding for next-generation remove damaged bases before sequencing (22, 23) and by bio- has greatly enhanced the sensitivity of detecting such mutations by informatics and statistical methods to enhance base calls after massively parallel sequencing. However, further improvements in sequencing (24, 25). Although useful for a variety of purposes, the specificity would be useful for a variety of applications. We herein sensitivity obtainable with these improvements is generally not describe a technology (BiSeqS) that can increase the specificity of sufficiently high for the most challenging applications, such as sequencing by at least two orders of magnitude over and above that liquid biopsies, which can require detection of one mutant mole- achieved with molecular barcoding and can be applied to any cule among thousands of WT molecules (9). massively parallel sequencing instrument. BiSeqS employs bisulfite Another important way to improve sensitivity is with the use of treatment to distinguish the two strands of molecularly barcoded “molecular barcodes,” in which each template is covalently linked DNA; its specificity arises from the requirement for the same to unique identifying sequences (UIDs). Molecular barcodes were to be identified in both strands. Because no library preparation is originally used to count individual template molecules (26) but required, the technology permits very efficient use of the template were subsequently incorporated into a powerful approach, termed DNA as well as sequence reads, which are nearly all confined to the SafeSeqS, for error reduction (27). After incorporation of the of interest. Such efficiency is critical for clinical samples, UIDs, subsequent amplification steps produce multiple copies such as plasma, in which only tiny amounts of DNA are often available. We show here that BiSeqS can be applied to evaluate transversions, as well as small insertions or deletions, and can reliably Significance detect one mutation among >10,000 wild-type molecules. The detection of rare mutations in clinical samples is essential next-generation sequencing | | strand-specificity | to the screening, diagnosis, and treatment of cancer. Although polymerase chain reaction | mutation next-generation sequencing has greatly enhanced the sensitivity of detecting mutations, the relatively high error rate of these xtensive knowledge of the genetic alterations that underlie platforms limits their overall clinical utility. The elimination of Ecancer is now available, opening new opportunities for the sequencing artifacts could facilitate the detection of early-stage management of patients (1–3). Some of the most important of and provide improved treatment recommendations tai- these opportunities involve “liquid biopsies”—that is, the evalua- lored to the genetic profile of a tumor. Here, we report the tion of blood and other bodily fluids for mutant DNA template development of BiSeqS, a bisulfite conversion-based sequencing GENETICS approach that allows for the strand-specific detection and quan- molecules that are released from tumor cells into such fluids. Al- tification of rare mutations. We demonstrate that BiSeqS elimi- though the potential value of liquid biopsies was recognized more nates nearly all sequencing artifacts in three common types of than two decades ago (4–6), more recent advances in sequencing mutations and thereby considerably increases the signal-to- technology have made this approach practical. For example, it has noise ratio for diagnostic analyses. recently been shown that liquid biopsies of blood can detect mini- mal amounts of disease in patients with early-stage colorectal Author contributions: A.K.M., S.Y., W.G.N., K.W.K., B.V., and N.P. designed research; A.K.M. cancers, thereby providing evidence that could substantially affect and B.V. performed research; A.K.M., K.W.K., and N.P. contributed new reagents/analytic their survival (7). Other studies have shown that circulating tumor tools; A.K.M., Y.W., S.S., J.D.C., K.W.K., B.V., and N.P. analyzed data; and A.K.M., B.V., and DNA(ctDNA)canbedetectedinthebloodofpatientswithother N.P. wrote the paper. malignancies, as well as in other bodily fluids such as pancreatic Reviewers: C.C., University of Manchester; and J.I., Fels Institute for Cancer and Molecular Biology, Temple University. cysts, Pap smears, and saliva (8–16). Conflict of interest statement: N.P., K.W.K., and B.V. have no conflicts of interest with respect The vast majority of current technologies for detecting rare to the new technology described in this manuscript, as defined by the Johns Hopkins Univer- mutations use digital approaches, where each template molecule is sity policy on conflict of interest. N.P., K.W.K., and B.V. are founders of Personal assessed, one by one, to determine whether it is wild type or Diagnostics, Inc. and PapGene, Inc. K.W.K. and B.V. are members of the Scientific Advisory mutant (17). The digitalization can be performed in wells (17), in Board of Syxmex-Inostics. B.V. is also a member of the Scientific Advisory Boards of Morphotek and Exelixis GP. These companies and others have licensed technologies from Johns Hopkins tiny droplets formed by emulsification or microfluidics (18, 19), or University; N.P., K.W.K., and B.V. are the inventors of some of these technologies and receive in clusters (20). The most comprehensive of these approaches equity or royalties from their licenses. The terms of these arrangements are being managed employs massively parallel sequencing to simultaneously analyze by the Johns Hopkins University in accordance with its conflict of interest policies. the entire sequences of hundreds of millions of individually am- Data deposition: The sequences reported in this paper have been deposited in the Euro- pean Genome-Phenome Archive (EGA) database (accession no. EGAS00001002406; plified template molecules (21). However, all of the currently https://www.ebi.ac.uk/ega/home). available sequencing instruments have relatively high error rates, 1To whom correspondence may be addressed. Email: [email protected] or npapado1@ limiting sensitivity at many positions to one mutant jhmi.edu. among 100 wild-type (WT) template molecules, even with DNA This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. templates that are of optimal quality (21). The DNA quality of 1073/pnas.1701382114/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1701382114 PNAS | May 2, 2017 | vol. 114 | no. 18 | 4733–4738 Downloaded by guest on September 30, 2021 of each UID-linked template. Each of the daughter molecules A produced by amplification contains the same UID, forming a UID family. To be considered a bona fide mutation, termed a super- mutant, every member of the UID family must have the identical sequence at each queried position (27). There are two general ways to assign molecular barcodes to template DNA molecules. One uses a set of locus-specific primers to PCR-amplify genomic loci, while the other uses adapters ligated before amplification to create a whole genome library. The PCR method uses primers containing a stretch of random (N) bases to distinguish each individual template molecule (exogenous barc- B odes) (27, 28). The advantage of this approach is that it is appli- cable to very small amounts of DNA, and virtually the only sequences amplified are the desired ones, reducing the amount of sequencing needed to evaluate a specific mutation. The disad- vantage is that errors introduced into one strand during the UID- incorporation cycles will create supermutants. This method will still therefore eliminate errors during sequencing but not errors made during the initial cycles of PCR. The ligation method either em- – ploys random sequences in the adapters used for ligation (27 29) Fig. 1. Overview of BiSeqS methodology. Bisulfite conversion creates C > T or uses the ends of the randomly sheared template DNA to which transitions at unique positions in each strand. Amplification of the (+)and(–) the adapters are ligated as “endogenous UIDs” (27, 30). Although strands with primers that are and strand-specific allows for targeted errors are still introduced during the PCR steps with the ligation amplification and addition of molecular barcodes. Analysis of both strands approach, its advantage is that both strands can be identified from allows for PCR errors generated in the first PCR cycle to be drastically reduced, the sequencing data (duplex sequencing). The probability that the as it is highly unlikely a complementary mutation will be generated at the identical, complementary mutation is introduced into both strands same genomic position on both strands. The conversion and amplification of is low (the square of the probability of the mutation appearing in the wild-type sequence is presented in A, and the conversion and amplification of an A > C transversion is presented in B. only one strand). The disadvantage of this approach is that it re- quires library preparation and capture of the sequences to be queried, neither of which are highly efficient. In addition to converting C to U, bisulfite treatment denatures We here describe an approach that incorporates advantages of DNA and can degrade it. Although this degradation is not limiting both the PCR- and ligation-based approaches described above. for standard applications of bisulfite treatment, it is critical for This approach takes advantage of the fact that bisulfite treatment applications involving mutation detection in clinical samples that can efficiently convert dC bases in DNA to U bases. This conver- are already degraded before conversion (36–38). In the current sion makes the two strands of DNA distinguishable and was pre- study, we evaluated many ways to convert DNA and purify the viously used to distinguish RNA transcripts copied from each of the converted strands. The best results were obtained with the reagents, two possible template strands of DNA (31). Bisulfite conversion Materials and Meth- hasalsobeenextensivelyusedtodistinguish methylated C-residues, conditions, and incubation times described in ods which do not get converted to T bases, from unmethylated C bases, .AsshowninFig. S2, treatment under these conditions did not thereby illuminating epigenetic changes (32). It has also been inhibit the amplification of PCR products up to 285 bp in size. shown that dC bases can be partially converted to T bases so that Sequencing of these products revealed that, on average, >99.8% of each individual template DNA molecule can be distinguished from the C bases were converted to T bases on both strands (excluding C others by its unique pattern of C to T changes, thereby creating an bases at 5′-CpG sites, which can be resistant to bisulfite conversion intrinsic barcode similar to what can be achieved with externally because they are either methylated or hydroxymethylated). added UIDs (33). In this work, we show that DNA in which all C Step ii: Molecular barcoding. The goal of bisulfite treatment is to bases have been fully converted to T bases can be used as PCR create a code for distinguishing the two strands of DNA. This templates with specially designed primers linked to exogenous doubles the number of templates that need to be molecularly barcodes. This allows individual mutations to be assessed on both barcoded, requiring specialized steps compared with that used for strands (duplex sequencing) in a reliable manner, without creation standardly amplifying DNA. First, four primers must be designed of libraries and with a relatively small number of sequencing reads. to amplify each region of interest, two primers for each strand. Second, the primers must be complementary to the converted form Results of the DNA, accentuating the importance of full conversion— BiSeqS Workflow. The principal feature of BiSeqS is the simulta- otherwise, some template molecules will not be amplified because neous detection of a mutation on both the plus and minus strands they will not be perfectly complementary to the primers. Third, of DNA templates that were bisulfite treated and molecularly bisulfite treatment under the conditions we used converts virtually barcoded. We refer to the reference sequence as defined by UCSC all nonmodified C residues to T, lowering the melting temperature as the plus (+) strand and its reverse complement as the minus (–) of both the primer annealing sites and the amplicon in general. strand. Three simple experimental steps (bisulfite conversion, Because both strands must be amplified equivalently and in the molecular barcoding, and sample barcoding) are required before a specialized bioinformatics analysis of the sequencing data, as de- same reaction, the primers must be chosen so that the same PCR scribed below (Fig. 1 and Fig. S1 A and B). cycling conditions can be used for amplifying both strands in a Step i: Bisulfite conversion. Incubation of DNA with sodium bisulfite highly specific manner. For regions in which there is already a low at elevated temperatures and low pH deaminates to form C:G base pair content, the primers have to be long enough to 5,6-dihydrocytosine-6-sulfonate (34). Subsequent hydrolytic de- allow specific amplification under relatively high-temperature amination at high pH removes the sulfonate, resulting in (35). annealing conditions. This proved difficult without yielding large Many modifications of this basic reaction have been described and amounts of primer dimers, and to overcome these challenges, used largely to differentiate between cytosine and 5-methylcytosine several primer designs were evaluated. Eventually, variations in (5-mC), the latter of which is not susceptible to bisulfite conversion. primer length, position, composition, and C:G content allowed for

4734 | www.pnas.org/cgi/doi/10.1073/pnas.1701382114 Mattox et al. Downloaded by guest on September 30, 2021 specific and robust amplification of both strands of every target the strand containing the C, although a supermutant would still be region attempted. evident at that position in the strand containing the G. There are a Another issue confronting amplification of bisulfite converted total of six possible single base substitutions in duplex DNA: A C:G DNA is that many polymerases will not efficiently copy DNA that base pair can be mutated to either A:T, G:C, or T:A base pair, and contains uracil bases. We tested seven commercially available an A:T base pair can be mutated to either C:G: G:C, or T:A. Of polymerases and various reaction conditions to optimize efficiency these six single base pair substitutions, all result in supermutants on of template use and uniformity of amplification of both strands at least one strand and four result in supermutants on both strands when four primers were used (Dataset S1). Although a combina- (i.e., SDMs). In addition, transitions that create a CpG di- tion of AMPIGene Hot Start Taq Polymerase and iTAQ Poly- nucleotide in which the C is methylated can be assessed on both merase amplified the greatest number of template molecules, their strands. All insertions or deletions within the amplified sequences lack of 3′ → 5′ exonuclease activity proved limiting for specificity in can form SDMs. also introduces complexity, as thatthenumberoferrorsduringPCRwasunacceptablyhigh. methylated or hydroxymethylated C bases are not converted to U Ultimately, we chose Phusion U Hot Start Polymerase, a poly- bases by bisulfite treatment. The BiSeqS pipeline takes this into merase that exhibits 3′ → 5′ exonuclease activity, as the enzyme to account when it analyzes the data by not assuming that any par- amplify uracil-containing templates with the highest specificity ticular C is methylated or unmethylated (or that every unmethy- while maintaining sensitivity. lated C is converted to T by bisulfite treatment). Instead, it Step iii: Sample barcoding. Part of the power of massively parallel considers the possible effects of conversion and methylation and sequencing instruments is that they can be used to analyze many only labels a mutation as a supermutant or SDM if there is no samples at once. To enable this capacity for BiSeqS, we incorporated ambiguity. A list of all possible single base substitutions on either a sample barcode PCR step following the purification of the mo- strand, within a triplet context and with the mutated base in the lecularly barcoded PCR products (Fig. S1, step iii). Moreover, the middle, is provided in Dataset S4. For each single base substitution, converted sample DNA was divided into two to six wells of the PCR the capacity of BiSeqS to identify SDMs is also provided in this plate before the molecular barcoding step. Each well was then table. In general terms, all transversions, all insertions and dele- assigned a different sample barcode. This distribution served two tions, and a small subset of transitions can be unambiguously purposes. First, with concentrated DNA templates, it could provide scored as SDMs (Dataset S4). Because the power of BiSeqS lies in independent replication of mutations with small mutant allele frac- SDMs, only mutations that are interpretable in both strands are tions. Second, with dilute DNA templates, as are often present in considered below. clinical samples such as plasma (9), urine (39), and CSF (12), it provides the opportunity to test more template molecules, increasing BiSeqS Increases the Specificity of Mutation Calling. We selected eight the chance of identifying mutant templates. amplicons within prototypic cancer driver genes to assess BiSeqS performance. For each of the eight amplicons, two forward primers BiSeqS Data Processing Pipeline. High-quality base calls were aligned and two reverse primers for each strand were synthesized and to the bisulfite-converted reference sequence, and the aligned data tested using the principles described above and in Materials and were organized into tables for each sample, where each observed Methods. For all amplicons, at least one primer pair for each strand mutation in each strand of each well was listed in a separate row. was found capable of specifically amplifying the intended strand The columns in this table included the number of reads, UIDs, and with high efficiency, as judged by polyacrylamide gel analysis supermutants for each mutation (see Datasets S2 and S3 for ex- (Fig. S2). The sequences of these primers are listed in Dataset S5. amples). Supermutants were defined as mutations in a UID family For each of the eight amplicons, we compared the specificity of in which >90% of the family members contained that mutation. BiSeqS to that of conventional next-generation sequencing (NGS) For example, if all three members of a UID family contained the and molecular barcode-assisted sequencing (i.e., SafeSeqS). We same mutation, it was considered a supermutant. The supermutant considered only those potential mutations that could be discerned allele fraction was defined as the number of supermutants divided in both strands, as described above. There were a total of 608 bp

by the number of UIDs in an individual well. within these amplicons, yielding a total of 1,550 possible single GENETICS Individual mutations in the plus and minus strands were com- base substitutions (SBSs). Of these 1,550 potential SBSs, 1,252 pared to determine whether the identical supermutant was found (80.8%) were scorable as SDMs; the remainder were transitions in both strands. If the mutation was found in both strands, the that were not scorable for the reasons noted above. There were supermutant allele fractions in each strand were compared. The also many possible indels at each position that could have been supermutant allele fractions on each strand provide an additional observed in the sequencing data, all scorable as SDMs. level of specificity because these fractions are expected to be In the actual experiment, we could distinguish the strand used as similar if a mutant base pair existed in the template DNA before template in the sequencing instrument because of the bisulfite conversion and amplification. Given that mutations arising during conversion. In light of this, there were actually 2,504 mutations (2× PCR are relatively rare, it would be even rarer for the same mu- the number of mutations noted in the previous paragraph) that tation to arise at the identical position in both strands. This is es- could be scored for both conventional and molecular-barcode pecially true after conversion, when the two strands contain assisted sequencing. Of these 2,504 potential SBSs, 1,865 (74.5% markedly different nucleotide contexts. If the supermutant allele of the total possible mutations) were actually observed upon con- fractions in each strand differed by <10-fold, then the mutation was ventional sequencing (25), highlighting the relatively large number considered to be a superduper mutant (SDM). The SDM allelic of errors observed unless error correction by SafeSeqS or BiSeqS is fraction was defined as the number of SDMs divided by the applied (Dataset S6). There was no discernible difference between number of UIDs in the strand that contained the fewest UIDs. For the two strands with respect to the number of mutations observed, example, if the number of SDMs was 10, and the number of UIDs with 907 and 958 mutations observed on the plus and minus in the plus and minus strands were 10,000 and 20,000, respectively, strands, respectively. There were also 298 small insertions or de- then the SDM allelic fraction would be 0.1% (i.e., 10 of 10,000). letions observed by conventional NGS. Special features of the analysis of mutations in converted DNA Application of the molecular barcoding approach to these data include the following. A transition from C > T noted in the se- considerably reduced the number of mutations, as evident by quencing could have resulted from a single base substitution mu- comparison of Fig. S3 A and B (note that the y axis scale was re- tation that changed a C:G base pair to a T:A base pair or from duced by two orders of magnitude in Fig. S3B). The most relevant bisulfite conversion of a C to a T on one strand. In light of this measure of this reduction is the comparison of the mutant allele ambiguity, C to T mutations cannot be considered supermutants in frequencies (MAFs) before and after molecular barcoding was

Mattox et al. PNAS | May 2, 2017 | vol. 114 | no. 18 | 4735 Downloaded by guest on September 30, 2021 applied. Before molecular barcoding was applied, the median MAF any base within an amplicon. The actual sensitivities in clinical of the SBS in the plus strand was 0.0233% (average, 0.0720%; 95% samples are limited only by the amount of input DNA and the CI, 0.0627–0.0813%; Fig. 2 A–C and Dataset S6). It was similar in specificity. In many types of liquid biopsies, such as those from the minus strand: median of 0.0185%, average of 0.0751%, 95% CI plasma, pancreatic cysts, CSF, and urine, the total DNA available 0.0643–0.0859%. As shown in Fig. 2B, after molecular barcoding, is often <33 ng (7, 9, 12, 39). A sensitivity of 0.01% is therefore the MAF in the plus strand was reduced by eightfold, to a median adequate for detecting the one or two mutant molecules that may of 0.0000%, average of 0.0091% (95% CI of 0.0062–0.0119%; P < exist among the ∼10,000 templates contained in 33 ng of human − 10 12, paired two-tailed Student’s t test). Note that the MAF after DNA in such samples. The reliability of this detection is limited by molecular barcoding is a measure of supermutant allele frequency the biological and technical specificities, where the queried mu- (SMAF) but is labeled MAF in Fig. 2B for simplicity. The MAF tation must be found at far lower frequencies in the normal con- of the minus strand was reduced by ninefold by molecular bar- trol samples used for comparison with the tumor. Although the coding (median of 0.0000%, average of 0.0080%, 95% CI of biological issues that might lead to mutations in normal samples − 0.0047–0.0113%; P < 10 12, paired two-tailed Student’s t test). The cannot be circumvented (40), technical issues can be addressed magnitude of the reductions achieved by SafeSeqS was in accor- and overcome through methodological advances such as BiSeqS. dance with expectations from experiments on native DNA that had To address the sensitivity of BiSeqS, we evaluated tumor sam- not been treated with bisulfite (27). ples containing 10 double-stranded mutations (20 mutations if each Application of BiSeqS to these data resulted in a further striking strand is counted separately) within the eight amplicons described reduction in errors. Only four SDMs were observed over all eight above (Dataset S7). The proportion of mutations in each of the amplicons sequenced, as opposed to 1,865 and 163 mutations tumor samples was defined through NGS. We used the DNA from without and with molecular barcoding, respectively (Fig. S3; note these tumors to create the scenario characteristic of liquid biopsies, that the y axis of Fig. S3C has been reduced by another order of wherein a small amount of DNA from neoplastic cells is mixed magnitude compared with Fig. S3B). This was reflected in the MAFs, with a much larger amount of DNA from normal cells in the pa- asshowninFig.2C, which were reduced by 1,217-fold through tient. More specifically, we diluted this tumor DNA with normal BiSeqS compared with NGS and 141-fold compared with molecu- leukocytes to achieve minor allele fractions of 0.02% and 0.20% lar barcoding (median of 0.0000%, average of 0.0001%, 95% CI of andthenusedbisulfitetreatment to convert the mixtures. We − 0.0000–0.0001%; P < 10 12, paired two-tailed Student’s t test). determined the mutant allele fractions of each of the tumor- BiSeqS also reduced errors at indels; there were 364 mutants, derived mutations when analyzed with standard NGS, with mo- 11 supermutants, and zero SDMs observed in the eight amplicons lecular barcodes, or with BiSeqS, in all cases holding the input (Dataset S6 and Figs. S4 and S5). The MAFs were thereby reduced DNA to 5,000 template molecules per well, and performing each from an average of 0.0041% with NGS to 0.0011% with molecular experiment in six wells. We found that each of the three methods − barcoding to 0.0000% with BiSeqS (P < 1.2 × 10 6 for NGS com- of analysis yielded mutant allele fractions that were similar to those − pared with molecular barcoding for the plus strand, P < 7.5 × 10 4 expected from the dilutions (examples in Fig. 3 and all data in for NGS compared with molecular barcoding for the minus strand, Dataset S7). This experiment demonstrated that the efficiency of − and P < 1.3 × 10 2 for molecular barcoding compared with BiSeqS). each of the steps in BiSeqS—from bisulfite conversion through the amplification and sequencing steps—was high. Sensitivity of BiSeqS. Massively parallel sequencing allows billions Although the efficiency of amplification was therefore always of amplicons to be assessed simultaneously, resulting in theoretical high enough to detect the mutant templates, the MAFs of the sensitivities of one mutation among >1 billion WT templates for normal controls limited the interpretation of the sequencing data.

A 2.50% 2.00%

1.50%

1.00%

MAF, Mutaons MAF, 0.50%

0.00% NRAS_001_014 PIK3R1_570_581 PTEN_100_128 PTEN_213_231 PTEN_248_267 PTEN_304_328 RNF43_099_125 TP53_367_381_S B 2.50% 2.00%

1.50%

1.00%

MAF, Supermutants MAF, 0.50%

0.00% NRAS_001_014 PIK3R1_570_581 PTEN_100_128 PTEN_213_231 PTEN_248_267 PTEN_304_328 RNF43_099_125 TP53_367_381_S C 2.50% 2.00%

1.50%

1.00% MAF, SDMs MAF,

0.50%

0.00% NRAS_001_014 PIK3R1_570_581 PTEN_100_128 PTEN_213_231 PTEN_248_267 PTEN_304_328 RNF43_099_125 TP53_367_381_S (+) Strand (-) Strand Both Strands

Fig. 2. BiSeqS drastically reduces the MAF of single base substitution mutations across amplified loci. (A) MAF of mutations per position across all amplicons. (B) MAF of supermutants per position across all amplicons. (C) MAF of SDMs per position across all amplicons.

4736 | www.pnas.org/cgi/doi/10.1073/pnas.1701382114 Mattox et al. Downloaded by guest on September 30, 2021 We called a mutant call a true mutation when the signal-to-noise Discussion ratios (SNRs), defined as the MAF in the tumor specimen divided The results described above show that BiSeqS can accurately quan- by the MAF in normal cells, was >10. We averaged the MAF in tify rare mutations in a highly sensitive and specific manner. We both strands for this calculation when considering standard NGS or envision that its major use will be in the surveillance of patients with molecular barcode-assisted NGS. Fig. 3 and Fig. S6 show the de- cancer whose primary tumors have been sequenced. It has already tected MAFs for dilutions of 0.20% and 0.02%. Standard NGS been shown that liquid biopsies can be used for this purpose and can yielded SNRs > 10 for only two of the eight mutations at a neo- accurately identify patients who are in clinical remission but are plastic cell content of 0.20% and one out of the three mutations at destined to recur (7, 11, 44). Many such patients, particularly when neoplastic cell contents of 0.02%. Molecular barcoding yielded their residual burden of disease is small and therefore most likely to > SNRs 10 for 7 of the 10 mutations at these neoplastic cell be cured by adjuvant therapy (45), have only one or two mutant > contents. In contrast, BiSeqS yielded SNR 10 for all 10 muta- DNA molecules in 10 mL of plasma. In such situations, a technique tions at all tested neoplastic cell fractions (Fig. 3, Fig. S6,and like BiSeqS, which can efficiently use all template molecules while Dataset S7). Representative SNR plots of the MAF for mutations maintaining high specificity, could prove particularly useful. NRAS TP53 A B in and are shown in Fig. S7 and ,respectively. A disadvantage of BiSeqS is that it cannot be applied to most transition mutations because of the ambiguities caused by the bi- BiSeqS Simultaneously Detects Methylation Status on Both Strands. ′ conversion of C to U, mimicking such transitions. Although Cytosine bases in 5 -CpG dinucleotides that are methylated are one strand is still susceptible to BiSeqS, the power of the tech- protected from conversion to uracil during bisulfite treatment, nology lies in its ability to detect mutations in both strands, so it allowing BiSeqS to detect the methylation status of the plus and poses no advantages over molecular barcoding for such mutations. minus strands simultaneously. Although not the primary purpose of For example, single base substitutions in KRAS codons 12, 13, and BiSeqS, this discrimination could prove useful for the analysis of 61 are commonly mutated in colon, rectal, and pancreatic adeno- methylation that occurs at low levels, either for basic research or carcinomas (46). BiSeqS can be used to quantify KRAS mutations clinical purposes. Although bisulfite treatment and specially designed in 38.7%, 43.4%, and 47.6% of these cancers, respectively (47). primers have often been used to evaluate methylation in the past for TP53 – Across all cancers and mutations cataloged in the IARC a variety of clinical purposes (41 43), the combination of molecular database, approximately 44% of all mutations (i.e., SBS and indels) barcoding with simultaneous amplification of both strands provides areamenabletoBiSeqSanalysis(IARCTP53Database,R18). unprecedented sensitivity in this type of analysis. Additionally, bisulfite treatment can result in conversion of To demonstrate the ability of BiSeqS to discriminate the meth- methylated C bases to U in rare instances, depending upon the ylation status on both strands simultaneously, we evaluated a re- incubation time and reagent concentration (48). The protocol used gion of the TP53 gene that contains a known methylated CpG at for BiSeqS employs reduced incubation temperatures that appear hg19 position 7,572,973–4. Greater than 90% of the UIDs on both to minimize this possibility (48), but sequence heterogeneity at strands were found to be methylated at the C at the plus strand of methylated CpG sites may raise background and such sites are not position 7,572,973 and the C opposite the G on the minus strand at preferred for mutation evaluation. position 7,572,974. Greater than 99.8% of the C residues that were However, for liquid biopsies in surveillance, limitations inherent not at 5′-CpG dinucleotides within this amplicon were found to be to a single gene are not a major issue because several different converted to Ts, providing an essential control for interpreting the mutations, including transversions and indels, are generally ob- extent of methylation. We then searched for evidence of double- served upon genome-wide sequencing of cancers (1–3), and any stranded methylation within all eight amplicons evaluated in this identified mutation could in principle be applied to this clinical study in normal white blood cells (WBCs). There were two 5′-CpG scenario. Based on a recent study of 3,281 cancer samples, it was residues within the 608 bp that could be evaluated. Of these, we evident that most cancers have at least one driver gene mutation foundthatbothCpGsweremethylatedonbothstrands,withthe that should be amenable to BiSeqS analysis (49). It is also worth fraction of methylated alleles ranging from 92.10% to 96.10%

noting that passenger gene mutations that are clonal can also be GENETICS (Dataset S8). useful for diagnostic evaluation (50). Because there are at least 10- fold as many passenger mutations as driver gene mutations in nearly all cancers, it is likely that the vast majority of cancers will 0.45% have several somatic mutations that could be assessed by BiSeqS.

0.40% For example, in a study of SBSs, insertions, and deletions detected in breast cancer, we calculate that 62.1% of mutations would be 0.35% amenable to BiSeqS analysis (51). Because breast cancers nearly 0.30% always contain >25 clonal substitutions, virtually all breast cancers would have many mutations amenable to BiSeqS analysis. 0.25% BiSeqS can complement screening for other genomic alterations, MAF 0.20% such as structural variants (SVs), for rare allele detection and mon-

0.15% itoring (52). SVs provide exquisitely specific markers for cancer that can be used for liquid biopsies (9, 50). Simple polymerase errors do 0.10% not produce SVs, providing advantages over single base substitutions (Mutaons, Supermutants, or SDMs) Supermutants, (Mutaons, 0.05% as diagnostic targets. On the other hand, there are disadvantages to the use of SVs as diagnostic markers. First, SV detection requires 0.00% NGS Molecular BiSeqS NGS Molecular BiSeqS NGS Molecular BiSeqS NGS Molecular BiSeqS of tumors, rather than targeted sequenc- Barcoding Barcoding Barcoding Barcoding 02.0 % %02.0 0 %02. 0.0 %2 ing of tumors, for their initial detection; the latter is currently much T>G G> >T T> less expensive than the former. Second, and more importantly, SVs PTEN_248_267 RNF43_099_125 TP53_367_381_S TP53_367_381_S are “private”—that is, generally confined to one or a small number of Control MAF Sample MAF patients. To be used as a tumor marker, primers that specifically amplify the translocation junction must be designed and tested on the Fig. 3. BiSeqS maintains the sensitivity inherent to PCR-based molecular ’ barcoding. Mutant DNA was spiked into normal DNA at a 0.20% or 0.02% patient s tumor to ensure that the SV is somatic and the amplicon is target MAF, and the sequencing data were evaluated by standard NGS, specific. Although this approach is feasible in a research setting, it is molecular barcoding, and BiSeqS. not easily practicable in large-scale settings. In contrast, single base

Mattox et al. PNAS | May 2, 2017 | vol. 114 | no. 18 | 4737 Downloaded by guest on September 30, 2021 substitutions and indels in driver genes are observed in numerous Research, cat. no. D5001). Custom primers containing a unique identifier (UID) independent tumors, and a small set of “off-the-shelf” primers can be and amplicon-specific sequence were used to amplify both strands of DNA, and used to assess most patients. For example, we estimate that >98% of the resulting products were sequenced on an Illumina MiSeq instrument. To patients with colorectal cancer have mutations detectable through characterize the specificity of BiSeqS, DNA isolated from one normal tissue was amplification with one of 130 predesigned primer pairs. bisulfite-treated and processed through the BiSeqS pipeline to query for single In the future, it is possible that chemical treatments of DNA that base substitutions and indels. To characterize the sensitivity of BiSeqS, macro- convert A:T base pair (rather than C:G base pair) to other base dissected tumor samples with known MAFs were diluted with the DNA from pair could substitute for bisulfite when transition mutations must normal WBCs to obtain final neoplastic cell contents ranging from 0.02% to be analyzed. Another avenue for future research is multiplexing, 0.20%, bisulfite-treated, and processed through the BiSeqS pipeline. All tissues permitting mutations in a variety of amplicons to be assessed si- were obtained from consenting patients at the Johns Hopkins Hospital with the multaneously in screening scenarios. This multiplexing is more approval of the Johns Hopkins Institutional Review Board. difficult than normal because two amplicons must be designed for each region of interest while achieving homogeneous efficiency of ACKNOWLEDGMENTS. We thank Margaret Hoang, Surojit Sur, Nick Wyhs, Wyatt McMahon, and Ming Zhang for their helpful comments on the project every amplicon in all regions of interest. and manuscript as well as Lisa Dobbyn, Janine Ptak, Joy Schaefer, Natalie Silliman, and Maria Papoli for their expert technical assistance. This work was Materials and Methods supported by The Virginia and D.K. Ludwig Fund for Cancer Research, the Detailed materials and methods are available in SI Materials and Methods.Briefly, Lustgarten Foundation for Pancreatic Cancer Research, and National Institutes DNA from macrodissected formalin-fixed paraffin-embedded (FFPE) tumor sec- of Health Grants P50-CA62924, CA 06973, and GM 07309. All sequencing was tions was extracted and bisulfite treated with an EZ DNA Methylation Kit (Zymo performed at the Sol Goldman Sequencing Facility at Johns Hopkins.

1. Garraway LA, Lander ES (2013) Lessons from the cancer genome. Cell 153:17–37. 28. Casbon JA, Osborne RJ, Brenner S, Lichtenstein CP (2011) A method for counting 2. Stratton MR, Campbell PJ, Futreal PA (2009) The cancer genome. Nature 458:719–724. PCR template molecules with application to next-generation sequencing. Nucleic 3. Vogelstein B, et al. (2013) Cancer genome landscapes. Science 339:1546–1558. Acids Res 39:e81. 4. Sidransky D, et al. (1992) Identification of ras oncogene mutations in the stool of 29. Schmitt MW, et al. (2012) Detection of ultra-rare mutations by next-generation se- patients with curable colorectal tumors. Science 256:102–105. quencing. Proc Natl Acad Sci USA 109:14508–14513. 5. Sidransky D, et al. (1991) Identification of p53 gene mutations in bladder cancers and 30. Hoang ML, et al. (2016) Genome-wide quantification of rare somatic mutations in urine samples. Science 252:706–709. normal human tissues using massively parallel sequencing. Proc Natl Acad Sci USA 6. Hruban RH, van der Riet P, Erozan YS, Sidransky D (1994) Brief report: Molecular 113:9846–9851. biology and the early detection of carcinoma of the bladder–The case of Hubert H. 31. He Y, Vogelstein B, Velculescu VE, Papadopoulos N, Kinzler KW (2008) The antisense Humphrey. N Engl J Med 330:1276–1278. transcriptomes of human cells. Science 322:1855–1857. 7. Tie J, et al. (2016) Circulating tumor DNA analysis detects minimal residual disease and 32. Frommer M, et al. (1992) A genomic sequencing protocol that yields a positive predicts recurrence in patients with stage II colon cancer. Sci Transl Med 8:346ra92. display of 5-methylcytosine residues in individual DNA strands. Proc Natl Acad Sci 8. Dawson SJ, et al. (2013) Analysis of circulating tumor DNA to monitor metastatic USA 89:1827–1831. – breast cancer. N Engl J Med 368:1199 1209. 33. Levy D, Wigler M (2014) Facilitated sequence counting and assembly by template 9. Bettegowda C, et al. (2014) Detection of circulating tumor DNA in early- and late- mutagenesis. Proc Natl Acad Sci USA 111:E4632–E4637. stage human malignancies. Sci Transl Med 6:224ra24. 34. Hayatsu H, Wataya Y, Kai K, Iida S (1970) Reaction of sodium bisulfite with uracil, 10. Kinde I, et al. (2013) Evaluation of DNA from the Papanicolaou test to detect ovarian cytosine, and their derivatives. Biochemistry 9:2858–2865. and endometrial cancers. Sci Transl Med 5:167ra4. 35. Clark SJ, Statham A, Stirzaker C, Molloy PL, Frommer M (2006) DNA methylation: 11. Wang Y, et al. (2015) Detection of somatic mutations and HPV in the saliva and plasma Bisulphite modification and analysis. Nat Protoc 1:2353–2364. of patients with head and neck squamous cell carcinomas. Sci Transl Med 7:293ra104. 36. Li M, et al. (2009) Sensitive digital quantification of DNA methylation in clinical 12. Wang Y, et al. (2015) Detection of tumor-derived DNA in cerebrospinal fluid of patients samples. Nat Biotechnol 27:858–863. with primary tumors of the brain and spinal cord. Proc Natl Acad Sci USA 112:9704–9709. 37. Lewis F, Maughan NJ, Smith V, Hillan K, Quirke P (2001) Unlocking the archive–Gene 13. Wang Y, et al. (2016) Diagnostic potential of tumor DNA from ovarian cyst fluid. eLife expression in paraffin-embedded tissue. J Pathol 195:66–71. 5:5. 38. Koch I, et al. (2006) Real-time quantitative RT-PCR shows variable, assay-dependent 14. Springer S, et al. (2015) A combination of molecular markers and clinical features sensitivity to formalin fixation: Implications for direct comparison of transcript levels improve the classification of pancreatic cysts. Gastroenterology 149:1501–1510. in paraffin-embedded tissues. Diagn Mol Pathol 15:149–156. 15. Forshew T, et al. (2012) Noninvasive identification and monitoring of cancer muta- 39. Kinde I, et al. (2013) TERT mutations occur early in urothelial neoplasia and are tions by targeted deep sequencing of plasma DNA. Sci Transl Med 4:136ra68. biomarkers of early disease and disease recurrence in urine. Cancer Res 73:7162–7167. 16. De Mattos-Arruda L, Caldas C (2016) Cell-free circulating tumour DNA as a liquid 40. Krimmel JD, et al. (2016) Ultra-deep sequencing detects ovarian cancer cells in peri- biopsy in breast cancer. Mol Oncol 10:464–474. toneal fluid and reveals somatic TP53 mutations in noncancerous tissues. Proc Natl 17. Vogelstein B, Kinzler KW (1999) Digital PCR. Proc Natl Acad Sci USA 96:9236–9241. Acad Sci USA 113:6005–6010. 18. Dressman D, Yan H, Traverso G, Kinzler KW, Vogelstein B (2003) Transforming single 41. Chung W, et al. (2011) Detection of bladder cancer using novel DNA methylation DNA molecules into fluorescent magnetic particles for detection and enumeration of – genetic variations. Proc Natl Acad Sci USA 100:8817–8822. biomarkers in urine sediments. Cancer Epidemiol Biomarkers Prev 20:1483 1491. – 19. Margulies M, et al. (2005) Genome sequencing in microfabricated high-density pi- 42. Taby R, Issa JP (2010) Cancer . CA Cancer J Clin 60:376 392. colitre reactors. Nature 437:376–380. 43. Issa JP (2012) DNA methylation as a clinical marker in oncology. J Clin Oncol 30: – 20. Mitra RD, Church GM (1999) In situ localized amplification and contact replication of 2566 2568. many individual DNA molecules. Nucleic Acids Res 27:e34. 44. Harris FR, et al. (2016) Quantification of somatic chromosomal rearrangements in 21. Shendure J, Ji H (2008) Next-generation DNA sequencing. Nat Biotechnol 26:1135–1145. circulating cell-free DNA from ovarian cancers. Sci Rep 6:29831. 22. Do H, Dobrovic A (2012) Dramatic reduction of sequence artefacts from DNA isolated 45. Bozic I, et al. (2013) Evolutionary dynamics of cancer in response to targeted com- from formalin-fixed cancer biopsies by treatment with uracil-DNA glycosylase. bination therapy. eLife 2:e00747. Oncotarget 3:546–558. 46. Fearon ER, Vogelstein B (1990) A genetic model for colorectal tumorigenesis. Cell 61: 23. Do H, Wong SQ, Li J, Dobrovic A (2013) Reducing sequence artifacts in amplicon- 759–767. based massively parallel sequencing of formalin-fixed paraffin-embedded DNA by 47. Prior IA, Lewis PD, Mattos C (2012) A comprehensive survey of Ras mutations in enzymatic depletion of uracil-containing templates. Clin Chem 59:1376–1383. cancer. Cancer Res 72:2457–2467. 24. Bratman SV, Newman AM, Alizadeh AA, Diehn M (2015) Potential clinical utility of 48. Shiraishi M, Hayatsu H (2004) High-speed conversion of cytosine to uracil in bisulfite ultrasensitive circulating tumor DNA detection with CAPP-Seq. Expert Rev Mol Diagn genomic sequencing analysis of DNA methylation. DNA Res 11:409–415. 15:715–719. 49. Kandoth C, et al. (2013) Mutational landscape and significance across 12 major cancer 25. Bokulich NA, et al. (2013) Quality-filtering vastly improves diversity estimates from types. Nature 502:333–339. Illumina amplicon sequencing. Nat Methods 10:57–59. 50. Leary RJ, et al. (2012) Detection of chromosomal alterations in the circulation of 26. Sykes PJ, et al. (1992) Quantitation of targets for PCR by use of limiting dilution. cancer patients with whole-genome sequencing. Sci Transl Med 4:162ra154. Biotechniques 13:444–449. 51. Wood LD, et al. (2007) The genomic landscapes of human breast and colorectal 27. Kinde I, Wu J, Papadopoulos N, Kinzler KW, Vogelstein B (2011) Detection and cancers. Science 318:1108–1113. quantification of rare mutations with massively parallel sequencing. Proc Natl Acad 52. Macintyre G, Ylstra B, Brenton JD (2016) Sequencing structural variants in cancer for Sci USA 108:9530–9535. precision therapeutics. Trends Genet 32:530–542.

4738 | www.pnas.org/cgi/doi/10.1073/pnas.1701382114 Mattox et al. Downloaded by guest on September 30, 2021