View metadata, citation and similar papers at core.ac.uk brought to you by CORE
provided by Cold Spring Harbor Laboratory Institutional Repository Nucleic Acids Research, 2017 1 doi: 10.1093/nar/gkx1054 Partial bisulfite conversion for unique template sequencing Vijay Kumar, Julie Rosenbaum, Zihua Wang, Talitha Forcier, Michael Ronemus, Michael Wigler and Dan Levy*
Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY 11724 USA
Received May 25, 2017; Revised October 10, 2017; Editorial Decision October 16, 2017; Accepted October 18, 2017
ABSTRACT tinguishable from PCR and sequencing error. Finally, dis- tortion during PCR amplification makes for an unreliable We introduce a new protocol, mutational sequenc- estimate of count in RNA expression and DNA copy num- ing or muSeq, which uses sodium bisulfite to ran- ber. domly deaminate unmethylated cytosines at a fixed There are a host of new technologies designed to ad- and tunable rate. The muSeq protocol marks each dress the shortcomings of short-read sequencing, utilizing initial template molecule with a unique mutation sig- various strategies of limiting dilution and tagging (2–7). In nature that is present in every copy of the template, this paper, we present a fundamentally different approach and in every fragmented copy of a copy. In the se- that embeds a unique mutational tag within the sequence of quenced read data, this signature is observed as a each template molecule, and discuss the merits of a diverse unique pattern of C-to-T or G-to-A nucleotide con- set of tools for enhanced short-read sequencing. Previously, versions. Clustering reads with the same conversion we described the theoretical practicality of such a method (8). By marking each initial template molecule with a ran- pattern enables accurate count and long-range as- dom mutation pattern, all subsequent copies of the origi- sembly of initial template molecules from short-read nal molecule will carry the same pattern. Thus overlapping sequence data. We explore count and low-error se- copies from the same initial template can be joined if they quencing by profiling 135 000 restriction fragments have near identical patterns that far exceed chance agree- in a PstI representation, demonstrating that muSeq ment. With sufficient coverage, this property enables the improves copy number inference and significantly long-range assembly of each mutated template molecule. reduces sporadic sequencer error. We explore long- Such information is also useful for problems of haplotype range assembly in the context of cDNA, generating phasing and measuring repeat lengths. As with all template contiguous transcript clusters greater than 3,000 bp tagging methods, this method also allows accurate counting in length. The muSeq assemblies reveal transcrip- and low-error sequencing. tional diversity not observable from short-read data We demonstrate a protocol and informatics that realize the theory: marking each initial template with a demonstra- alone. bly unique mutational pattern and reconstructing identity, assembly, and count from noisy real-world data. We call INTRODUCTION this method mutational sequencing or muSeq. We use par- tial sodium bisulfite conversion to mark double-stranded Long-read sequencing platforms such as PacBio and Ox- template DNA molecules or first-strand cDNAs. The bisul- ford Nanopore are costly and error-prone, but provide the fite reaction deaminates unmethylated cytosines, and is typ- long-range information required for high quality assem- ically used for studying cytosine methylation patterns in the blies (1). Short-read sequencers are relatively inexpensive genome (9). For that application, the deamination reaction and have excellent precision; however, the reads lengths are is run to completion, converting nearly every unmethylated sufficient only for simpler assemblies. The specific problems cytosine to uracil. For randomly marking templates, how- with short-read sequencers are readily enumerated. When- ever, we require partial conversion. By adjusting the time ever the distinguishing variants in the template molecules and temperature of a step in the bisulfite reaction, we can are more than one read length apart, multiple distinct as- reliably control the rate of conversion. Reflecting the binary semblies are equally consistent with the read data. This pre- nature of the conversion, we refer to cytosines in this con- vents resolving haplotypes, observing transcript isoforms, text as ‘bits.’ and assembling complex repetitive regions. Although se- quence fidelity is good, low-frequency variants are not dis-
*To whom correspondence should be addressed. Tel: +1 516 367 8377; Fax: +1 516 367 8381; Email: [email protected]