Partial Bisulfite Conversion for Unique Template Sequencing
Total Page:16
File Type:pdf, Size:1020Kb
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Cold Spring Harbor Laboratory Institutional Repository Nucleic Acids Research, 2017 1 doi: 10.1093/nar/gkx1054 Partial bisulfite conversion for unique template sequencing Vijay Kumar, Julie Rosenbaum, Zihua Wang, Talitha Forcier, Michael Ronemus, Michael Wigler and Dan Levy* Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY 11724 USA Received May 25, 2017; Revised October 10, 2017; Editorial Decision October 16, 2017; Accepted October 18, 2017 ABSTRACT tinguishable from PCR and sequencing error. Finally, dis- tortion during PCR amplification makes for an unreliable We introduce a new protocol, mutational sequenc- estimate of count in RNA expression and DNA copy num- ing or muSeq, which uses sodium bisulfite to ran- ber. domly deaminate unmethylated cytosines at a fixed There are a host of new technologies designed to ad- and tunable rate. The muSeq protocol marks each dress the shortcomings of short-read sequencing, utilizing initial template molecule with a unique mutation sig- various strategies of limiting dilution and tagging (2–7). In nature that is present in every copy of the template, this paper, we present a fundamentally different approach and in every fragmented copy of a copy. In the se- that embeds a unique mutational tag within the sequence of quenced read data, this signature is observed as a each template molecule, and discuss the merits of a diverse unique pattern of C-to-T or G-to-A nucleotide con- set of tools for enhanced short-read sequencing. Previously, versions. Clustering reads with the same conversion we described the theoretical practicality of such a method (8). By marking each initial template molecule with a ran- pattern enables accurate count and long-range as- dom mutation pattern, all subsequent copies of the origi- sembly of initial template molecules from short-read nal molecule will carry the same pattern. Thus overlapping sequence data. We explore count and low-error se- copies from the same initial template can be joined if they quencing by profiling 135 000 restriction fragments have near identical patterns that far exceed chance agree- in a PstI representation, demonstrating that muSeq ment. With sufficient coverage, this property enables the improves copy number inference and significantly long-range assembly of each mutated template molecule. reduces sporadic sequencer error. We explore long- Such information is also useful for problems of haplotype range assembly in the context of cDNA, generating phasing and measuring repeat lengths. As with all template contiguous transcript clusters greater than 3,000 bp tagging methods, this method also allows accurate counting in length. The muSeq assemblies reveal transcrip- and low-error sequencing. tional diversity not observable from short-read data We demonstrate a protocol and informatics that realize the theory: marking each initial template with a demonstra- alone. bly unique mutational pattern and reconstructing identity, assembly, and count from noisy real-world data. We call INTRODUCTION this method mutational sequencing or muSeq. We use par- tial sodium bisulfite conversion to mark double-stranded Long-read sequencing platforms such as PacBio and Ox- template DNA molecules or first-strand cDNAs. The bisul- ford Nanopore are costly and error-prone, but provide the fite reaction deaminates unmethylated cytosines, and is typ- long-range information required for high quality assem- ically used for studying cytosine methylation patterns in the blies (1). Short-read sequencers are relatively inexpensive genome (9). For that application, the deamination reaction and have excellent precision; however, the reads lengths are is run to completion, converting nearly every unmethylated sufficient only for simpler assemblies. The specific problems cytosine to uracil. For randomly marking templates, how- with short-read sequencers are readily enumerated. When- ever, we require partial conversion. By adjusting the time ever the distinguishing variants in the template molecules and temperature of a step in the bisulfite reaction, we can are more than one read length apart, multiple distinct as- reliably control the rate of conversion. Reflecting the binary semblies are equally consistent with the read data. This pre- nature of the conversion, we refer to cytosines in this con- vents resolving haplotypes, observing transcript isoforms, text as ‘bits.’ and assembling complex repetitive regions. Although se- quence fidelity is good, low-frequency variants are not dis- *To whom correspondence should be addressed. Tel: +1 516 367 8377; Fax: +1 516 367 8381; Email: [email protected] C The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected] Downloaded from https://academic.oup.com/nar/advance-article-abstract/doi/10.1093/nar/gkx1054/4637579 by Cold Spring Harbor Laboratory user on 01 December 2017 2 Nucleic Acids Research, 2017 To test the operating characteristics of the muSeq proto- read 1 and all G to A in read 2 (Figure 1F). These modified col, we conducted two series of experiments. In the first we reads are then mapped by Bowtie2 (11) to two modified ver- studied the application of muSeq to a genomic representa- sions of the reference genome (hg38 assembly): one with all tion (Figure 1) focusing our attention on ∼135,000 restric- C converted to T (hg38 CT), and one with all G converted tion fragments of the proper size and uniqueness. Our ex- to A (hg38 GA). Selection of the best mapping determines periments show that the rate of deamination is independent the strand of origin (Figure 1G). By referencing the original of position and uncorrelated within the template. We ob- read pair, we determine the conversion pattern. serve that fragment counts are linear with copy number and From the reference genome, there are 162,353 expected that allele ratios follow the expected binomial distribution. PstI fragments between 150 and 400 bp in length. We con- We determine that the method does not contribute any mea- vert these fragments in silico, both C-to-T and G-to-A, and surable sequence error. In the second series of experiments, mapthemtohg38CT and hg38 GA. Those in which both we applied the muSeq protocol to cDNA derived from re- top and bottom strands map unambiguously (MAPQ ≥ verse transcribed poly(A)+ cellular RNA. Applying a simple 40) comprise 135,262 high quality representation fragments algorithm, we clustered sequence reads into longer consen- (HQRFs). We further consider only those reads that map sus templates. The resulting templates comprised thousands with high quality alignments to HQRFs. These reads ac- of unique transcripts and compare favorably to reference count for about 50% of the raw sequence. transcript assemblies. Analyses of the data demonstrate the Read pairs are binned by restriction fragment and the ability to reconstruct splicing patterns at the level of indi- RT or RB of the initial template (Figure 1H).Eachofthe vidual transcripts. 270,000 bins is analyzed separately to determine the set of initial template conversion patterns. While many read MATERIALS AND METHODS pairs from the same initial template fragment bear identi- cal conversion patterns and sequence, sequencing and PCR Representations errors are sufficiently frequent to require methods for infer- Genomic DNAs were extracted from whole blood, cleaved ring their consensus, or common ancestral template. Conse- with PstI, end-repaired and ligated to custom Illumina quently, we extract all bits from each read pair (Figure 1H), sequencing primers (Figure 1A and B). The primers are establishing a bit string where 0 indicates that a position rendered bisulfite conversion-resistant by substituting 5- is unconverted and 1 indicates that a position is converted methyl-cytosine (5mC) for cytosine during oligo synthe- by sodium bisulfite. To cluster read pairs, we use transitive sis. The complete conversion protocol uses the MethylEasy propagation, an algorithm we developed to find an optimal Xceed Rapid DNA Bisulphite Modification Kit Mix (Hu- clustering (see supplement for details). Given a model for man Genetic Signatures/Clontech) according to standard base-calling error and a model for conversion rates, tran- instructions. The partial conversion protocol (Figure 1C) sitive propagation identifies a clustering solution that opti- uses the same kit and instructions, but we reduce the tem- mizes pairwise probabilities of belonging to the same cluster perature and time during incubation with the combined (a = b) or not (a = b) under the condition that belonging to Reagents 1 and 2 (step 5 in the instructions). We started the same cluster is a transitive relation. with 75 ng of input DNA for each reaction and carried out both complete (45 min, 80◦C) and partial conversion ◦ cDNA at 3, 6 and 9 min at 73 C. After conversion, we sampled 4% from each converted sample and PCR-amplified (Fig- We extracted total RNA from 3 million fibroblasts from a ure 1D) using Illumina P5 and P7 sequencing adapters (for line derived from the same donor as the whole blood sample. the complete conversion library, we sampled 40%). The re- We sampled 3.3% of the RNA for conversion to cDNA by sulting libraries were sequenced (Figure 1E) on an Illumina reverse transcriptase (100 U; SMARTScribe reverse tran- MiSeq (∼17 million paired-end reads per sample). We also scriptase; Clontech), employing custom oligo d(T) primers sampled 2% from the 6 min conversion, then amplified for and template switch primers, each with a sample tag and five linear rounds with just one primer (P7); we then com- random barcode. We made two such samples with distinct pleted the PCR as above, sequencing the resulting libraries pairs of sample tags.