Nanopore Sequencing of Single-Cell Transcriptomes with Sccolor-Seq
Total Page:16
File Type:pdf, Size:1020Kb
BRIEF COMMUNICATION https://doi.org/10.1038/s41587-021-00965-w Nanopore sequencing of single-cell transcriptomes with scCOLOR-seq Martin Philpott1,2, Jonathan Watson3, Anjan Thakurta2,4,5, Tom Brown Jr 3, Tom Brown Sr 3,6, Udo Oppermann 1,2,7 ✉ and Adam P. Cribbs 1,2 ✉ Here we describe single-cell corrected long-read sequencing sequencing error rate of up to 10% (Fig. 1c). In single-cell sequenc- (scCOLOR-seq), which enables error correction of barcode ing, UMIs allow the deduplication of single transcripts that are and unique molecular identifier oligonucleotide sequences detected multiple times in sequencing, arising from PCR copies pro- and permits standalone cDNA nanopore sequencing of single duced during library preparation. The directional network-based cells. Barcodes and unique molecular identifiers are synthe- method first proposed by UMI-tools6 was modified to correct for sized using dimeric nucleotide building blocks that allow error UMI sequence duplication (Fig. 1d and Supplementary Fig. 3). detection. We illustrate the use of the method for evaluating Using simulated data, we show that the dimer-synthesized UMIs barcode assignment accuracy, differential isoform usage in can be used to deduplicate UMIs even with a sequencing error rate myeloma cell lines, and fusion transcript detection in a sar- above 10% (Fig. 1e,f and Supplementary Fig. 4). coma cell line. scCOLOR-seq was validated using human HEK293T and mouse Long-read sequencing technologies such as PacBio 3T3 single-cell Drop-seq libraries from approximately 1,200 cells (at single-molecule real-time (SMRT) sequencing1 or Oxford Nanopore a 50:50 mouse:human cell ratio), followed by Illumina short-read sequencing2 enable the sequencing of full-length transcripts. The sequencing. Overall, 68% of all reads show complete dinucleotide application of PacBio SMRT to single-cell sequencing has been block complementarity across the full barcode sequence. This hindered by a low sequencing capacity (four million reads per flow suggests that the theoretical base-calling accuracy for the bar- cell), which means that a single run can report on only 40–133 cells code should be 98.4%, which aligns with the reported accuracy of at a comparable read depth to short-read approaches, or on thou- Illumina sequencing. The dimer-correction approach was evaluated sands of cells by sacrificing quantitative information3. Zeng et al.4 by measuring the proportion of human, mouse and mixed spe- have improved the platform for single-cell sequencing workflows by cies cells identified following increased edit distances (that is, the concatenating multiple full-length cDNAs into a single insert. This Levenshtein distance) between the error-sequenced and the accu- approach led to a return of 10 million reads from a single SMRT cell, rately sequenced barcodes (Supplementary Fig. 5). An edit distance providing eight times more data output than the standard PacBio of 4 was found to result in accurate assignment of both mouse and sequencing protocol. Alternatively, nanopore sequencing provides human reads (Supplementary Fig. 6), enabling the recovery of an up to 250 million reads per PromethION flow cell, but its main draw- extra 8% of total reads. Although further reads could be recov- back is its high error rate compared with both PacBio long-read and ered using an edit distance of 5, this was obtained at the expense of Illumina short-read sequencing (5–15% compared with less than increased numbers of mixed species cells (Supplementary Fig. 5i). 1%)5. To overcome the low base-calling accuracy of Oxford Nanopore Application of scCOLOR-seq to nanopore sequencing identified sequencing, here we describe single-cell corrected long-read the presence of a poly(A) sequence in 40% (range, 24–62%) of all sequencing (scCOLOR-seq), in which the barcode and unique nanopore sequencing reads and detected 12.9% (range, 9–15%) of molecular identifier (UMI) regions of the oligonucleotide-barcoded these reads with dual nucleotide complementarity across the full RNA-capture microbeads are synthesized using homodimeric nucle- barcode sequence (Supplementary Fig. 7). This suggests that the oside phosphoramidite building blocks (Supplementary Fig. 1), theoretical base-calling accuracy of single-cell nanopore sequenc- which provides a means for sequencing-error detection and correc- ing should be 91.8%. However, barcodes were observed to often tion of the barcode and UMI (Fig. 1a). contain more than one error per barcode, which has the effect of To correctly assign barcodes to cells, a computational strategy reducing the overall measurable base-calling accuracy to 86% was developed in which barcodes were identified in a two-pass (Supplementary Fig. 8). Naive collapsing of barcodes and UMIs assignment method (Fig. 1b and Supplementary Fig. 2). First, bar- sequenced into single-base sequences without error correction led codes without errors were identified on the basis of nucleotide pair to only 81 recovered cells (Supplementary Fig. 9). The dimer cor- complementarity across the full length of the barcode. Next, these rection approach using an edit distance of 6 led to the recovery of accurate barcodes were used as a guide to correct the remaining 54% (range, 43%–68%) of barcodes containing sequencing errors erroneous barcodes. Using simulated data, we show that this strat- (Fig. 2a,b and Supplementary Fig. 7c). Increasing the edit distance to egy is capable of correcting erroneous barcodes with a high sequenc- 7 increased recovery to 82% (range, 79.8%–83.6%), at the expense of ing error rate, with 96% of barcodes recovered with a barcode slightly increased numbers of mixed cells (Supplementary Fig. 10). 1Botnar Research Centre, Nuffield Department of Orthopedics, Rheumatology and Musculoskeletal Sciences, National Institute of Health Research Oxford Biomedical Research Unit (BRU), University of Oxford, Oxford, UK. 2Oxford Centre for Translational Myeloma Research University of Oxford, Oxford, UK. 3ATDBio, Oxford, UK. 4Radcliffe Department of Medicine, Oxford University, Oxford, UK. 5Translational Medicine, Bristol Myers Squibb, Summit, NJ, USA. 6Chemistry Research Laboratory, Department of Chemistry, University of Oxford, Oxford, UK. 7Centre for Medicines Discovery, University of Oxford, Oxford, UK. ✉e-mail: [email protected]; [email protected] NaturE BIotECHnoloGY | www.nature.com/naturebiotechnology BRIEF COMMUNICATION NATURE BIOTECHNOLOGY a Barcode and UMI synthesized using homodimeric blocks of reverse phosphoramidites Mononucleotide Block nucleotide PCR handle Barcode UMI Poly(A) site Bead Original barcode and UMI: Sequenced read: Error Error b c Barcode blocks 100 Barcode have a mismatch 75 Guide 50 Barcode blocks agree across the 25 full length Cell barcodes correctly Barcodes recovered (%) 0 assigned –3 –2 –1 log10-transformed sequencing error rate d e f UMI UMI block mismatch. 0.20 UMI split and collapsed 7,500 into single nucleotides 0.15 5,000 UMI nucleotide 0.10 blocks agree Difference 2,500 across the full length. Directional 0.05 UMI collapsed into Coefficient of variation UMI-tools count (corrected versus truth) 0 single correction nucleotides –3 –2 –1 –3 –2 –1 log10-transformed log10-transformed sequencing error rate sequencing error rate No correction Directional Directional paired Fig. 1 | Developing a strategy to error-correct barcode and UMI sequences from droplet-based sequencing. a, Schematic bead and oligonucleotide structure using dimer blocks of nucleotides for Buc-seq. b, Cell barcode-assignment strategy. c, UMI deduplication strategy. d, Simulated data showing the number of barcodes recovered with increasing simulated sequencing error rates. e,f, Simulated data showing the difference and coefficient of variation between the deduplicated UMIs and the ground truth. Correction of the UMI counts was performed using a basic directional network-based approach after accounting for sequencing errors within homodimeric blocks of nucleotides. However, filtering on the basis of at least the presence of 200 fea- within these cells. We next searched for differentially regulated tures per cell removed a substantial proportion of mixed cells transcripts between cell types and clusters. In this experiment, (Supplementary Fig. 11). Cells were then projected into two cell-type-specific usage was observed for 359 genes and 416 dif- dimensions using uniform manifold approximation and projec- ferentially expressed isoforms. Differential transcript usage was tion (UMAP) and a clear separation of the mouse and human cell particularly apparent for the marker CD74 (Fig. 2e–h), which is a populations was observed, with 1,077 cells recovered using an edit potential therapeutic target in multiple myeloma7. Furthermore, in distance of 6 and 1,064 cells recovered using an edit distance of 7 agreement with the literature and the biology of plasma cells8, a dif- (Supplementary Fig. 10). ferential expression of both immunoglobulin κ and λ light-chain scCOLOR-seq was applied to a mixture (1:1:1 ratio) of human isoform use between the different myeloma cell lines was observed NCI-H929, JJN3 and DF15 myeloma cell lines and approximately (Supplementary Figs. 16, 17). 500 cells were sequenced using a MinION flow cell and 1,200 cells Long read sequencing permits the measurement of fusion tran- sequenced using a PromethION flow cell. After filtering with a scripts that are often key drivers of tumor development. To illus- minimum of 200 features per cell (Supplementary Figs. 12, 13), we trate the principle, Ewing’s sarcoma was selected, which harbors show that nanopore sequencing can resolve the different myeloma the t(11:22)(q24:q212) translocation that generates EWS-FLI, a cell types at both the gene level (Fig. 2c and Supplementary Fig. 14) fusion between EWSR1 (Ewing’s sarcoma breakpoint region 1) and and the transcript level (Fig. 2d and Supplementary Fig. 14b–f). the ETS transcription factor FLI1 (Friend leukemia integration 1) There was also a good correlation between Oxford Nanopore and genes9. The EWS–FLI protein regulates the expression of numerous Illumina gene counts per cell (R = 0.67) and the number of UMIs target genes that promote cancer survival and drug resistance10,11.