Quantification of Experimentally Induced Nucleotide Conversions in High-Throughput Sequencing Datasets Tobias Neumann1* , Veronika A

Neumann et al. BMC Bioinformatics (2019) 20:258 https://doi.org/10.1186/s12859-019-2849-7 RESEARCH ARTICLE Open Access Quantification of experimentally induced nucleotide conversions in high-throughput sequencing datasets Tobias Neumann1* , Veronika A. Herzog2, Matthias Muhar1, Arndt von Haeseler3,4, Johannes Zuber1,5, Stefan L. Ameres2 and Philipp Rescheneder3* Abstract Background: Methods to read out naturally occurring or experimentally introduced nucleic acid modifications are emerging as powerful tools to study dynamic cellular processes. The recovery, quantification and interpretation of such events in high-throughput sequencing datasets demands specialized bioinformatics approaches. Results: Here, we present Digital Unmasking of Nucleotide conversions in K-mers (DUNK), a data analysis pipeline enabling the quantification of nucleotide conversions in high-throughput sequencing datasets. We demonstrate using experimentally generated and simulated datasets that DUNK allows constant mapping rates irrespective of nucleotide-conversion rates, promotes the recovery of multimapping reads and employs Single Nucleotide Polymorphism (SNP) masking to uncouple true SNPs from nucleotide conversions to facilitate a robust and sensitive quantification of nucleotide-conversions. As a first application, we implement this strategy as SLAM-DUNK for the analysis of SLAMseq profiles, in which 4-thiouridine-labeled transcripts are detected based on T > C conversions. SLAM-DUNK provides both raw counts of nucleotide-conversion containing reads as well as a base-content and read coverage normalized approach for estimating the fractions of labeled transcripts as readout. Conclusion: Beyond providing a readily accessible tool for analyzing SLAMseq and related time-resolved RNA sequencing methods (TimeLapse-seq, TUC-seq), DUNK establishes a broadly applicable strategy for quantifying nucleotide conversions. Keywords: Mapping, Epitranscriptomics, Next generation sequencing, High-throughput sequencing Background bisulfite-sequencing (BS-Seq) identifies non-methylated Mismatches in reads yielded from standard sequencing cytosines from cytosine-to-thymine (C > T) conversions protocols such as genome sequencing and RNA-Seq ori- [1]. Similarly, photoactivatable ribonucleoside-enhanced ginate either from genetic variations or sequencing errors crosslinking and immunoprecipitation (PAR-CLIP) en- and are typically ignored by standard mapping ap- ables the identification of protein-RNA-interactions by proaches. Beyond these standard applications, a growing qualitative assessment of thymine-to-cytosine (T > C) con- number of profiling techniques harnesses nucleotide conversions [2]. Most recently, emerging sequencing tech- versions to monitor naturally occurring or experimentally nologies further expanded the potential readout of introduced DNA or RNA modifications. For example, nucleotide-conversions in high-throughput sequencing datasets by employing chemoselective modifications to modified nucleotides in RNA species, resulting in specific * Correspondence: [email protected]; nucleotide conversions upon reverse transcription and se- [email protected] 1Research Institute of Molecular Pathology (IMP), Campus-Vienna-Biocenter 1, quencing [3]. Among these, thiol (SH)-linked alkylation for Vienna BioCenter (VBC), 1030 Vienna, Austria the metabolic sequencing of RNA (SLAMseq) is a novel se- 3 Center for Integrative Bioinformatics Vienna, Max F. Perutz Laboratories, quencing protocol enabling quantitative measurements of University of Vienna, Medical University of Vienna, Dr. Bohrgasse 9, VBC, 1030 Vienna, Austria RNA kinetics within living cells which can be applied to Full list of author information is available at the end of the article © The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Neumann et al. BMC Bioinformatics (2019) 20:258 Page 2 of 16 determine RNA stabilities [4] and transcription-factor estimation of nucleotide-conversion rates taking into account dependent transcriptional outputs [5]invitro,or,when SNPs that may feign nucleotide-conversions. As an applica- combined with the cell-type-specific expression of uracil tion of DUNK, we introduce SLAM-DUNK - a SLAMseq- phosphoribosyltransferase, to assess cell-type-specific tran- specific pipeline that takes additional complications of the scriptomes in vivo (SLAM-ITseq) [6]. SLAMseq employs SLAMseq approach into account. SLAM-DUNK allows to metabolic RNA labeling with 4-thiouridine (4SU), which is address the increased number of multi-mapping reads in readily incorporated into newly synthesized transcripts. low-complexity regions frequently occurring in 3′ end se- After RNA isolation, chemical nucleotide-analog derivatiza- quencing data sets and a robust and unbiased quantification tion specifically modifies thiol-containing residues, which of nucleotide-conversions in genomic intervals such as 3′ in- leads to specific misincorporation of guanine (G) instead of tervals. SLAM-DUNK enables researchers to analyze SLAM- adenine (A) when the reverse transcriptase encounters an seq data from raw reads to fully normalized nucleotide- alkylated 4SU residue during RNA to cDNA conversion. conversion quantifications without expert bioinformatics The resulting T > C conversion can be read out by high- knowledge. Moreover, SLAM-DUNK provides a comprehen- throughput sequencing. sive analysis of the input data, including visualization, sum- Identifying nucleotide conversions in high-throughput mary statistics and other relevant information of the data sequencing data comes with two major challenges: First, processing. To allow scientists to assess feasibility and accur- depending on nucleotide conversion rates, reads will acy of nucleotide-conversion based measurements for genes contain a high proportion of mismatches with respect to and/or organisms of interest in silico, SLAM-DUNK comes a reference genome, causing common aligners to mis- with a SLAMseq simulation module enabling optimization align them to an incorrect genomic position or to fail of experimental parameters such as sequencing depth and aligning them at all [7]. Secondly, Single Nucleotide sample numbers. We supply this fully encapsulated and easy Polymorphisms (SNPs) in the genome will lead to an to install software package via BioConda, the Python Package overestimation of nucleotide-conversions if not appro- Index, Docker hub and Github (see http://t-neumann.github. priately separated from experimentally introduced genu- io/slamdunk) as well as a MultiQC (http://multiqc.info)plu- ine nucleotide conversions. Moreover, depending on the gin to make SLAMseq data analysis and integration available nucleotide-conversion efficiency and the number of to bench-scientists. available conversion-sites, high sequencing depth is required to reliably detect nucleotide-conversions at lower frequencies. Therefore, selective amplification of Results transcript regions, such as 3′ end mRNA sequencing Digital unmasking of nucleotide-conversions in k-mers (QuantSeq [8]) reduces library complexity ensuring high DUNK addresses the challenges of distinguishing nucleotide- local coverage and allowing increased multiplexing of conversions from sequencing error and genuine SNPs in samples. In addition, QuantSeq specifically recovers only high-throughput sequencing datasets by executing four main mature (polyadenylated) mRNAs and allows the identifi- steps (Fig. 1): First, a nucleotide conversion-aware read map- cation of transcript 3′ ends. However, the 3′ terminal re- ping algorithm facilitates the alignment of reads (k-mers) gions of transcripts sequenced by QuantSeq (typically with elevated numbers of mismatches (Fig. 1a). Second, to 250 bp; hereafter called 3′ intervals) largely overlap with provide robust nucleotide-conversion readouts in repetitive 3′ untranslated regions (UTRs), which are generally of or low-complexity regions such as 3′ UTRs, DUNK option- less sequence complexity than coding sequences [9], ally employs a recovery strategy for multi-mapping reads. In- resulting in an increased number of multi-mapping stead of discarding all multi-mapping reads, DUNK only reads, i.e. reads mapping equally well to several genomic discards reads that map equally well to two different 3′ inter- regions. Finally, besides the exact position of nucleotide vals. Reads with multiple alignments to the same 3′ interval conversions in the reads, SLAMseq down-stream ana- or to a single 3′ interval and a region of the genome that is lysis requires quantifications of overall conversion-rates not part of a 3′ interval are kept (Fig 1b). Third, robust against variation in coverage and base compos- DUNK identifies Single-Nucleotide Polymorphisms ition in genomic intervals e.g. 3′ intervals. (SNPs) to mask false-positive nucleotide-conversions Here we introduce Digital Unmasking of Nucleotide- at SNP positions (Fig. 1c). Finally, the high-quality conversions in k-mers (DUNK), a data analysis method nucleotide-conversion

Quantification of Experimentally Induced Nucleotide Conversions in High-Throughput Sequencing Datasets Tobias Neumann1* , Veronika A

The Encodedb Portal: Simplified Access to ENCODE Consortium Data Laura L

Differential Gene Expression Profiling in Bed Bug (Cimex Lectularius L.) Fed on Ibuprofen and Caffeine in Reconstituted Human Blood Ralph B

Bedtools Documentation Release 2.30.0

Penguin: a Tool for Predicting Pseudouridine Sites in Direct RNA Nanopore Sequencing Data

Reference Genomes and Common File Formats Overview

De Novo Human Genome Assemblies Reveal Spectrum of Alternative Haplotypes in Diverse Populations

Introduction to High-Throughput Sequencing File Formats

Gffread and Gffcompare[Version 1; Peer Review: 3 Approved]

Genomic Files

BIOINFORMATICS APPLICATIONS NOTE Doi:10.1093/Bioinformatics/Btq033

Bed Bug Cytogenetics: Karyotype, Sex Chromosome System, FISH Mapping of 18S Rdna, and Male Meiosis in Cimex Lectularius Linnaeus, 1758 (Heteroptera: Cimicidae)

A Modern Perl Framework for High Throughput Sequencing