A Comprehensive Database of A-To-I RNA Editing Events in Humans Ernesto Picardi1,2,3,*, Anna Maria D’Erchia1,2, Claudio Lo Giudice1 and Graziano Pesole1,2,3,*
Total Page:16
File Type:pdf, Size:1020Kb
Nucleic Acids Research Advance Access published September 1, 2016 Nucleic Acids Research, 2016 1 doi: 10.1093/nar/gkw767 REDIportal: a comprehensive database of A-to-I RNA editing events in humans Ernesto Picardi1,2,3,*, Anna Maria D’Erchia1,2, Claudio Lo Giudice1 and Graziano Pesole1,2,3,* 1Department of Biosciences, Biotechnology and Biopharmaceutics, University of Bari, Via Orabona 4, 70126 Bari, Italy, 2Institute of Biomembranes and Bioenergetics, National Research Council, Via Amendola 165/A, 70126 Bari, Italy and 3National Institute of Biostructures and Biosystems (INBB), 00136 Roma, Italy Received August 01, 2016; Revised August 19, 2016; Accepted August 22, 2016 ABSTRACT diseases and cancer (3–5). RNA editing by A-to-I modifi- cation contributes to transcriptome and proteome expan- RNA editing by A-to-I deamination is the prominent sion (6) and has several functional and regulatory implica- / Downloaded from co- post-transcriptional modification in humans. It tions, altering codon identity, creating or destroying splice is carried out by ADAR enzymes and contributes to sites and affecting base-pairing interactions in secondary both transcriptomic and proteomic expansion. RNA and tertiary RNA structures (7,8). editing has pivotal cellular effects and its deregula- The advent of high-throughput sequencing technologies tion has been linked to a variety of human disorders has largely improved the computational detection of RNA including neurological and neurodegenerative dis- editing events at genomic scale (9), revealing its pervasive http://nar.oxfordjournals.org/ eases and cancer. Despite its biological relevance, nature in the human transcriptome. Recently, we have pro- many physiological and functional aspects of RNA filed RNA editing in six human tissues (brain, lung, mus- editing are yet elusive. Here, we present REDIportal, cle, heart, kidney and liver) from three individuals using high coverage directional RNAseq and whole genome se- available online at http://srv00.recas.ba.infn.it/atlas/, quencing (WGS) data (6). By our large survey, we identified the largest and comprehensive collection of RNA more than 3 millions of A-to-I events differently distributed editing in humans including more than 4.5 millions across six tissues, thus producing the first RNA editing at- of A-to-I events detected in 55 body sites from thou- las in humans (6). Despite these findings, many functional sands of RNAseq experiments. REDIportal embeds aspects of RNA editing are yet unknown and further in- by guest on October 24, 2016 RADAR database and represents the first editing re- vestigations are needed to elucidate the dynamic regulation source designed to answer functional questions, en- of editing sites. To shed light on potential functional roles abling the inspection and browsing of editing lev- of RNA editing, we have developed an ad hoc bioinformat- els in a variety of human samples, tissues and ics resource named REDIportal, comprising the largest and body sites. In contrast with previous RNA editing non-redundant collection of RNA editing events across 55 databases, REDIportal comprises its own browser human body sites grouped in 30 tissues. Currently, RNA editing events are annotated in three (JBrowse) that allows users to explore A-to-I changes main databases: DARNED (http://darned.ucc.ie/)(10), in their genomic context, empathizing repetitive ele- RADAR (http://rnaedit.com/)(11) and REDIdb (http: ments in which RNA editing is prominent. //srv00.recas.ba.infn.it/py script/REDIdb/)(12,13). While the last is devoted to organellar RNA editing, DARNED INTRODUCTION provides information on A-to-I changes for human, mouse A growing literature describes RNA editing as an essential and fruit fly. However, it is not updated since 2013 and does co-/post-transcriptional process, whereby a genetic mes- not provide editing levels information. RADAR, instead, sage is modified from the corresponding DNA template by annotates A-to-I events in human, mouse and fly likewise means of substitutions, insertions and/or deletions (1). The DARNED and incorporates editing levels for 38% of stored deamination of adenosines (As) to inosines (Is) by the fam- positions, since based on a limited number of RNAseq sam- ily of ADAR enzymes acting on double RNA strands is the ples and mainly from LCL cell lines that may not be optimal prominent RNA editing event occurring in humans (2). A- for RNA editing studies (6). to-I changes are pivotal for cellular homeostasis as attested In contrast, REDIportal includes more than 4.5 millions by the association between RNA editing dysregulation and of A-to-I changes obtained merging RNA editing positions human disorders such as neurological/neurodegenerative from our Inosinome ATLAS (6) and RADAR database *To whom correspondence should be addressed. Tel: +39 0805443588; Fax: +39 0805443317; Email: [email protected] Correspondence may also be addressed to Ernesto Picardi. Tel: +39 0805443308; Fax: +39 0805443317; Email: [email protected] C The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected] 2 Nucleic Acids Research, 2016 (11). This large and non-redundant collection of RNA edit- combination of two computational strategies on strand ori- ing sites has been employed to interrogate more than 2,500 ented RNAseq reads (6). Initially, RNA editing events were GTEx RNAseq experiments from 55 body sites of 147 indi- detected using REDItools (15) and stringent filters, espe- viduals for which WGS data are available. REDIportal al- cially in case of positions falling in non-repetitive regions for lows the study of dynamic regulation of RNA editing con- which the RNA editing detection is challenging (6). Editing tributing to elucidate its biological roles in physiological candidates not supported by homozygous genomic DNA, and pathological conditions. REDIportal annotates also a obtained by whole genome resequencing of the same indi- plethora of additional info and embeds a specific genome vidual, were excluded (6). Then, unaligned RNA reads were browser (JBrowse) to explore RNA editing events in their rescued through the pipeline described in Porath et al. (16) genomic context. in order to detect RNA editing sites in hyper-edited reads. Our portal has been conceived to collect RNA editing Merging ATLAS and RADAR positions yielded a com- events/levels from a huge amount of RNAseq data in order prehensive and non-redundant RNA editing catalogue to create the largest repository and bioinformatics infras- comprising 4,668,508 sites. This huge collection was used tructure for RNA editing. to interrogate aligned RNAseq reads from GTEx project Hereafter, we describe main REDIportal features includ- through REDItools, employing a large computational farm ing database architecture and content as well as source data at the Italian National Institute for Nuclear Physics (INFN) for calling A-to-I events. that includes about 10,000 cores. An ad hoc script was finally applied to add genomic support from GTEx WGS data to exclude SNPs resembling editing events at transcript level Downloaded from RNAseq DATA COLLECTION (Figure 1). The number of detected RNA editing events per RNA editing data stored in REDIportal derive from 2,660 tissue group as well as the amount of tissue exclusive A-to-I RNAseq experiments in 150 human individuals. Of these, changes are reported in Table 1 and graphically displayed in 2,642 RNAseqs originate from the Genotype-Tissue Ex- Figure 2A. pression (GTEx) project, the largest collection of high- RNA editing sites were annotated using ANNOVAR http://nar.oxfordjournals.org/ throughput genomic data for studying gene expression in (17) tool and the following databases: (i) RepeatMasker for different normal tissues obtained from hundreds of individ- repetitive elements; (ii) dbSNP (version 142) for genomic uals. Remaining 18 RNAseqs were produced in our lab and single nucleotide polymorphisms; (iii) Gencode (v19), Ref- used to create the first RNA editing atlas in humans (6). Al- seq and UCSC for gene and transcript annotations; (iv) though the current GTEx release (v6) includes more than PhastCons for conservation scores across 46 species and 8,500 RNAseq data, we selected only 2,642 experiments for (v) RADAR and DARNED for known A-to-I changes. All which matched RNAseq and WGS data were available, al- repositories but RADAR and DARNED were downloaded lowing more reliable RNA editing calls. from UCSC genome browser. RADAR and DARNED po- RNAseq data used in REDIportal encompass 55 hu- sitions were obtained from corresponding web sites. by guest on October 24, 2016 man body sites from 30 different tissues with an over- representation of brain, skin, blood and esophagus (Table DATABASE CONTENT AND ARCHITECTURE 1). On average, there are 18 RNAseq data per individual and 50 million reads per experiment. The majority of RNAseq REDIportal collects 4,668,508 A-to-I editing sites in two data derives from unstranded libraries of polyA enriched main MySQL tables. The first table stores basic info includ- RNA (2×76 bp), while only 72 experiments are from li- ing genomic positions, strand, genes and transcripts, SNP braries of total RNA preserving strand orientation (2×100 accessions and RepeatMasker elements. This table com- and 2×150 bp). prises also a binary string for a fast search of tissues and GTEx datasets were downloaded from the database body sites in which each RNA editing position has been of Genotypes and Phenotypes (dbGaP) with accession observed. The second table, instead, includes RNA editing phs000424.v6.p1 in sra format and converted in standard levels per tissue and body site as well as RNAseq and WGS fastq by means of fastq-dump program that is part of the support.