Supplementary Material and Methods the Discovery Potential of RNA Processing Profiles
Total Page:16
File Type:pdf, Size:1020Kb
Supplementary Material and Methods The discovery potential of RNA processing profiles Amadís Pagès1,2, Ivan Dotu3, Roderic Guigó1,2 and Eduardo Eyras2,4,* 1Centre for Genomic Regulation (CRG), E08003 Barcelona, Spain. 2Universitat Pompeu Fabra (UPF), E08003 Barcelona, Spain. 3Institut Hospital del Mar d'Investigacions Mèdiques (IMIM), E08003 Barcelona, Spain. 4Catalan Institution for Research and Advanced Studies, E08010 Barcelona, Spain Implementation SeRPeNT is implemented in plain C programming language and runs as a standalone tool, with no external dependencies. This makes SeRPeNT extremely fast and easy to run. It also tends to be effectively faster than other tools, like BlockClust, even though SeRPeNT algorithm runs on O(n2), whereas BlockClust processing algorithm runs in linear time. On the other hand, to perform an analysis similar to SeRPeNT, the tool blockbuster (Langenberger et al. 2009) must be run prior to BlockClust to generate the profiles takes much longer than SeRPeNT. Data filtering Typically, sequenced small RNAs are shorter than the read length, thus the read contains part of the 3’ adapter sequence, sometimes including untemplated nucleotides in its 3’-end. If mismatches are allowed during the mapping step, it is possible that some of those untemplated nucleotides are included in the mapped read, resulting in a number of nucleotides at the 3’-end of the profile that generally have an extremely low height compared to the rest of the profile. A profile processing step, consisting in trimming the 3’-end nucleotides that have an abnormally low height, is performed prior to generating the final set of profiles. Accuracy evaluation The accuracy was evaluated using a cross-fold validation approach, where at each iteration we remove labels from some of the known sRNAs in each cluster and applied SeRPeNT annotation method based on the population of sRNA profiles in the same cluster (Supplementary Figure 1). The accuracy was measured for each sRNA family independently, and for each sRNA family (e.g. miRNAs), the labeled cases from that family were considered as the positive class, and all other sRNAs from other families as the negative class. For each of the 100 folds, a contingency matrix for the predicted profiles was built for each non-coding RNA family, and precision (positive predictive value or PPV), recall or true positive rate (TPR) and Matthews Correlation Coefficient (MCC) measures were calculated as follows: �� ��� = �� + �� !" ��� = !"!!" �� ⋅ �� − �� ⋅ �� ��� = (�� + ��)(�� + ��)(�� + ��)(�� + ��) MCC includes true and false positives and negatives, takes values from -1 to 1, and it is considered as a balanced measure that can be used even if the classes are of very different sizes. MCC values greater than 0.7 are considered as very strong predictions, values ranging from 0.5 to 0.7 as strong predictions, and values lower than 0.5 are considered as weak or close to random. Negative values are interpreted as anti-correlations. Finally, for each cell line and family of non-coding RNAs, the three metrics were averaged over 100 folds. ENCODE short RNA-Seq data download Total short RNA Sequencing (sRNA-Seq) alignment files in BAM format were downloaded from the ENCODE webserver1 (ENCODE Project Consortium. 2012) for cell lines A549, GM12878, HeLa-S3, HepG2, HUVEC, IMR90, K562, MCF-7, NHEK and SK-N-SH. Additionally, sRNA- Seq alignment files from the chromatin, nucleolus, nucleoplasm and cytoplasm compartments of the cell line K562 were also downloaded. Characteristics for each of the downloaded RNA-Seq alignment files are provided in Supplementary Table 2. Detailed information about the cell culture, sequencing and mapping protocols is described in the ENCODE webserver (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCshlShortRnaSeq). Annotation data GENCODE release 19 (Harrow et al. 2012) main and tRNA annotation files in GTF format were downloaded from the GENCODE project website (http://www.gencodegenes.org/releases/19.html), which includes small non-coding RNAs predicted using sequences from RFAM (Nawrocki et al. 2014), miRBase (Kozomara et al. 2014) and tRNA-scan (Lowe et al. 1997). We first filtered out those features that were not located in autosomal or sex chromosomes. We only kept those features of type “transcript” (3rd column in the GTF files) that had only one of the following biotypes (attribute “transcript_type” in the 9th column of the GTF files): "snRNA", "snoRNA", "miRNA" or “tRNAscan”, and discarded 6 features that were annotated as tRNA and miRNA, finally obtaining a total of 7.177 annotated small non-coding RNAs. For the benchmarking with analogous existing methods we also discarded all the snRNAs and all the snoRNAs whose name did not start with the term SNORD, obtaining a total of 4.124 annotated small non-coding RNAs. Consistency of sRNA profiles across multiple experiments Given an sRNA profile P expressed in N cell lines and labeled with M different labels L = {l1, …, lM} in the N cell lines, we defined the score H(P) for the sRNA profile P as the normalized entropy of its labels: ! �! �! !!! − � ���! � �(�) = ���!(�) where |li| is the number of cell lines where the label li has been used, and 0 ≤ H(P) ≤ 1. We used a consistency score cutoff of 0.2 to keep sRNA profiles for further analyses. Consistency of sRNA profiles within clusters Given a cluster K containing N sRNA profiles and given C = {c1, …, cM} be the set of different RNA classes assigned to each of the profiles in cluster K. Then, the consistency of the sRNA classes within the cluster is measured as the normalized cluster entropy H(K), which is defined as: ! �! �! !!! − � ���! � �(�) = ���!(�) where 0 ≤ H(K) ≤ 1. H(K) measures the purity of the cluster in terms of the known and predicted RNA classes. The lower the value of H(K), the higher the purity of the cluster. Creation of the GTF file from the extended annotation For each novel sRNA from the extended annotation generated with SeRPeNT over the ENCODE cell lines, we added annotations in the GTF file at the gene, transcript and exon levels, obeying the standards set by the Gencode Consortium. We used the label “SeRPeNT” in the source field (2nd column), and we only included the tags gene_id, gene_type, gene_name, transcript_id, transcript_type and transcript_name in the attributes field (9th column). File is available as a supplementary file with the name hg19.serpent.annotation.gtf. Differential expression analysis We used DESeq2 (Love et al., 2014) to assess the differential expression of the sRNAs in the extended annotation between 2 control samples and 2 individual knockdown samples for DICER1, DROSHA and XPO5. We considered as differentially expressed those sRNAs having a p-value adjusted for multiple testing lower than 0.05. Supplementary Tables Supplementary Table 1. Classification performance comparison. Comparison of the classification performance of BlockClust (Videm et al. 2014), DARIO (Fasold et al. 2011) and SeRPeNT on the GSM769510 (Friedländer et al. 2012) dataset for the prediction of known miRNAs, tRNAs and C/D-box snoRNAs from the Gencode v19 annotation using 100 cross-fold validation. Performance is given in terms of positive predictive value (PPV), also called precision, and true positive rate (TPR), also called recall. Classification performance values for BlockClust and DARIO were taken from Table 5 in (Videm et al. 2014). miRNA tRNA snoRNA C/D-box PPV TPR PPV TPR PPV TPR BlockClust 0.88 0.89 0.95 0.80 0.74 0.39 DARIO 0.85 0.81 0.92 0.88 0.46 0.52 SeRPeNT 0.99 0.99 0.95 0.96 0.75 0.70 Supplementary Table 2. ENCODE Dataset. (a) Characteristics of the ENCODE benchmarking dataset, composed of nuclear short RNA-Seq alignment files from ten different cell lines, and (b) characteristics of the ENCODE compartments dataset, composed of short RNA-Seq alignment files from four different compartments of K562 cell line cells. a) Cell line Tissue Karyotype Replicate # mapped Read length reads A549 Epithelium Cancer 1 175.033.257 101 2 141.307.675 101 GM12878 Blood Normal 1 34.303.754 36 2 53.087.115 36 HeLa-S3 Cervix Cancer 1 29.674.616 36 2 29.943.178 36 HepG2 Liver Cancer 1 38.337.423 36 2 32.112.643 36 HUVEC Blood Vessel Normal 1 43.069.106 36 2 29.365.379 36 IMR90 Lung Cancer 1 203.347.637 101 2 207.729.440 101 K562 Blood Cancer 1 37.009.953 36 2 31.479.779 36 MCF-7 Breast Cancer 1 168.359.421 101 2 158.785.959 101 NHEK Skin Normal 1 29.326.978 36 2 35.008.048 36 SK-N-SH Brain Cancer 1 213.226.585 101 2 91.852.402 101 b) Compartment Replicate # mapped reads Read length Chromatin 1 40.696.120 36 2 42.195.335 36 Cytosol 1 42.230.824 36 2 43.451.312 36 Nucleolus 1 37.368.047 36 2 36.429.188 36 Nucleoplasm 1 45.815.945 36 2 44.151.588 36 Supplementary Table 3. Accuracy analysis for the identification of known sRNA classes. Averaged across 100 folds for precision (PPV), recall or true positive rate (TPR) and Matthew’s Correlation Coefficient (MCC) are reported for each cell line and each biotype of non-coding RNAs in Gencode v19 annotation. Non-predicted (NP) column is the average across 100 folds of the percentage of profiles that cannot be annotated. snoRNA miRNA tRNA snRNA NP PPV TPR MCC PPV TPR MCC PPV TPR MCC PPV TPR MCC Ratio A549 0.77 0.85 0.72 0.92 0.92 0.9 0.89 0.86 0.8 0.64 0.52 0.52 30.9 % GM12878 0.81 0.8 0.74 0.93 0.94 0.92 0.9 0.91 0.8 0.64 0.53 0.54 30.7 % HeLa-S3 0.77 0.77 0.68 0.94 0.9 0.9 0.83 0.86 0.73 0.65 0.53 0.55 34.2 % HepG2 0.86 0.88 0.8 0.98 0.98 0.98 0.91 0.9 0.81 0.56 0.5 0.49 38.8 % IMR90 0.71 0.82 0.66 0.95 0.91 0.91 0.88 0.82 0.76 0.54 0.45 0.43 37.1 % K562 0.66 0.62 0.56 0.87 0.9 0.84 0.9 0.88 0.76 0.62 0.63 0.58 40.6 % MCF-7 0.68 0.78 0.62 0.9 0.9 0.87 0.83 0.79 0.7 0.54 0.42 0.4 32.3 % NHEK 0.79 0.84 0.74 0.97 0.97 0.97 0.91 0.94 0.84 0.2 0.11 0.12 25.3 % SK-N-SH 0.84 0.83 0.77 0.93 0.93 0.91 0.87 0.88 0.79 0.55 0.5 0.47 35.9 % Supplementary Table 4.