Supplementary Material and Methods

The discovery potential of RNA processing profiles

Amadís Pagès1,2, Ivan Dotu3, Roderic Guigó1,2 and Eduardo Eyras2,4,*

1Centre for Genomic Regulation (CRG), E08003 Barcelona, Spain. 2Universitat Pompeu Fabra (UPF), E08003 Barcelona, Spain. 3Institut Hospital del Mar d'Investigacions Mèdiques (IMIM), E08003 Barcelona, Spain. 4Catalan Institution for Research and Advanced Studies, E08010 Barcelona, Spain

Implementation

SeRPeNT is implemented in plain C programming language and runs as a standalone tool, with no external dependencies. This makes SeRPeNT extremely fast and easy to run. It also tends to be effectively faster than other tools, like BlockClust, even though SeRPeNT algorithm runs on O(n2), whereas BlockClust processing algorithm runs in linear time. On the other hand, to perform an analysis similar to SeRPeNT, the tool blockbuster (Langenberger et al. 2009) must be run prior to BlockClust to generate the profiles takes much longer than SeRPeNT.

Data filtering

Typically, sequenced small RNAs are shorter than the read length, thus the read contains part of the 3’ adapter sequence, sometimes including untemplated nucleotides in its 3’-end. If mismatches are allowed during the mapping step, it is possible that some of those untemplated nucleotides are included in the mapped read, resulting in a number of nucleotides at the 3’-end of the profile that generally have an extremely low height compared to the rest of the profile. A profile processing step, consisting in trimming the 3’-end nucleotides that have an abnormally low height, is performed prior to generating the final set of profiles.

Accuracy evaluation

The accuracy was evaluated using a cross-fold validation approach, where at each iteration we remove labels from some of the known sRNAs in each cluster and applied SeRPeNT annotation method based on the population of sRNA profiles in the same cluster (Supplementary Figure 1). The accuracy was measured for each sRNA family independently, and for each sRNA family (e.g. miRNAs), the labeled cases from that family were considered as the positive class, and all other sRNAs from other families as the negative class. For each of the 100 folds, a contingency matrix for the predicted profiles was built for each non-coding RNA family, and precision (positive predictive value or PPV), recall or true positive rate (TPR) and Matthews Correlation Coefficient (MCC) measures were calculated as follows:

�� ��� = �� + ��

!" ��� = !"!!"

�� ⋅ �� − �� ⋅ �� ��� = (�� + ��)(�� + ��)(�� + ��)(�� + ��)

MCC includes true and false positives and negatives, takes values from -1 to 1, and it is considered as a balanced measure that can be used even if the classes are of very different sizes. MCC values greater than 0.7 are considered as very strong predictions, values ranging from 0.5 to 0.7 as strong predictions, and values lower than 0.5 are considered as weak or close to random. Negative values are interpreted as anti-correlations. Finally, for each cell line and family of non-coding RNAs, the three metrics were averaged over 100 folds.

ENCODE short RNA-Seq data download

Total short RNA Sequencing (sRNA-Seq) alignment files in BAM format were downloaded from the ENCODE webserver1 (ENCODE Project Consortium. 2012) for cell lines A549, GM12878, HeLa-S3, HepG2, HUVEC, IMR90, K562, MCF-7, NHEK and SK-N-SH. Additionally, sRNA- Seq alignment files from the chromatin, nucleolus, nucleoplasm and cytoplasm compartments of the cell line K562 were also downloaded. Characteristics for each of the downloaded RNA-Seq alignment files are provided in Supplementary Table 2. Detailed information about the cell culture, sequencing and mapping protocols is described in the ENCODE webserver (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCshlShortRnaSeq).

Annotation data

GENCODE release 19 (Harrow et al. 2012) main and tRNA annotation files in GTF format were downloaded from the GENCODE project website (http://www.gencodegenes.org/releases/19.html), which includes small non-coding RNAs predicted using sequences from RFAM (Nawrocki et al. 2014), miRBase (Kozomara et al. 2014) and tRNA-scan (Lowe et al. 1997). We first filtered out those features that were not located in autosomal or sex . We only kept those features of type “transcript” (3rd column in the GTF files) that had only one of the following biotypes (attribute “transcript_type” in the 9th column of the GTF files): "snRNA", "snoRNA", "miRNA" or “tRNAscan”, and discarded 6 features that were annotated as tRNA and miRNA, finally obtaining a total of 7.177 annotated small non-coding RNAs. For the benchmarking with analogous existing methods we also discarded all the snRNAs and all the snoRNAs whose name did not start with the term SNORD, obtaining a total of 4.124 annotated small non-coding RNAs.

Consistency of sRNA profiles across multiple experiments

Given an sRNA profile P expressed in N cell lines and labeled with M different labels L = {l1, …, lM} in the N cell lines, we defined the score H(P) for the sRNA profile P as the normalized entropy of its labels:

! �! �! !!! − � ���! � �(�) = ���!(�)

where |li| is the number of cell lines where the label li has been used, and 0 ≤ H(P) ≤ 1. We used a consistency score cutoff of 0.2 to keep sRNA profiles for further analyses.

Consistency of sRNA profiles within clusters

Given a cluster K containing N sRNA profiles and given C = {c1, …, cM} be the set of different RNA classes assigned to each of the profiles in cluster K. Then, the consistency of the sRNA classes within the cluster is measured as the normalized cluster entropy H(K), which is defined as:

! �! �! !!! − � ���! � �(�) = ���!(�) where 0 ≤ H(K) ≤ 1. H(K) measures the purity of the cluster in terms of the known and predicted RNA classes. The lower the value of H(K), the higher the purity of the cluster.

Creation of the GTF file from the extended annotation

For each novel sRNA from the extended annotation generated with SeRPeNT over the ENCODE cell lines, we added annotations in the GTF file at the , transcript and exon levels, obeying the standards set by the Gencode Consortium. We used the label “SeRPeNT” in the source field (2nd column), and we only included the tags gene_id, gene_type, gene_name, transcript_id, transcript_type and transcript_name in the attributes field (9th column). File is available as a supplementary file with the name hg19.serpent.annotation.gtf.

Differential expression analysis

We used DESeq2 (Love et al., 2014) to assess the differential expression of the sRNAs in the extended annotation between 2 control samples and 2 individual knockdown samples for DICER1, and XPO5. We considered as differentially expressed those sRNAs having a p-value adjusted for multiple testing lower than 0.05.

Supplementary Tables

Supplementary Table 1. Classification performance comparison. Comparison of the classification performance of BlockClust (Videm et al. 2014), DARIO (Fasold et al. 2011) and SeRPeNT on the GSM769510 (Friedländer et al. 2012) dataset for the prediction of known miRNAs, tRNAs and C/D-box snoRNAs from the Gencode v19 annotation using 100 cross-fold validation. Performance is given in terms of positive predictive value (PPV), also called precision, and true positive rate (TPR), also called recall. Classification performance values for BlockClust and DARIO were taken from Table 5 in (Videm et al. 2014).

miRNA tRNA snoRNA C/D-box

PPV TPR PPV TPR PPV TPR

BlockClust 0.88 0.89 0.95 0.80 0.74 0.39

DARIO 0.85 0.81 0.92 0.88 0.46 0.52

SeRPeNT 0.99 0.99 0.95 0.96 0.75 0.70

Supplementary Table 2. ENCODE Dataset. (a) Characteristics of the ENCODE benchmarking dataset, composed of nuclear short RNA-Seq alignment files from ten different cell lines, and (b) characteristics of the ENCODE compartments dataset, composed of short RNA-Seq alignment files from four different compartments of K562 cell line cells. a)

Cell line Tissue Karyotype Replicate # mapped Read length reads

A549 Epithelium Cancer 1 175.033.257 101

2 141.307.675 101

GM12878 Blood Normal 1 34.303.754 36

2 53.087.115 36

HeLa-S3 Cervix Cancer 1 29.674.616 36

2 29.943.178 36 HepG2 Liver Cancer 1 38.337.423 36

2 32.112.643 36

HUVEC Blood Vessel Normal 1 43.069.106 36

2 29.365.379 36

IMR90 Lung Cancer 1 203.347.637 101

2 207.729.440 101

K562 Blood Cancer 1 37.009.953 36

2 31.479.779 36

MCF-7 Breast Cancer 1 168.359.421 101

2 158.785.959 101

NHEK Skin Normal 1 29.326.978 36

2 35.008.048 36

SK-N-SH Brain Cancer 1 213.226.585 101

2 91.852.402 101

b)

Compartment Replicate # mapped reads Read length

Chromatin 1 40.696.120 36

2 42.195.335 36

Cytosol 1 42.230.824 36

2 43.451.312 36 Nucleolus 1 37.368.047 36

2 36.429.188 36

Nucleoplasm 1 45.815.945 36

2 44.151.588 36

Supplementary Table 3. Accuracy analysis for the identification of known sRNA classes. Averaged across 100 folds for precision (PPV), recall or true positive rate (TPR) and Matthew’s Correlation Coefficient (MCC) are reported for each cell line and each biotype of non-coding RNAs in Gencode v19 annotation. Non-predicted (NP) column is the average across 100 folds of the percentage of profiles that cannot be annotated.

snoRNA miRNA tRNA snRNA NP

PPV TPR MCC PPV TPR MCC PPV TPR MCC PPV TPR MCC Ratio A549 0.77 0.85 0.72 0.92 0.92 0.9 0.89 0.86 0.8 0.64 0.52 0.52 30.9 % GM12878 0.81 0.8 0.74 0.93 0.94 0.92 0.9 0.91 0.8 0.64 0.53 0.54 30.7 % HeLa-S3 0.77 0.77 0.68 0.94 0.9 0.9 0.83 0.86 0.73 0.65 0.53 0.55 34.2 % HepG2 0.86 0.88 0.8 0.98 0.98 0.98 0.91 0.9 0.81 0.56 0.5 0.49 38.8 % IMR90 0.71 0.82 0.66 0.95 0.91 0.91 0.88 0.82 0.76 0.54 0.45 0.43 37.1 % K562 0.66 0.62 0.56 0.87 0.9 0.84 0.9 0.88 0.76 0.62 0.63 0.58 40.6 % MCF-7 0.68 0.78 0.62 0.9 0.9 0.87 0.83 0.79 0.7 0.54 0.42 0.4 32.3 % NHEK 0.79 0.84 0.74 0.97 0.97 0.97 0.91 0.94 0.84 0.2 0.11 0.12 25.3 % SK-N-SH 0.84 0.83 0.77 0.93 0.93 0.91 0.87 0.88 0.79 0.55 0.5 0.47 35.9 %

Supplementary Table 4. ENCODE cell lines dataset results. Summary of the number of sRNA profiles and clusters identified in each of the ENCODE cell lines. Total number of profiles are reported as well as number and percentage of known and predicted profiles separated by non-coding RNA biotype. Known profiles are profiles that overlap with a feature annotated in the gencode v19 annotation. Predicted profiles are those that do not overlap any feature in the annotation but that are label by SeRPeNT, while unknown profiles are those that lack a prediction.

A549 GM12878 HeLa-S3 HepG2 IMR90 K562 MCF-7 NHEK SK-N-SH 252 184 193 191 247 83 239 173 235 snoRNA (13.45%) (17.05%) (19.05%) (20.25%) (13.32%) (10.11%) (12.00%) (18.25%) (14.17%) 453 401 345 380 451 376 446 394 415 tRNA (24.19%) (37.16%) (34.06%) (40.30%) (24.33%) (45.80%) (22.39%) (41.56%) (25.02%) 205 128 135 93 218 126 191 108 182 Known miRNA (10.95%) (11.86%) (13.33%) (9.86%) (11.76%) (15.35%) (9.59%) (11.39%) (10.97%) 111 51 45 118 123 45 84 snRNA (5.93%) 52 (4.82%) (5.03%) (4.77%) (6.36%) 36 (4.38%) (6.17%) (4.75%) (5.06%) 275 34 35 169 336 21 160 snoRNA (14.68%) 29 (2.69%) (3.36%) (3.71%) (9.12%) 15 (1.83%) (16.87%) (2.22%) (9.64%) 71 31 48 83 117 48 76 tRNA (3.79%) 50 (4.63%) (3.06%) (5.09%) (4.48%) 28 (3.41%) (5.87%) (5.06%) (4.58%) 30 16 38 41 16 miRNA (1.60%) 16 (1.48%) (1.58%) 6 (0.64%) (2.05%) 12 (1.46%) (2.06%) 5 (0.53%) (0.96%) 145 38 43 74 Labeled snRNA (7.74%) 14 (1.30%) 9 (0.89%) 3 (0.32%) (2.05%) 3 (0.37%) (2.16%) 5 (0.53%) (4.46%) 331 205 199 142 492 142 456 149 417 Unknown (17.67%) (19.00%) (19.64%) (15.06%) (26.54%) (17.30%) (22.89%) (15.72%) (25.14%) # profiles 1873 1079 1013 943 1854 821 1992 948 1659 # clusters 287 194 201 198 303 168 314 187 294 # singletons 516 389 377 290 690 331 592 297 506

Supplementary Table 5. ENCODE cell compartments dataset results. Summary of the number of profiles and clusters identified in each of the ENCODE cell compartments for K562 cell line. Total number of profiles are reported as well as number and percentage of annotated and unknown profiles. Percentages are calculated within each compartment. Singletons were not considered in further analyses.

Chromatin Cytosol Nucleolus Nucleoplasm snoRNA 268 (23.14%) 117 (12.20%) 289 (38.23%) 240 (4.83%) miRNA 37 (3.20%) 142 (14.81%) 28 (3.70%) 31 (0.62%) Extended tRNA 175 (15.11%) 431 (44.94%) 164 (21.69%) 299 (6.02%) annotation snRNA 108 (9.33%) 61 (6.36%) 64 (8.47%) 69 (1.39%) cuRNAs 8 (0.69%) 10 (1.04%) 6 (0.79%) 10 (0.20%) Not in the extended annotation 562 (48.53%) 198 (20.65%) 205 (27.12%) 4319 (86.94%) # profiles 1158 959 756 4968 # clusters 239 218 115 739 # singletons 370 327 303 1546

Supplementary Table 6. Mixed clusters of miRNAs and other ncRNAs. Summary of the clusters where a tRNA or a snoRNA annotated in the Gencode v19 annotation clusters with at least two miRNAs annotated in the Gencode v19 annotation. tRNA / snoRNA miRNAs Cluster Cell line SCARNA3 tRNA-Ile-GAT MIR1307-201, MIR33B-201, MIR301B-201 26 A549 SNORD116-5 MIR589-201,MIR1296-201,MIR125B1-201 28 A549 SNORD116-7 GM1287 SNORA57 MIR106B-201, MIRLET7I-201, MIR98-202 88 8 tRNA-Glu-GAA MIR425-201, MIR17HG-205, MIR221-201 53 HepG2 SNORD14C MIRLET7G-201, MIR744-001 126 IMR90 MIR1296-201, MIR3180-5-201, hsa-mir-3180-3.2-201 Hsa-mir-3180-4.1-201, hsa-mir-3180-3.1-201, hsa-mir-3180-3.3- tRNA-Gly-CCC 201 5 K562 SNORD26 MIR425-201, MIR423-201, MIR362-201, MIR660-201 18 K562 SNORD60 MIR532-201, MIR545-201, MIR503HG-201 19 K562 tRNA-Ala-AGC MIR4802-201, MIR3913-2-201 21 MCF-7 MIR30E-201, MIR590-201, MIR126-201, MIR326-201, MIR3187-201 tRNA-Ile-GAT MIR502-202 26 MCF-7 MIR219-1-201, MIR192-201, MIR17HG-202, MIR497HG-201 SNORA3 MIR222-201, MIR503-201 38 MCF-7 SCARNA15/ACA 45 MIR27B-201, TRI-TAT2-3 107 NHEK tRNA-Leu-AAG MIRLET7F1-201, MIRLET7A3-201 198 SK-N-SH tRNA-Ile-GAT MIR30E-201, MIR652-201 88 SK-N-SH

Supplementary Figures

Supplementary Figure 1. Accuracy benchmarking pipeline. Pipeline of the cross-fold validation method for the accuracy benchmarking of the prediction of novel non-coding RNAs.

             

     



  

 



   

  

       

 

 

          

 

 

           

 

          

               Supplementary Figure 2. Empirical assessment of the fold-change threshold. Distribution of fold-change values for the sRNA profiles showing a significant change in pairwise distances (seagreen) and those non-significant (red) profiles, using a threshold of (Mann-Whitney U test - p-value < 0.01, between pairs of ENCODE cell lines. A fold-change threshold of 2.5 to separates the significant profiles from the non-significant ones in all pairs of cell lines.

   

 

    

 

                                          Supplementary Figure 3. Differential processing of miRNA arms. Results of SeRPeNT diffproc analysis of 24 miRNAs (x axis) to test differences in 5’arm-to-3’arm expression ratio and arm-switching in normal versus tumour tissues. The left hand y-axis shows the log fold-change calculated by SeRPeNT diffproc analysis, with the miRNAs ranked by this value. We indicate which miRNAs were measured in (Li et al. 2012) to have a difference in expression between 5’ and 3’ arms (blue) or not (red). The right hand y axis separates the cases according to whether these were labeled as differentially or not differentially processed by SeRPeNT. Diamonds indicate which miRNAs were observed to have arm-switching in (Li et al. 2012).

Supplementary Figure 4. Differential processing comparison. Number of differentially processed profiles assessed by RPA and SeRPeNT between pairs of ENCODE cell lines. Numbers represent profiles that show expression in all the cell lines compared. Cell lines sequenced with reads of length 36 bp (AG04450, BJ, GM12878, HeLa-S3, h1-hESC and HepG2) have been analyzed independently from the cell lines sequenced with reads of length 101 bp (A549, MCF-7 and SK-N-SH). Differentially processed profiles assessed by RPA were taken from Supplementary File 2 in (Pundhir and Gorodkin. 2015). As RPA does not perform a reproducibility test, we dropped this filter from our results to make the analyses comparable and only demanded that profiles should be expressed in both replicates. The overlap between the two methods is only moderate, with SeRPent detecting in general more differentially expressed events and RPA recovering often a subset of SeRPeNT predictions.



 



  Supplementary Figure 5. Extended annotation summary. Distribution of the sRNAs from the extended annotation in terms of the number of ENCODE cell lines in which they show expression.

    

    









    

    









    



 









Supplementary Figure 6. Differential processing of tRNAs in cell compartments. Representation of the read profiles for tRNA-Lys, tRNA-His and tRNA-Leu showing abundant processing of the 3’-half in the cytosol compared to the chromatin compartment. The plot represents the number of reads per nucleotide in the same scale for each compartment.

  

    





 

 

   

    



  



 

       

   

   

 

 

          





 

      

 Supplementary Figure 7. DROSHA dependence of the extended sRNA annotation. Differential expression analysis of the sRNAs from the extended annotation between DROSHA knockout cells and controls in HCT116 cells. Volcano plots are shown independently for (a) miRNAs, (b) snoRNAs, (c) snRNAs and (d) tRNAs. Differentially expressed sRNAs are indicated in blue. The rest are indicated in red.

 



  



 

 

   

 

   

 

 

      

  

  



 

   

 

  

           

 Supplementary Figure 8. XPO5 dependence of the extended sRNA annotation. Differential expression analysis of the sRNAs from the extended annotation between XPO5 knockout cells and controls in HCT116 cells. Volcano plots are shown independently for (a) miRNAs, (b) snoRNAs, (c) snRNAs and (d) tRNAs.

            



     

 

       

 

      

         

 



  

 Supplementary Figure 9. Dependence of the cuRNAs on the miRNA biogenesis pathway. Differential expression analysis of the cuRNAs from the extended annotation in (a) HCT116 DICER1 knockout cells, (b) HCT116 DROSHA knockout cells and (c) HCT116 XPO5 knockout cells, compared to HCT116 control cells.



    

 

   

   

   

  

    

   

   

   



Supplementary Figure 10. Improved dynamic warping algorithm for the optimal alignment of sRNA-seq read profiles. Pseudocode of the algorithm based on dynamic time-warping algorithm used in our analyses to compare two read profiles. The function ϕ, when applied to a profile S, returns a random negative value taken from a uniform distribution of mean and standard deviation equal to those values from the distribution of read-counts from S.



   

 

     

  

    



       

  



Supplementary Figure 11. Improved density-based clustering algorithm. Pseudocode for the modification of the density-based clustering algorithm used in our analysis, using a gaussian kernel density estimator. The function optimize_distance is detailed in (Wang et al. 2015).

Supplementary Data

Supplementary Data 1. Novel profiles in ENCODE cell lines. List of predicted profiles found by SeRPeNT among ENCODE cell lines that conform the extended annotation of sRNAs in human. Profiles highlighted in green are new profiles which prediction is validated by an existing structure in Rfam database.

Supplementary Data 2. Extended annotation. Extended annotation in GTF format.

Supplementary Data 3. Multiple validation of predicted sRNAs. List of sRNAs in the extended annotation that show evidence of processing by the miRNA machinery, validation of a miRNA-like secondary structure with FOMmiR, or overlap with a long non-coding RNA.

Supplementary Data 4. Differential processing. List of differentially processed profiles in the extended annotation for each pair of K563 cell compartments.

References

Bernhart, S. et al. (2008). RNAalifold: improved consensus structure prediction for RNA alignments. BMC Bioinformatics, 9, 474-478.

Cunningham et al. (2015). “Ensembl 2015". Nucleic Acids Research.

Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29(1):15-21. (2013).

ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the . Nature 489(7414):57-74. (2012)

Fasold, M. et al. DARIO: a ncRNA detection and analysis tool for next-generation sequencing experiments. Nucleic Acids Research 39 (Web server issue), W112-W117 (2011).

Friedländer, M. R. et al. miRDeep2 accurately identifies known and hundreds of novel microRNA in seven animal clades. Nucleic Acids Research 40, 37-52 (2012).

Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res 22(9):1760-74. (2012)

Kozomara A, Griffiths-Jones S. miRBase: annotating high confidence microRNAs using deep sequencing data. Nucleic Acids Res. 42(Database issue):D68-73. (2014)

Li et al. MicroRNA 3’ end nucleotide modification patterns and arm selection preference in liver tissues. BMC Systems Biology 6, S14 (2012).

Love M, Huber A, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome biology 15(12)500. (2014)

Lowe TM, Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25(5):955-64. (1997)

Nawrocki, E. and Eddy, S. Infernal 1.1: 100-fold faster homology searches. Bioinformatics, 29, 2933-2935. (2013)

Nawrocki, E. et al. Rfam 12.0: updates to the RNA families database. Nucleic Acids Research 43 (Database Issue), D130-D137 (2014).

Needleman, S. and Wunsch, C. (1970) A general method applicable to the search for similarities in the amino acid sequences of two . Journal of Molecular Biology, 48, 443- 453.

Pundhir, S., Poirazi, P. and Gorodkin, J. Emerging applications of read profiles towards the functional annotation of the genome. Front. Genet. 6, article 188 (2015).

Quinlan, A. and Hall, I. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 26, 841-842. (2010)

Reuter, J. S. and Mathews, D. H. RNAstructure: software for RNA secondary structure prediction and analysis. Bioinformatics 11, 129-132 (2010).

Rybak-Wolf A. et al. A variety of substrates in human and C. elegans. Cell, 159, 1153- 67.(2014)

Videm, P. et al. BlockClust: efficient clustering and classification of non-coding RNAs from short read RNA-Seq profiles. Bioinformatics 30, i274-i282 (2014).

Wang, S. et al. (2015) Comment on “Clustering by fast and fins of density peaks”. arXiv:1501.04267v1.