FEELnc program: FlExible Extraction of Long non-coding RNAs

Valentin Wucher, Fabice Legeai, Christophe Hitte, Thomas Derrien

Whole transcriptome sequencing (RNA-Seq) has become the de facto standard for cataloguing all RNA populations and monitoring their levels of expression in a cell/tissue samples and at a specific time point1. Among all transcripts identified in transcriptomic studies, it remains crucial to distinguish between the different classes of RNAs, particularly those that will be translated into proteins (mRNAs) from the emerging class of long non- coding RNAs (lncRNAs). Indeed, lncRNAs have been involved in many essential functions of the cell machinery2,3 and identifying lncRNAs represents a proxy towards an adapted experimental protocol to validate the functions of these RNAs.

The FEELnc program, which stands for FlExible Extraction of LncRNAs (https://github.com/tderrien/FEELnc), has been designed to improve the annotation and classification of lncRNAs versus mRNAs transcripts assembled by RNA-Seq. FEELnc is an alignment-free software which uses multi k-mer frequencies data and relaxed open- reading-frames (ORF) annotation as main computational predictors for characterising transcriptomic sequences. These features are then used in a machine learning algorithm (random forest4) to compute a new coding potential score which is used to discriminate mRNAs from lncRNAs. Particularly, the program can be self-trained with species-specific annotations and thus automatically defines the coding potential threshold that maximises classification performances.

FEELnc offers several advantages compared to previously developed methods. First, it represents a all-in-one solution from the filtering of dubious transcripts inherent to transcriptome assemblies, to the computation of a new coding potential score. It allows in addition a precise classification of lncRNAs into standardized classes based on GENCODE definition5 such as intergenic (lincRNAs) ; genic lncRNAs and subclasses (sense/antisense and exonic/intronic). FEELnc benchmarks on multiple organisms (human, mouse, chicken...) showed most often better performances based on sensitivity (Sn), Specificity (Sp), accuracy, precision and recall in comparison to several state-of-the-art programs. Third, FEELnc can annotate species-specific lncRNAs and be used on “non-model” organisms for which no set of lncRNAs is available. To this end, we have developed specific modules to model lncRNA sequences to serve as learning set for the training. Finally, to our knowledge, FEELnc is the only tool allowing users to provide their own specificity measure to annotate stringent sets of lncRNAs and mRNAs. Consequently, this defines a third a class of ambiguous transcripts (TUCp for Transcript of Unknown Coding potential6) for which the coding potential score is questionable and does not fit user's criteria.

To conclude, FEELnc is a robust and adaptable tool suitable for filtering, annotating and classifying long non-coding RNAs and messenger RNAs assembled from RNA-seq data.

References 1. Wang, Z., Gerstein, M. & Snyder, M. RNA-Seq: a revolutionary tool for transcriptomics. Nature Publishing Group 10, 57–63 (2009). 2. Ponting, C. P., Oliver, P. L. & Reik, W. Evolution and functions of long noncoding RNAs. Cell 136, 629–641 (2009). 3. Derrien, T., Guigo, R. & Johnson, R. The Long Non-Coding RNAs: A New (P)layer in the "Dark Matter". Front Genet 2, 107 (2011). 4. Breiman, L. Random Forests. Machine Learning 45, 5–32 (2001). 5. Harrow, J. et al. GENCODE: The reference human genome annotation for The ENCODE Project. Genome Res 22, 1760–1774 (2012). 6. Mattick, J. S. & Rinn, J. L. Discovery and annotation oflong noncoding RNAs. Nature Structural Molecular Biology 22, 5–7 (2015).