Structure-based Whole Genome Realignment Reveals Many Novel Non-coding RNAs Sebastian Will,1;2;4;6 Michael Yu,1;2;5;6 and Bonnie Berger1;2;3;7 1 Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Tech- nology (MIT), Cambridge, MA 02139, USA 2 Department of Mathematics, MIT, Cambridge, MA 02139, USA 3 Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA 4 Present Address: Department of Computer Science, University of Leipzig, 04107 Leipzig, Germany 5 Present Address: Bioinformatics and Systems Biology Graduate Program, University of Californa San Diego, La Jolla, CA 92093, USA 6 These authors contributed equally to this work. 7 Corresponding author; e-mail:
[email protected] Keywords: structural non-coding RNAs, whole genome alignments, de novo prediction of non-coding RNAs 1 Will, et al. Abstract Recent genome-wide computational screens that search for conservation of RNA secondary structure in whole genome alignments (WGAs) have predicted thousands of structural non- coding RNAs (ncRNAs). The sensitivity of such approaches, however, is limited due to their reliance on sequence-based whole-genome aligners, which regularly misalign structural ncR- NAs. This suggests that many more structural ncRNAs may remain undetected. Structure- based alignment, which could increase the sensitivity, has been prohibitive for genome-wide screens due to its extreme computational costs. Breaking this barrier, we present the pipeline REAPR (RE-Alignment for Prediction of structural ncRNA), which efficiently realigns whole genomes based on RNA sequence and structure, thus allowing us to boost the performance of de novo ncRNA predictors, such as RNAz. Key to the pipeline's efficiency is the development of a novel banding technique for multiple RNA alignment.