A Model-Based Approach for Identifying Functional Intergenic Transcribed Regions and Noncoding RNAs John P. Lloyd,†,1 Zing Tsung-Yeh Tsai,2 Rosalie P. Sowers,3 Nicholas L. Panchy,‡,4 and Shin-Han Shiu*,1,4,5 1Department of Plant Biology, Michigan State University, East Lansing, MI 2Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 3Department of Biochemistry and Molecular Biology, Pennsylvania State University, University Park, PA 4Genetics Program, Michigan State University, East Lansing, MI 5Ecology, Evolutionary Biology, and Behavior Program, Michigan State University, East Lansing, MI †Present address: Departments of Human Genetics and Internal Medicine, University of Michigan, Ann Arbor, MI ‡National Institute for Mathematical and Biological Synthesis, University of Tennessee, Knoxville, TN *Corresponding author: E-mail:
[email protected]. Associate editor: Jun Gojobori Abstract With advances in transcript profiling, the presence of transcriptional activities in intergenic regions has been well established. However, whether intergenic expression reflects transcriptional noise or activity of novel genes remains unclear. We identified intergenic transcribed regions (ITRs) in 15 diverse flowering plant species and found that the amount of intergenic expression correlates with genome size, a pattern that could be expected if intergenic expression is largely nonfunctional. To further assess the functionality of ITRs, we first built machine learning models using Arabidopsis thaliana as a model that accurately distinguish functional sequences (benchmark protein-coding and RNA genes) and likely nonfunctional ones (pseudogenes and unexpressed intergenic regions) by integrating 93 biochemical, evolutionary, and sequence-structure features. Next, by applying the models genome-wide, we found that 4,427 ITRs (38%) and 796 annotated ncRNAs (44%) had features significantly similar to benchmark protein-coding or RNA genes and thus were likely parts of functional genes.