Using Deep Learning to Classify Riboswitches with 99
Total Page:16
File Type:pdf, Size:1020Kb
fbioe-08-00808 July 10, 2020 Time: 18:46 # 1 ORIGINAL RESEARCH published: 14 July 2020 doi: 10.3389/fbioe.2020.00808 Riboflow: Using Deep Learning to Classify Riboswitches With ∼99% Accuracy Keshav Aditya R. Premkumar1†, Ramit Bharanikumar2† and Ashok Palaniappan3* 1 MS Program in Computer Science, Department of Computer Science, College of Engineering and Applied Sciences, Stony Brook University, Stony Brook, NY, United States, 2 MS in Bioinformatics, Georgia Institute of Technology, Atlanta, GA, United States, 3 Department of Bioinformatics, School of Chemical and Biotechnology, SASTRA Deemed University, Thanjavur, India Riboswitches are cis-regulatory genetic elements that use an aptamer to control gene expression. Specificity to cognate ligand and diversity of such ligands have expanded the functional repertoire of riboswitches to mediate mounting apt responses to sudden metabolic demands and signal changes in environmental conditions. Given their critical role in microbial life, riboswitch characterisation remains a challenging Edited by: computational problem. Here we have addressed the issue with advanced deep Yusuf Akhter, learning frameworks, namely convolutional neural networks (CNN), and bidirectional Babasaheb Bhimrao Ambedkar University, India recurrent neural networks (RNN) with Long Short-Term Memory (LSTM). Using a Reviewed by: comprehensive dataset of 32 ligand classes and a stratified train-validate-test approach, Ahsan Z. Rizvi, we demonstrated the accurate performance of both the deep learning models (CNN and Mewar University, India RNN) relative to conventional hyperparameter-optimized machine learning classifiers Joohyun Kim, Vanderbilt University Medical Center, on all key performance metrics, including the ROC curve analysis. In particular, the United States bidirectional LSTM RNN emerged as the best-performing learning method for identifying *Correspondence: the ligand-specificity of riboswitches with an accuracy >0.99 and macro-averaged Ashok Palaniappan [email protected] F-score of 0.96. An additional attraction is that the deep learning models do not require †These authors have contributed prior feature engineering. A dynamic update functionality is built into the models to factor equally to this work for the constant discovery of new riboswitches, and extend the predictive modeling to new classes. Our work would enable the design of genetic circuits with custom-tuned Specialty section: This article was submitted to riboswitch aptamers that would effect precise translational control in synthetic biology. Computational Genomics, The associated software is available as an open-source Python package and standalone a section of the journal resource for use in genome annotation, synthetic biology, and biotechnology workflows. Frontiers in Bioengineering and Biotechnology Availability: Received: 17 December 2019 Accepted: 23 June 2020 PyPi package: riboflow @ https://pypi.org/project/riboflow Published: 14 July 2020 Repository with Standalone suite of tools: https://github.com/RiboswitchClassifier Citation: Premkumar KAR, Bharanikumar R Language: Python 3.6 with numpy, keras, and tensorflow libraries. and Palaniappan A (2020) Riboflow: Using Deep Learning to Classify License: MIT. Riboswitches With ∼99% Accuracy. Front. Bioeng. Biotechnol. 8:808. Keywords: riboswitch family, synthetic biology, machine learning, convolutional neural network, recurrent neural doi: 10.3389/fbioe.2020.00808 network, hyperparameter optimization, multiclass ROC, clustering Frontiers in Bioengineering and Biotechnology| www.frontiersin.org 1 July 2020| Volume 8| Article 808 fbioe-08-00808 July 10, 2020 Time: 18:46 # 2 Premkumar et al. Deep Learning to Classify Riboswitches INTRODUCTION and di-nucleotide frequencies in a supervised machine learning framework to classify different riboswitch sequences, and Riboswitches are ubiquitous and critical metabolite-sensing gene concluded that the multi-layer perceptron was optimal (Singh expression regulators in bacteria that are capable of folding and Singh, 2016). Their work achieved modest performance into at least two alternative conformations of 50UTR mRNA (F-score of 0.35 on 16 different riboswitch classes). None of secondary structure, which functionally switch gene expression the above methods were shown to generalize effectively to between on and off states (Mandal et al., 2003; Roth and unseen riboswitches. Our remedy was to explore the use of Breaker, 2009; Serganov and Nudler, 2013). The selection of deep learning models for riboswitch classification. Deep networks conformation is dictated by the binding of ligand cognate to are relatively recent neural network-based frameworks that use the aptamer domain of a given riboswitch (Gelfand et al., a type of learning known as representation learning (Bengio 1999; Winkler et al., 2002, 2004). Cognate ligands are key et al., 2013). Convolutional neural networks are one type of metabolites that mediate responses to metabolic or external deep learning, known for hierarchical information extraction. stimuli. Consequent to conformational changes, riboswitches Such architectures with alternating convolutional and pooling weaken transcriptional termination or occlude the ribosome layers have been earlier used to extract structural and functional binding site thereby inhibiting translation initiation of associated information from genome sequences (Alipanahi et al., 2015; genes (Yanofsky, 1981; Mandal and Breaker, 2004). Riboswitches Sønderby et al., 2015; Zhou and Troyanskaya, 2015; Kelley provide an intriguing window into the ’RNA world’ biology et al., 2016). Recurrent neural networks are counterparts to (Stormo and Ji, 2001; Brantl, 2004; Breaker et al., 2006; Strobel CNNs and specialize in extracting features from time-series and Cochrane, 2007) and there is evidence of their wider data (Che et al., 2018). RNNs with Long Short-Term Memory distribution in complex genomes (Sudarsan et al., 2003; Barrick (termed LSTM) incorporate recurrent connections to model and Breaker, 2007; Bocobza and Aharoni, 2014; McCown et al., long-run dependencies in sequential information (Hochreiter 2017). The modular properties of riboswitches have engendered and Schmidhuber, 1997), such as in speech and video (Graves and the possibility of synthetic control of gene expression (Tucker Schmidhuber, 2005). This feature of LSTM RNNs immediately and Breaker, 2005), and combined with the ability to engineer suggests their use in character-level modeling of biological binding to an ad hoc ligand, riboswitches have turned out sequence data (Lipton, 2015; Lo Bosco and Di Gangi, 2017). to be a valuable addition to the synthetic biologist’s toolkit Bidirectional LSTM RNN have been shown to be especially (Wieland and Hartig, 2008; Wittmann and Suess, 2012). In effective, given that they combine the outputs of two LSTM addition to orthogonal gene control they are useful in a variety RNNs, one processing the sequence from left to right, the of applications, notably metabolic engineering (Zhou and Zeng, other one from right to left, together enabling the capture of 2015), biosensor design (Yang et al., 2013; Meyer et al., 2015) dynamic temporal or spatial behavior (Sundermeyer et al., 2014). and genetic electronics (Villa et al., 2018). Riboswitches have Bidirectional LSTM RNNs are a particularly powerful abstraction been used as basic computing units of a complex biocomputation for modeling nucleic acid sequences whose spatial secondary network, where the concentration of the ligand of interest is structure determines function (Lee et al., 2015). Two recent titrated into measurable gene expression (Beisel and Smolke, successes of deep learning methods in RNA biology have been: (i) 2009; Domin et al., 2017). Riboswitches have also been directly prediction of RNA secondary structure (Singh et al., 2019), and used as posttranscriptional and translational checkpoints in (ii) dynamic range improvement in riboswitch devices (Groher genetic circuits (Chang et al., 2012). Their key functional roles et al., 2019). Here we have evaluated the relative merits of in infectious agents but absence in host genomes make them a spectrum of state-of-the-art learning methods for resolving attractive targets for the design of cognate inhibitors (Blount the ligand-specificity of riboswitches from sequence. It is and Breaker, 2006; Deigan and Ferré-D’Amaré, 2011; Wang demonstrated that the deep learning models vastly outperformed et al., 2017). Characterisation of riboswitches would expand other machine learning models with respect to the classification the repertoire of translational control options in synthetic of riboswitches belonging to 32 different families. biology and bioengineering. In turn, this would facilitate the reliable construction of precise genetic circuits. In view of their myriad applications, robust computational methods for the MATERIALS AND METHODS accurate characterisation of novel riboswitch sequences would be of great value. Dataset and Pre-processing Since the discovery of riboswitches (Mironov et al., 2002; We searched the Rfam database of RNA families (Kalvari Nahvi et al., 2002), many computational efforts have been et al., 2018) with the term “Riboswitch AND family” and the advanced for their characterisation, notably Infernal (Nawrocki corresponding hit sequences were obtained in FASTA format and Eddy, 2013), Riboswitch finder (Bengert and Dandekar, from the Rfam ftp server (Rfam v13 accessed on July 6, 2004), RibEx (Abreu-Goodger and Merino, 2005), RiboSW 2019). Each