fbioe-08-00808 July 10, 2020 Time: 18:46 # 1

ORIGINAL RESEARCH published: 14 July 2020 doi: 10.3389/fbioe.2020.00808

Riboflow: Using Deep Learning to Classify With ∼99% Accuracy

Keshav Aditya R. Premkumar1†, Ramit Bharanikumar2† and Ashok Palaniappan3*

1 MS Program in Computer Science, Department of Computer Science, College of Engineering and Applied Sciences, Stony Brook University, Stony Brook, NY, United States, 2 MS in Bioinformatics, Georgia Institute of Technology, Atlanta, GA, United States, 3 Department of Bioinformatics, School of Chemical and Biotechnology, SASTRA Deemed University, Thanjavur, India

Riboswitches are cis-regulatory genetic elements that use an aptamer to control . Specificity to cognate ligand and diversity of such ligands have expanded the functional repertoire of riboswitches to mediate mounting apt responses to sudden metabolic demands and signal changes in environmental conditions. Given their critical role in microbial life, characterisation remains a challenging

Edited by: computational problem. Here we have addressed the issue with advanced deep Yusuf Akhter, learning frameworks, namely convolutional neural networks (CNN), and bidirectional Babasaheb Bhimrao Ambedkar University, India recurrent neural networks (RNN) with Long Short-Term Memory (LSTM). Using a Reviewed by: comprehensive dataset of 32 ligand classes and a stratified train-validate-test approach, Ahsan Z. Rizvi, we demonstrated the accurate performance of both the deep learning models (CNN and Mewar University, India RNN) relative to conventional hyperparameter-optimized machine learning classifiers Joohyun Kim, Vanderbilt University Medical Center, on all key performance metrics, including the ROC curve analysis. In particular, the United States bidirectional LSTM RNN emerged as the best-performing learning method for identifying *Correspondence: the ligand-specificity of riboswitches with an accuracy >0.99 and macro-averaged Ashok Palaniappan [email protected] F-score of 0.96. An additional attraction is that the deep learning models do not require †These authors have contributed prior feature engineering. A dynamic update functionality is built into the models to factor equally to this work for the constant discovery of new riboswitches, and extend the predictive modeling to new classes. Our work would enable the design of genetic circuits with custom-tuned Specialty section: This article was submitted to riboswitch aptamers that would effect precise translational control in synthetic biology. Computational Genomics, The associated software is available as an open-source Python package and standalone a section of the journal resource for use in genome annotation, synthetic biology, and biotechnology workflows. Frontiers in Bioengineering and Biotechnology Availability: Received: 17 December 2019 Accepted: 23 June 2020 PyPi package: riboflow @ https://pypi.org/project/riboflow Published: 14 July 2020 Repository with Standalone suite of tools: https://github.com/RiboswitchClassifier Citation: Premkumar KAR, Bharanikumar R Language: Python 3.6 with numpy, keras, and tensorflow libraries. and Palaniappan A (2020) Riboflow: Using Deep Learning to Classify License: MIT. Riboswitches With ∼99% Accuracy. Front. Bioeng. Biotechnol. 8:808. Keywords: riboswitch family, synthetic biology, machine learning, convolutional neural network, recurrent neural doi: 10.3389/fbioe.2020.00808 network, hyperparameter optimization, multiclass ROC, clustering

Frontiers in Bioengineering and Biotechnology| www.frontiersin.org 1 July 2020| Volume 8| Article 808 fbioe-08-00808 July 10, 2020 Time: 18:46 # 2

Premkumar et al. Deep Learning to Classify Riboswitches

INTRODUCTION and di-nucleotide frequencies in a supervised machine learning framework to classify different riboswitch sequences, and Riboswitches are ubiquitous and critical metabolite-sensing gene concluded that the multi-layer perceptron was optimal (Singh expression regulators in that are capable of folding and Singh, 2016). Their work achieved modest performance into at least two alternative conformations of 50UTR mRNA (F-score of 0.35 on 16 different riboswitch classes). None of secondary structure, which functionally switch gene expression the above methods were shown to generalize effectively to between on and off states (Mandal et al., 2003; Roth and unseen riboswitches. Our remedy was to explore the use of Breaker, 2009; Serganov and Nudler, 2013). The selection of deep learning models for riboswitch classification. Deep networks conformation is dictated by the binding of ligand cognate to are relatively recent neural network-based frameworks that use the aptamer domain of a given riboswitch (Gelfand et al., a type of learning known as representation learning (Bengio 1999; Winkler et al., 2002, 2004). Cognate ligands are key et al., 2013). Convolutional neural networks are one type of metabolites that mediate responses to metabolic or external deep learning, known for hierarchical information extraction. stimuli. Consequent to conformational changes, riboswitches Such architectures with alternating convolutional and pooling weaken transcriptional termination or occlude the ribosome layers have been earlier used to extract structural and functional binding site thereby inhibiting translation initiation of associated information from genome sequences (Alipanahi et al., 2015; genes (Yanofsky, 1981; Mandal and Breaker, 2004). Riboswitches Sønderby et al., 2015; Zhou and Troyanskaya, 2015; Kelley provide an intriguing window into the ’RNA world’ biology et al., 2016). Recurrent neural networks are counterparts to (Stormo and Ji, 2001; Brantl, 2004; Breaker et al., 2006; Strobel CNNs and specialize in extracting features from time-series and Cochrane, 2007) and there is evidence of their wider data (Che et al., 2018). RNNs with Long Short-Term Memory distribution in complex genomes (Sudarsan et al., 2003; Barrick (termed LSTM) incorporate recurrent connections to model and Breaker, 2007; Bocobza and Aharoni, 2014; McCown et al., long-run dependencies in sequential information (Hochreiter 2017). The modular properties of riboswitches have engendered and Schmidhuber, 1997), such as in speech and video (Graves and the possibility of synthetic control of gene expression (Tucker Schmidhuber, 2005). This feature of LSTM RNNs immediately and Breaker, 2005), and combined with the ability to engineer suggests their use in character-level modeling of biological binding to an ad hoc ligand, riboswitches have turned out sequence data (Lipton, 2015; Lo Bosco and Di Gangi, 2017). to be a valuable addition to the synthetic biologist’s toolkit Bidirectional LSTM RNN have been shown to be especially (Wieland and Hartig, 2008; Wittmann and Suess, 2012). In effective, given that they combine the outputs of two LSTM addition to orthogonal gene control they are useful in a variety RNNs, one processing the sequence from left to right, the of applications, notably metabolic engineering (Zhou and Zeng, other one from right to left, together enabling the capture of 2015), biosensor design (Yang et al., 2013; Meyer et al., 2015) dynamic temporal or spatial behavior (Sundermeyer et al., 2014). and genetic electronics (Villa et al., 2018). Riboswitches have Bidirectional LSTM RNNs are a particularly powerful abstraction been used as basic computing units of a complex biocomputation for modeling nucleic acid sequences whose spatial secondary network, where the concentration of the ligand of interest is structure determines function (Lee et al., 2015). Two recent titrated into measurable gene expression (Beisel and Smolke, successes of deep learning methods in RNA biology have been: (i) 2009; Domin et al., 2017). Riboswitches have also been directly prediction of RNA secondary structure (Singh et al., 2019), and used as posttranscriptional and translational checkpoints in (ii) dynamic range improvement in riboswitch devices (Groher genetic circuits (Chang et al., 2012). Their key functional roles et al., 2019). Here we have evaluated the relative merits of in infectious agents but absence in host genomes make them a spectrum of state-of-the-art learning methods for resolving attractive targets for the design of cognate inhibitors (Blount the ligand-specificity of riboswitches from sequence. It is and Breaker, 2006; Deigan and Ferré-D’Amaré, 2011; Wang demonstrated that the deep learning models vastly outperformed et al., 2017). Characterisation of riboswitches would expand other machine learning models with respect to the classification the repertoire of translational control options in synthetic of riboswitches belonging to 32 different families. biology and bioengineering. In turn, this would facilitate the reliable construction of precise genetic circuits. In view of their myriad applications, robust computational methods for the MATERIALS AND METHODS accurate characterisation of novel riboswitch sequences would be of great value. Dataset and Pre-processing Since the discovery of riboswitches (Mironov et al., 2002; We searched the Rfam database of RNA families (Kalvari Nahvi et al., 2002), many computational efforts have been et al., 2018) with the term “Riboswitch AND family” and the advanced for their characterisation, notably Infernal (Nawrocki corresponding hit sequences were obtained in FASTA format and Eddy, 2013), Riboswitch finder (Bengert and Dandekar, from the Rfam ftp server (Rfam v13 accessed on July 6, 2004), RibEx (Abreu-Goodger and Merino, 2005), RiboSW 2019). Each riboswitch is represented by the coding strand (Chang et al., 2009) and DRD (Havill et al., 2014), and reviewed sequence, with uracil replaced by thymine, thereby conforming in Clote(2015) and Antunes et al.(2017). These methods to the nucleotide alphabet ‘ACGT.’ Each sequence was scanned used probabilistic models of known classes with or without for non-standard letters (i.e., other than the alphabet) and secondary structure information to infer or predict the riboswitch such occurrences were corrected using the character mapping class. Singh and Singh explored featuring mono-nucleotide defined in Table 1. The feature vectors for machine learning

Frontiers in Bioengineering and Biotechnology| www.frontiersin.org 2 July 2020| Volume 8| Article 808 fbioe-08-00808 July 10, 2020 Time: 18:46 # 3

Premkumar et al. Deep Learning to Classify Riboswitches

TABLE 1 | Non-standard nucleotide mapping. Hyperparameter Optimisation

S. No. Original letter Mapped character #Occurrences in dataset Hyperparameter fine tuning for each classifier was effected by discrete combinatorial grid search on the hyperparameters 1 R G 6 associated with that classifier. The grid search was evaluated 2 Y T 8 with 10-fold cross-validation of the training set. This yielded 3 K G 1 the optimal hyperparameters for each classifier. The scripts for 4 S G 3 hyperparameter optimisation of the base models are available 5 W A 2 in the repository. In the case of the deep models, we used 6 H A 2 a train-validate-test approach to model optimisation with Rare occurrences of non-standard nucleotides in the sequences were converted keras/TensorFlow, by setting the ‘validation’ flag to 0.1. using this mapping key. Evaluation Metrics were extracted from the sequences. For each sequence, 20 The performance of each classifier was evaluated on the test features were computed, comprising four mononucleotide set using the receiver operating characteristic (ROC) analysis frequencies (A,C,G,T) and 16 dinucleotide frequencies. To in addition to standard metrics such as the precision, recall, address possible skew in distribution, all the frequency features accuracy and F-score (harmonic mean of precision and recall) were normalized to zero mean and unit variance. Deep models, (van Rijsbergen, 1975). The ROC curve was obtained by plotting namely convolutional neural networks (CNNs) and bidirectional the TP rate vs. the FP rate, i.e., sensitivity vs. (1 – specificity), and recurrent neural networks with LSTM (hereafter simply referred the area under the ROC curve (AUROC) could be estimated to as RNNs) are capable of using the sequences directly as the rate the model’s performance. AUROC represents the probability feature space, obviating any need for feature engineering. We that a given classifier would rank a randomly chosen positive used the first 250 bases of the riboswitch sequence as the input, instance higher than a randomly chosen negative one. ROC with the proviso that shorter sequences (which is usually the analysis is robust to class imbalance, typical of the machine case; Table 2) were padded for the extra spaces. Python scripts learning problem at hand, however, a multi-class adaptation of used to create the final dataset are available in the repository the binary ROC is necessary. For each classifier, this is achieved for this project1. The dataset consists of the riboswitch sequence, by computing classwise binary AUROC values in a one-vs.-all four 1-mer frequencies, 16 2-mer frequencies, and class, for each manner, followed by aggregating the classwise AUROC values instance, which could be appropriately subsetted for training the into a single multi-class AUROC measure (Yang, 1999; Manning base and deep models. et al., 2008). Aggregation of the classwise AUROC values could be done in at least two ways: Predictive Modeling 1. Micro-average AUROC, where each instance is The machine learning problem is simply stated as: given the given equal weight. riboswitch sequence, predict the ligand class of the riboswitch. Toward this, a battery of eight supervised machine learning 2. Macro-average AUROC, where each class is and deep classifiers were studied and evaluated in the present given equal weight. work (Table 3). Classifiers derived from implementations in the Python scikit-learn machine learning library (Pedregosa et al., In micro-averaging, all the instances from different classes are 2011)2 are referred to as base models and include the Decision treated equally, to arrive at the final metrics. In particular, the Tree, K-nearest Neighbors, Gaussian Naive Bayes, the ensemble microaverage AUROC is given by the area under the overall TP classifiers AdaBoost and Random Forest and the Multi-layer rate vs. overall FP rate curve. Perceptron. The deep classifiers namely CNN and RNN derived On the other hand, the macro-average of a given metric for a from implementations in the Python keras library3 on tensorflow multi-class prediction problem is estimated by the average of the (Abadi et al., 2015). Three scripts in the repository, namely metric for the individual classes. For example, the macro-average baseModels.py, rnnApp.py, and cnnApp.py, implement the base AUROC is given by: models, RNN, and CNN, respectively. For both the base and deep classifiers, the dataset was split into 0.9:0.1 training:test sets. Multi-class modeling is fraught with overfitting to particular Macro − average AUROC = (AUROC1 + AUROC2 + · · · + classes (especially pronounced in cases of extreme class skew). To address this issue, two strategies were adopted: (i) the splitting AUROC32)/32 process was stratified on the class, which ensured that each where AUROC is the binary AUROC for the ith class. class was proportionately and sufficiently distributed in both i It is clear that the micro-average AUROC would be dominated the training and test sets, and (ii) hyperparameter optimisation, by the larger classes, while the macro-average AUROC is a more discussed below. balanced measure. Both the micro-average and macro-average 1https://github.com/RiboswitchClassifier AUROC measures were used to evaluate the performance of all 2www.sklearn.com our classifiers. A python script, multiclass.py available in the 3http://keras.io repository, generates all the performance metrics and plots.

Frontiers in Bioengineering and Biotechnology| www.frontiersin.org 3 July 2020| Volume 8| Article 808 fbioe-08-00808 July 10, 2020 Time: 18:46 # 4

Premkumar et al. Deep Learning to Classify Riboswitches

TABLE 2 | A summary of the riboswitch dataset used in our study.

Class no. Rfam ID Class Name Class size Avg. length

1 RF00504 Glycine riboswitch 4592 100 2 RF01786 Cyclic di-GMP-II riboswitch 661 86 3 RF01750 ZMP/ZTP riboswitch 1674 92 4 RF00059 Thiamine pyrophosphate riboswitch 12559 110 5 RF01057 SAH Riboswitch 832 92 6 RF01725 SAM -1 -4 Variant riboswitch 793 104 7 RF00162 SAM - 1 Riboswitch 6027 113 8 RF00174 Cobalamin riboswitch 14212 189 9 RF01055 Molybdenum Co-factor riboswitch 1221 134 10 RF01727 SAM-SAH Riboswitch 240 50 11 RF01482 AdoCbl riboswitch 182 137 12 RF03057 nhaA-I RNA 559 56 13 RF01734 Fluoride Riboswitch 2018 70 14 RF00167 Purine Riboswitch 2632 101 15 RF00234 Glucosamine-6-phosphate riboswitch 936 175 16 RF01739 Glutamine riboswitch 1103 64 17 RF03072 raiA RNA 736 219 18 RF03058 sul1 RNA 344 56 19 RF00380 Ykok riboswitch (Magnesium sensing) 1059 170 20 RF00168 Lysine Riboswitch 2240 180 21 RF03071 DUF1646 265 53 22 RF01689 AdoCbl variant RNA 212 125 23 RF00379 ydaO/yuaA leader 3918 164 24 RF00634 SAM - 4 Riboswitch 1245 116 25 RF01767 SAM - 3 Riboswitch 195 90 26 RF00080 yybP-ykoY manganese riboswitch 833 158 27 RF02683 NiCo riboswitch (Nickel or Cobalt sensing) 235 97 28 RF00442 Guanidine-I riboswitch 902 109 29 RF00522 Pre-queosine riboswitch -1 533 45 30 RF00050 Flavin Mononucleotide Riboswitch 4080 142 31 RF01831 THF riboswitch 663 102 32 RF00521 SAM - 2 Riboswitch 819 78

The dataset includes a mixture of metabolite/ion/cofactor/amino-acid/nucleotide/vitamin/signaling-molecule aptamer ligands. ‘Class no.’ corresponds to the response classes to be learnt in machine learning. The average length of all sequences in a given class is also given.

TABLE 3 | Description of the base model and deep classifiers.

Classifier Features used Hyperparameters of interest ML Library

Decision Tree 1- and 2-mer frequencies Maximum features, Minimum sample split, Minimum sample leaf, Sklearn Random state, Max depth Gaussian Naïve Bayes 1- and 2-mer frequencies Priors Sklearn k-Nearest Neighbors 1- and 2-mer frequencies Number of neighbors Leaf size, Weights, Algorithm Sklearn Adaptive Boosting 1- and 2-mer frequencies Number of estimators, Learning rate, Algorithm Sklearn Random Forest 1- and 2-mer frequencies Number of estimators, Max depth, Impurity criterion Sklearn Multi-layer Perceptron 1- and 2-mer frequencies Activation, Solver, Alpha (regularization term), Learning rate, epochs, Sklearn hidden layers, nodes per hidden layer CNN Riboswitch sequence Number of filters, Kernel size, Activation function, Pooling method, TensorFlow (Keras) number of Conv1D layers, Dropout ratio, Optimiser, #Epochs Bidirectional RNN with LSTM Riboswitch sequence Activation function, number of LSTM nodes, number of Bidirectional TensorFlow (Keras) layers, Dropout ratio, Optimiser, #Epochs

Hyperparameters noted for each classifier are meant to be representative. For the deep models, any long riboswitch genome sequences were truncated to 250 nucleotides, which is an adjustable parameter (max_len) set to much larger than the average length of any riboswitch class. The Python3 library used for implementation of machine learning model is noted.

Frontiers in Bioengineering and Biotechnology| www.frontiersin.org 4 July 2020| Volume 8| Article 808 fbioe-08-00808 July 10, 2020 Time: 18:46 # 5

Premkumar et al. Deep Learning to Classify Riboswitches

Dynamic Extension of the Models of metrics including accuracy, and F-score. It is abundantly clear Genome sequencing of diverse, exotic prokaryotes is likely to that the deep models vastly outperformed the base classifiers in yield new regulatory surprises mediated by riboswitches (Breaker, all metrics across all classes. Figures 4, 5 show the ROC curves 2011). A model that could classify a fixed set of 32 classes remains along with the micro-averaged and macro-averaged AUROC for static in the wake of exponentially growing number of known the base models and the deep models, respectively. The AUROC genomes. To address the challenge of extending the model to is indicative of the quality of the overall model. It is seen that the new classes, we have devised a strategy for dynamically updating AUROC is 1.00 for all classes for the RNN. Table 4 summarizes the model. This process is initiated by feeding the sequences the performance of the classifiers, with the detailed classwise corresponding to the new class(es) to an updater script, which F-score of each classifier available in the Supplementary Table S2 revises the dataset and then trains a new model. The automation and the classwise break-up of the AUROC of all classifiers in the of modeling would ensure a user-friendly pipeline for learning Supplementary Table S3. any number of riboswitch classes along with the production of performance metrics of such models. The script dynamic.py, available in the repository, implements the dynamic functionality DISCUSSION of our modeling effort. The RNN model marginally (but clearly) outperformed the CNN model, and both of them significantly outperformed all the base RESULTS models on all key metrics, notably accuracy and F-score. The best-performing among the base models was the Multi-layer Our Rfam query retrieved 39 riboswitch class hits, however, seven Perceptron. It is noteworthy that the AUROC approached unity of these classes had a membership of less than 100 sequences and near-perfection for both the deep models, especially the and were filtered out. Subsequently, our dataset consisted of 32 bidirectional RNN with LSTM. This implied that the use of riboswitch classes with a total of 68,520 sequences. A summary k-mer features masked long-range information whereas the deep of this dataset is presented in Table 2. The largest classes include models were able to capture such correlations directly from the the cobalamin and thiamine pyrophosphate classes, with >10,000 full sequence. These results affirmed that RNNs could be used to riboswitches in each, accounting for considerable diversity effectively simulate the interactions in biological sequences. within classes. Classes with >4,000 members include Flavin The F-score (a measure of balanced accuracy) is a more mononucleotide (FMN) and glycine riboswitches. Other notable unforgiving metric than AUROC in the case of multi-class classes with at least 1,000 members include the lysine, purine, problems (Table 4). While the CNN and RNN had macro- fluoride and glutamine switches. The riboswitch sequences were averaged F-scores of 0.93 and 0.96, respectively, none of the inspected for the standard alphabet (Table 1) and the final base models exceeded 0.70 including the multilayer perceptron. pre-processed comma-separated values (csv) datafile with each Supplementary Table S2 provides the classwise F-scores of instance containing the sequence, 20 features and riboswitch class all classifiers. All the base models struggled to classify the was prepared (available at4). largest riboswitch classes, namely TPP and Cobalamin. This is Table 3 recapitulates the key properties of the classifier a consequence of the diversity of such large riboswitch classes, models used in this study. We noticed poor performance of making the ‘outlier’ members harder to classify correctly. Both the base models on the test set with default model parameters, the RNN and CNN are mapping the sequence input to its which could be traced to persistent overfitting (dominated by corresponding riboswitch family. In such a case, the sequence the larger classes), despite stratified sampling. Hyperparameter similarity within a family and sequence dissimilarity across optimisation of the default parameters is one solution to address families represent plausible discriminating features that the this problem and was carried out on the base models. The models are learning. Higher order features such as sequence exercise is summarized in Appendix I (Supplementary File S1), context and base dependencies dictated by RNA secondary which includes the final configuration of the hyperparameter structure also constitute learnable information. Table 5 shows the space for all the base and deep classifiers. The optimized CNN results of a sequence-based clustering analysis of the riboswitch and RNN architectures are illustrated in Figure 1. In the CNN, families using the cd-hit algorithm (Li and Godzik, 2006). It two convolutional layers were used followed by a pooling layer is seen that there are >7000 and >10,000 singleton clusters in and dropout layer before flattening to a fully connected output the TPP and Cobalamin classes, respectively. Here we introduce layer. The RNN employed two sophisticated bidirectional LSTM a diversity metric for riboswitch families, defined as the ratio units sandwiched by dropout layers before flattening to a fully of the number of clusters at 90% sequence identity to the connected output layer. The number of training epochs necessary total size of the family. Compared to the overall diversity score for each deep model was determined based on the convergence of of 0.7, both TPP and Cobalamin classes had above-average the error function (shown in Figure 2). diversities (0.71 and 0.82, respectively). However, these diverse With the optimized classifiers, the performance of the classes did not pose any problems for the deep models. On predictive modeling was evaluated on the unseen testing set. the contrary, the AdoCbl and AdoCbl variant riboswitch classes Figure 3 shows the resultant classifier performance by an array posed significant learning challenges to both base and deep models. AdoCbl in particular is the smallest riboswitch family 4https://github.com/RiboswitchClassifier in the Datasets folder. considered here, but is also unnaturally diverse (with a score

Frontiers in Bioengineering and Biotechnology| www.frontiersin.org 5 July 2020| Volume 8| Article 808 fbioe-08-00808 July 10, 2020 Time: 18:46 # 6

Premkumar et al. Deep Learning to Classify Riboswitches

FIGURE 1 | Deep learning frameworks used in the study. (A), CNN architecture, optimized for two 1-dimensional convolutional layers; and (B), Bidirectional RNN with LSTM, optimized for two bidirectional layers. Two dropout layers are used in the RNN.

Frontiers in Bioengineering and Biotechnology| www.frontiersin.org 6 July 2020| Volume 8| Article 808 fbioe-08-00808 July 10, 2020 Time: 18:46 # 7

Premkumar et al. Deep Learning to Classify Riboswitches

FIGURE 2 | Epoch tuning curves for the CNN (A) and RNN (B). The CNN converges faster with respect to the number of epochs, however, the RNN learns better, as seen with the continuously decreasing loss function.

FIGURE 3 | Standard performance metrics. Clockwise from top left, Accuracy; Precision; F-score; and Recall. The overall precision, recall and F-score were computed by macro-averaging the classwise scores. The deep models emerged as vastly superior alternatives to the base machine learning models on all performance metrics.

of 0.83). This frustrates learning, because the limited number F-scores of 0.38 and 0.67, respectively, by far their lowest for of instances do not adequately represent the class diversity, any class. Even here, the consistency of the RNN model is and result in class outliers. Consequently this emerged as the remarkable. The other classes that were notably challenging to most challenging for all classifiers, reflected in the classwise the base models but not to the deep models included Cyclic di- F-scores (Supplementary Table S2). Four of the base classifiers GMP-II, ZMP/ZTP, SAM 1–4 variant, Molybdenum co-factor, managed a zero F-score, while the CNN and RNN achieved Glucosamine-6-phosphate and Guanidine-I riboswitch classes.

Frontiers in Bioengineering and Biotechnology| www.frontiersin.org 7 July 2020| Volume 8| Article 808 fbioe-08-00808 July 10, 2020 Time: 18:46 # 8

Premkumar et al. Deep Learning to Classify Riboswitches

FIGURE 4 | AUROC for the base models. (A) Decision Tree, (B) Gaussian NB, (C) kNN, D: AdaBoost, (E) Random Forest, and (F) Multi-layer Perceptron. Gray lines denote classwise AUROCs of all 32 classes, from which it is clear that not all classes are equally learnt.

Of these, the Glucosamine-6-phosphate, Cyclic di-GMP-II, and size of the class. It could be inferred in this case that using direct Molybdenum co-factor riboswitch classes are among the most features (i.e., sequences) rather than engineered features (i.e., diverse riboswitch families, with diversity scores ≥0.80. It is seen k-mer frequencies) led to more robust models. From Table 5, it that the classification problems arise either with diverse classes or is also clear that most of the learnt classes (especially the large at the extremes of class sizes. Too large the class, the diversity ones) are diverse, thereby elevating the classifier performance to is challenging, whereas too small and the learning itself is robustness against adversarial attacks – that is, changing a few incomplete and challenged. The deep models – RNN and CNN – nucleotides in the input sequence would be unlikely to drastically consistently performed well across all classes, independent of the alter the class prediction.

Frontiers in Bioengineering and Biotechnology| www.frontiersin.org 8 July 2020| Volume 8| Article 808 fbioe-08-00808 July 10, 2020 Time: 18:46 # 9

Premkumar et al. Deep Learning to Classify Riboswitches

FIGURE 5 | AUROC for the deep models. (A) CNN, and (B) RNN. Gray lines denote classwise AUROCs of all 32 classes. It clear that RNN achieves learning perfection at both the macro and classwise levels.

TABLE 4 | Performance metrics for all classifiers.

Model Accuracy Precision Recall F-score Micro AUROC Macro AUROC

Decision Tree 0.54 0.49 0.39 0.42 0.88 0.81 Gaussian NaïveBayes 0.37 0.39 0.46 0.38 0.9 0.89 K-neighbors 0.67 0.75 0.55 0.61 0.94 0.88 AdaBoost 0.47 0.42 0.36 0.36 0.89 0.92 Random Forest 0.71 0.86 0.58 0.65 0.98 0.97 Multi-layer perceptron 0.74 0.75 0.67 0.70 0.99 0.98 CNN 0.97 0.98 0.91 0.93 1 1 RNN 0.99 0.97 0.96 0.96 1 1

The macro-averaged values of precision, recall and F-score are shown. Micro AUROC, micro-average AUROC; Macro AUROC, macro-average AUROC.

These results might be put in perspective by benchmarking An interesting benchmark is afforded by the Riboswitch against the existing literature. Guillen-Ramirez and Martinez- Scanner (Mukherjee and Sengupta, 2016), which used profile Perez (Guillén-Ramírez and Martínez-Pérez, 2018) extended the HMMs (Eddy, 2011) of riboswitch classes to detect riboswitches k-mer features logic and arrived at an optimal combination of in genomic sequences. While our method addresses inter-class 5460 k-mer features. Using a limited dataset of 16 classes, they discrimination of riboswitch sequences, Riboswitch Scanner used state-of-the-art machine learning to achieve accuracies in is a web-server that essentially performs riboswitch vs. not- the high nineties, however, their results did not generalize equally riboswitch discrimination for user-given riboswitch classes. The to riboswitches with remote homology. For e.g., their best- absence of F-score metrics does not allow for direct comparisons, performing classifier (Multi-layer Perceptron) misclassified 6 out however, the sensitivity and specificity seemed consistently of the 225 instances of Lysine riboswitch as cobalamin-gated. The comparable for most classes, with noticeable variations with source code for the features and modeling used in their work is respect to the Glycine, THF and SAM I-IV variant riboswitch not readily available for new applications. To make the workflow classes. It must be indicated that their method is validated with described in our study easily reproducible and user-friendly, we Rfam seed sequences, without consideration for the proliferation have developed a Python package riboflow5 mirroring the best of riboswitch sequences. Performance evaluation on limited RNN model. In addition to predicting the most probable class of a data could inflate performance estimates and complicate their given riboswitch sequence, riboflow provides an option to predict interpretation. We further note that it is possible to extend the complete vector of class probabilities, which could be helpful our method to the ’riboswitch-or-not’ classification problem by in disambiguating any class confusion. It would also inform the calibrating the prediction probability thresholds. In any case, our design of synthetic orthogonal riboswitches for biotechnology work enables the ranking of riboswitches using the strength of the applications. The implementation and usage details are provided predicted probabilities, which would aid the selection of the best in Appendix II (Supplementary File S1). riboswitch sequence design. It must be noted that riboswitches are precisely specific 5https://pypi.org/project/riboflow/ to cognate ligands. For e.g., the AdoCbl riboswitch would

Frontiers in Bioengineering and Biotechnology| www.frontiersin.org 9 July 2020| Volume 8| Article 808 fbioe-08-00808 July 10, 2020 Time: 18:46 # 10

Premkumar et al. Deep Learning to Classify Riboswitches

TABLE 5 | Clustering analysis of riboswitch families at 90% sequence identity.

Rfam ID #Clusters at 90% Diversity #Singleton clusters #Redundant sequences #Clusters in the rest

RF00050 2484 0.61 1866 46 45208 RF00059 8954 0.71 7366 272 38738 RF00080 611 0.73 493 29 47081 RF00162 3736 0.62 2722 141 43956 RF00167 2122 0.81 1818 59 45570 RF00168 1905 0.85 1707 45 45787 RF00174 11620 0.82 10130 289 36077 RF00234 814 0.87 731 11 46878 RF00379 2583 0.66 2044 97 45109 RF00380 441 0.42 265 35 47251 RF00442 600 0.67 449 46 47092 RF00504 3010 0.66 2345 66 44682 RF00521 495 0.61 379 20 47197 RF00522 214 0.40 113 28 47478 RF00634 501 0.40 317 12 47191 RF01055 978 0.80 865 73 46714 RF01057 629 0.76 519 13 47063 RF01482 150 0.83 132 12 47546 RF01689 131 0.62 101 2 47562 RF01725 435 0.55 326 25 47257 RF01727 149 0.62 105 16 47543 RF01734 1616 0.80 1389 127 46076 RF01739 344 0.31 215 56 47348 RF01750 1074 0.64 821 94 46618 RF01767 122 0.63 88 12 47570 RF01786 555 0.84 480 49 47137 RF01831 450 0.68 327 57 47242 RF02683 159 0.68 122 31 47533 RF03057 359 0.64 266 70 47333 RF03058 97 0.28 57 17 47595 RF03071 86 0.33 48 29 47606 RF03072 273 0.37 141 72 47419 ALL 47686 0.70 38871 1951 N-

Singleton clusters indicate clusters with only one representative sequence. Redundant sequences are 100% identical to another member in the riboswitch family. The full set of riboswitch sequences (ALL) and ALL minus the riboswitch family of interest were also clustered.

not tolerate a methyl-substituted cobalamin (Nahvi, 2004) nor by way of dataset augmentation, thereby making them general does the TPP riboswitch interact with thiamine or thiamine and more robust. This work used the dynamic functionality to monophosphate (Lang et al., 2007). At the same time, these two extend a preliminary 24-class model to the present 32 classes with riboswitches are very diverse in their phylogenomic distribution sustained performance. The implementation and usage details and actual sequences. The key to effective learning lies in treading of the dynamic functionality and other utilities provided in the a fine line between the intra-class diversity and inter-class repository are given in Appendix II (Supplementary File S1). It specificity. It is remarkable the bidirectional LSTM RNN was able is noted that the deep learning models could be adapted to new to achieve exactly this tradeoff. The roots of such performance classes and related problems by the technique of transfer learning of the deep models in general has been explained recently to be (Weiss et al., 2016). Addition of new data to existing models related to the lottery ticket hypothesis (Frankle and Carbin, 2019) presents data quality issues, which remain contentious (Leonelli, as well as learning the intrinsic dimension of the problem (Li 2019), and could be partially addressed using the tools employed et al., 2018), here the classification of riboswitches. in this study to assess the canonical Rfam dataset. To extend the functionality of our work, we have introduced In summary, we present riboflow, a python package (see foot a dynamic component to all our models, both base and deep. note 5) as well as standalone suite of tools, that have been With the exponential growth in genome sequencing, the room validated and thoroughly tested on 32 riboswitch classes. By for riboswitch discovery is enormous (Yu and Breaker, 2020). using large and complete datasets, the variance of our modeling Our models could accommodate new riboswitch class definitions procedure has been optimized, and this ensures the generality

Frontiers in Bioengineering and Biotechnology| www.frontiersin.org 10 July 2020| Volume 8| Article 808 fbioe-08-00808 July 10, 2020 Time: 18:46 # 11

Premkumar et al. Deep Learning to Classify Riboswitches

and applicability of our models on new instances/classes without is freely available for any use and further improvement by compromise of performance. riboflow is an off-the-shelf solution the scientific community as well as in the larger interest of that is ready for programmatic incorporation of the RNN model reproducible research. Our study has highlighted the use of into automatic annotation and design pipelines. All of our trained macro-averaged F-score as a disciminating objective metric of models are available to the interested user as ’pickled’ models classifier performance on multi-class data. Our work reaffirms from https://github.com/RiboswitchClassifier. Riboswitches are the competitive advantages of bidirectional LSTM RNNs over a cornerstone of progress in synthetic biology, representing key conventional machine learning and hidden markov profiles in checkpoints for translation activation. Our work presents an modeling data sequences, and opens up their applications for intuitive general-purpose extensible platform for the effortless modeling other non-coding RNA elements. characterization of new riboswitch sequences and classes, which would accelerate bacterial genome annotation, synthetic biology, and biotechnology, including the rapid design of novel genetic DATA AVAILABILITY STATEMENT circuits with exquisite specificity. The predicted probabilities of class membership could be used as a proxy of aptamer All datasets generated for this study are included in the binding strength with cognate signaling molecule, and this article/Supplementary Material. paves the way for the design of effective riboswitches for any stimulus or set of stimuli. In addition to being indispensable workhorses in synthetic biology, riboswitches represent novel AUTHOR CONTRIBUTIONS and exciting targets for the development of new class of antibiotics (Penchovsky et al., 2020), and our work would also AP conceived and designed the work and wrote the manuscript. help toward the design of riboswitch inhibitors to combat KP, RB, and AP performed the experiments and analyzed the emerging and multi-drug resistant pathogens. In addition, our data. All authors contributed to the article and approved the work opens up the applications of deep learning methods, submitted version. including advanced relatives like stacked Bi-LSTM and attention models (Vaswani et al., 2017), for addressing related and unrelated problems in the biological realm. FUNDING This work was funded in part by DST-SERB grant EMR/2017/000470 and the SASTRA TRR grant (to AP). CONCLUSION

We have demonstrated that CNN and RNN, without needing ACKNOWLEDGMENTS prior feature extraction, are capable of robust multi-class learning of ligand specificity from riboswitch sequence, with the RNN We are grateful to the School of Chemical and Biotechnology, posting an F-score of ∼0.96. The confidence of classification SASTRA Deemed University for infrastructure and computing could be obtained from an inspection of the predicted classwise support. A preliminary version of the manuscript has been probabilities. The bidirectional LSTM RNN model has been released as a Pre-Print at bioRxiv (Premkumar et al., 2019). packaged into riboflow to enable embedding into genome annotation pipelines, genetic circuit-design automation, and biotechnology workflows. The CNN shows the best tradeoff SUPPLEMENTARY MATERIAL between the time-cost of training the model and overall performance and could be applied to the task of learning The Supplementary Material for this article can be found new riboswitch classes using the provided dynamic update online at: https://www.frontiersin.org/articles/10.3389/fbioe. option that is provided. All the code used in our study 2020.00808/full#supplementary-material

REFERENCES comprehensive review of available software and tools. Front. Genet. 8:231. doi: 10.3389/fgene.2017.00231 Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., et al. (2015). Barrick, J. E., and Breaker, R. R. (2007). The distributions, mechanisms, and TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Available structures of metabolite-binding riboswitches. Genome Biol. 8:R239. online at: www.tensorflow.org (accessed May 10, 2018). Beisel, C. L., and Smolke, C. D. (2009). Design principles for riboswitch function. Abreu-Goodger, C., and Merino, E. (2005). RibEx: a web server for locating PLoS Comput. Biol. 5:e1000363. doi: 10.1371/journal.pcbi.1000363 riboswitches and other conserved bacterial regulatory elements. Nucleic Acids Bengert, P., and Dandekar, T. (2004). Riboswitch finder—a tool for identification Res. 33(Suppl. 2), W690–W692. of riboswitch . Nucleic Acids Res. 32(Suppl. 2), W154–W159. Alipanahi, B., Delong, A., Weirauch, M. T., and Frey, B. J. (2015). Predicting the Bengio, Y., Courville, A., and Vincent, P. (2013). Representation learning: a sequence specificities of DNA- and RNA-binding proteins by deep learning. review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intelligence 35, Nat. Biotechnol. 33, 831–838. doi: 10.1038/nbt.3300 1798–1828. doi: 10.1109/tpami.2013.50 Antunes, D., Jorge, N., Caffarena, E. R., and Passetti, F. (2017). Using Blount, K. F., and Breaker, R. R. (2006). Riboswitches as antibacterial drug targets. RNA sequence and structure for the prediction of riboswitch aptamer: a Nat. Biotechnol. 24, 1558–1564. doi: 10.1038/nbt1268

Frontiers in Bioengineering and Biotechnology| www.frontiersin.org 11 July 2020| Volume 8| Article 808 fbioe-08-00808 July 10, 2020 Time: 18:46 # 12

Premkumar et al. Deep Learning to Classify Riboswitches

Bocobza, S. E., and Aharoni, A. (2014). Small molecules that interact with RNA: Li, C., Farkhoor, H., Liu, R., and Yosinski, J. (2018). Measuring the intrinsic riboswitch-based gene control and its involvement in metabolic regulation in dimension of objective landscapes. arXiv [Preprint]. Available online at: https: plants and algae. Plant J. 79, 693–703. doi: 10.1111/tpj.12540 //arxiv.org/pdf/1804.08838.pdf (accessed July 3, 2019). Brantl, S. (2004). Bacterial gene regulation: from transcription attenuation to Li, W., and Godzik, A. (2006). Cd-hit: a fast program for clustering and comparing riboswitches and ribozymes. Trends Microbiol. 12, 473–475. doi: 10.1016/j.tim. large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659. 2004.09.008 doi: 10.1093/bioinformatics/btl158 Breaker, R. R. (2011). Prospects for riboswitch discovery and analysis. Mol. Cell 43, Lipton, Z. C. (2015). A critical review of recurrent neural networks for sequence 867–879. doi: 10.1016/j.molcel.2011.08.024 learning. arXiv [Preprint]. Available online at: https://arxiv.org/abs/arXiv:1506. Breaker, R. R., Gesteland, R. F., Cech, T. R., and Atkins, J. F. (2006). The RNA 00019 (accessed May 29, 2015). World. New York, NY: Cold Spring Harbor Laboratory Press. Lo Bosco, G., and Di Gangi, M. (2017). Deep learning architectures for DNA Chang, C. L., Lei, Q., Lucks, J. B., Segall-Shapiro, T. H., Wang, D., and Mutalik, sequence classification. Lecture Notes Comput. Sci. 10147, 162–171. doi: 10. V. K. (2012). An adaptor from translational to transcriptional control enables 1007/978-3-319-52962-2_14 predictable assembly of complex regulation. Nat. Methods 9, 1088–1094. doi: Mandal, M., Boese, B., Barrick, J. E., Winkler, W. C., and Breaker, R. R. (2003). 10.1038/nmeth.2184 Riboswitches control fundamental biochemical pathways in Bacillus subtilis and Chang, T.-H., Huang, H.-D., Wu, L.-C., Yeh, C.-T., Liu, B.-J., and Horng, J.-T. other bacteria. Cell 113, 577–586. doi: 10.1016/s0092-8674(03)00391-x (2009). Computational identification of riboswitches based on RNA conserved Mandal, M., and Breaker, R. R. (2004). Gene regulation by riboswitches. Nat. Rev. functional sequences and conformations. RNA 15, 1426–1430. doi: 10.1261/ Mol. Cell Biol. 5, 451–463. .1623809 Manning, C. D., Raghavan, P., and Schütze, H. (2008). Introduction to Information Che, Z., Purushotham, S., Cho, K., Sontag, D., and Liu, Y. (2018). Recurrent neural Retrieval. Cambridge: Cambridge University Press. networks for multivariate time series with missing values. Sci. Rep. 8:6085. McCown, P. J., Corbino, K. A., Stav, S., Sherlock, M. E., and Breaker, R. R. (2017). doi: 10.1038/s41598-018-24271-9 Riboswitch diversity and distribution. RNA 23, 995–1011. doi: 10.1261/rna. Clote, P. (2015). Computational prediction of riboswitches. Methods Enzymol. 553, 061234.117 287–312. doi: 10.1016/BS.MIE.2014.10.063 Meyer, A., Pellaux, R., Potot, S., Becker, K., Hohmann, H. P., Panke, S., et al. (2015). Deigan, K. E., and Ferré-D’Amaré, A. R. (2011). Riboswitches: discovery of drugs Optimization of a whole-cell biocatalyst by employing genetically encoded that target bacterial gene-regulatory RNAs. Acc. Chem. Res. 44, 1329–1338. product sensors inside nanolitre reactors. Nat. Chem. 7, 673–678. doi: 10.1038/ doi: 10.1021/ar200039b nchem.2301 Domin, G., Findeiß, S., Wachsmuth, M., Will, S., Stadler, P. F., and Mörl, M. (2017). Mironov, A. S., Gusarov, I., Rafikov, R., Lopez, L. E., Shatalin, K., Kreneva, Applicability of a computational design approach for synthetic riboswitches. R. A., et al. (2002). Sensing small molecules by nascent RNA: a mechanism Nucleic Acids Res. 45, 4108–4119. to control transcription in bacteria. Cell 111, 747–756. doi: 10.1016/S0092- Eddy, S. R. (2011). Accelerated profile HMM searches. PLoS Comput. Biol. 8674(02)01134-0 7:e1002195. doi: 10.1371/journal.pcbi.1002195 Mukherjee, S., and Sengupta, S. (2016). Riboswitch scanner: an efficient pHMM- Frankle, J., and Carbin, M. (2019). The lottery ticket hypothesis: finding sparse, based web-server to detect riboswitches in genomic sequences. Bioinformatics trainable neural networks. arXiv [Preprint]. Available online at: https://arxiv. 32, 776–778. doi: 10.1093/bioinformatics/btv640 org/abs/1803.03635 (accessed March 4, 2019). Nahvi, A. (2004). Coenzyme B12 riboswitches are widespread genetic control Gelfand, M. S., Mironov, A. A., Jomantas, J., Kozlov, Y. I., and Perumov, D. A. elements in prokaryotes. Nucleic Acids Res. 32, 143–150. doi: 10.1093/nar/ (1999). A conserved RNA structure element involved in the regulation of gkh167 bacterial riboflavin synthesis genes. Trends Genet. 15, 439–442. doi: 10.1016/ Nahvi, A., Sudarsan, N., Ebert, M. S., Zou, X., Brown, K. L., and Breaker, R. R. s0168-9525(99)01856-9 (2002). Genetic control by a metabolite binding mRNA. Chem. Biol. 9, 1043– Graves, A., and Schmidhuber, J. (2005). Framewise phoneme classification with 1049. doi: 10.1016/S1074-5521(02)00224-7 bidirectional LSTM and other neural network architectures. Neural Net. 18, Nawrocki, E. P., and Eddy, S. R. (2013). Infernal 1.1: 100-fold faster RNA homology 602–610. doi: 10.1016/j.neunet.2005.06.042 searches. Bioinformatics 29, 2933–2935. doi: 10.1093/bioinformatics/btt509 Groher, A. C., Jager, S., Schneider, C., Groher, F., Hamacher, K., and Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Suess, B. (2019). Tuning the performance of synthetic riboswitches using et al. (2011). Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, machine learning. ACS Synth. Biol. 8, 34–44. doi: 10.1021/acssynbio.8b0 2825–2830. 0207 Penchovsky, R., Pavlova, N., and Kaloudas, D. (2020). RSwitch: a novel Guillén-Ramírez, H. A., and Martínez-Pérez, I. M. (2018). Classification of bioinformatics database on riboswitches as antibacterial drug targets. riboswitch sequences using k-mer frequencies. Biosystems 174, 63–76. doi: IEEE/ACM Trans. Comput. Biol. Bioinform. 99:1. doi: 10.1109/TCBB.2020. 10.1016/j.biosystems.2018.09.001 2983922 Havill, J. T., Bhatiya, C., Johnson, S. M., Sheets, J. D., and Thompson, J. S. (2014). A Premkumar, K. A. R., Bharanikumar, R., and Palaniappan, A. (2019). Riboflow: new approach for detecting riboswitches in DNA sequences. Bioinformatics 30, using deep learning to classify riboswitches with ∼99% accuracy. bioRxiv 3012–3019. doi: 10.1093/bioinformatics/btu479 [Preprint]. doi: 10.1101/868695 Hochreiter, S., and Schmidhuber, J. (1997). Long short-term memory. Neural Roth, A., and Breaker, R. R. (2009). The structural and functional diversity of Comput. 9, 1735–1780. doi: 10.1162/neco.1997.9.8.1735 metabolite-binding riboswitches. Annu. Rev. Biochem. 78, 305–334. doi: 10. Kalvari, I., Argasinska, J., Quinones-Olvera, N., Nawrocki, E. P., Rivas, E., Eddy, 1146/annurev.biochem.78.070507.135656 S. R., et al. (2018). Rfam 13.0: shifting to a genome-centric resource for non- Serganov, A., and Nudler, E. (2013). A decade of riboswitches. Cell 152, 17–24. coding RNA families. Nucleic Acids Res. 46, D335–D342. doi: 10.1093/nar/ doi: 10.1016/j.cell.2012.12.024 gkx1038 Singh, J., Hanson, J., Paliwal, K., and Zhou, Y. (2019). RNA secondary structure Kelley, D. R., Snoek, J., and Rinn, J. (2016). Basset: learning the regulatory code of prediction using an ensemble of two-dimensional deep neural networks and the accessible genome with deep convolutional neural networks. Genome Res. transfer learning. Nat. Commun. 10:5407. 26, 990–999. doi: 10.1101/gr.200535.115 Singh, S., and Singh, R. (2016). Application of supervised machine learning Lang, K., Rieder, R., and Micura, R. (2007). Ligand-induced folding of the thiM algorithms for the classification of regulatory RNA riboswitches. Brief. Funct. TPP riboswitch investigated by a structure-based fluorescence spectroscopic Genomics 16, 99–105. doi: 10.1093/bfgp/elw005 approach. Nucleic Acids Res. 35, 5370–5378. doi: 10.1093/nar/gkm580 Sønderby, S. K., Sønderby, C. K., Nielsen, H., and Winther, O. (2015). Lee, B., Lee, T., Na, B., and Yoon, S. (2015). DNA-level splice junction prediction Convolutional LSTM networks for subcellular localization of proteins. ArXiv using deep recurrent neural networks. arXiv [Preprint]. Available online at: [Preprint]. doi: 10.1007/978-3-319-21233-3_6 arXiv.org>cs>arXiv:1512.05135 (accessed August 25, 2019). Stormo, G. D., and Ji, Y. (2001). Do mRNAs act as direct sensors of small molecules Leonelli, S. (2019). Philosophy of biology: the challenges of big data biology. eLife to control their expression? Proc. Natl. Acad. Sci. U.S.A. 98, 9465–9467. doi: 8:e47381. doi: 10.7554/eLife.47381 10.1073/pnas.181334498

Frontiers in Bioengineering and Biotechnology| www.frontiersin.org 12 July 2020| Volume 8| Article 808 fbioe-08-00808 July 10, 2020 Time: 18:46 # 13

Premkumar et al. Deep Learning to Classify Riboswitches

Strobel, S. A., and Cochrane, J. C. (2007). RNA catalysis: ribozymes, ribosomes, and Winkler, W. C., Nahvi, A., Roth, A., Collins, J. A., and Breaker, R. R. (2004). Control riboswitches. Curr. Opin. Chem. Biol. 11, 636–643. doi: 10.1016/j.cbpa.2007.09. of gene expression by a natural metabolite-responsive ribozyme. Nature 428, 010 281–286. Sudarsan, N., Barrick, J. E., and Breaker, R. R. (2003). Metabolite-binding RNA Wittmann, A., and Suess, B. (2012). Engineered riboswitches: expanding domains are present in the genes of eukaryotes. RNA 9, 644–647. doi: 10.1261/ researchers’ toolbox with synthetic RNA regulators. FEBS Lett. 586, 2076–2083. rna.5090103 doi: 10.1016/j.febslet.2012.02.038 Sundermeyer, M., Alkhouli, T., Wuebker, J., and Ney, H. (2014). “Translation Yang, J., Seo, S. W., Jang, S., Shin, S. I., Lim, C. H., Roh, T. Y., et al. (2013). Synthetic modeling with bidirectional recurrent neural networks,” in Proceedings of RNA devices to expedite the evolution of metabolite-producing microbes. Nat. the 2014 Conference on Empirical Methods in Natural Language Processing Commun. 4:1413. doi: 10.1038/ncomms2404 (EMNLP), Doha, 14–25. Yang, Y. (1999). An evaluation of statistical approaches to text categorization. J. Inf. Tucker, B. J., and Breaker, R. R. (2005). Riboswitches as versatile gene control Retr. 1, 67–88. elements. Curr. Opin. Struct. Biol. 15, 342–348. doi: 10.1016/j.sbi.2005. Yanofsky, C. (1981). Attenuation in the control of expression of bacterial operons. 05.003 Nature 289, 751–758. doi: 10.1038/289751a0 van Rijsbergen, C. J. (1975). Information Retrieval. London: Butterworths. Yu, D., and Breaker, R. R. (2020). A bacterial riboswitch class senses xanthine Vaswani, A., Jones, L., Shazeer, N., Parmar, N., Gomez, A. N., Uszkoreit, and uric acid to regulate genes associated with purine oxidation. RNA doi: J., et al. (2017). “Attention is all you need,” in Proceedings of the 31st 10.1261/rna.075218.120 [Epub ahead of print]. Conference on Neural Information Processing Systems (NIPS 2017), Long Zhou, J., and Troyanskaya, O. G. (2015). Predicting effects of noncoding variants Beach, CA. with deep learning- based sequence model. Nat. Methods 12, 931–934. doi: Villa, J. K., Su, Y., Contreras, L. M., and Hammond, M. C. (2018). 10.1038/nmeth.3547 Synthetic biology of small RNAs and riboswitches. Microbiol. Spectr. Zhou, L. B., and Zeng, A. P. (2015). Engineering a lysine-ON riboswitch for 6:10.1128/microbiolspec.RWR-0007-2017. doi: 10.1128/microbiolspec.rwr- metabolic control of lysine production in Corynebacterium glutamicum. ACS 0007-2017 Synth. Biol. 4, 1335–1340. doi: 10.1021/acssynbio.5b00075 Wang, H., Mann, P. A., Xiao, L., Gill, C., Galgoci, A. M., Howe, J. A., et al. (2017). Dual-targeting small-molecule inhibitors of the Staphylococcus aureus FMN Conflict of Interest: The authors declare that the research was conducted in the riboswitch disrupt riboflavin homeostasis in an infectious setting. Cell Chem. absence of any commercial or financial relationships that could be construed as a Biol. 24, 576–588. doi: 10.1016/j.chembiol.2017.03.014 potential conflict of interest. Weiss, K., Khoshgoftaar, T. M., and Wang, D. (2016). A survey of transfer learning. J. Big Data 3:9. Copyright © 2020 Premkumar, Bharanikumar and Palaniappan. This is an open- Wieland, M., and Hartig, J. S. (2008). Artificial riboswitches: synthetic mRNA- access article distributed under the terms of the Creative Commons Attribution based regulators of gene expression. Chembiochem 9, 1873–1878. doi: 10.1002/ License (CC BY). The use, distribution or reproduction in other forums is permitted, cbic.200800154 provided the original author(s) and the copyright owner(s) are credited and that the Winkler, W., Nahvi, A., and Breaker, R. R. (2002). Thiamine derivatives bind original publication in this journal is cited, in accordance with accepted academic messenger RNAs directly to regulate bacterial gene expression. Nature 419, practice. No use, distribution or reproduction is permitted which does not comply 952–956. doi: 10.1038/nature01145 with these terms.

Frontiers in Bioengineering and Biotechnology| www.frontiersin.org 13 July 2020| Volume 8| Article 808