Search for Novel Insecticidal Protein Genes in Genomic And

Dvorkina et al. Microbiome (2021) 9:149 https://doi.org/10.1186/s40168-021-01092-z RESEARCH Open Access ORFograph: search for novel insecticidal protein genes in genomic and metagenomic assembly graphs Tatiana Dvorkina1, Anton Bankevich2, Alexei Sorokin3, Fan Yang4,5, Boahemaa Adu-Oppong4,6, Ryan Williams4, Keith Turner4 and Pavel A. Pevzner2* Abstract Background: Since the prolonged use of insecticidal proteins has led to toxin resistance, it is important to search for novel insecticidal protein genes (IPGs) that are effective in controlling resistant insect populations. IPGs are usually encoded in the genomes of entomopathogenic bacteria, especially in large plasmids in strains of the ubiquitous soil bacteria, Bacillus thuringiensis (Bt). Since there are often multiple similar IPGs encoded by such plasmids, their assemblies are typically fragmented and many IPGs are scattered through multiple contigs. As a result, existing gene prediction tools (that analyze individual contigs) typically predict partial rather than complete IPGs, making it difficult to conduct downstream IPG engineering efforts in agricultural genomics. Methods: Although it is difficult to assemble IPGs in a single contig, the structure of the genome assembly graph often provides clues on how to combine multiple contigs into segments encoding a single IPG. Results: We describe ORFograph, a pipeline for predicting IPGs in assembly graphs, benchmark it on (meta)genomic datasets, and discover nearly a hundred novel IPGs. This work shows that graph-aware gene prediction tools enable the discovery of greater diversity of IPGs from (meta)genomes. Conclusions: We demonstrated that analysis of the assembly graphs reveals novel candidate IPGs. ORFograph identified both already known genes “hidden” in assembly graphs and potential novel IPGs that evaded existing tools for IPG identification. As ORFograph is fast, one could imagine a pipeline that processes many (meta)genomic assembly graphs to identify even more novel IPGs for phenotypic testing than would previously be inaccessible by traditional gene-finding methods. While here we demonstrated the results of ORFograph only for IPGs, the proposed approach can be generalized to any class of genes. Keywords: Bioinformatics, Gene finding, Bacterial genomics, Metagenomics, Bioinsecticides * Correspondence: [email protected] 2Department of Computer Science and Engineering, University of California San Diego, San Diego, CA, USA Full list of author information is available at the end of the article © The Author(s). 2021 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Dvorkina et al. Microbiome (2021) 9:149 Page 2 of 14 Introduction which are all limited by the success/failure of the primer Biopesticides are important components of pest manage- selection had only been applied to the discovery of the ment programs that have been successful as an alterna- three-domain Cry genes [48]. tive to conventional chemical pesticides. These Next-generation sequencing opened new possibilities compounds, which are developed from the plant, animal, for IPG discovery as novel Cry and Cyt genes in a newly and bacterial proteins, do not leave harmful residues, are sequenced genome can be found by similarity search non-toxic to humans and the environment, and are against a database of known genes [52]. In particular, more target-specific than conventional pesticides [51]. the similarity search based on Hidden Markov Models These advantages led to a worldwide proliferation of (HMMs) allows one to reveal more diverged Cry genes biopesticides and resulted in a multi-billion-dollar bio- than those found using PCR-based methods. However, pesticide market. since Cry genes are rather variable, their HMMs typic- Insecticidal proteins, representing an important class ally represent only the main sequence domains rather of biopesticides, have been widely used in agriculture. than complete Cry genes. Ye et al. [67] and Zheng et al. Entomopathogenic bacteria, especially strains of the spe- [70] developed the BtToxin_scanner and BtToxin_Dig- cies Bacillus thuringiensis (Bt), produce crystal (Cry) and ger tools that use machine learning techniques to make cytolytic (Cyt) insecticidal proteins and secreted vegeta- the search for Cry genes more sensitive. BtToxin_scan- tive insecticidal proteins (VIPs) that specifically target ner was applied for Cry gene identification in various various insects, including insects from the orders Lepi- studies [13, 19, 57]. doptera, Coleoptera, Hemiptera, and Diptera [51]. In- However, all existing methods for IPG discovery are secticidal proteins are used to control pests of crop limited in their ability to reconstruct complete genes plants by mechanical methods, such as spraying to dis- when their fragments scattered over multiple contigs. perse microbial formulations containing various bacterial Since popular general-purpose gene prediction tools strains onto plant surfaces, and by using genetic trans- GeneMark [7], Prodigal [32], and Glimmer [20], as well formation techniques to produce transgenic plants ex- as their metagenomic versions metaGeneMark [72], pressing insecticidal proteins. Indeed, the development metaProdigal [33], and metaGlimmer [37], analyze indi- of insecticidal transgenic crops has been transformative vidual contigs, they typically predict partial rather than for agriculture. In 2017, 101 million hectares of cropland complete IPGs, a bottleneck in the downstream IPG en- were devoted to their cultivation across the world and gineering efforts in agricultural genomics. the adoption of specific transgenic crops has been asso- Development of a candidate IPG into a commercially ciated with the reduction or elimination of broad- viable toxin is a complex and time-consuming process spectrum synthetic chemical insecticides in those envi- that includes (i) prioritization of novel candidate IPGs ronments [56]. for follow-up synthesis, (ii) synthesis and expression of Although insecticidal proteins from B. thuringiensis selected IPGs for follow-up novel toxin production, and have become an important biopesticide against a wide (iii) testing these novel toxins against various insects. range of insects, their prolonged use leads to rapidly de- ORFograph contributes to the first step of this pipeline veloping toxin resistance [24]. Thus, it is important to by providing additional candidate IPGs whose parts are search for novel insecticidal proteins that are effective in scattered over multiple contigs and thus were not avail- controlling resistant insect populations. Although the able for a follow-up analysis in previous studies. This number of known Cry-encoding genes grew from just 14 new stream of novel IPGs is important not only for agri- 30 years ago [30] to over 700 today, there is a constant cultural genomics but also for biomedicine since some need to identify new insecticidal protein genes (IPGs) to Cry proteins, such as parasporins, preferentially kill can- overcome insecticide resistance. Since B. thuringiensis is cer cells [50]. indigenous to many environments (its strains have been ORFograph searches for novel IPGs in the assembly isolated worldwide from soil, insects, and leaves), gen- graphs (rather than individual contigs) that are generated omic and metagenomic samples containing B. thurin- by modern genome assemblers such as SPAdes [5] and giensis or other entomopathogenic bacterial strains metaSPAdes [49]. Given a read-set, SPAdes and metaS- provide many opportunities for finding novel IPGs [59]. PAdes first construct the de Bruijn graph that consists However, the search for novel IPGs faces computational of nodes (k-mers that appear frequently in reads) and challenges that we describe below. edges connecting these nodes that are labeled by sub- Initially, the Cry-encoding genes were searched for by strings from reads [17]. Since each error in a read cre- PCR-based techniques using primers from their highly ates a bubble in the de Bruijn graph (making this graph conserved regions [6, 10]. The basic PCR step was very complex), SPAdes and metaSPAdes error-correct followed by variations such as E-PCR [35], PCR-RFLP reads and transform the de Bruijn graph into a simpler [29], and PCR-SSCP [39]. Historically, these methods, assembly graph. In the case of an “ideal” assembly graph, Dvorkina et al. Microbiome (2021) 9:149 Page 3 of 14 a genome is spelled by a path that visits all edges of the underexplored metagenomic datasets. ORFograph uses assembly graph. the SPAligner tool for graph-based sequence

Search for Novel Insecticidal Protein Genes in Genomic And

Cryptic Inoviruses Revealed As Pervasive in Bacteria and Archaea Across Earth’S Biomes

Microbial Gene Identification Using Interpolated Markov Models Steven L

Steven L. Salzberg

By Katherine E. L. Cox a Dissertation Submitted to Johns Hopkins University in Conformity with the Requirements for the Degree O

Improved Microbial Gene Identification with GLIMMER

The Effect of Machine Learning Algorithms on Metagenomics Gene

Genome Sequencing and Annotation

Gene Prediction with Glimmer for Metagenomic Sequences Augmented by Classification and Clustering

MCB 1200/1201, Virus Hunting

Machine Learning for Metagenomics: Methods and Tools

Glimmer-MG Release Notes Version 0.1

Information Theory, Graph Theory and Bayesian Statistics Based Improved and Robust Methods in Genome Assembly