A Modern Tool-Suite for Detecting Long Terminal Repeat Retrotransposons De-Novo on the Genomic Scale

bioRxiv preprint doi: https://doi.org/10.1101/448969; this version posted October 22, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Valencia and Girgis SOFTWARE LtrDetector: A modern tool-suite for detecting long terminal repeat retrotransposons de-novo on the genomic scale Joseph D Valencia and Hani Z Girgis* of these regions is the prevalence of transposable elements (TEs), a type of repeated sequence. TEs include class I elements, which replicate using RNA to \copy-and-paste" themselves, and class II elements, Abstract which replicate via a \cut-and-paste" mechanism using Long terminal repeat retrotransposons are the DNA as an intermediate [1]. Barbara McLintock dis- most abundant transposons in plants. They play covered transposons in 1950 while studying the maize important roles in alternative splicing, genome [2]. TEs are common to all eukaryotes, com- recombination, gene regulation, and genomic prising around 45% of the human genome and up to evolution. Large-scale sequencing projects for plant 80% of some plants like maize and wheat [3,4]. genomes are currently underway. Software tools are TEs have several important functions. Bennetzen important for annotating long terminal repeat and Wang highlight the known functions of plant retrotransposons in these newly available genomes. TEs [5]. Transposons are the major factor affecting However, the available tools are not very sensitive the sizes of plant genomes [6{8]. Under stressful con- to known elements and perform inconsistently on ditions, they can rearrange a genome [9{11]. TEs different genomes. Some are hard to install or play roles in relocating genes [12, 13], and generat- obsolete. They may struggle to process large plant ing new genes [14, 15] and new pseudo genes [16, 17]. genomes. None are concurrent or have features to They can contribute to centromere function [18, 19]. support manual review of new elements. To TEs can regulate the expression of nearby genes via overcome these limitations, we developed several mechanisms including: (i) providing regula- LtrDetector, which uses signal-processing tory elements, such as promoters and enhancers, to techniques. LtrDetector is easy to install and use. nearby genes [14,20{22]; (ii) inserting themselves into It is not species specific. It utilizes multi-core genes, then targeting the epigenetic regulatory sys- processors available in personal computers. It is tem [23]; (iii) producing small interfering RNA spe- more sensitive than other tools by 14.4%{50.8% cific to host genes [24{26]; and (iv) generating new mi- while maintaining a low false positive rate on six cro RNA genes modulating host genes [27{29]. Trans- plant genomes. posons have been utilized in cloning plant genes in a technique called transposon tagging [30{32]. They also Keywords: have the potential to become a new frontier in enhanc- Long terminal repeats retrotransposons; Repeats; ing the productivity of crops [33,34]. Software; Signal processing Long Terminal Repeat retrotransposons (LTR-RTs) are a particularly interesting type of class I transposable element related to retroviruses. LTR-RTs are widespread in plants and are considered one of their 1 Introduction primary evolutionary mechanisms [35]. Gonzalez, et al. Formerly considered \junk DNA", the intergenic se- summarizes some of their functions [36]. LTR-RTs can quences of genomes are attracting increased atten- insert adjacent to and inside of genes and promote tion among biologists. A particularly striking feature alternative splicing [37], They play roles in recombination, epigenetic control [38, 39], and other forms of *Correspondence: [email protected] regulation [36]. LTR-RTs have been found with regula- The Bioinformatics Toolsmith Laboratory, Tandy School of Computer tory motifs that promote defense mechanisms in dam- Science, University of Tulsa, 800 South Tucker Drive, Tulsa, OK 74104, USA aged plant tissues [40]. They can also serve as genomic Full list of author information is available at the end of the article markers for evolutionary phylogeny [41]. bioRxiv preprint doi: https://doi.org/10.1101/448969; this version posted October 22, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Valencia and Girgis Page 2 of 12 LTR-RTs are named for their characteristic direct to install because it does not have any external de- repeat | typically 200{600 base pair (bp) long [42]. pendencies. It is concurrent by default, taking advan- These direct repeats surround interior coding regions tage of the advanced hardware available on personal (the gag and pol genes). Lerat suggests 5kbp{9kbp as computers. It is not species specific. It is more sen- a size range for LTR-RTs [1]. sitive to known LTR-RTs than the related tools. It Computational tools are extremely important in lo- can process large genomes such as the barley genome. cating repeated sequences, including LTR-RTs. Tools It can produce images to facilitate the manual re- can be roughly divided into knowledge-based tools, view/annotation of the newly located LTR-RTs. which leverage consensus sequence databases to search We have tested LtrDetector on synthetic sequences for repeats, and de-novo tools, which use sequence and multiple genomes. Our results demonstrate that comparison and structural features to search for re- LtrDetector is the best de-novo tool currently available peats without prior knowledge about the sequence [1]. for detecting LTR-RTs. Knowledge-based methods include well-known bioinformatics software such as NCBI BLAST [43] and Re- 2 Results and discussion peatMasker (http://www.repeatmasker.org); they Contributions can be utilized in locating all types of known TEs Our efforts have resulted in the following contribu- including LTR-RTs. However, if the sequence of the tions: repetitive element is unknown, these two tools can- • The LtrDetector software for discovering LTR retro- not find copies in a genome. Several methods for transposons in assembled genomes. LtrDetector locating all types of TEs de-novo have been devel- is available on GitHub (https://github.com/Tulsa oped [44{47]. De-novo tools for detecting LTR-RTs in- BioinformaticsToolsmith/LtrDetector) and in Ad- clude LTR STRUC [48], MGEScan-LTR [49], LTRhar- ditional File 1. vest [50], LTR seq [51], and LTR Finder [52]. Post- • Visualization script to view scores, which should aid processing tools like LTR retriever may help increase in the manual verification of newly found elements the accuracy of de-novo approaches [53]. | available in Additional File 2 and the GitHub These tools face a variety of usability, scalability, and repository. accuracy concerns. For example, LTR STRUC, one of • Novel pipeline to generate ground truth (sequences the pioneering tools for locating LTR-RTs, was devel- of known LTR retrotransposons). The pipeline is oped exclusively for an old version of Windows, making available in Additional File 2 and the GitHub repos- it difficult to use nowadays. Several tools have external itory. dependencies which greatly complicate their installa- • Putative LTR retrotransposons of six plant genomes tion. None of them take advantage of the concurrent (Additional Files 3{9). architecture of modern personal computers. All may be unable to process large plant genomes such as the bar- Evaluation measures ley genome on an ordinary personal computer. Some True Positives (TP) are the detections that overlap tools are highly sensitive to species-specific parame- with an entry in the ground truth. False Negatives ters. All produce false positive predictions and do not (FN) are elements listed in the ground truth but not retrieve all known LTRs. Finally, none of these tools found by a tool. False Positives (FP) are the detec- was designed with post-processing manual review in tions that overlap with an entry in the false positive mind. data set. Mutual overlap is required to be 95%; for ex- Thousands of plant genomes are being sequenced ample, sequences A and B are counted as equivalent if currently and in the near future. The 10KP Project the overlapping segment between A and B constitutes for plant genomes (https://db.cngb.org/10kp/) 95% of both A and B. Because we calculate this over- and the Earth Biogenome Project (https://www. lap on the whole element rather than on its flanking earthbiogenome.org) aim at sequencing a large num- LTRs, it is theoretically possible that this definition of ber of plant genomes. This expansion of genomic data overlap may overlook some slight inaccuracies in the creates an urgent need for modern software tools to aid length of the LTRs. We use the standard measures of in detecting LTR-RTs in the new plant genomes; such sensitivity (Equation1) and precision (Equation 2) to tools should remedy the limitations of the currently assess the performances of LtrDetector and the related available tools. tools. Sensitivity is the ratio (or percentage) of the true To this end, we have developed LtrDetector, which elements found by a tool, whereas precision is the ratio is a software tool for detecting LTR-RTs. LtrDetec- (or percentage) of the true elements identified by a tool tor depends on signal processing techniques. It is easy to the total number of regions predicted by the same bioRxiv preprint doi: https://doi.org/10.1101/448969; this version posted October 22, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Valencia and Girgis Page 3 of 12 Table 1 Results on the X Chromosome of D. melanogaster: We evaluated four de-novo tools on a ground-truth annotation provided by Lerat [1]. Total is the number of proposed LTR-RTs, TP stands for true positives.

A Modern Tool-Suite for Detecting Long Terminal Repeat Retrotransposons De-Novo on the Genomic Scale

Specific Binding of Transposase to Terminal Inverted Repeats Of

Discrete-Length Repeated Sequences in Eukaryotic Genomes (DNA Homology/Nuclease SI/Silk Moth/Sea Urchin/Transposable Element) WILLIAM R

Bioinformatic and Phylogenetic Analyses of Retroelements in Bacteria

The Genomic Structure and Expression of MJD, the Machado-Joseph Disease Gene

Interfering with Retrotransposition by Two Types of CRISPR Effectors

RESEARCH ARTICLES Gene Cluster Statistics with Gene Families

Complete Article

1 Retrotransposons and Pseudogenes Regulate Mrnas and Lncrnas Via the Pirna Pathway 1 in the Germline 2 3 Toshiaki Watanabe*, E

Retroviral Insertions Into a Herpesvirus Are Clustered At

Long-Read Cdna Sequencing Identifies Functional Pseudogenes in the Human Transcriptome Robin-Lee Troskie1, Yohaann Jafrani1, Tim R

Evolution of Orthologous Tandemly Arrayed Gene Clusters

Human Immunodeficiency Virus Type 1 Two-Long Terminal Repeat