<<

G C A T T A C G G C A T

Article TEfinder: A Pipeline for Detecting New Events in Next-Generation Sequencing Data

Vista Sohrab 1 , Cristina López-Díaz 2, Antonio Di Pietro 2 , Li-Jun Ma 1,3 and Dilay Hazal Ayhan 1,3,*

1 Department of Biochemistry and Molecular Biology, University of Massachusetts Amherst, Amherst, MA 01003, USA; [email protected] (V.S.); [email protected] (L.-J.M.) 2 Departamento de Genética, Universidad de Córdoba, 14071 Córdoba, Spain; [email protected] (C.L.-D.); [email protected] (A.D.P.) 3 Molecular and Cellular Biology Graduate Program, University of Massachusetts Amherst, Amherst, MA 01003, USA * Correspondence: [email protected]

Abstract: Transposable elements (TEs) are mobile elements capable of introducing genetic changes rapidly. Their importance has been documented in many biological processes, such as introducing genetic instability, altering patterns of expression, and accelerating . Increasing appreciation of TEs has resulted in a growing number of bioinformatics software to identify insertion events. However, the application of existing tools is limited by either narrow-focused design of the package, too many dependencies on other tools, or prior knowledge required as input files that may not be readily available to all users. Here, we reported a simple pipeline, TEfinder, developed for  the detection of new TE insertions with minimal software and input file dependencies. The external  software requirements are BEDTools, SAMtools, and Picard. Necessary input files include the Citation: Sohrab, V.; López-Díaz, C.; reference genome sequence in FASTA format, an alignment file from paired-end reads, existing TEs Di Pietro, A.; Ma, L.-J.; Ayhan, D.H. in GTF format, and a text file of TE names. We tested TEfinder among several evolving populations TEfinder: A Bioinformatics Pipeline of Fusarium oxysporum generated through a short-term study. Our results demonstrate for Detecting New Transposable that this easy-to-use tool can effectively detect new TE insertion events, making it accessible and Element Insertion Events in practical for TE analysis. Next-Generation Sequencing Data. Genes 2021, 12, 224. https://doi.org/ Keywords: transposable elements; mobile element insertion events; next-generation sequencing 10.3390/genes12020224 (NGS);

Academic Editor: Irina R. Arkhipova Received: 17 December 2020 Accepted: 31 January 2021 Published: 4 February 2021 1. Introduction Transposable elements (TEs) are DNA sequences that move from one genomic location Publisher’s Note: MDPI stays neutral to another and thus impact genome evolution and adaptation [1]. TE transpo- with regard to jurisdictional claims in sition can alter the genomic architecture, introduce structural polymorphisms, disrupt published maps and institutional affil- coding sequences, and affect transcriptional and translational regulation. Additionally, TEs iations. are capable of changing eukaryotic gene expression by providing cis-regulatory elements such as promoters, factor binding sites, and repressive elements [2,3]. Ulti- mately, TEs provide a wide array of genomic diversity, functional impact, and evolutionary consequences that can be of notable interest to population , host interaction, and Copyright: © 2021 by the authors. comparative genomics studies. Licensee MDPI, Basel, Switzerland. TEs comprise a significant portion of the genome of and many other or- This article is an open access article ganisms, due to their mobilization and accumulation throughout evolution [4]. While distributed under the terms and some TEs are no longer active, certain TE families remain mobile and their transposition conditions of the Creative Commons contributes to both at the individual and the population level. TEs play Attribution (CC BY) license (https:// important roles in many biological processes, such as biology [5], neurodegenerative creativecommons.org/licenses/by/ diseases [6], or host-pathogen interactions [7]. Therefore, it is of high interest to identify 4.0/).

Genes 2021, 12, 224. https://doi.org/10.3390/genes12020224 https://www.mdpi.com/journal/genes Genes 2021, 12, 224 2 of 11

transposon insertion polymorphisms (TIPs) for the detection of highly active TE families and the understanding of their contribution to genome dynamics and organism adaptation. Advancements in next-generation sequencing technologies have made in silico dis- covery of transposon insertion events readily accessible. Two features commonly used for the detection of TE insertions are the target site duplication (TSD) and discordantly aligned reads. Upon insertion of certain transposons into a new genomic location, the mechanism of integration results in the duplication of the target sequence at the integration site, which is referred to as the TSD [4]. TSD length varies across superfamilies, and the identification of this structural motif proves useful in determining true transposition events within the genome [8]. Due to the nature of the transposon insertion, the TE sequence should be mapped to existing TE locations within the genome, while the paired-end read partner should be mapped to the unique sequence at the insertion site. This will result in discordant reads, which can be recognized when two paired-end reads are placed in different genomic locations or in a much greater distance that exceeds the expected insert size of the sequencing library. The more discordant reads localized in a genomic region (cluster), the higher the confidence to call it a new insertion site. Although several bioinformatic tools that detect such events have been developed, the broad application of these tools is limited by heavy external software or file dependencies. For instance, ISMapper [9] can report insertion positions of bacterial insertion sequences (ISs) when provided with paired-end short-read sequences, TE sequences as multi-FASTA queries, and the reference genome sequence. However, this requires specific versions of Python 3 and BioPython, which may not be readily available to all users and lead to difficulties in running the software. Mobile Element Locator Tool (MELT) [10], a Java software package originally developed as a part of the 1000 Project [11], discovers, annotates, and mobile element insertions with the only requirement of Bowtie2 [12]. The package is powerful in comprehensive TE analysis in human and chimpanzee genomes. However, it requires gene annotations. For each TE, users need to provide the and multiple files related to this particular transposon. Users have to make sure only “ATGC” characters are present in the consensus FASTA file. Such a high level of external file dependency makes it challenging for users who are interested in TE analysis in non-model that lack well-annotated genomes. Encountering difficulties in applying available TE insertion detection tools, we de- veloped a simple bash bioinformatics pipeline, TEfinder, to detect new insertion events using tools that are commonly embedded in genomics variant calling workflows, including BEDTools [13], SAMtools [14], and Picard [15]. Required input files include the reference genome sequence, TE annotations of the reference genome sequence, alignment of paired- end short-read sequencing data, and a list of TE names of interest. The pipeline reports new insertion events based on the TE annotation of the reference genome sequence. The output file can be in either BED or GTF format that captures all details and can be integrated into the downstream analysis. Here we report the design of the pipeline, testing results of its performance using short-read sequencing data derived from a short-term evolution experiment in the filamentous Fusarium oxysporum f. sp. lycopersici 4287 (Fol4287), as well as a simulation dataset from 2 L of melanogaster [16], and two datasets from the study [17,18].

2. Materials and Methods 2.1. Requirements TEfinder is a bash pipeline for detecting TE insertions using paired-end sequencing data. The overall objective is to identify new TE insertion events in a given sample that are different from ones captured in the reference genome sequence. For this, an assembled genome and pair-end sequencing of the sample is required. Software required to run this tool include BEDTools 2.28.0 or later [13], SAMtools 1.3 or later [14], and Picard 2.0.1 or later [15]. The four user input files are a FASTA file of the reference genome sequence, a file of paired-end read alignments (BAM or SAM), TEs Genes 2021, 12, x FOR PEER REVIEW 3 of 12

Software required to run this tool include BEDTools 2.28.0 or later [13], SAMtools 1.3 or later [14], and Picard 2.0.1 or later [15]. The four user input files are a FASTA file of the Genes 2021, 12, 224 reference genome sequence, a file of paired-end read alignments (BAM or SAM), TEs3 ofpre- 11 sent in reference genome sequence in GTF format, and a list of TE names to be analyzed in text file format. The reference genome sequence and the read alignment files are essen- tial. The TE GTF file can be produced using any annotation tool (see Implementation for presentan example). in reference The list genome of TE sequence names provides in GTF format, users the and option a list of to TE focus names their to beanalysis analyzed on inselected text file TE format. families. The Without reference a genome specification, sequence the and list the can read be alignmenteasily derived files arefrom essential. the TE TheGTF TE file. GTF file can be produced using any annotation tool (see Implementation for an example). The list of TE names provides users the option to focus their analysis on selected Optional arguments have been incorporated for users to customize the tool. TE families. Without a specification, the list can be easily derived from the TE GTF file.  OptionalThe default arguments DNA fragment have been length incorporated or insert size for usersof the to short customize-read sequencing the tool. library is 400 base pairs (bp). For a sequencing library with an insert size of 500, this value • Thecan defaultbe modified DNA by fragment including length “-fis or 500” insert; size of the short-read sequencing library is  400The base default pairs maximum (bp). For distance a sequencing between library reads with for an merging insert size and of forming 500, this clusters value canhas bebeen modified set to 150 by bp. including To decrease “-fis 500”; the value of this parameter by 30 bp, the “-md” argu- • Thement default can be maximum set to 120; distance between reads for merging and forming clusters has  beenThe default set to 150maximum bp. To target decrease site theduplication value of (TSD) this parameter length is 20 by bp. 30 Modifying bp, the “-md” this argument can be set to 120; value can be useful if the TSD lengths of the TEs being analyzed are known leading • The default maximum target site duplication (TSD) length is 20 bp. Modifying this to more targeted TE analysis results. The maximum TSD length can be increased to value can be useful if the TSD lengths of the TEs being analyzed are known leading 30 bp by setting “-k 30”; to more targeted TE analysis results. The maximum TSD length can be increased to  The “-picard” argument should be set to the full path of the picard.jar file. 30 bp by setting “-k 30”;  An additional Java argument relating to Picard’s maximum memory heap size can • The “-picard” argument should be set to the full path of the picard.jar file. be submitted to the pipeline as a fraction of the total memory allocated, leading to • An additional Java argument relating to Picard’s maximum memory heap size can enhancement of the overall runtime. A maximum memory heap size of 25,000 MB is be submitted to the pipeline as a fraction of the total memory allocated, leading to set via “-maxHeapMem 25000”; enhancement of the overall runtime. A maximum memory heap size of 25,000 MB is  Multithreading is supported for the SAMtools commands of the pipeline via the set via “-maxHeapMem 25000”; • Multithreadingthread option. The is supported number of for threads the SAMtools can be commandsset to 4 using of the“-threads pipeline 4” via; the thread  option.A working The directory number of name threads can can be beprovided set to 4 as using such “-threads “-workingdir 4”; TEfinder_Y1” for • Abetter working file organization directory name and candifferentiation be provided amongst as such TEfinder “-workingdir runs; TEfinder_Y1” for  betterAn output file organization name can be and provided differentiation to be appended amongst TEfinderto the default runs; output names for • Aneffective output labeling name canof files, be provided such as “- tooutname be appended Y1”; to the default output names for  effectiveUsers can labeling also specify of files, G suchTF as as the “-outname output Y1”;argument, “-out gtf” (case insensitive), • Userswhich can reports also specifythe TE GTFinsertions as the in output GTF format argument,; “-out gtf” (case insensitive), which  reportsLastly, theif the TE user insertions includes in the GTF optional format; argument “-intermed yes” (case insensitive), • Lastly,all intermediate if the user files includes including the optional the TE-specific argument directories “-intermed are provided yes” (case in insensitive), addition to allthe intermediate output files. files including the TE-specific directories are provided in addition to the output files. 2.2. Implementation 2.2.2.2.1. Implementation Preprocessing 2.2.1. Preprocessing Preparation of the four input files depicted in the top panel of Figure 1A: Preparation of the four input files depicted in the top panel of Figure1A: 1. A FASTA file of the reference genome sequence. No special requirement. 1.2. A BAM FASTA file file of of aligned the reference paired- genomeend reads sequence. to the reference No special genome requirement. sequence. 2. A BAM file of aligned paired-end reads to the reference genome sequence. The sample sequencing reads needed to be aligned to the reference genome sequence usingThe an samplealigner. sequencing In this study, reads Burrows needed– toWheeler be aligned Aligner to the (BWA) reference [19] genome was used. sequence Users usingmay choose an aligner. other In aligners, this study, as Burrows–Wheelerlong as read group Aligner information (BWA) [is19 included] was used. in Usersthe header may choosecreated other by the aligners, aligner. as Different long as readaligners group can information have slightly is includeddifferent inresults. the header created by the aligner. Different aligners can have slightly different results. bwa index reference.fasta bwa mem -R “@RG\tID:sample\tPL:illumina\tLB:LIB\tSM:sample” \ reference.fasta sample_R1.fq sample_R2.fq > sample.bam

The bwa command above specifies read group information using bwa-mem, option The bwa command above specifies read group information using bwa-mem, option “-R”. If Bowtie2 [12] was used, and users needed to select option “–rg-id”. For GEM3- “-R”. If Bowtie2 [12] was used, and users needed to select option “--rg-id”. For GEM3- Mapper [20], option “-r” needed to be specified. If read group information was not included in the initial alignments, Picard [15] has AddOrReplaceReadGroups function to create read group information based on the input BAM file. 3. GTF file of TE annotation in the reference genome sequence. This file captured genomic locations of all TEs that were present in the reference genome sequence. This file could be GFF2/GTF or GFF3. To generate it, a library of TE Genes 2021, 12, x FOR PEER REVIEW 4 of 12

Genes 2021, 12, x FOR PEER REVIEW 4 of 12

Mapper [20], option “-r” needed to be specified. If read group information was not in- cluded in the initial alignments, Picard [15] has AddOrReplaceReadGroups function to createMapper read [20] group, option information “-r” need baseded to beon specified.the input BAMIf read file. group information was not in- 3.cluded GTF in file the of initial TE annotation alignments, in thePicard reference [15] has genome AddOrReplaceReadGroups sequence. function to createThis read file group capture informationd genomic based locations on the of inputall TEs BAM that file.were present in the reference ge- 3. GTF file of TE annotation in the reference genome sequence. Genes 2021, 12, 224 nome sequence. This file could be GFF2/GTF or GFF3. To generate it, a library of TE4 of se- 11 quencesThis of file the capture referenced genomic genome locations had to be of provided. all TEs that For we a wellre present-annotated in the genome, reference us ersge- cnomeould obtainsequence. this This library file fromcould existing be GFF2/GTF databases or GFF3. of known To generate repetitive it, aelements library of such TE se-as Repbasequences of[21] the. If reference such a library genome is not had available, to be provided. users can For compile a well -oneannotated by running genome, a de usnovoers sequencesTEcould family obtain ofidentification thethis referencelibrary software from genome existing such had asdatabases to RepeatModeler be provided. of known For [22] repetitive a, well-annotatedRepeatScou elementst [23] genome, ,such or Re- as userspeatMaskerRepbase could [21] obtain[24]. If such on this the a library library reference is from not genome existingavailable, sequence. databases users can An of compile knownexample one repetitive is byshown running elements below a de to suchnovo use asRepeatScoutTE Repbase family identification [21 to]. discover If such a repetitivesoftware library is suchsequences not available,as RepeatModeler and usersform the can TE[22] compile library, RepeatScou one of the by runningreferencet [23], or a Re-ge- de novonomepeatMasker TEsequence. family [24] identification on the reference software genome such sequence. as RepeatModeler An example [22 is], RepeatScoutshown below [23 to], use or RepeatMaskerRepeatScout to [discover24] on the repetitive reference sequences genome sequence.and form the An TE example library isof shownthe reference below ge- to use RepeatScoutbuild_lmer_table to discover -sequence repetitive reference.fasta sequences and form -freq the TE referenc library ofe.freq the reference nome sequence. genomeRepeatScout sequence. -sequence reference.fasta -output TElib.fa -freq \ reference.freq build_lmer_table -sequence reference.fasta -freq reference.freq RepeatScout -sequence reference.fasta -output TElib.fa -freq \ Thereference.freq most straightforward tool was RepeatMasker [24], which is available on Galaxy [25], and could be used to generate the GTF file based on the provided reference genome The most straightforward tool was RepeatMasker [24], which is available on Galaxy [25], sequenceThe mostand a straightforward common repeat toollibrary wa sfrom RepeatMasker RepBase [21] [24] or, whichthe TE islibrary available created on Galaxyabove. and could be used to generate the GTF file based on the provided reference genome se- The[25] , outputand could from be RepeatMasker used to generate was the to GTF be filteredfile based so onthat the simple provided repeats reference would genome be re- quence and a common repeat library from RepBase [21] or the TE library created above. moved.sequence and a common repeat library from RepBase [21] or the TE library created above. The output from RepeatMasker was to be filtered so that simple repeats would be removed. The output from RepeatMasker was to be filtered so that simple repeats would be re- RepeatMasker -lib TElib.fa -dir workingdir -gff reference.fasta moved. 4. Text file with names of the TEs of interest. 4. TextRepeat fileMasker with names -lib of TElib.fa the TEs of - interest.dir workingdir -gff reference.fasta The TE namesnames providedprovided neededneeded to exactlyexactly matchmatch the namesnames inin thethe GTFGTF entriesentries inin aa single column. The names could not include characters other than letters, digits, “-”, “_”, single4. Text column. file with The names names of could the TEs not ofinclude interest. characters other than letters, digits, “-”, “_”, and “#”. More information about this can be found in the tool manual. and “The#”. MoreTE names information provided about needed this canto exactly be found match in the the tool names manual. in the GTF entries in a A single column. The namesB couldU_9 not include characters other than letters, digits, “-”, “_”, and “#”. More information about this can be found in the tool manual. 1,458 bp

9,600 bp 9,800 bp 10,000 bp 10,200 bp 10,400 bp 10,600 bp 10,800 bp 11,000 bp 1

[0 - 122]

Read alignments (BAM) [0 - 64]

Hornet-small

Discordant reads (BAM)

Grouped discordant reads on plus strand (BED)

Grouped discordant reads on minus strand (BED) Figure 1. (A) Basic workflow of the TEfinder pipeline. TheTE required insertions (BED) input files are in dark purple, while other files that are used to create the input files are in light purple. Arrows show the dependencies of the files. Please check Implementa- Figure 1. (A) Basic workflow of the TEfinder pipeline. The required input files are in dark purple, while other files that are tion for details. The steps of the pipeline are in the light gray box and the outputs are in the green box. (B) Visualization usedFigure to create1. (A) Basic the input workflow files are of in the light TEfinder purple. pipeline. Arrows The show required the dependencies input files ofare the infiles. darkPlease purple, check while Implementation other files that of a new small Hornet (TIR/hAT) insertion event using Integrative Genomics Viewer (IGV). The data were derived from forare details. used to Thecreate steps the ofinput the pipelinefiles are in are light in the purple. light Arrows gray box show and the outputsdependencies are in of the the green files. box. Please (B) check Visualization Implementa- of a tion for details. The steps of the pipeline are in the light gray box and the outputs are in the green box. (B) Visualization new small Hornet (TIR/hAT) insertion event using Integrative Genomics Viewer (IGV). The data were derived from the of a new small Hornet (TIR/hAT) insertion event using Integrative Genomics Viewer (IGV). The data were derived from Y3 population after a short-term evolution experiment using Fol4287. (Top) input read alignments file, (middle) output discordant reads alignment file, and (bottom) intermediate and output BED files are displayed. The aligned reads in the top panel are shown in squished mode with reads mapped to different sequences colored non-gray. The aligned reads in the middle panel are shown in collapsed mode and forward reads are in red while reverse reads are in blue. The histograms above the alignments in the top and middle panels show the aligned read counts. The position of the insertion event reported by TEfinder coincides with the duplicated target site (TSD). Genes 2021, 12, x FOR PEER REVIEW 5 of 12

the Y3 population after a short-term evolution experiment using Fol4287. (Top) input read alignments file, (middle) out- put discordant reads alignment file, and (bottom) intermediate and output BED files are displayed. The aligned reads in the top panel are shown in squished mode with reads mapped to different sequences colored non-gray. The aligned reads in the middle panel are shown in collapsed mode and forward reads are in red while reverse reads are in blue. The histo- Genes 2021grams, 12 ,above 224 the alignments in the top and middle panels show the aligned read counts. The position of the insertion event5 of 11 reported by TEfinder coincides with the duplicated target site (TSD).

2.2.2. Pipeline 2.2.2. Pipeline TEfinder relies on paired-end sequencing and uses information on discordant reads, whichTEfinder are reads relies that on do paired-end not match sequencing the expected and orientation uses information or insert on size. discordant The software reads, whichrequired are for reads the thatpackage do not are matchSAMtools the expected[14], BEDTools orientation [13], and or insert Picard size. [15]. The software requiredA typical for the command package areto run SAMtools TEfinder [14 is:], BEDTools [13], and Picard [15]. A typical command to run TEfinder is:

TEfinder -alignment sample.bam -fa reference.fa -gtf TEs.gtf \ -te List_of_TEs.txt

TheThe referencereference sequencesequence informationinformation anchorsanchors allall downstreamdownstream analysis analysis in in TIPs,TIPs, andand thethe initialinitial stepstep ofof thethe tooltool waswas creatingcreating thethe FASTAFASTA indexindex filefile which which enhancesenhances efficientefficient accessaccess to regions within the FASTA file. The subsequent step of the tool was to sort the input to regions within the FASTA file. The subsequent step of the tool was to sort the input alignment file by coordinates. To avoid multi-mapping and chimeric ambiguities from alignment file by coordinates. To avoid multi-mapping and chimeric ambiguities from affecting our analysis, the tool removed secondary and supplementary alignments from the affecting our analysis, the tool removed secondary and supplementary alignments from user-provided sample alignment file (Figure1A). For each TE, the package went through the user-provided sample alignment file (Figure 1A). For each TE, the package went the following logical steps: through the following logical steps: 1. Identify discordant reads. The program extracted all primary reads mapped to known 1. Identify discordant reads. The program extracted all primary reads mapped to TEs in the reference using BEDTools intersect. Then, alignments of the selected reads known TEs in the reference using BEDTools intersect. Then, alignments of the se- and their pairs were extracted from the input BAM file using the FilterSamReads tool lected reads and their pairs were extracted from the input BAM file using the Filter- of Picard [15]. Among those, reads were selected as discordant if the corresponding SamReads tool of Picard [15]. Among those, reads were selected as discordant if the mate mapped to a different sequence or the read pair had an insert size that exceeded acorresponding threshold of 10 mate times map theped mean to a insert different size (Figuresequence1B, or top the panel). read pair had an insert 2. Groupsize that discordant exceeded reads.a threshold Once of the 10 discordant times the mean read alignments insert size (Figure had been 1B, filtered, top panel). the

2. regionsGroup discordant of clustered reads. reads Once were the identified discordant using read the alignments BEDTools ha merged been so filtered, that reads the alignedregions of to clustered the plus strandreads we andre minusidentified strand using were the grouped BEDTools separately merge so (Figure that reads1B, middlealigned andto the bottom plus panels).strand and In this minus step, strand the reads were had grouped to be overlapping separately or(Figure within 1B, a givenmiddle distance and bottom to be panels). considered In this in the step, same the group.reads had to be overlapping or within a 3. Definegiven distance a TE insertion to be considered site. Each plus-strand in the same group group. was coupled with the nearest minus- 3. strandDefine group.a TE insertion The coupled site. Each regions plus went-strand through group a wa filterings coupled step with to remove the nearest incorrect mi- orientations,nus-strand group. considering The coupled that TE regions sequences went should through bepresent a filtering between step to forward remove and in- reversecorrect orientations, groups and notconsidering the reverse that order. TE sequences Due to the should nature be ofpresent duplication between of thefor- targetward and site reverse upon the groups transposon and not insertion, the reverse TSD order. sequences Due to the may nature have of been duplication present inof boththe target forward site andupon reverse the transposon strands, resultinginsertion, in TSD an overlapsequences between may have forward been andpre- reversesent in clusters.both forward If an overlappingand reverse sitestrands, was smaller resulting than in the an maximumoverlap between TSD length, forward the locationand reverse was reported clusters. as If a an possible overlapping TE insertion site wa site.s smaller If they were than not the overlapping maximum dueTSD tolength, low coveragethe location or otherwas reported reasons, as and a possible the distance TE insertion between site. the If groups they we wasre not smaller over- thanlapping the due given to threshold,low coverage the regionor other in reasons, between and was the reported. distance between the groups 4. Filterwas smaller new insertion than the sites.given Thethreshold, identified the region insertion in between sites were wa filtereds reported. based upon 4. 3Filter criteria: new Occurrence insertion sites. in repeat The regions,identified supporting insertion readsites count, were andfiltered strand based bias. upon If the 3 insertioncriteria: O siteccurrence coincided in repeat with aregions, TE-annotated supporting site in read the count, reference and genome strand bias. sequence, If the itsinsertion filter label site becamecoincide “in_repeat”.d with a TE If-annotated the total supportingsite in the reference reads for thegenome , event wasits filter less label than thebeca cutoffme “in_repeat 10, then the”. If filtering the total process supporting labeled reads this for low the confidence insertion eventevent aswa “weak_evidence”.s less than the cutoff Strand 10, then bias the was filtering critical process for filtering labeled out this insertion low confidence events. event Two thresholdas “weak_evidence values were”. Strand calculated bias wa baseds criti oncal the for forward filtering readout insertion count (F), events. and if Two the reversethreshold read values count we (R)re calculated fell outside based those on boundaries the forward (R read < F0.8 countor R >(F), F1.25 and), thenif the the re- eventverse read likely count was not(R) causedfell outside by TIP those and boundaries received a (R “stand_bias” < F0.8 or R > label. F1.25), Ifthen an insertionthe event sitelikely passed was not all threecaused filtering by TIP criteria, and received then it a was “stand_bias labeled “PASS”,” label. representingIf an insertion high site confidencepassed all three insertion filtering events criteria, (Figure then1). it was labeled “PASS”, representing high con- Detectedfidence insertion TE insertions events events (Figure were 1). reported in BED detail format with 7 columns: (1) Insertion site sequence, (2) start coordinate, (3) end coordinate, (4) TE family, (5) total number of reads supporting the insertion event, (6) unused, (7) additional information regarding the insertion such as the number of forward and reverse reads supporting the event (FR and RR), as well as the insertion region start and end coordinate (InsRegion), and the filtering results (FILTER) obtained from the pipeline. Additionally, discordant pair alignment files from individual TE families were combined to create the “Discor- Genes 2021, 12, 224 6 of 11

dantReads.bam” file in the working directory (Figure1B, middle panel). This file could be used to visualize the events on genome browsers such as Integrative Genomics Viewer (IGV) [26]. If the GTF output option was selected, the GTF file reported the additional information as attributed in the ninth column.

2.3. Testing Dataset and Processing The performance of TEfinder was tested using a model Fusarium oxysporum, a soil- inhabiting ascomycete fungus that causes devastating losses in more than a hundred different crops and disseminated infections in immunocompromised humans [27,28]. One interesting genomic feature of the F. oxysporum species complex is the compartmentalization of its genome, where conserved core regions carry essential house-keeping functions while lineage-specific accessory regions are enriched for TEs and associated with host-specific pathogenicity [28,29]. The pipeline was used to identify new TE insertion events among five populations, Y1– Y5, evolved under laboratory conditions. Briefly, the ancestor strain (WT) of Fol4287, which was previously used to generate the reference genome assembly [30], was subjected to suc- cessive transfers on yeast peptone dextrose agar plates. After 10 passages with 5 indepen- dent biological replicates (Y1–Y5), genomic DNA was extracted from the final mixed popu- lations and sequenced using Illumina HiSeq 2500 platform with 2 × 71 cycles. The whole- genome shotgun sequencing reads are available at NCBI under project PRJNA682786 and datasets SRR13203443, SRR13203444, SRR13203445, SRR13203446, and SRR13203447. To assess the sensitivity, we performed random sampling of the alignments in the Y2 population using Picard [15] and reduced the coverage to 50%, 30%, 20%, and 10% of the initial read coverage and repeated the analysis. A small subset of simulation data from chromosome 2 L of the reference genome sequence (dm3), used for testing TEMP [16] software available on GitHub at (https://github.com/JialiUMassWengLab/TEMP, accessed on 28 January 2021), was also used to evaluate TEfinder. To test TEfinder on identifying new insertion events in 2 Arabidopsis thaliana acces- sions [17], Alst-1 (SRR492202) and Benk-1 (SRR492214) whole-genome paired-end se- quencing data were mapped to the TAIR10 reference genome (ftp://ftp.arabidopsis.org/ home/tair/Genes/TAIR10_genome_release) with BWA [19]. The median sequencing coverage for Alst-1 and Benk-1 were 36 and 40, respectively. TAIR10 TE annotation GTF was obtained using RepeatMasker [24] and the TAIR10 transposon library avail- able at (ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR10_genome_release/TAIR10_ transposable_elements/, accessed on 28 January 2021). Non-reference insertion events for these accessions, reported in an A. thaliana mobilome analysis [18], were used for benchmarking.

2.4. Experimental Validation Thirteen reported TE insertion events were validated by PCR. Genomic DNA was ex- tracted from mycelia of the F. oxysporum reference strain, evolved populations, or of single spore (SS) isolates obtained from the experimentally evolved lines, using the Cetyltrimethy- lammonium Bromide method [31]. PCR was performed in a thermocycler using the thermostable DNA polymerase of the Expand High Fidelity PCR System (Roche Diagnos- tics, Mannheim, Germany). Each PCR reaction contained 300 nM of each primer, 2.5 mM MgCl2, 0.8 mM dNTP mix, 0.05 U/µL polymerase, and 5–10 ng/µL genomic DNA. PCR cycling conditions were as follows: An initial step of denaturation (5 min, 94 ◦C); 35 cycles of 35 s at 94 ◦C, 35 s at the calculated primer annealing temperature, and 1 min/1.5 kb extension at 72 ◦C (or 68 ◦C for templates larger than 3 kb); and a final extension step of 10 min at 72 ◦C (or 68 ◦C). For each predicted TE insertion event, a pair of specific primers flanking the insertion site was designed. Genes 2021, 12, 224 7 of 11

3. Results 3.1. Data Preparation The Fol4287 reference genome assembly was 53.9 MB in 499 scaffolds [30]. The sequencing reads were mapped to the reference using BWA [19] with >99% mapping and median coverages ranging from 67 to 94 (Table1). RepeatMasker [ 24] was used to identify the known TEs in the genome with a curated TE library [27,28] which included 69 TE families. Approximately 4.5% of the entire genome was identified as repetitive sequence, with 3.98% of the genome comprised of transposable elements. Accessory sequences included 74% of all TEs in the genome [27].

Table 1. Sample sequencing and mapping summary statistics.

Discordant Mate Percent Reads Median Sample Total Reads Mapping Quality ≥ 5 Mapping to Reference Coverage Y1 62,015,365 475,966 99.41 67 Y2 85,688,005 600,267 99.38 94 Y3 83,109,424 578,959 99.44 92 Y4 70,322,907 475,591 99.42 78 Y5 76,034,020 567,823 99.37 83

3.2. Total TE Insertion Events Detected On a high-performance computing cluster with a memory request of 3 cores × 50,000 MB, the trial runs took an average of 4.7 h to complete with the minimum time being 3.6 h across the samples. Three threads were provided for SAMtools multithreading. The maximum heap memory for Java was set to 25,000 MB to enhance the run time when filtering the alignment to only include discordant reads. TEfinder detected 502 to 570 insertion events across whole genome sequencing data from 5 evolved Y1–Y5 populations. Details for each individual sample are summarized in Table2 (Supplementary File). The TIPs had varying allele frequencies depending on the population structure. As expected, many insertion events were detected among complex repetitive regions that captured most transposons. In fact, the number of new TE insertions reported in each sample population was approximately one-third of the insertions detected in known repeat regions. Additionally, there were filtered out events because of the low read count (“weak evidence”). Furthermore, almost half of the detected events had strand bias as estimated by a power function. The constants for this function were determined arbitrarily for the datasets tested in this study. The users could utilize the reported forward and reverse read counts for their data if the need arose.

Table 2. Number of transposable element (TE) insertion events reported by TEfinder in evolved populations of Fol4287.

Weak Strand PCR- Sample All Events In Repeat PASS Evidence Bias Verified Y1 502 397 11 256 55 1/1 Y2 566 449 9 281 63 1/1 Y3 570 455 8 264 60 1/1 Y4 565 446 10 278 60 3/3 Y5 544 423 10 272 64 6/7

The BAM output file format was intended for visual inspection. Figure1B captures a new insertion event of the TE small Hornet (TIR/hAT) that reached fixation in the Y3 population visualized in IGV. The confidence level for calling this new insertion event was high, with a total of 415 supporting reads spanning a 1114-bp insertion region in a dataset with 92× median coverage. Of these supporting discordant reads, 223 reads were grouped in the plus strand cluster and 192 in the minus strand cluster. Since the Genes 2021, 12, 224 8 of 11 Genes 2021, 12, x FOR PEER REVIEW 8 of 11

in the plus TEfinderstrand cluster pipeline anddoes 192 in not the focus minus on strand TSD position cluster. detection,Since the TEfinder the reported pipeline insertions may does not focusnot alwayson TSD coincideposition preciselydetection, with the thereported true location. insertions Nevertheless, may not always the pipeline coincide mapped the precisely withTSD the position true location. of many Nevertheless, events precisely. the Inpipeline the example mapped of Figurethe TSD1B, position two clusters of overlap many eventsat precisely. an 8-base In TSD, the example which coincides of Figure with1B, two the clusters known overlap TSD of at this an 8-base particular TSD, transposon which coincidessuperfamily with the [8 ].known TSD of this particular transposon superfamily [8]. We selected 13 newly detected TIPs in the “PASS” catalog for experimental validation We selected 13 newly detected TIPs in the “PASS” catalog for experimental valida- (Table2). Results of PCR confirmed our prediction in 12 out of 13 cases, resulting in a tion (Table 2). Results of PCR confirmed our prediction in 12 out of 13 cases, resulting in success rate of 92%. The one TIP event that failed PCR amplification only had 22% allele a success rate of 92%. The one TIP event that failed PCR amplification only had 22% allele frequency in that evolved population. The success rate for non-PASS events might be lower. frequency in that evolved population. The success rate for non-PASS events might be Figure2 shows four TIPs by small Hornet in Y1, Y3, Y4, and Y5 populations. The size lower. Figure 2 shows four TIPs by small Hornet in Y1, Y3, Y4, and Y5 populations. The difference in the ancestor and the lines with TIP were about 750 bp, which coincided with size difference in the ancestor and the lines with TIP were about 750 bp, which coincided the size of the small Hornet transposon. with the size of the small Hornet transposon.

Figure 2. ValidationFigure of small2. Validation Hornet transposonof small Hornet insertion transposon events detectedinsertion in events experimentally detected in evolved experimentally populations of Fol4287. evolved populations of Fol4287. PCR was performed with primers flanking the insertion site in the PCR was performed with primers flanking the insertion site in the ancestor strain (WT) and two single spore (SS) isolates in ancestor strain (WT) and two single spore (SS) isolates in populations Y1, Y3, Y4, and Y5, respec- populations Y1, Y3, Y4, and Y5, respectively. GeneRuler 1 kb Plus DNA Ladder shown with indicated sizes in kb to the left. tively. GeneRuler 1 kb Plus DNA Ladder shown with indicated sizes in kb to the left. Specific pri- Specific primersmers designed designed for validation for validation of TE of insertion TE insertio eventsn events are Y1 are forward: Y1 forward: TCCTCCTGGGTTTCTTGTCAC, TCCTCCTGGGTTTCTTGT- Y1 reverse: CTCTTGAAACGTGGTGCAGAC;CAC, Y1 reverse: CTCTTGAAACGTGGTGCAG Y3 and Y4 forward: AGACGGACAAAGAGGTGTGAC,AC; Y3 and Y4 forward: AGACGGACAAA- Y3 and Y4 reverse CTCGA- CACTACCAGGCACTAT;GAGGTGTGAC, Y5 forward: Y3 and GAGTATGCTTCCCGATCCTTG, Y4 reverse CTCGACACTACCAGGCACTAT; Y5 reverse: GACATCCCTCAATCCGCTGAA. Y5 forward: GAG- TATGCTTCCCGATCCTTG, Y5 reverse: GACATCCCTCAATCCGCTGAA. 3.3. Sensitivity and Applicability of TEfinder 3.3. Sensitivity andThe Applicability output BED of fileTEfinder of TEfinder reported an overview of the total number of reads The outputsupporting BED file the of insertion TEfinder event, reported as well an asoverview the number of the of forwardtotal number and reverseof reads read counts. supportingIn the our insertion trial, TEfinder event, as was well able as to the capture number low-frequency of forward and insertion reverse events read withcounts. read evidence In our trial,as TEfinder little as 8was in all able datasets. to capture low-frequency insertion events with read evi- dence as little asTo 8 testin all the datasets. sensitivity of TEfinder, we used one of the evolving samples with read cover- To testage the reduced sensitivity to 0.5, of 0.3, TEfinder, 0.2, and 0.1we of used the originalone of datasetthe evolving (Table3 samples, Figure3 ,withSupplementary read File), coverage reducedcorresponding to 0.5, 0.3, to 0.2, 47× and, 28 ×0.1, 19of× the, and original 9× sequence dataset (Table coverage. 3, Figure When 3, Supple- the read coverage mentary File),reduced corresponding to 47×, 82.5% to 47×, of the28×, previously 19×, and 9× detected sequence events coverage. were captured.When the Atread 9× coverage, coverage reducedonly 38% to of47×, the 82.5% events of were the detected.previously Therefore, detected theevents sensitivity were captured. of TEfinder At depended9× on coverage, onlythe read38% coverageof the events of the were alignments detected. in Therefore, the input the BAM sensitivity file. To ensureof TEfinder effectiveness de- in the pended on detectionthe read coverage of TIPs, we of suggestthe alignments a minimal in the 20× inputsequence BAM coverage,file. To ensure especially effective- when reads are ness in thegenerated detection fromof TIPs, a mixed we suggest population. a minimal 20× sequence coverage, especially when reads are generated from a mixed population. Table 3. Number of TE insertion events reported by TEfinder in Y2 population with reduced coverages. Table 3. Number of TE insertion events reported by TEfinder in Y2 population with reduced cov- erages. Median Read Coverage Reported PASS 94 566 63 Median Read47 Coverage 463 Reported PASS 52 2894 352566 63 39 1947 280463 52 36 9 172 24 28 352 39 19 280 36 9 172 24

Genes 2021, 12, x FOR PEER REVIEW 9 of 11 Genes 2021, 12, 224 9 of 11

Figure 3. PercentPercent of of detected detected TIP TIP events events with with respect respect to to read read coverage coverage in in the the input input alignment alignment file. file. Orange circles depict all reported events while blue trianglestriangles representrepresent PASSPASS events.events.

We used TEfinderTEfinder with default options exceptexcept “-k“-k 30”30” on the subset of simulated data from chromosome 2 2 L L in in D.D. melanogaster melanogaster [16].[16]. There There were were 10 10 simulated simulated insertions insertions and and 8 were8 were by by TEs TEs present present in in chromosome chromosome 2 2 L. L. TEfinder TEfinder identified identified all 8 new insertions. Seven of the eighteight eventsevents werewere in in the the “PASS” “PASS” category category and and one one event event was was labeled labeled as strandas strand biased bi- withased 32with forward 32 forward and 14and reverse 14 reverse read countsread co (Supplementaryunts (Supplementary File). File). Since Since the otherthe other two twosimulated simulated insertion insertion events events were were caused caused by TEs by from TEs otherfrom partsother ofparts the genomeof the genome and absent and absentin the chromosomein the chromosome 2 L TE 2 annotation,L TE annotation, they werethey were not detected not detected by the by TEfinderthe TEfinder pipeline. pipe- line.Furthermore, Furthermore, we testedwe tested TEfinder TEfinder with with data data used used in in the theArabidopsis Arabidopsis thaliana thaliana mobilomemobilome study [17,18] [17,18] that detected 63 non-reference TE insertion events in Alst-1 and and 61 61 in in Benk- Benk- 1 accessions.accessions. WithWith default default options, options, TEfinder TEfinder successfully successfully detected detected 55 and 55 58 and of these 58 of known these knowninstances, instances, respectively. respectively. These testing These results testing demonstrated results demonstrated feasibility feasibility in applying in applying TEfinder TEfinderin , in , plants, andanimals, fungi and with fungi high with sensitivity. high sensitivity.

4. Discussion Here we reported a simple pipeline to detect new TE insertions in related populations when compared compared to to a areference reference genome genome sequence sequence.. TheThe required required input input and output and output file types file aretypes commonly are commonly used in used variant in variant detection detection workfl workflowsows and therefore and therefore it can be it caneasily be imple- easily mentedimplemented in larger in largerpipelines. pipelines. The pipeline The pipeline has a low has RAM a low requirement RAM requirement and takes and a rather takes shorta rather computation short computation time. Testing time. of Testing the pipe ofline the in pipeline evolved in populations evolved populations of the fungus of the F. fungus F. oxysporum demonstrated that this easy-to-use tool could effectively detect new TE oxysporum demonstrated that this easy-to-use tool could effectively detect new TE inser- insertion events, making it accessible and practical to address TE activity-related biological tion events, making it accessible and practical to address TE activity-related biological questions in population genomics, genome evolution, and other applications. Successfully questions in population genomics, genome evolution, and other applications. Successfully detecting TIPs using Drosophila melanogaster and Arabidopsis thaliana data demonstrated detecting TIPs using Drosophila melanogaster and Arabidopsis thaliana data demonstrated feasibility in applying TEfinder in plants, animals, and fungi with high sensitivity. feasibility in applying TEfinder in plants, animals, and fungi with high sensitivity. Paired-end sequence reads are the foundation to read out genetic changes in an Paired-end sequence reads are the foundation to read out genetic changes in an indi- individual or a population when comparing to a reference genome sequence. Therefore, vidual or a population when comparing to a reference genome sequence. Therefore, suf- sufficient sequence coverage and good quality of the sequence reads are important, as for ficient sequence coverage and good quality of the sequence reads are important, as for all all variant calling software. However, even with a small number of read evidence, the variant calling software. However, even with a small number of read evidence, the pipe- pipeline can still capture low-frequency events. The optional parameters can be used to line can still capture low-frequency events. The optional parameters can be used to adjust adjust the strictness of the algorithm. the strictnessOur data of suggest the algorithm. that as long as the reference genome sequence is reasonably assem- bled,Our the qualitydata suggest of the that reference as long genome as the reference assembly genome should sequence not affect is the reasonably performance assem- of bled,the TEfinder the quality pipeline. of the For reference instance, genome the assemblies assembly of should the lineage-specific not affect the genomeperformance regions of theof the TEfinder Fol4287 pipeline. genome For used instance, to test TEfinderthe assemblies were fragmented of the lineage-specific due to high genome repeat content,regions

Genes 2021, 12, 224 10 of 11

while chromosome-level assembly was accomplished for most core regions. Importantly, TEfinder was able to capture the events in both types of regions. TEfinder depends on read placement within a genome. Ambiguity exists in such processes especially when placing a read into a highly repetitive region, which may result in lower confidence level in calling a TIP. Such insertion events, as well as nested TEs, might still be detected and reported by TEfinder with a tag “in repeat”, to differentiate them from TIP events with high confidence. As with other variant calling software, the output files need to be filtered before further analysis. Users can utilize the internal filter tags and reported forward, reverse, and total read count values according to their needs. The output BAM file is also useful to do visual confirmations. One feature missing from the output is the allele frequency of the events. However, the read counts and insertion region positions can be used to estimate allele frequencies. We were able to experimentally verify some of the events. Although we are in the early phase of understanding the functional impact of these TIPs, the ability to detect these events with high confidence enables hypothesis generation and establishment of targeted functional studies. Testing of the TEfinder pipeline in evolved fungal populations further confirmed the effectiveness of this tool in identifying TE insertion events in non- reference eukaryotic genomes.

5. Conclusions TEfinder is a tool for detecting new TE insertions in fungal, , or genomes via paired-end resequencing data. TEfinder has a small number of external software requirements and input files, making it an easy-to-use and accessible tool for the detection of new TE insertions events.

Supplementary Materials: The following are available online at https://www.mdpi.com/2073-442 5/12/2/224/s1; the TEfinder pipeline is available on GitHub at https://github.com/VistaSohrab/ TEfinder, accessed on 28 January 2021 (doi:10.5281/zenodo.4446970). Supplementary File: Zipped TEfinder BED outputs of evolving populations (Y1–Y5), the reduced coverage sample sets of Y2 pop- ulation, and the simulation data with dm3 chromosome 2 L sequence. Author Contributions: Conceptualization, V.S., L.-J.M. and D.H.A.; software, V.S. and D.H.A.; data generation, C.L.-D., D.H.A., L.-J.M. and A.D.P.; writing—original draft preparation, V.S., L.-J.M. and D.H.A.; review and editing, all authors contributed. All authors have read and agreed to the published version of the manuscript. Funding: V.S.was supported by University of Massachusetts Amherst Honors College Research grant. D.H.A. and L.-J.M. were supported by the National Institute of Health (R01EY030150) and Burroughs Welcome Foundation (1014893). C.L.-D. was supported by FEDER Junta de Andalucía (27374-R). The sequencing data were generated through the support of the United States Department of Agriculture, National Institute of Food and Agriculture (Grant awards 2011-35600-30379 and MASR-2009-04374) and the National Science Foundation (IOS-1652641), and by the Spanish Ministerio de Ciencia e Innovación MICINN (PID2019-108045RB-I00). Acknowledgments: We thank the Massachusetts Green High Performance Computing Center for providing high-performance computing capacity for the development and testing of our tool. Conflicts of Interest: The authors declare no conflict of interest.

References 1. Chénais, B.; Caruso, A.; Hiard, S.; Casse, N. The impact of transposable elements on eukaryotic genomes: From increase to genetic adaptation to stressful environments. Gene 2012, 509, 7–15. [CrossRef] 2. Bourque, G.; Burns, K.H.; Gehring, M.; Gorbunova, V.; Seluanov, A.; Hammell, M.; Imbeault, M.; Izsvák, Z.; Levin, H.L.; Macfarlan, T.S.; et al. Ten things you should know about transposable elements. Genome Biol. 2018, 19, 1–12. [CrossRef][PubMed] 3. Huang, C.R.L.; Burns, K.H.; Boeke, J.D. Active transposition in genomes. Annu. Rev. Genet. 2012, 46, 651–675. [CrossRef] 4. Munoz-Lopez, M.; Garcia-Perez, J. DNA Transposons: Nature and Applications in Genomics. Curr. Genom. 2010, 11, 115–128. [CrossRef] 5. Burns, K.H. Transposable elements in cancer. Nat. Rev. Cancer 2017, 17, 415–424. [CrossRef] 6. Jönsson, M.E.; Garza, R.; Johansson, P.A.; Jakobsson, J. Transposable Elements: A Common Feature of Neurodevelopmental and Neurodegenerative Disorders. Trends Genet. 2020, 36, 610–623. [CrossRef] Genes 2021, 12, 224 11 of 11

7. Seidl, M.F.; Thomma, B.P.H.J. Transposable Elements Direct The Coevolution between Plants and Microbes. Trends Genet. 2017, 33, 842–851. [CrossRef] 8. Wicker, T.; Sabot, F.; Hua-Van, A.; Bennetzen, J.L.; Capy, P.; Chalhoub, B.; Flavell, A.; Leroy, P.; Morgante, M.; Panaud, O.; et al. A unified classification system for eukaryotic transposable elements. Nat. Rev. Genet. 2007, 8, 973–982. [CrossRef] 9. Hawkey, J.; Hamidian, M.; Wick, R.R.; Edwards, D.J.; Billman-Jacobe, H.; Hall, R.M.; Holt, K.E. ISMapper: Identifying insertion sites in bacterial genomes from short read sequence data. BMC Genom. 2015, 16, 667. [CrossRef] 10. Gardner, E.J.; Lam, V.K.; Harris, D.N.; Chuang, N.T.; Scott, E.C.; Pittard, W.S.; Mills, R.E.; Devine, S.E. The Mobile Element Locator Tool (MELT): Population-scale mobile element discovery and biology. Genome Res. 2017, 27, 1916–1929. [CrossRef] 11. 1000 Genomes Project Consortium. A global reference for . Nature 2015, 526, 68–74. [CrossRef] 12. Langmead, B.; Salzberg, S.L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 2012, 9, 357–359. [CrossRef] 13. Quinlan, A.R.; Hall, I.M. BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics 2010, 26, 841–842. [CrossRef][PubMed] 14. Li, H.; Handsaker, B.; Wysoker, A.; Fennell, T.; Ruan, J.; Homer, N.; Marth, G.; Abecasis, G.; Durbin, R. The /Map format and SAMtools. Bioinformatics 2009, 25, 2078–2079. [CrossRef][PubMed] 15. Picard Toolkit. Available online: http://broadinstitute.github.io/picard (accessed on 28 January 2021). 16. Zhuang, J.; Wang, J.; Theurkauf, W.; Weng, Z. TEMP: A computational method for analyzing transposable element polymorphism in populations. Nucleic Acids Res. 2014, 42, 6826–6838. [CrossRef] 17. Schmitz, R.J.; Schultz, M.D.; Urich, M.A.; Nery, J.R.; Pelizzola, M.; Libiger, O.; Alix, A.; McCosh, R.B.; Chen, H.; Schork, N.J.; et al. Patterns of population epigenomic diversity. Nature 2013, 495, 193–198. [CrossRef] 18. Quadrana, L.; Bortolini Silveira, A.; Mayhew, G.F.; LeBlanc, C.; Martienssen, R.A.; Jeddeloh, J.A.; Colot, V. The Arabidopsis thaliana mobilome and its impact at the species level. Elife 2016, 5.[CrossRef] 19. Li, H.; Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009, 25, 1754–1760. [CrossRef] 20. Marco-Sola, S.; Sammeth, M.; Guigó, R.; Ribeca, P. The GEM mapper: Fast, accurate and versatile alignment by filtration. Nat. Methods 2012, 9, 1185–1188. [CrossRef] 21. Bao, W.; Kojima, K.K.; Kohany, O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob. DNA 2015, 6, 11. [CrossRef] 22. Smit, A.F.A.; Hubley, R. RepeatModeler Open-1.0. 2015. Available online: http://www.repeatmasker.org (accessed on 28 January 2021). 23. Price, A.L.; Jones, N.C.; Pevzner, P.A. De novo identification of repeat families in large genomes. Bioinformatics 2005, 21, i351–i358. [CrossRef] 24. Smit, A.F.A.; Hubley, R.; Green, P. RepeatMasker Open-4.0. 2015. Available online: http://www.repeatmasker.org (accessed on 28 January 2021). 25. Afgan, E.; Baker, D.; Batut, B.; van den Beek, M.; Bouvier, D.; Cech,ˇ M.; Chilton, J.; Clements, D.; Coraor, N.; Grüning, B.A.; et al. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Res. 2018, 46, W537–W544. [CrossRef] 26. Thorvaldsdottir, H.; Robinson, J.T.; Mesirov, J.P. Integrative Genomics Viewer (IGV): High-performance genomics data visualiza- tion and exploration. Brief. Bioinform. 2013, 14, 178–192. [CrossRef] 27. Ma, L.J.; Van Der Does, H.C.; Borkovich, K.A.; Coleman, J.J.; Daboussi, M.J.; Di Pietro, A.; Dufresne, M.; Freitag, M.; Grabherr, M.; Henrissat, B.; et al. Comparative genomics reveals mobile pathogenicity in Fusarium. Nature 2010, 464, 367–373. [CrossRef] 28. Zhang, Y.; Yang, H.; Turra, D.; Zhou, S.; Ayhan, D.H.; DeIulio, G.A.; Guo, L.; Broz, K.; Wiederhold, N.; Coleman, J.J.; et al. The genome of opportunistic fungal pathogen Fusarium oxysporum carries a unique set of lineage-specific chromosomes. Commun. Biol. 2020, 3, 50. [CrossRef] 29. Kistler, H.C.; Rep, M.; Ma, L.-J. Structural dynamics of Fusarium genomes. In Fusarium, Genomics, Molecular and Cellular Biology; Caister Academic Press: Norfolk, UK, 2013; pp. 31–42. 30. Ayhan, D.H.; López-Díaz, C.; Di Pietro, A.; Ma, L.-J. Improved Assembly of Reference Genome Fusarium oxysporum f. sp. lycopersici Strain Fol4287. Microbiol. Resour. Announc. 2018, 7.[CrossRef] 31. Raeder, U.; Broda, P. Rapid preparation of DNA from filamentous fungi. Lett. Appl. Microbiol. 1985, 1, 17–20. [CrossRef]