Ploidy-Aware Syntenic Orthologous Networks Identified Via Collinearity
Total Page:16
File Type:pdf, Size:1020Kb
bioRxiv preprint doi: https://doi.org/10.1101/2021.02.18.431864; this version posted February 18, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 1 pSONIC: Ploidy-aware Syntenic Orthologous Networks Identified via Collinearity 2 Justin L Conover*, Joel Sharbrough†,‡, Jonathan F Wendel* 3 4 * – Department of Ecology, Evolution, and Organismal Biology, Iowa State University, 5 Ames, IA, USA 6 † – Biology Department, Colorado State University, Fort Collins, CO, USA 7 ‡ – Biology Department, New Mexico Institute of Mining and Technology, Socorro, NM, 8 USA 1 bioRxiv preprint doi: https://doi.org/10.1101/2021.02.18.431864; this version posted February 18, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 9 Running Title: Ploidy-aware syntenic orthologs 10 Key words: orthology, synteny, polyploidy, OrthoFinder, MCScanX 11 12 13 *Corresponding author: Jonathan F Wendel 14 Mailing address: Department of EEOB, 251 Bessey Hall, 2200 Osborn Dr, Ames, IA 15 50011 16 Phone Number: (515) 294-7172 17 Email:[email protected] 18 19 20 ORCID ID: 21 JLC: 0000-0002-3558-6000 22 JS: 0000-0002-3642-1662 23 JFW: 0000-0003-2258-5081 2 bioRxiv preprint doi: https://doi.org/10.1101/2021.02.18.431864; this version posted February 18, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 24 ABSTRACT 25 With the rapid rise in availaBility of high-quality genomes for closely related species, 26 methods for orthology inference that incorporate synteny are increasingly useful. 27 Polyploidy perturbs the 1:1 expected frequencies of orthologs Between two species, 28 complicating the identification of orthologs. Here we present a method of ortholog 29 inference, Ploidy-aware Syntenic Orthologous Networks Identified via Collinearity 30 (pSONIC). We demonstrate the utility of pSONIC using four species in the cotton triBe 31 (Gossypieae), including one allopolyploid, and place Between 75-90% of genes from 32 each species into nearly 32,000 orthologous groups, 97% of which consist of at most 33 singletons or tandemly duplicated genes -- 58.8% more than comparaBle methods that 34 do not incorporate synteny. We show that 99% of singleton gene groups follow the 35 expected tree topology, and that our ploidy-aware algorithm recovers 97.5% identical 36 groups when compared to splitting the allopolyploid into its two respective suBgenomes, 37 treating each as separate “species”. 38 39 40 INTRODUCTION 41 The recent explosion in high-quality genome assemblies has increased the opportunity 42 to investigate Biological questions using a comparative genomics framework. An 43 essential first step in many applications is inference of a high-confidence set of 44 orthologs in the genomes under study. Methods for inferring orthologs are Broadly 45 Based on sequence similarity, either through the construction of phylogenetic trees or 46 through clustering of sequence similarity scores. There has Been consideraBle progress 3 bioRxiv preprint doi: https://doi.org/10.1101/2021.02.18.431864; this version posted February 18, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 47 in developing methods that curate a genome-wide set of orthologs for distantly-related 48 genomes (Trachana et al. 2011; Emms and Kelly 2020), prioritizing flexiBility for use on 49 species with fragmented genome assemblies or even transcriptome assemblies, e.g. 50 Inparanoid (O’Brien et al. 2005), OrthoMCL (Li et al. 2003), and OrthoFinder (Emms 51 and Kelly 2015, 2019). As genomes for closely-related species Become more prevalent, 52 however, methods designed for deep-phylogenetic identification are less than optimal, 53 as new methods can leverage conserved gene order across closely-related species (i.e. 54 synteny) as powerful evidence for orthology. Two closely related species have largely 55 collinear genomes, Barring chromosomal rearrangements or small-scale gene loss or 56 gain events (e.g. via transposition) that Break up Blocks of collinear genes (Dehal and 57 Boore 2005). Programs have Been developed to identify these collinear Blocks (e.g. 58 MCScanX (Wang et al. 2012) and CoGe (Lyons et al. 2008)) But these methods are 59 restricted to pairwise comparisons (MCScanX) or comparisons among three genomes 60 (CoGe), and no method for genome-wide detection of orthologs across multiple species 61 has yet incorporated the powerful evidence of orthology provided By synteny. 62 A Biological feature that can complicate orthology inference is whole genome 63 multiplication (polyploidy), which is widespread throughout the tree of life (Van de Peer 64 et al. 2017; Li et al. 2018), especially in plants (Jiao et al. 2011; One Thousand Plant 65 Transcriptomes Initiative 2019). In the case of ancient polyploids, extensive gene 66 deletion and chromosome rearrangement (Wendel 2015) often oBscures the expected 67 number of gene copies that should Be present in a genome and complicates pairwise 68 genome alignments, syntenic Block detection, and gene tree - species tree 69 reconciliation. Differences in ancestral ploidy levels have not Been integrated into 4 bioRxiv preprint doi: https://doi.org/10.1101/2021.02.18.431864; this version posted February 18, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 70 existing programs for detecting orthologs, although this is essential for oBtaining 71 accurate estimates of orthogroup completeness. 72 Here, we present a new method of ortholog inference, Ploidy-aware Syntenic 73 Orthologous Networks Identified via Collinearity (pSONIC), which uses pairwise 74 collinearity Blocks from multiple species inferred via MCScanX, along with a high- 75 confidence set of singleton orthologs identified through OrthoFinder, to curate a 76 genome-wide set of syntenic orthologs. As part of pSONIC’s inference, we developed a 77 ploidy-aware algorithm to identify collinear Blocks originating from Both speciation and 78 duplication events. To evaluate pSONIC’s performance in a system with a complex 79 history of duplication and speciation, we tested pSONIC on four species in the cotton 80 triBe (Gossypieae), including one allopolyploid, its two closest diploid progenitors, and a 81 phylogenetic outgroup. Our method assigned Between 75-90% of all genes into 82 orthogroups, and when compared to OrthoFinder, identifies 40% more single-copy 83 orthogroups (97% of which exhiBit gene tree topologies consistent with the species 84 relationships within the Gossypieae) and 33% more orthogroups that contain only 85 tandemly duplicated genes from each species. To demonstrate the effectiveness of our 86 ploidy-aware algorithm, we show that, unlike OrthoFinder, splitting the tetraploid 87 genome into its respective genomes has little effect on our final set of orthogroups. 88 89 90 METHODS 91 Required input for pSONIC includes the list of orthogroups inferred from OrthoFinder (- 92 og flag) and the files containing the list of collinearity groups and tandemly duplicated 5 bioRxiv preprint doi: https://doi.org/10.1101/2021.02.18.431864; this version posted February 18, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 93 genes from MCScanX (default parameters). To reduce the memory requirements of 94 MCScanX and pSONIC, gene names are converted to the style of OrthoFinder (this can 95 Be done using the --translate_gff flag in the pSONIC program). Additionally, an optional 96 file providing the relative degree of ploidy increase for each species can Be used for 97 analyses in which a polyploidy event has occurred along the phylogeny of the species in 98 the analysis. In the example Below from the cotton triBe, we run two analyses to 99 demonstrate this: one in which the suBgenomes of allopolyploid Gossypium hirsutum 100 (2n = 4x) is run normally (i.e. the relative ploidy is 2 compared to all other species in the 101 analysis), and a second in which the genome has Been split into its respective 102 suBgenomes, with each treated as a separate “species”. This feature allows the 103 possiBility of syntenic analysis of genomes with varying complexities of ploidy histories, 104 including those where clear partitioning into suBgenomes is not possiBle. 105 The pSONIC pipeline proceeds in four Basic steps. First, OrthoFinder results are 106 parsed to find “tethers” -- that is, orthogroups in which at least two species have fewer 107 than or equal to the number of “gene sets” expected from relative ploidy levels. Here, 108 we define “gene sets” to include a gene and all immediately neighBoring tandemly 109 duplicated genes (as determined By MCScanX). Using “gene sets” instead of singleton 110 genes dramatically increases the number of tethers that can Be used in steps two and 111 three without creating spurious cases of inferred collinearity (TaBle 1). For any 112 orthogroup in which some, But not all, species contain more gene sets than expected By 113 ploidy, genes from these species are excluded while genes from all other species are 114 included in downstream steps.