The Gene Family-Free Median of Three
Total Page:16
File Type:pdf, Size:1020Kb
Doerr et al. Algorithms Mol Biol (2017) 12:14 DOI 10.1186/s13015-017-0106-z Algorithms for Molecular Biology RESEARCH Open Access The gene family‑free median of three Daniel Doerr1,2* , Metin Balaban1, Pedro Feijão2 and Cedric Chauve3 Abstract Background: The gene family-free framework for comparative genomics aims at providing methods for gene order analysis that do not require prior gene family assignment, but work directly on a sequence similarity graph. We study two problems related to the breakpoint median of three genomes, which asks for the construction of a fourth genome that minimizes the sum of breakpoint distances to the input genomes. Methods: We present a model for constructing a median of three genomes in this family-free setting, based on maximizing an objective function that generalizes the classical breakpoint distance by integrating sequence similarity in the score of a gene adjacency. We study its computational complexity and we describe an integer linear program (ILP) for its exact solution. We further discuss a related problem called family-free adjacencies for k genomes for the special case of k ≤ 3 and present an ILP for its solution. However, for this problem, the computation of exact solutions remains intractable for sufciently large instances. We then proceed to describe a heuristic method, FFAdj-AM, which performs well in practice. Results: The developed methods compute accurate positional orthologs for genomes comparable in size of bacte- rial genomes on simulated data and genomic data acquired from the OMA orthology database. In particular, FFAdj- AM performs equally or better when compared to the well-established gene family prediction tool MultiMSOAR. Conclusions: We study the computational complexity of a new family-free model and present algorithms for its solution. With FFAdj-AM, we propose an appealing alternative to established tools for identifying higher confdence positional orthologs. Keywords: Family-free genome comparison, Positional orthology, Breakpoint median Background neighborhoods to the input gene orders. Comparing Te presented work relates to the branch of research gene orders of distinct species presupposes knowledge that studies the structural organization of genomes of positional- (sometimes also called main-) ortholo- across species. Genome structures are subject to change gies between their constituting genes. Tis is where our caused by large-scale mutations. Such mutations per- approach difers from previous work: Whereas tradi- mute the order or alter the composition of functional, tionally genes are required to form equivalence classes inheritable entities, subsequently called genes, in genome across gene orders such that each genome contains one sequences. Te breakpoint median constitutes a fam- and only one member of each class, our model only ily of well-studied problems that mainly difer through assumes a symmetric and refexive similarity measure. varying karyotypic constraints [1]. A general, uncon- Te tasks of forming one-to-one relationships between strained variant asks to construct a fourth gene order, genes (i.e. computing a matching) and fnding a median called a median, composed of one or more linear or cir- are then combined into a single objective. Our approach cular chromosomes, from three given gene orders, such has the decisive advantage of solving what was formerly a that this median maximizes the sum of conserved gene circularity problem: a median provides valuable insights into positional conservation, yet knowledge of posi- tional orthologies are already a prerequisite of traditional *Correspondence: [email protected]‑bielefeld.de 2 Faculty of Technology and Center for Biotechnology (CeBiTec), breakpoint median problems. Resolving this antilogy, our Bielefeld University, Universitätsstr. 25, 33615 Bielefeld, Germany approach continues a research program outlined in [2] Full list of author information is available at the end of the article © The Author(s) 2017. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/ publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Doerr et al. Algorithms Mol Biol (2017) 12:14 Page 2 of 14 (see also [3]) under the name of (gene) family-free gene algorithm. However, further methods have been pro- order comparison. So far, family-free methods have been posed that integrate synteny and sequence conserva- developed for the pairwise comparison of genomes [4–6] tion for inferring orthogroups, see [10]. Our approach and shown to be efective for orthology analysis [7]. difers frst and foremost in its family-free principle (all Te prediction of evolutionary relationships between other methods require a prior gene family assignment). genomic sequences is a long-standing problem in com- Compared to MultiMSOAR, the only other method that putational biology. According to Fitch [8], two genomic can handle more than two genomes with an optimiza- sequences are called homologous if they descended from tion criterion that considers gene order evolution, both a common ancestral sequence. Furthermore, Fitch iden- MultiMSOAR (for three genomes) and FF-Median aim tifes diferent events that give rise to a branching point at computing a maximum weight tripartite matching. in the phylogeny of homologous sequences, leading to However we difer fundamentally from MultiMSOAR the well-established concepts of orthologous genes (who by the full integration of sequence and synteny conserva- descend from their last common ancestor through a spe- tion into the objective function, while MultiMSOAR pro- ciation) and paralogous genes (descending from their ceeds frst by computing pairwise orthology assignments last common ancestor through a duplication) [9]. Until to defne a multipartite graph. quite recently, orthology and paralogy relationships were mostly inferred from sequence similarity. However it is The gene family‑free median of three now well accepted that the syntenic context can carry The family‑free principle valuable evolutionary information, which has lead to the In the gene family-free framework, we are given all- notion of positional orthologs [10], which are orthologs against-all gene similarities through a symmetric and R whose syntenic context was not changed in a duplication refexive similarity measure σ : � × � → ≥0 over the event. universe of genes [2]. We use sequence similarity but Most methods for detecting potential orthologous other similarity measures can ft the previous defni- groups require a prior clustering of the genes of the con- tion. Tis leads to the formalization of the gene similarity sidered genomes into homologous gene families, defned graph [2], i.e. a graph where each vertex corresponds to a as groups of genes assumed to originate from a single gene of the dataset and where each pair of vertices asso- ancestral gene. Yet clustering protein sequences into ciated with genes of distinct genomes are connected by a families is already in itself a difcult problem. In the pre- strictly positively weighted edge according to gene simi- sent work, we describe two methods to infer likely posi- larity measure σ. Ten gene family or homology assign- tional orthologies for a group of three genomes. Te frst ments represent a particular subgroup of gene similarity method solves a new problem we introduce, the gene functions that require transitivity. Independent of the family-free median of three. It generalizes the traditional particular similarity measure σ, relations between genes breakpoint median problem [1]. Our second method imposed by σ are considered as candidates for homology makes use of the frst exact algorithm that solves the assignments. problem family-free adjacencies for k genomes (FF-Adja- cencies) that has been introduced by Braga et al. in [2], Extant genomes, genes and adjacencies for the special case where k ≤ 3. We then discuss the In this work, a genome G is entirely represented by a methods’ abilities to solve the biological question at hand tuple G ≡ (C, A), where C denotes a non-empty set of and study their computational complexity. We show that unique genes, and A is a set of adjacencies. Genes are t h our approach can be used for positional ortholog predic- represented by their extremities, i.e., a gene g ≡ (g , g ) , h t tion in simulated and real data sets of bacterial genomes. g ∈ C, consists of a head g and a tail g . Telomeres are modeled explicitly, as special genes of C(G) with a single a b Related problems extremity, denoted by “◦”. Extremities g1 , g2, a, b ∈{h, t} a b Te FF-Median problem relates to previously studied of any two genes g1, g2 form an adjacency {g1 , g2 } if they gene order evolution problems. It is a generalization of are immediate neighbors in their genome sequence. In the tractable mixed multichromosomal median problem the following, we will conveniently use the notation C(G) introduced in [1], that can indeed be defned as an FF- and A(G) to denote the set of genes and the set of adja- Median problem with a similarity graph composed of cencies of genome G, respectively. We indicate the pres- a b disjoint 3-cliques